Job Summary:
UCSF Health is a leading university dedicated to promoting health worldwide through advanced biomedical research and education. They are seeking an HPC Systems Engineer to support the development and maintenance of high-performance computing clusters, ensuring the integrity and availability of cyber infrastructure systems while providing consultation and support to researchers.
Responsibilities:
• Applies advanced systems / infrastructure concepts to define, design, implement, and operate highly complex, research cyberinfrastructure systems, services and technology solutions. Proposes and implements highly complex system or device enhancements such as software, hardware and network configuration, updates and installations for projects or services of broad scope. Sets standards for monitoring and maintaining the health and integrity of CI systems including upgrading and patching.
• Independently manages systems and services for a large facility, campuswide, medical center or Office of the President and / or institution-wide scope and makes recommendations for purchases or upgrades. Performs complex and advanced analysis to acquire, install, modify and support operating systems, databases, utilities and web-related tools. Selects methods and techniques to obtain solutions. Interacts with senior management. May perform complex network integration tasks and interoperability assessments for interconnected servers or components of clusters for communication. Support and collaborate with researchers and other key IT (e.g. network and security) and Data Center partners in a timely manner
• Specifies, writes and executes highly complex software and scripts to support systems management, log analysis, monitoring, deployment, configuration management, and other system administration duties for multiple, highly integrated systems.
• Provides consultation, training, support, and guidance to researchers enabling them to utilize HPC resources effectively.
• Maintains complex security systems. Interprets and adopts campus, medical center or Office of the President, system and regulation-based security policies to control access to networked resources. Provides recommendations and requirements on network access controls.
• Collaborates and may provide leadership with other Systems Engineers within the CI ecosystem/higher-education community. Regularly contribute best practices documentation, present at conferences, or publish in peer reviewed journals.
• Define and track performance metrics to ensure efficient current and future use of cyber infrastructure resources.
Qualifications:
Required:
• Bachelor's degree in a related area such as computer science or engineering, and 6+ years of experience with large-scale or HPC systems or 10+ years of related experience with large-scale or HPC systems
• Expert knowledge of HPC systems infrastructure design
• Strong knowledge of high-performance parallel filesystems and storage such as GPFS, Lustre, Vast, DDN, etc.
• Advanced knowledge of computer security best practices and policies including demonstrated experience securing research cyberinfrastructure systems to meet NIST 800-171 / 800-223, HIPPA or IS-3 requirements
• Demonstrated testing and test planning skills. Demonstrated ability to create automated testing.
• Knowledge of HPC job scheduler system design and operation such as SLURM or PBS
• Demonstrated skill (5 years +) deploying, managing, and troubleshooting Warewulf (or similar) infiniband based clusters
• Ability to elicit and communicate technical and non-technical information in a clear and concise manner.
• Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
• Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
• Demonstrated advanced knowledge, skills and abilities associated with system problem identification and resolution. Experience with design, configuration, operation, repair, and tuning of technology systems.
• Advanced experience writing and editing the most complex scripts used to perform system maintenance and administration.
• Ability to write technical documentation in a clear and concise manner. Ability to develop runbooks defining complex technical processes in a clear and concise manner.
Preferred:
• Knowledge of the design, development, and application of technology and systems to meet business needs.
• General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.
• Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.
Company:
UCSF Health offers healthcare services in the areas of cancer, heart disorders, neurological disorders, organ transplants, and more. Founded in 1990, the company is headquartered in San Francisco, USA, with a team of 5001-10000 employees. The company is currently Late Stage.