Job Summary:
NVIDIA has been transforming computer graphics and accelerated computing for over 25 years, and they are seeking a highly skilled HPC Cluster Engineer. The role involves designing, deploying, and operating GPU Compute Clusters for EDA and high-performance computing workloads, ensuring their performance, scalability, and reliability while collaborating with researchers and infrastructure teams.
Responsibilities:
• Develop and enhance our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
• Continuously improve infrastructure provisioning, management, observability and day to day operation through automation.
• Provide technical leadership and strategic guidance for managing large-scale HPC systems, including the deployment of compute, networking, and storage.
• Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs
• Support our researchers to run their EDA workloads including performance analysis and optimizations.
• Conduct root cause analysis and suggest corrective action. Proactively find and fix issues before they occur.
• Build innovative tooling to accelerate researchers' velocity, debugging and software performance at scale.
Qualifications:
Required:
• Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
• Minimum of 5 years of proven experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.
• Experience with AI/HPC job schedulers and orchestrators, such as Slurm, LSF, PBS or K8s. Applied experience with AI/HPC workflows that use MPI and NCCL.
• Proficient in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions. A solid understanding of container technologies such Enroot and Docker.
• Proficiency in Python and Bash
• Experience analyzing and tuning performance for a variety of EDA workloads. Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
• Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals.
• Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields.
Preferred:
• Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking.
• Experience supporting EDA workloads and tools.
• Familiarity with High-Speed Networking pertaining to HPC including InfiniBand, RDMA and RoCE.
• Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workload.
• Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana.
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.