Job Summary:
NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. The Senior GPU Supercomputer Scheduler Engineer will design and implement GPU compute clusters to support demanding workloads and improve scheduling features for AI and ML systems.
Responsibilities:
• Design and develop new scheduling features and add-on services to improve GPU compute clusters across many dimensions, such as resource usage fairness, GPU occupancy, GPU waste, application resilience, application performance and power usage.
• Design and develop batch workload management and orchestration services
• Provide support to staff and end users to resolve batch scheduler issues
• Build and improve our ecosystem around GPU-accelerated computing
• Performance analysis and optimizations of deep learning workflows
• Develop large scale automation solutions
• Root cause analysis and suggest corrective action for problems large and small scales
• Finding and fixing problems before they occur
Qualifications:
Required:
• Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
• 5+ years of work experience
• Strong understanding of batch scheduling, preferably with experience in schedulers such as SLURM or K8s batch schedulers (Kueue, Volcano, etc.)
• Significant experience in systems programming languages such as C/C++ & Go as well as scripting languages such as Python and bash
• Established experience in Linux operating system, environment and tools
• Experience analyzing and tuning performance for a variety of AI workloads
• In-depth understanding of container technologies like Docker, Singularity, Podman
• Flexibility/adaptability for working in a dynamic environment with different frameworks and requirements
• Excellent communication, interpersonal and customer collaboration skills
Preferred:
• Knowledge in High-performance computing
• Open Source Software Contribution
• Experience with deep learning frameworks like PyTorch and TensorFlow
• Passionate about SW development processes
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.