Job Summary:
NVIDIA is a leading technology company known for its groundbreaking developments in Artificial Intelligence and High-Performance Computing. They are seeking a Senior Software Engineer for their CSP Engagements team to focus on cloud-native stacks for advanced AI/ML datacenters, where the engineer will define workflows, prototype enhancements, and debug complex issues. The role involves collaboration with various teams to deliver integrated solutions and technical documentation.
Responsibilities:
• Perform deep-dive debugging of multi-rack, multi-tenant clusters: scheduler behavior, container runtime issues, device-plugin crashes, RDMA/IB fabric anomalies, etc.
• Gather customer requirements and prototype feature extensions for Kubernetes operators, Slurm plugins, and custom micro-services that expose new GPU capabilities.
• Drive joint architecture reviews and “whiteboard” sessions with CSP and internal platform teams; convert findings into RFCs and upstream pull requests.
• Create reproducible testbeds (Helm/Ansible/Terraform) that mirror customer environments; automate validation and benchmark suites.
• Deliver technical collateral-design docs, how-to guides, demo scripts-and present at customer on-sites, KubeCon, and SlurmUG.
• Collaborate with AE, FAE, and Solution Architect teams to deliver integrated customer solutions and technical documentation.
Qualifications:
Required:
• Strong source-level expertise in Kubernetes internals (scheduler, CRI/CNI/CSI, operators) and Slurm (federation, power-save, plugins).
• Hands-on experience integrating next-gen GPUs (Blackwell/GB200/GB300) or comparable accelerators into containerized clusters.
• Proven track record debugging large-scale, cloud-native stacks across networking (RDMA/RoCE), storage, and control planes.
• Customer-facing engineering or solutions-architect background: requirements gathering, PoC ownership, roadmap influence.
• Familiarity with CI/CD (GitHub Actions, Tekton), observability (Prometheus, OpenTelemetry), and infrastructure-as-code.
• Excellent communication-able to switch between deep technical detail and high-level business impact.
• 10+ years of professional software development experience in distributed systems (Go, Rust, C/C++ or Python for tooling).
• BS or MS (or equivalent experience) in Computer Engineering, Computer Science, or related field.
Preferred:
• Upstream contributions to Kubernetes, Slurm, Volcano, or similar projects.
• Experience with GPU computing (CUDA), deep learning workloads
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.