Job Summary:
NVIDIA is known as 'the AI computing company' and is seeking a Senior Software Engineer to enhance their CI/CD infrastructure for deep learning compiler stacks. The role involves designing scalable CI systems and improving CI reliability while collaborating with various teams.
Responsibilities:
• Build, maintain, and improve CI infrastructure that supports development, verification, and release of NVIDIA’s deep learning compiler stacks across GPU and accelerator environments
• Improve CI reliability and signal quality by reducing flakes, improving reproducibility, strengthening diagnostics, and making correctness and performance failures easier to understand and act on
• Apply automation, AI, and agent-based workflows to reduce manual CI operations, speed up failure triage, and improve developer efficiency
• Build reusable and self-service CI platforms that support multiple products, projects, model suites, hardware targets, and software configurations while partnering closely with compiler, infrastructure, and release teams
Qualifications:
Required:
• BS, MS, or PhD (or equivalent experience) in Computer Science, Computer/Electrical Engineering, Mathematics, or a related field
• 5+ years of experience designing, scaling, and operating CI/CD, build/release, or developer infrastructure for complex software systems
• Proven experience building CI platforms end-to-end using systems such as GitLab CI, GitHub Actions, Jenkins, or similar tools, including pipeline orchestration, compute/runner management, artifact and package systems, and observability, with strong emphasis on reliability, reproducibility, and debuggability
• Strong software engineering skills (Python required), with the ability to design, implement, and debug distributed systems end-to-end
• Proven track record of designing, building, and deploying AI/LLM-based systems in real engineering workflows, demonstrating skill in evaluating trade-offs, failure modes, maintainability, and measurable impact on developer productivity, signal quality, or operational efficiency
Preferred:
• Experience crafting and shipping sophisticated AI/agent-based systems that improve continuous integration or developer efficiency. These systems include intelligent test selection, automated triage and routing, regression localization, autonomous remediation, and developer-assist workflows
• Experience operating CI for DL/GPU software environments, including multi-GPU / multi-node workloads on Slurm, Kubernetes, or cloud platforms
• Familiarity with compiler IRs and infrastructure such as LLVM/MLIR, XLA/HLO, Triton IR, cuTile, or TileIR, especially in the context of testing, debugging, and validating compiler-driven workloads
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.