Job Summary:
NVIDIA is a leading technology company focused on pioneering initiatives in artificial intelligence and deep learning systems. They are seeking a Senior Software Engineer to work on optimizing CUDA and Deep Learning Systems, exploring novel systems optimizations, and collaborating with AI researchers to enhance hardware performance for AI workloads.
Responsibilities:
• Explore, research, and prototype novel systems optimizations for advanced deep learning models at the intersection of high-level DL frameworks and low-level CUDA through modeling, simulation, and silicon prototyping.
• Architect and optimize distributed computing systems that scale seamlessly from a single node to massive, cluster-scale supercomputing environments.
• Design, implement, and optimize custom high-performance CUDA kernels tailored to emerging neural network architectures and workloads.
• Analyze complex hardware-software interactions to identify and resolve performance bottlenecks in both training and inference pipelines.
• Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts to co-design systems and algorithms that improve accelerator compute utilization, memory bandwidth, cross-node network communication efficiency and programmability.
• Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning.
• Write clean, effective, and maintainable code, ensuring exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products.
Qualifications:
Required:
• BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
• 8+ years of relevant industry experience or equivalent academic experience after degree achievement.
• Strong proficiency in C++ and Python programming.
• Solid background in the fundamentals of Deep Learning with a focus on transformers.
• Strong understanding of distributed computing principles, multi-node scaling, and the unique performance challenges of cluster-scale execution.
• Proven experience in systems programming, computer architecture, and low-level systems performance optimization.
• Familiarity with deep learning accelerator architectures such as the GPU and hands-on experience with CUDA programming and kernel optimization.
• A strong analytical approach with experience using profiling tools to deeply understand software performance on hardware.
• Experience profiling and optimizing innovative vision models, generative AI architectures, or diffusion models.
• Background in deep learning compilers, both graph-level and codegen (e.g., Triton, XLA, torch compile)
Preferred:
• Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e.g., PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc.).
• Hands-on experience with CUDA, communication libraries (e.g., NCCL, MPI, UCX) and distributed machine learning techniques (e.g., pipeline parallelism, tensor parallelism).
• Knowledge of numerical methods, low-precision arithmetic (e.g., NVFP4, MXFP4, FP8, INT8), and their implications on deep learning model accuracy and performance.
• Familiarity with systems requirements for Reinforcement Learning (RL) or highly parallel simulation environments and/or research background in machine learning systems or adjacent fields.
• Experience with machine learning, especially agentic systems, applied to systems problems.
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.