Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)
Short Job Description
We are seeking a Senior ML Infrastructure Engineer to design and scale the infrastructure powering large-scale machine learning training workloads. In this role, you'll build high-performance GPU training platforms, optimize distributed training pipelines, and improve the developer experience for ML researchers.
Responsibilities:
- Design and scale distributed ML training infrastructure for large GPU clusters.
- Build and optimize training pipelines using PyTorch, DeepSpeed, and distributed training frameworks.
- Develop and maintain job scheduling systems using Kubernetes and/or SLURM.
- Create high-throughput data pipelines for large-scale multimodal datasets.
- Optimize GPU utilization, memory efficiency, and overall system performance.
- Build low-latency inference pipelines for production ML deployments.
Required Skills:
- 7+ years of experience in ML Infrastructure, HPC, or Distributed Systems.
- Strong experience with PyTorch, DeepSpeed, FSDP, ZeRO, or similar distributed training frameworks.
- Hands-on experience with Kubernetes, cloud platforms (AWS/Google Cloud Platform), and containerized environments.
- Strong understanding of distributed systems, GPU optimization, NCCL, memory management, and performance tuning.
- Experience building scalable ML infrastructure from development through production.
Location: Redwood City, CA (On-site)
Employment Type: Full-Time
Nice to Have:
- Experience with multimodal AI, robotics data pipelines, Triton, TensorRT, custom ML kernels, or ML compiler/runtime optimization.