Job Summary:
Tesla is looking for strong software engineers to help scale the next generation of large AI models for Autopilot, Optimus, and Digital Optimus. The role involves optimizing large-scale distributed training systems, improving training efficiency, and collaborating with ML practitioners to enhance model quality and performance.
Responsibilities:
• Optimize large-scale distributed training across thousands of GPUs
• Improve training throughput, utilization, reliability, and scalability
• Develop tooling to identify bottlenecks in compute, networking, memory, and data pipelines
• Design and implement performance optimizations across PyTorch, CUDA, communication libraries, and training frameworks
• Partner with researchers to evaluate how infrastructure changes impact model quality, convergence, and downstream metrics
• Analyze training runs and build dashboards that connect system performance to model outcomes
• Drive improvements in model scaling efficiency, including larger models, longer context lengths, and higher-quality datasets
• Debug complex issues across software, hardware, networking, and machine learning systems
• Build infrastructure that accelerates experimentation and shortens iteration cycles for researchers
Qualifications:
Required:
• Strong software engineering fundamentals in Python and C++
• Experience with distributed systems, high-performance computing, or large-scale infrastructure
• Understanding of machine learning fundamentals, including optimization, training dynamics, and evaluation
• Familiarity with PyTorch and modern deep learning frameworks
• Ability to analyze performance bottlenecks using profiling and observability tools
• Strong debugging and problem-solving skills
• Excellent communication and collaboration skills
Company:
Tesla is an electric vehicle and clean energy company that provides electric cars, solar, and renewable energy solutions. Founded in 2003, the company is headquartered in Austin, USA, with a team of 10001+ employees. The company is currently Late Stage.