Job Summary:
Tesla is seeking a Software Engineer within the Autopilot AI Infrastructure team to enhance and scale their AI research infrastructure. The role involves optimizing training processes and improving efficiency across various components of the AI training software stack.
Responsibilities:
• Reduce wall clock time to convergence of our training jobs by identifying bottlenecks in the ML stack, from data-loading up to the GPU
• Integrate efficient, low-level code with the overall high-level training framework
• Profile our workloads and implement solutions to increase training efficiency
• Optimize workloads for efficient hardware utilization (e.g. CPU and GPU compute, data throughput, networking)
Qualifications:
Required:
• Members of the Autopilot AI Infrastructure team are expected to be adaptable to the dynamic requirements of AI research and capable of contributing across all parts of the AI training software stack
• Practical experience programming in Python and/or C/C++
• Experience programming in CUDA, cuDNN or Triton, particularly in the context of operations used in AI workloads
• Experience profiling and optimizing CPU-GPU interactions (pipelining computation with data transfers, etc.)
• Experience working with training frameworks (ideally PyTorch)
• Proficient in system-level software, in particular hardware-software interactions and resource utilization
• Experience with parallel programming concepts and primitives
• Understanding of modern machine learning concepts and state of the art deep learning
• Experience scaling neural network training jobs across many GPUs
Company:
Tesla is an electric vehicle and clean energy company that provides electric cars, solar, and renewable energy solutions. Founded in 2003, the company is headquartered in Austin, USA, with a team of 10001+ employees. The company is currently Late Stage.