Job Summary:
Baseten powers mission-critical inference for leading AI companies, and they are seeking a Software Engineer to join their Training Infrastructure team. The role involves architecting and developing the training platform, ensuring high performance and reliability for model developers and research engineers.
Responsibilities:
• Design and architect scalable infrastructure systems for our ML training platform (e.g. scheduling, storage, and networking)
• Partner closely with developers and research engineers to translate complex training requirements into technical solutions
• Design and architect a global training scheduler
• Design and architect reinforcement learning systems and continuous learning pipelines
• Drive long-term improvements to improve reliability of systems and velocity of development
• Partner closely with SRE and Capacity teams to unlock state of the art training infrastructure
• Make critical architectural decisions balancing performance with system reliability
• Lead technical discussions and mentor junior engineers on infrastructure best practices
• Contribute to long-term technical strategy and infrastructure roadmap
Qualifications:
Required:
• Bachelor’s degree or high in Computer Science or related field
• Proficiency in Go, with Python experience a plus
• Deep expertise with Kubernetes in production environments
• Extensive experience with major cloud providers (AWS, GCP) and neo-cloud providers (Crusoe, DigitalOcean, Nebius) a plus
• Advanced understanding of distributed systems concepts and performance tuning
• Proven experience designing observability systems
• Experience with ML/AI workloads and MLOps platforms highly valued
Preferred:
• Experience with distributed storage systems
• Experience with workload orchestration platforms like Temporal or Airflow
• Familiarity or experience with the open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainier) and distributed training techniques (FSDP, DeepSpeed)
• Experience developing AI products, tooling, or agents
Company:
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes. Founded in 2019, the company is headquartered in San Francisco, USA, with a team of 201-500 employees. The company is currently Growth Stage.