Job Summary:
Baseten powers mission-critical inference for leading AI companies, and they are seeking a Software Engineer to join their Training Infrastructure team. The role involves architecting and developing the training platform, ensuring high performance and reliability for model developers and research engineers.
Responsibilities:
โข Design and architect scalable infrastructure systems for our ML training platform (e.g. scheduling, storage, and networking)
โข Partner closely with developers and research engineers to translate complex training requirements into technical solutions
โข Design and architect a global training scheduler
โข Design and architect reinforcement learning systems and continuous learning pipelines
โข Drive long-term improvements to improve reliability of systems and velocity of development
โข Partner closely with SRE and Capacity teams to unlock state of the art training infrastructure
โข Make critical architectural decisions balancing performance with system reliability
โข Lead technical discussions and mentor junior engineers on infrastructure best practices
โข Contribute to long-term technical strategy and infrastructure roadmap
Qualifications:
Required:
โข Bachelorโs degree or high in Computer Science or related field
โข Proficiency in Go, with Python experience a plus
โข Deep expertise with Kubernetes in production environments
โข Extensive experience with major cloud providers (AWS, GCP) and neo-cloud providers (Crusoe, DigitalOcean, Nebius) a plus
โข Advanced understanding of distributed systems concepts and performance tuning
โข Proven experience designing observability systems
โข Experience with ML/AI workloads and MLOps platforms highly valued
Preferred:
โข Experience with distributed storage systems
โข Experience with workload orchestration platforms like Temporal or Airflow
โข Familiarity or experience with the open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainier) and distributed training techniques (FSDP, DeepSpeed)
โข Experience developing AI products, tooling, or agents
Company:
Baseten provides the necessary infrastructure, tooling, and expertise to integrate AI into business operations. Founded in 2019, the company is headquartered in San Francisco, USA, with a team of 201-500 employees. The company is currently Growth Stage.