1

Scale Ai Jobs (Flexible Options) Near Me

Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Forward Deployed AI Engineer on the Enterprise team, you'll be the technical ...

Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Forward Deployed AI Engineer on our Enterprise team, you'll be the ...

Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Staff Forward Deployed AI Engineer, you will act as a technical bridge between ...

next page

Showing results 1-20

Scale Ai information

What cities are hiring for Scale Ai jobs? Cities with the most Scale Ai job openings:
What are the most commonly searched types of Scale Ai jobs? The most popular types of Scale Ai jobs are:
What states have the most Scale Ai jobs? States with the most job openings for Scale Ai jobs include:
Senior AI Infrastructure Engineer - Training Platform

Senior AI Infrastructure Engineer - Training Platform

Scale AI

San Francisco, CA • On-site

Full-time

Posted 15 days ago


Job description

Job Summary:
Scale AI is focused on developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Machine Learning Infrastructure team. The role involves architecting a high-performance training platform for large-scale GPU clusters and collaborating closely with researchers to enhance the efficiency of AI model training.
Responsibilities:
• Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
• Design and implement scheduling primitives to optimize the lifecycle of training jobs.
• Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
• Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
• Work closely with Finance and Procurement teams to drive our capacity planning process.
• Participate in our team’s on call process to ensure the availability of our services.
• Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
Qualifications:
Required:
• 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
• Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++).
• Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling.
• Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling.
• Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput.
• Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware.
• Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
• Proven ability to solve complex problems and work independently in fast-moving environments.
Preferred:
• Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
• Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
• Experience with PyTorch.
• Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning.
Company:
Scale’s mission is to develop reliable AI systems for the world’s most important decisions. Founded in 2016, the company is headquartered in San Francisco, USA, with a team of 501-1000 employees. The company is currently Late Stage.