Scale AI is focused on developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Machine Learning Infrastructure team. The role ...
Scale AI is focused on developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Machine Learning Infrastructure team. The role ...
Software Engineer, Frontier AI Infrastructure Scale AI is seeking a highly skilled and motivated Software Engineer, Frontier AI Infrastructure to join our dynamic Public Sector Engineering team. As a ...
Software Engineer, Frontier AI Infrastructure Scale AI is seeking a highly skilled and motivated Software Engineer, Frontier AI Infrastructure to join our dynamic Public Sector Engineering team. As a ...
Scale AI is dedicated to developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Model Serving Platform. The role involves designing ...
Scale AI is dedicated to developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Model Serving Platform. The role involves designing ...
Scale AI is dedicated to developing reliable AI systems for critical decision-making. They are seeking a Senior AI Infrastructure Engineer to build a high-performance training platform for large ...
Scale AI is dedicated to developing reliable AI systems for critical decision-making. They are seeking a Senior AI Infrastructure Engineer to build a high-performance training platform for large ...
The role involves collaborating with various teams to design and implement solutions for deploying Rack Scale AI products in data centers and labs, while optimizing cloud services and ensuring robust ...
The role involves collaborating with various teams to design and implement solutions for deploying Rack Scale AI products in data centers and labs, while optimizing cloud services and ensuring robust ...
The role involves collaborating with various teams to design and implement solutions for deploying Rack Scale AI products in data centers and labs, while optimizing cloud services and ensuring robust ...
The role involves collaborating with various teams to design and implement solutions for deploying Rack Scale AI products in data centers and labs, while optimizing cloud services and ensuring robust ...
Scale AI is building the infrastructure that makes enterprise AI seamless. They are seeking a Staff Infrastructure Software Engineer to act as a primary technical lead, engineering deployment ...
Scale AI is building the infrastructure that makes enterprise AI seamless. They are seeking a Staff Infrastructure Software Engineer to act as a primary technical lead, engineering deployment ...
Scale AI is a company focused on developing reliable AI systems for significant decisions. They are seeking a Senior AI Infrastructure Engineer to architect a high-performance training platform for ...
Scale AI is a company focused on developing reliable AI systems for significant decisions. They are seeking a Senior AI Infrastructure Engineer to architect a high-performance training platform for ...
Scale AI is a leading data and evaluation partner for frontier AI companies, focused on bridging the gap between AI research and global policymakers. The Research Scientist in AI Controls and ...
Scale AI is a leading data and evaluation partner for frontier AI companies, focused on bridging the gap between AI research and global policymakers. The Research Scientist in AI Controls and ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. The Forward Deployed AI Engineering Manager will act as a technical bridge between ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. The Forward Deployed AI Engineering Manager will act as a technical bridge between ...
Scale AI is a leading data and evaluation partner for frontier AI companies, focusing on policy research to bridge the gap between AI research and global policymakers. The Research Scientist will ...
Scale AI is a leading data and evaluation partner for frontier AI companies, focusing on policy research to bridge the gap between AI research and global policymakers. The Research Scientist will ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Forward Deployed AI Engineer on the Enterprise team, you'll be the ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Forward Deployed AI Engineer on the Enterprise team, you'll be the ...
Frontier Agents Engineer
Manhattan, NY · On-site
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Forward Deployed AI Engineer on the Enterprise team, you'll be the technical ...
Frontier Agents Engineer
Manhattan, NY · On-site
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Forward Deployed AI Engineer on the Enterprise team, you'll be the technical ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Staff Forward Deployed AI Engineer on the Enterprise team, you'll serve as a ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Staff Forward Deployed AI Engineer on the Enterprise team, you'll serve as a ...
Scale AI is building the infrastructure that makes enterprise AI seamless. They are seeking a Senior or Staff Infrastructure Engineer to lead the engineering of deployment standards for knowledge ...
Scale AI is building the infrastructure that makes enterprise AI seamless. They are seeking a Senior or Staff Infrastructure Engineer to lead the engineering of deployment standards for knowledge ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Staff Forward Deployed AI Engineer on our Enterprise team, you'll be ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Staff Forward Deployed AI Engineer on our Enterprise team, you'll be ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Forward Deployed AI Engineer on our Enterprise team, you'll be the ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Forward Deployed AI Engineer on our Enterprise team, you'll be the ...
At Scale AI, our mission is to develop reliable AI systems for the world's most important decisions. For the past ten years, Scale has been the leading AI data foundry, supporting some of the most ...
At Scale AI, our mission is to develop reliable AI systems for the world's most important decisions. For the past ten years, Scale has been the leading AI data foundry, supporting some of the most ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Staff Forward Deployed AI Engineer on the Enterprise team, you'll act ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Senior Staff Forward Deployed AI Engineer on the Enterprise team, you'll act ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Staff Forward Deployed AI Engineer, you will act as a technical bridge between ...
Scale AI is the data foundation for AI, helping organizations build and deploy reliable production AI applications. As a Staff Forward Deployed AI Engineer, you will act as a technical bridge between ...
Scale Ai information
Full-time
Posted 15 days ago
Job description
Scale AI is focused on developing reliable AI systems for critical decisions, and they are seeking a Senior AI Infrastructure Engineer for their Machine Learning Infrastructure team. The role involves architecting a high-performance training platform for large-scale GPU clusters and collaborating closely with researchers to enhance the efficiency of AI model training.
Responsibilities:
• Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
• Design and implement scheduling primitives to optimize the lifecycle of training jobs.
• Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
• Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
• Work closely with Finance and Procurement teams to drive our capacity planning process.
• Participate in our team’s on call process to ensure the availability of our services.
• Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
Qualifications:
Required:
• 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
• Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++).
• Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling.
• Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling.
• Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput.
• Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware.
• Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
• Proven ability to solve complex problems and work independently in fast-moving environments.
Preferred:
• Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
• Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
• Experience with PyTorch.
• Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning.
Company:
Scale’s mission is to develop reliable AI systems for the world’s most important decisions. Founded in 2016, the company is headquartered in San Francisco, USA, with a team of 501-1000 employees. The company is currently Late Stage.
About Scale AI
Sourced by ZipRecruiter
Industry
Software development
Company size
201 - 500 Employees
Headquarters location
San Francisco, CA, US
Year founded
2016