1

Machine Learning Infrastructure Engineer Jobs (NOW HIRING)

Machine Learning - Infrastructure

San Francisco, CA ยท On-site

$126K - $166K/yr

Infrastructure Engineer Our mission is general causal intelligence, AI that is capable of predicting the future and identifying the optimal actions to change that future. To achieve this breakthrough ...

Senior Machine Learning Engineer

Pittsburgh, PA ยท On-site

$97K - $134K/yr

Design, develop, and support production MLOps pipelines and machine learning infrastructure * Build ... Mentor junior engineers and share technical expertise across the team * Lead technical initiatives ...

New

next page

Showing results 1-20

Machine Learning Infrastructure Engineer information

See salary details

$46.5K

$127.1K

$182K

How much do machine learning infrastructure engineer jobs pay per year?

As of Jun 16, 2026, the average yearly pay for machine learning infrastructure engineer in the United States is $127,066.00, according to ZipRecruiter salary data. Most workers in this role earn between $107,500.00 and $141,000.00 per year, depending on experience, location, and employer.

What are some common challenges faced by Machine Learning Infrastructure Engineers, and how can these be addressed on the job?

Machine Learning Infrastructure Engineers often face challenges such as ensuring infrastructure scalability, managing resource allocation, and maintaining system reliability while supporting rapid experimentation by data science teams. Balancing the needs for flexibility in research environments with production-grade stability requires a deep understanding of both engineering best practices and the unique requirements of machine learning workflows. Collaboration with data scientists, clear communication about infrastructure capabilities, and staying current with fast-evolving technologies are key strategies for success. Most companies encourage ongoing learning and provide opportunities to contribute to architecture decisions, which makes this a rewarding environment for problem-solvers and innovators.

What are the key skills and qualifications needed to thrive in the Machine Learning Infrastructure Engineer position, and why are they important?

To thrive as a Machine Learning Infrastructure Engineer, you need a strong background in computer science, cloud computing, distributed systems, and experience with machine learning frameworks, often supported by a degree in a related field. Familiarity with tools such as Docker, Kubernetes, Terraform, as well as cloud platforms like AWS, GCP, or Azure, and certifications in cloud or DevOps technologies are highly valued. Strong problem-solving abilities, effective communication, and collaboration skills help engineers work seamlessly with data scientists and cross-functional teams. These skills are essential to design, implement, and maintain robust, scalable infrastructure that enables efficient machine learning development and deployment.

What is a Machine Learning Infrastructure Engineer job?

A Machine Learning Infrastructure Engineer designs, builds, and maintains the systems that support the development and deployment of machine learning models. This includes managing data pipelines, optimizing model training and inference, and ensuring scalability and reliability in production environments. They work closely with data scientists, ML engineers, and DevOps teams to create efficient workflows and infrastructure. Key technologies often include cloud platforms, containerization, orchestration tools, and distributed computing frameworks.

More about Machine Learning Infrastructure Engineer jobs
What cities are hiring for Machine Learning Infrastructure Engineer jobs? Cities with the most Machine Learning Infrastructure Engineer job openings:
What states have the most Machine Learning Infrastructure Engineer jobs? States with the most job openings for Machine Learning Infrastructure Engineer jobs include:
What job categories do people searching Machine Learning Infrastructure Engineer jobs look for? The top searched job categories for Machine Learning Infrastructure Engineer jobs are:
Infographic showing various Machine Learning Infrastructure Engineer job openings in the United States as of June 2026, with employment types broken down into 80% Full Time, 11% Part Time, 3% Temporary, and 6% Contract. Highlights an 87% Physical, 5% Hybrid, and 8% Remote job distribution, with an average salary of $127,066 per year, or $61.1 per hour.

Machine Learning - Infrastructure

Causal Labs

San Francisco, CA โ€ข On-site

$126K - $166K/yr

Other

Posted 10 days ago


Job description

Infrastructure Engineer

Our mission is general causal intelligence, AI that is capable of predicting the future and identifying the optimal actions to change that future.

To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data.

Weather is the ideal training ground for an LPM. It is the most well-observed physical system, offering rapid, objective ground truth feedback from sensory observations and data at a scale that dwarfs what is used to train today's LLMs.

Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Insitro, and Nabla Bio - who believe general causal intelligence will be the most important technical breakthrough for civilization.

We look for infrastructure engineers who are excited to tackle unsolved problems.

Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission.

Responsibilities
  • Design, deploy, and maintain large distributed ML training and inference clusters
  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
  • Analyze, profile and debug low-level GPU operations to optimize performance
  • Stay up-to-date on research to bring new ideas to work
What We're Looking For
  • A relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.
  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
  • Background working on distributed task management systems and scalable model serving & deployment architectures
  • Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don't have to meet every single requirement above.