1

Machine Learning Infrastructure Engineer Jobs (NOW HIRING)

next page

Showing results 1-20

Machine Learning Infrastructure Engineer information

See salary details

$46.5K

$127.1K

$182K

How much do machine learning infrastructure engineer jobs pay per year?

As of Jun 16, 2026, the average yearly pay for machine learning infrastructure engineer in the United States is $127,066.00, according to ZipRecruiter salary data. Most workers in this role earn between $107,500.00 and $141,000.00 per year, depending on experience, location, and employer.

What are some common challenges faced by Machine Learning Infrastructure Engineers, and how can these be addressed on the job?

Machine Learning Infrastructure Engineers often face challenges such as ensuring infrastructure scalability, managing resource allocation, and maintaining system reliability while supporting rapid experimentation by data science teams. Balancing the needs for flexibility in research environments with production-grade stability requires a deep understanding of both engineering best practices and the unique requirements of machine learning workflows. Collaboration with data scientists, clear communication about infrastructure capabilities, and staying current with fast-evolving technologies are key strategies for success. Most companies encourage ongoing learning and provide opportunities to contribute to architecture decisions, which makes this a rewarding environment for problem-solvers and innovators.

What are the key skills and qualifications needed to thrive in the Machine Learning Infrastructure Engineer position, and why are they important?

To thrive as a Machine Learning Infrastructure Engineer, you need a strong background in computer science, cloud computing, distributed systems, and experience with machine learning frameworks, often supported by a degree in a related field. Familiarity with tools such as Docker, Kubernetes, Terraform, as well as cloud platforms like AWS, GCP, or Azure, and certifications in cloud or DevOps technologies are highly valued. Strong problem-solving abilities, effective communication, and collaboration skills help engineers work seamlessly with data scientists and cross-functional teams. These skills are essential to design, implement, and maintain robust, scalable infrastructure that enables efficient machine learning development and deployment.

What is a Machine Learning Infrastructure Engineer job?

A Machine Learning Infrastructure Engineer designs, builds, and maintains the systems that support the development and deployment of machine learning models. This includes managing data pipelines, optimizing model training and inference, and ensuring scalability and reliability in production environments. They work closely with data scientists, ML engineers, and DevOps teams to create efficient workflows and infrastructure. Key technologies often include cloud platforms, containerization, orchestration tools, and distributed computing frameworks.

More about Machine Learning Infrastructure Engineer jobs
What cities are hiring for Machine Learning Infrastructure Engineer jobs? Cities with the most Machine Learning Infrastructure Engineer job openings:
What states have the most Machine Learning Infrastructure Engineer jobs? States with the most job openings for Machine Learning Infrastructure Engineer jobs include:
What job categories do people searching Machine Learning Infrastructure Engineer jobs look for? The top searched job categories for Machine Learning Infrastructure Engineer jobs are:
Infographic showing various Machine Learning Infrastructure Engineer job openings in the United States as of June 2026, with employment types broken down into 80% Full Time, 11% Part Time, 3% Temporary, and 6% Contract. Highlights an 87% Physical, 5% Hybrid, and 8% Remote job distribution, with an average salary of $127,066 per year, or $61.1 per hour.

Machine Learning Infrastructure Engineer

Mind Robotics

Palo Alto, CA • On-site

$126K - $166K/yr

Other

Posted 9 days ago


Job description

Machine Learning Infrastructure Engineer

At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.

We're looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.

Responsibilities
  • Design and implement scalable systems for training large ML models
  • Enable efficient workflows for data ingestion, training, and iteration
  • Develop and optimize distributed training systems across hundreds of GPUs
  • Implement strategies for parallelization, sharding, and efficient compute utilization
  • Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
  • Partner closely with modeling teams to accelerate iteration speed and reduce training costs
  • Build internal tools for experiment tracking, monitoring, and debugging
  • Implement systems for tracking training performance, failures, and resource utilization
  • Debug and resolve bottlenecks across the training stack
  • Provide lightweight infrastructure support for deploying and running models in production environments
  • Optimize inference performance and reliability where needed
  • Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
  • Manage compute resources efficiently across training jobs
Qualifications
  • Strong experience building infrastructure for large-scale ML training
  • Deep understanding of how modern LLM/VLM systems are trained and scaled
  • Proven experience setting up and scaling distributed training across hundreds of GPUs
  • Strong understanding of parallelization strategies (data, model, pipeline parallelism)
  • Strong proficiency in Python programming
  • Expert-level proficiency in PyTorch and/or JAX
  • Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage
Nice to Have
  • Experience supporting inference systems in production
  • Familiarity with robotics or embodied AI workloads
  • Experience building tools for experiment management and researcher productivity