1

Ml Infrastructure Jobs (NOW HIRING)

ML Infrastructure Engineer

Palo Alto, CA ยท On-site

$126K - $165K/yr

The ML Infrastructure Engineer will be responsible for designing, developing, and maintaining large-scale distributed systems, collaborating with engineering teams, and optimizing the model delivery ...

AI/ML Infrastructure Engineer

San Francisco, CA ยท On-site

$126K - $166K/yr

The AI Infrastructure team at Zensors builds the engine that powers our visual sensing platform. We ... As a Machine Learning Engineer in ML Runtime & Optimization , you will develop technologies to ...

Senior ML Data Infrastructure Engineer

Sunnyvale, CA ยท Remote

$129K - $175K/yr

ML Data Infrastructure Engineer LOCATION: Sunnyvale CA or Remote Duration: 12+ Months Rate: DOE Key skills - GCP ML Infrastructure, BigQuery, Dataflow, Airflow ( Cloud composer), Vertext AI ...

ML Infrastructure Engineer

Sunnyvale, CA ยท Hybrid

$119K - $187K/yr

Hands-on experience in ML platforms * Experience with GPU/TPU optimizations * Experience with Ray framework * Experience with Kubernetes at Scale * Experience infrastructure applications or similar ...

ML Infrastructure Engineer

Sunnyvale, CA ยท Hybrid

$119K - $187K/yr

Hands-on experience in ML platforms * Experience with GPU/TPU optimizations * Experience with Ray framework * Experience with Kubernetes at Scale * Experience infrastructure applications or similar ...

next page

Showing results 1-20

Ml Infrastructure information

See salary details

$46.5K

$127.1K

$182K

How much do ml infrastructure jobs pay per year?

As of Jun 6, 2026, the average yearly pay for ml infrastructure in the United States is $127,066.00, according to ZipRecruiter salary data. Most workers in this role earn between $107,500.00 and $141,000.00 per year, depending on experience, location, and employer.

What are some common challenges faced by professionals working in ML Infrastructure roles?

Professionals in ML Infrastructure often encounter challenges related to scaling systems to handle large volumes of data, ensuring reliable deployment pipelines, and maintaining reproducibility across different environments. They must also collaborate closely with data scientists and engineers to streamline workflows and address issues like version control and model monitoring. Staying updated with rapidly evolving tools and best practices is essential, and balancing stability with innovation is a frequent aspect of the role.

What is the difference between Ml Infrastructure vs Data Engineer?

AspectML InfrastructureData Engineer
Required CredentialsBachelor's in CS, Data Science, or related; knowledge of cloud platformsBachelor's in CS, Software Engineering, or related; experience with databases and ETL tools
Work EnvironmentFocus on deploying and maintaining ML systems, cloud environments, and infrastructure toolsDesigning, building, and managing data pipelines and storage solutions
Industry UsageUsed in AI/ML teams to support model deployment and scalabilityUsed across data-driven organizations for data management and analytics

ML Infrastructure specialists focus on deploying, scaling, and maintaining machine learning systems and infrastructure, while Data Engineers primarily build and manage data pipelines and storage solutions. Both roles require technical skills and often collaborate, but their core responsibilities differ in focus and tools used.

What are the key skills and qualifications needed to thrive as an ML Infrastructure Engineer, and why are they important?

To thrive as an ML Infrastructure Engineer, you need a strong background in software engineering, cloud computing, and machine learning concepts, often supported by a degree in computer science or a related field. Proficiency with containerization tools (like Docker and Kubernetes), cloud platforms (such as AWS, GCP, or Azure), and CI/CD systems is critical. Excellent problem-solving, collaboration, and communication skills help you efficiently work with data scientists and DevOps teams. These skills and qualities are vital for building scalable, reliable ML systems that support rapid experimentation and deployment in production environments.

What is ML Infrastructure?

ML Infrastructure refers to the underlying systems, tools, and processes that enable the development, deployment, and scaling of machine learning models. This includes data storage and management, computing resources, model training and serving environments, monitoring, and automation tools. ML Infrastructure ensures that data scientists and engineers can efficiently build, test, and maintain machine learning applications in a reliable and reproducible manner. It is a crucial foundation for organizations looking to operationalize AI and machine learning solutions at scale.
More about Ml Infrastructure jobs
What cities are hiring for Ml Infrastructure jobs? Cities with the most Ml Infrastructure job openings:
What states have the most Ml Infrastructure jobs? States with the most job openings for Ml Infrastructure jobs include:
What job categories do people searching Ml Infrastructure jobs look for? The top searched job categories for Ml Infrastructure jobs are:

ML Infrastructure Engineer, Training

Dyna Robotics

Redwood City, CA โ€ข On-site

$131K - $172K/yr

Full-time

Posted 2 days ago


Job description

Job Summary:
Dyna Robotics is a pioneering company in AI-driven robotics, known for its innovative embodied AI foundation model. They are seeking a ML Training Infrastructure Engineer to architect and build systems that optimize their multi-cloud GPU fleet for training, ensuring high performance and reproducibility for researchers.
Responsibilities:
โ€ข Scale Distributed Training: Architect and own the infrastructure for large-scale GPU clusters. Youโ€™ll implement sharding, activation checkpointing, and memory optimization (ZeRO, FSDP) to enable the training of massive multimodal models.
โ€ข Optimize Researcher Ergonomics: Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritizes fast iteration, automated retries, and seamless failure recovery.
โ€ข High-Performance Data Handling: Design high-throughput pipelines to ingest and transform terabytes of multimodal robot data (video, proprioception, 3D signals), ensuring dataloaders never starve the GPUs.
โ€ข Production Inference: Build low-latency inference pipelines for real-time robot control. Youโ€™ll apply quantization, distillation, and model compilation (TensorRT, Triton) to move models from the lab to the physical world.
โ€ข Deep Systems Profiling: Dive into the weeds of GPU utilization, I/O bottlenecks, and memory fragmentation to squeeze every bit of performance out of our expanding compute fleet.
Qualifications:
Required:
โ€ข 7+ Years of Engineering: With a track record of leading technical projects in high-performance computing (HPC) or ML infrastructure.
โ€ข ML Systems Mastery: Deep experience with PyTorch and distributed training frameworks (DeepSpeed, Accelerate). You understand the nuances of mixed precision and gradient accumulation.
โ€ข Infrastructure Expertise: Hands-on experience managing cloud GPU environments (GCP/AWS) and container orchestration (Kubernetes).
โ€ข Low-Level Intuition: A fundamental understanding of distributed systems, including race conditions, memory management, and NCCL/inter-node communication.
โ€ข Ownership Mindset: You don't just 'deploy' code; you design, build, and operate systems end-to-end to unblock fast-moving research.
Preferred:
โ€ข Experience with Robotics Data Formats (MCAP, Protobuf) or multimodal models (VLAs).
โ€ข Deep ML systems experience: custom kernels (Triton), compilers, or runtime optimization.
โ€ข Experience as a founding or early-stage infrastructure hire.
Company:
Dyna Robotics develops advanced robotic manipulation models to automate repetitive and stationary tasks. Founded in 2024, the company is headquartered in Redwood City, USA, with a team of 11-50 employees. The company is currently Early Stage.