1

Distributed Systems Engineer Jobs in Milpitas, CA

next page

Showing results 1-20

Distributed Systems Engineer information

See Milpitas, CA salary details

$62.3K

$148.3K

$194.6K

How much do distributed systems engineer jobs pay per year?

As of Jun 17, 2026, the average yearly pay for distributed systems engineer in Milpitas, CA is $148,253.00, according to ZipRecruiter salary data. Most workers in this role earn between $114,200.00 and $183,000.00 per year, depending on experience, location, and employer.

What are the typical daily responsibilities of a Distributed Systems Engineer?

A Distributed Systems Engineer typically spends their days designing, implementing, and testing scalable systems that handle large volumes of data and user requests. You'll collaborate closely with software developers, DevOps engineers, and product managers to architect solutions that ensure reliability, performance, and fault-tolerance. Regular tasks may include reviewing system performance metrics, debugging distributed applications, writing detailed documentation, and participating in code reviews. Engaging in team meetings and cross-functional discussions is also common, as seamless cooperation is vital in this complex and fast-evolving field.

What are the key skills and qualifications needed to thrive in the Distributed Systems Engineer position, and why are they important?

To thrive as a Distributed Systems Engineer, you need a strong background in computer science, experience with large-scale system design, and proficiency in languages such as Java, Go, or Python. Familiarity with cloud platforms (like AWS, GCP, or Azure), container orchestration tools (such as Kubernetes), and distributed databases is commonly required, and certifications in cloud computing can be advantageous. Strong problem-solving abilities, collaboration, and excellent communication skills help you navigate complex issues and work effectively across technical teams. These skills are fundamental for designing, implementing, and maintaining robust distributed systems that perform reliably at scale.

What does a Distributed Systems Engineer do?

A Distributed Systems Engineer designs, builds, and maintains large-scale systems that run across multiple machines or data centers. They ensure reliability, scalability, and fault tolerance by using technologies like cloud computing, containerization, and distributed databases. Their work often involves solving complex problems related to data consistency, network latency, and system coordination.

What are popular job titles related to Distributed Systems Engineer jobs in Milpitas, CA? For Distributed Systems Engineer jobs in Milpitas, CA, the most frequently searched job titles are:
What job categories do people searching Distributed Systems Engineer jobs in Milpitas, CA look for? The top searched job categories for Distributed Systems Engineer jobs in Milpitas, CA are:
What cities near Milpitas, CA are hiring for Distributed Systems Engineer jobs? Cities near Milpitas, CA with the most Distributed Systems Engineer job openings:
Infographic showing various Distributed Systems Engineer job openings in Milpitas, CA as of June 2026, with employment types broken down into 1% As Needed, 97% Full Time, and 2% Part Time. Highlights an 86% Physical, 5% Hybrid, and 9% Remote job distribution, with an average salary of $148,253 per year, or $71.3 per hour.

Senior Distributed Systems Engineer

Institute of Foundation Models

Sunnyvale, CA • Hybrid

$122K - $167K/yr

Full-time

Posted 16 days ago


Job description

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
 
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Reduce tail latency and improve determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       Deep debugging of NCCL, RDMA, and custom communication layers
·       Hybrid expert parallel strategies in modern large-scale MoE systems
·       Elastic and resilient distributed job orchestration concepts
·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
·       Hybrid expert parallel communication for Mixture-of-Experts training
·       Scaling behavior under network pressure
·       Distributed orchestration for elastic, large-scale training
·       Fault detection and recovery in distributed GPU workloads
·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
·       Deep familiarity with NCCL and/or UCX internals
·       Strong systems programming ability (C/C++, Rust, or Go)
·       Strong familiarity with modern model training frameworks such as PyTorch
·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks
·       Ability to translate research ideas into production-grade optimizations
·       Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
·       You can explain why an communication degrades at scale and how to fix it
·       You have improved real cluster throughput via communication redesign
·       You can trace a distributed hang across ranks and identify the root cause
·       You are comfortable working at the boundary between hardware and runtime
Application Requirements
·       Include a link to your GitHub (required)
·       Provide links to relevant distributed systems, HPC, or large-scale training projects
·       Include a list of publications and/or public technical reports (if applicable)
·       Describe the hardest distributed debugging problem you solved
·       Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Visa Sponsorship
This position is eligible for visa sponsorship.
 
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability