1

Senior Distributed Systems Engineer Jobs (NOW HIRING)

About the Role We're looking for a distributed systems engineer to expand the reach and effectiveness of our small Shared Services team. The ideal candidate uses their skills, experience, and ...

Sr. Distributed Systems Engineer

San Francisco, CA · On-site

$123K - $168K/yr

Role As a distributed systems engineer, you'll work across the stack to solve problems as they come up and help build Archil volumes. You'll have significant influence over the technical and product ...

next page

Showing results 1-20

Senior Distributed Systems Engineer information

See salary details

$56K

$124.7K

$176K

How much do senior distributed systems engineer jobs pay per year?

As of Jun 21, 2026, the average yearly pay for senior distributed systems engineer in the United States is $124,732.00, according to ZipRecruiter salary data. Most workers in this role earn between $104,500.00 and $143,000.00 per year, depending on experience, location, and employer.

What is the difference between Senior Distributed Systems Engineer vs Cloud Solutions Architect?

AspectSenior Distributed Systems EngineerCloud Solutions Architect
CredentialsBachelor's/Master's in CS or related, experience with distributed systemsBachelor's/Master's in CS, IT, or related, cloud certifications (AWS, Azure)
Work EnvironmentDesigning, developing, and maintaining distributed systems in tech companiesDesigning cloud infrastructure solutions for clients or internal teams
Industry UsageTech, finance, e-commerce, and enterprise sectorsIT consulting, cloud service providers, enterprise IT departments

The Senior Distributed Systems Engineer focuses on building and optimizing distributed computing systems, while the Cloud Solutions Architect designs cloud infrastructure solutions. Both roles require technical expertise and often overlap in cloud environments, but their primary responsibilities differ in scope and focus.

What engineer makes $500,000 a year?

Senior Distributed Systems Engineers can earn $500,000 or more annually, especially with extensive experience, specialized skills in cloud infrastructure, and leadership roles. High compensation often includes bonuses, stock options, and other incentives in large tech companies or startups with significant funding.

What are the key skills and qualifications needed to thrive as a Senior Distributed Systems Engineer, and why are they important?

A Senior Distributed Systems Engineer requires deep expertise in computer science fundamentals, scalable system architecture, and proficiency in programming languages such as Java, Go, or Python, often supported by a relevant degree and significant experience in distributed systems. Familiarity with tools like Kubernetes, Docker, Kafka, and cloud platforms (AWS, GCP, or Azure) is typically expected, along with knowledge of monitoring and CI/CD pipelines. Strong problem-solving, communication, and leadership skills help in tackling complex engineering challenges and collaborating across teams. These skills are crucial for designing robust, scalable, and reliable systems that support organizational growth and high availability.

What is the role of a DCS engineer?

A DCS (Distributed Control System) engineer designs, implements, and maintains control systems used in industrial processes, ensuring reliable and efficient operation. They work with automation tools, programming languages, and system integration to optimize plant performance and safety.

How much do distributed systems engineers make?

Distributed systems engineers typically earn between $100,000 and $160,000 annually, depending on experience, location, and company size. Senior roles with specialized skills in cloud platforms, programming, and system architecture can command higher salaries, often exceeding $180,000.

What are some common challenges Senior Distributed Systems Engineers face when designing scalable systems?

Senior Distributed Systems Engineers often encounter challenges such as managing data consistency, ensuring fault tolerance, and minimizing latency across multiple nodes. Balancing trade-offs between availability and partition tolerance (as outlined by the CAP theorem) is a frequent consideration. Additionally, coordinating between development and operations teams to maintain system reliability and efficiently resolve issues that arise in production environments is crucial. Strong communication skills and a deep understanding of distributed architectures help address these complexities effectively.

What are Senior Distributed Systems Engineers?

Senior Distributed Systems Engineers are experienced professionals who design, build, and maintain large-scale computing systems that run across multiple machines or locations. They focus on ensuring reliability, scalability, and performance of distributed applications, often dealing with challenges like data consistency, fault tolerance, and network latency. These engineers typically have deep expertise in distributed computing principles, programming languages, and cloud infrastructure. They also mentor junior team members and help architect robust solutions for complex technical problems.

What engineers make $300,000 a year?

Senior distributed systems engineers, software engineers in specialized fields like machine learning or cloud infrastructure, and senior roles in high-demand tech companies often earn $300,000 or more annually. These positions typically require advanced skills, extensive experience, and expertise in distributed architectures, scalable systems, and relevant tools such as Kubernetes or cloud platforms.
More about Senior Distributed Systems Engineer jobs
What cities are hiring for Senior Distributed Systems Engineer jobs? Cities with the most Senior Distributed Systems Engineer job openings:
What are the most commonly searched types of Distributed Systems Engineer jobs? The most popular types of Distributed Systems Engineer jobs are:
What states have the most Senior Distributed Systems Engineer jobs? States with the most job openings for Senior Distributed Systems Engineer jobs include:
What job categories do people searching Senior Distributed Systems Engineer jobs look for? The top searched job categories for Senior Distributed Systems Engineer jobs are:
Infographic showing various Senior Distributed Systems Engineer job openings in the United States as of June 2026, with employment types broken down into 1% As Needed, 66% Full Time, 32% Part Time, and 1% Contract. Highlights an 87% Physical, 5% Hybrid, and 8% Remote job distribution, with an average salary of $124,732 per year, or $60 per hour.

Senior Distributed Systems Engineer

Institute of Foundation Models

Sunnyvale, CA • Hybrid

$122K - $167K/yr

Full-time

Posted 20 days ago


Job description

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
 
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Reduce tail latency and improve determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       Deep debugging of NCCL, RDMA, and custom communication layers
·       Hybrid expert parallel strategies in modern large-scale MoE systems
·       Elastic and resilient distributed job orchestration concepts
·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
·       Hybrid expert parallel communication for Mixture-of-Experts training
·       Scaling behavior under network pressure
·       Distributed orchestration for elastic, large-scale training
·       Fault detection and recovery in distributed GPU workloads
·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
·       Deep familiarity with NCCL and/or UCX internals
·       Strong systems programming ability (C/C++, Rust, or Go)
·       Strong familiarity with modern model training frameworks such as PyTorch
·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks
·       Ability to translate research ideas into production-grade optimizations
·       Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
·       You can explain why an communication degrades at scale and how to fix it
·       You have improved real cluster throughput via communication redesign
·       You can trace a distributed hang across ranks and identify the root cause
·       You are comfortable working at the boundary between hardware and runtime
Application Requirements
·       Include a link to your GitHub (required)
·       Provide links to relevant distributed systems, HPC, or large-scale training projects
·       Include a list of publications and/or public technical reports (if applicable)
·       Describe the hardest distributed debugging problem you solved
·       Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Visa Sponsorship
This position is eligible for visa sponsorship.
 
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability