Remote Devops Sre Engineer Jobs in San Ramon, CA

Staff SRE, AI Infrastructure

$67.25 - $89.25/hr

Staff SRE, AI Infrastructure Location: North America Remote / San Francisco • Full-Time About ... Production Operations of GPU Fleets: Own the day-to-day health of thousands of GPUs across ...

Andromeda Cluster, Inc

Staff SRE, AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

Truv

Senior DevOps Engineer

San Francisco, CA · Remote

$140K - $170K/yr

... DevOps, SRE, or Infrastructure Engineering roles * Deep expertise with AWS, including hands-on ... Fully remote * Competitive salary and equity package * Health, dental, and vision insurance ...

Truv

Senior DevOps Engineer

San Francisco, CA · Remote

$140K - $170K/yr

Andromeda Cluster, Inc

Senior Site Reliability Engineer - AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat ... Diagnose and resolve fabric-level issues that degrade collective operations. * Observability: Build ...

Andromeda Cluster, Inc

Senior Site Reliability Engineer - AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

Five9

Senior DevOps Engineer

San Ramon, CA · On-site +1

$252K/yr

Proven application of SRE practices in high-stakes, always-on environments. * Strong experience ... This role is fully remote for candidates who reside outside the 30 mile radius of one of our ...

Quick apply

Five9

Senior DevOps Engineer

San Ramon, CA · On-site +1

$252K/yr

Jones Lang LaSalle IP, Inc.

DevFinOps Engineer

San Francisco, CA · On-site +1

$62.25 - $85/hr

... experience in DevOps, Site Reliability Engineering, or a related IT operations discipline ... Remote -Chicago, IL, San Francisco, CA If this resonates with you, we encourage you to apply, even ...

Jones Lang LaSalle IP, Inc.

DevFinOps Engineer

San Francisco, CA · On-site +1

$62.25 - $85/hr

Five9

Senior DevOps Engineer

San Ramon, CA · On-site +1

$145K - $186K/yr

... SRE, and product engineering teams. We are looking for a hands-on senior engineer who thrives in ... This role is fully remote for candidates who reside outside the 30 mile radius of one of our ...

Quick apply

Five9

Senior DevOps Engineer

San Ramon, CA · On-site +1

$145K - $186K/yr

Invisible Technologies

Principal Software Engineer (SRE/DevOps) - Remote

San Francisco, CA · On-site +1

$159K - $213K/yr

Qualifications -We know that if we have a DevOps team we aren't practicing DevOps both are listed to make it clear that we're looking for a multi position player who's comfortable with application ...

Invisible Technologies

Principal Software Engineer (SRE/DevOps) - Remote

San Francisco, CA · On-site +1

$159K - $213K/yr

Menlo

DevOps Engineer

San Francisco, CA · Remote

$54 - $74/hr

The Role As an DevOps Engineer, you will own and evolve the platform that everything at Menlo runs ... Bonus points if you have experimented with agent-assisted SRE workflows or LLM-driven incident ...

Quick apply

Menlo

DevOps Engineer

San Francisco, CA · Remote

$54 - $74/hr

Invisible Technologies

Principal Software Engineer (SRE/DevOps) - Remote

San Francisco, CA · Remote

$159K - $213K/yr

Invisible Technologies

Principal Software Engineer (SRE/DevOps) - Remote

San Francisco, CA · Remote

$159K - $213K/yr

Arch Systems

Staff Platform Engineer (U.S.-based required) (Remote)

Palo Alto, CA · Remote

$170K - $218K/yr

... DevOps, SRE, Infrastructure Engineering, Security Engineering, or related roles. * Demonstrated ... Remote-First Flexibility: Embrace a remote-first work culture that provides flexibility and balance ...

Quick apply

Arch Systems

Staff Platform Engineer (U.S.-based required) (Remote)

Palo Alto, CA · Remote

$170K - $218K/yr

Platform9 Systems

Senior Platform Engineer

San Jose, CA · On-site +1

$127K - $172K/yr

US (Remote) About Platform9 Platform9 is the leader in enterprise Private Cloud. Founded by VMware ... Qualifications * 5+ years in a DevOps or SRE role with deep experience in cloud infrastructure and ...

Platform9 Systems

Senior Platform Engineer

San Jose, CA · On-site +1

$127K - $172K/yr

Bitdeer Technologies Group

Sr. SRE Platform Architect (Remote)

San Jose, CA · Remote

Security owns policy and risk acceptance; you own the operational mechanisms they ride. * Pre ... Qualifications * 10+years of production SRE / platform-engineering / infra-architecture, including ...

Quick apply

Bitdeer Technologies Group

Sr. SRE Platform Architect (Remote)

San Jose, CA · Remote

Pano

Platform Engineer

San Francisco, CA · On-site +1

$146K - $220K/yr

About Pano: We are a 175+ person growth-stage hybrid-remote start-up, headquartered in San ... Reliability Engineer (SRE) or a DevOps Engineer * 3+ years of hands-on experience with cloud ...

New

Pano

Platform Engineer

San Francisco, CA · On-site +1

$146K - $220K/yr

New

Arlo Technologies, Inc.

Staff Engineer - Engineering Platform

Milpitas, CA · On-site +1

$155K - $225K/yr

Partner with Product, DevOps, SRE, and Security to align engineering initiatives with business goals. What You Bring * 10+ years of backend engineering experience, primarily in Java. * Strong ...

Arlo Technologies, Inc.

Staff Engineer - Engineering Platform

Milpitas, CA · On-site +1

$155K - $225K/yr

Careerswift

DevOps Engineer

San Francisco, CA · Remote

$62.25 - $85/hr

You will work to improve deployment reliability, scalability and security across our platform ... Remote Seniority Level: Senior

Careerswift

DevOps Engineer

San Francisco, CA · Remote

$62.25 - $85/hr

You will work to improve deployment reliability, scalability and security across our platform ... Remote Seniority Level: Senior

Careerswift

DevOps Engineer

San Francisco, CA · Remote

$54 - $74/hr

You will work to improve deployment reliability, scalability and security across our platform ... Remote Seniority Level: Senior

Quick apply

Careerswift

DevOps Engineer

San Francisco, CA · Remote

$54 - $74/hr

You will work to improve deployment reliability, scalability and security across our platform ... Remote Seniority Level: Senior

CYNET SYSTEMS

Development Tooling Engineer - Remote / Telecommute

San Jose, CA · Remote

$55 - $60/hr

Strong documentation and operational process discipline. * Ability to work across security, IT, SRE, developer teams, and vendors. Responsibilities: * Operating GitLab, Jenkins, TeamCity, Gerrit, and ...

Quick apply

CYNET SYSTEMS

Development Tooling Engineer - Remote / Telecommute

San Jose, CA · Remote

$55 - $60/hr

Biointellisense

DevOps Engineer

Redwood City, CA · Remote

$140K - $170K/yr

We're a remote-first, lean start-up environment and our global BioTeam colleagues are growth ... Drive reliability and infrastructure optimization for performance, cost, and business continuity ...

Quick apply

Biointellisense

DevOps Engineer

Redwood City, CA · Remote

$140K - $170K/yr

Biointellisense

DevOps Engineer

Redwood City, CA · Remote

$140K - $170K/yr

Biointellisense

DevOps Engineer

Redwood City, CA · Remote

$140K - $170K/yr

Cisco

Customer Reliability Engineer, Hypershield (remote)

San Jose, CA · On-site +1

$120K - $151K/yr

The team owns the hardest break/fix and reliability cases escalated by Cisco TAC, applying Site ... Operational Kubernetes and Helm proficiency, including diagnosing failures beyond the workload ...

Cisco

Customer Reliability Engineer, Hypershield (remote)

San Jose, CA · On-site +1

$120K - $151K/yr

Showing results 1-20

Remote Devops Sre Engineer Jobs in San Ramon, CA

Remote Devops Sre Engineer information

See San Ramon, CA salary details

$12

$71

$102

How much do remote devops sre engineer jobs pay per hour?

As of Jul 15, 2026, the average hourly pay for remote devops sre engineer in San Ramon, CA is $71.23, according to ZipRecruiter salary data. Most workers in this role earn between $61.25 and $81.39 per hour, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive in the Remote Devops Sre Engineer position, and why are they important?

To thrive as a Remote DevOps SRE Engineer, you need deep expertise in systems administration, cloud platforms (such as AWS, Azure, or Google Cloud), automation, and infrastructure-as-code, typically backed by experience in scripting languages and a relevant technical degree. Proficiency with tools like Docker, Kubernetes, Terraform, Jenkins, and monitoring solutions, as well as certifications such as AWS Certified DevOps Engineer or Google Professional SRE, is highly valued. Strong problem-solving skills, effective communication, and the ability to collaborate virtually are key soft skills in a remote setting. These capabilities ensure reliability, scalability, and seamless operations of critical systems in distributed environments.

What are the primary responsibilities and challenges faced by a Remote DevOps SRE Engineer day-to-day?

Remote DevOps SRE Engineers are responsible for maintaining the availability, scalability, and performance of production systems, often using automation to streamline deployments and incident response. A typical day may involve refining CI/CD pipelines, responding to system alerts, conducting root cause analysis of outages, and collaborating with software engineers to improve system reliability. One of the main challenges is proactively identifying and mitigating potential infrastructure issues before they impact customers, all while communicating effectively across remote teams. The role often requires balancing the implementation of new technologies with maintaining operational stability, making adaptability and strong time management essential for success.

What is a Remote DevOps SRE Engineer job?

A Remote DevOps SRE (Site Reliability Engineer) job involves managing and automating infrastructure, ensuring system reliability, and optimizing deployment processes from a remote location. These professionals work with cloud platforms, CI/CD pipelines, monitoring tools, and configuration management to enhance system performance and availability. They collaborate with development and operations teams to prevent service disruptions and improve scalability. The role requires expertise in coding, automation, troubleshooting, and cloud technologies like AWS, Azure, or GCP.

What are popular job titles related to Remote Devops Sre Engineer jobs in San Ramon, CA? For Remote Devops Sre Engineer jobs in San Ramon, CA, the most frequently searched job titles are:

What job categories do people searching Remote Devops Sre Engineer jobs in San Ramon, CA look for? The top searched job categories for Remote Devops Sre Engineer jobs in San Ramon, CA are:

What cities near San Ramon, CA are hiring for Remote Devops Sre Engineer jobs? Cities near San Ramon, CA with the most Remote Devops Sre Engineer job openings:

Los Angeles

Remote Devops Sre Engineer jobs near you

Infographic showing various Remote Devops Sre Engineer job openings in San Ramon, CA as of July 2026, with employment types broken down into 1% Locum Tenens, 91% Full Time, 1% Part Time, and 7% Contract. Highlights an 77% Physical, 6% Hybrid, and 17% Remote job distribution, with an average salary of $148,164 per year, or $71.2 per hour.

Staff SRE, AI Infrastructure

Andromeda Cluster, Inc

San Francisco, CA • On-site, Remote

Apply

$67.25 - $89.25/hr

Full-time

Re-posted 25 days ago

Job description

Staff SRE, AI Infrastructure
Location: North America Remote / San Francisco • Full-Time
About Andromeda
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.
Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it's needed most. Our aim is to become a liquidity layer for global AI compute - routing workloads across providers, GPU generations, and geographies the way financial markets route capital.
We're a small, senior team where one engineer's judgment shapes every customer's experience. You'll join early enough to define how we run infrastructure at scale, work directly with the world's most demanding AI customers, and build a career operating at the frontier of what compute can do.
The Role
We're hiring a Staff SRE to own the reliability of Andromeda's infrastructure end to end - from a node being racked and joined to a cluster, through the schedulers and control planes that place jobs on it, up to the customer-facing surface where a training run either succeeds or doesn't.
We're looking for someone with multiple years of hands-on experience operating GPU infrastructure at scale. You read NVIDIA release notes the day they drop. You have war stories about NCCL, fabric topology choices, and what it takes to keep a multi-thousand-GPU run healthy. You move comfortably from a kernel-level perf trace to a customer incident bridge in the same hour, and you write the postmortem yourself.
What You'll Own

Highest-Priority Incident Leadership: Carry the pager. When a top-customer training run degrades or a multi-cluster incident hits, you're the engineer who walks the stack from PyTorch → NCCL → driver → fabric → hardware until the answer is found. You lead the response, write the postmortem, and ship the systemic fix.
Production Operations of GPU Fleets: Own the day-to-day health of thousands of GPUs across providers and generations. Node lifecycle, burn-in, validation, draining, repair workflows, firmware rollouts, driver upgrades - the unglamorous work that decides whether the platform actually holds up.
Observability & Health Systems: Build and own the telemetry, GPU health checks, fabric monitoring, and automated remediation that let us catch a degraded NVLink or a flaky transceiver before a customer does. Tooling lives on your laptop; you ship it.
On-Call Practice: Define how on-call works at Andromeda - rotations, escalation, runbooks, incident command, blameless review. As the team grows, you set the bar.
Customer-Facing Technical Presence: Be the senior reliability voice in the room with sophisticated AI infra customers and providers. Run incident reviews with a customer's principal engineer. Scope demanding workloads. Sit in on architecture deep-dives and deal cycles where reliability credibility closes the room.
Partnership with Engineering: Work shoulder-to-shoulder with the product team. You design with SLOs, error budgets, and failure modes in mind; they ship features; together you close the loop on every systemic issue. Translate customer pain into actionable priorities for product teams.
Hardware & Buildout Influence: Partner with providers and DC teams on physical design - rack and pod layout, power and cooling envelopes, network topology, burn-in and validation - to keep failure modes out of production before they arrive.
Mentorship as a Daily Practice: Spend real time every day making other engineers better. Incident reviews, pairing on diagnosis, written guidance, hiring.

What We're Looking For

Years in This Space, Not Months: Multiple years building and operating large-scale GPU infrastructure as your primary job. You came up through this industry.
Staff-Level SRE Track Record: A clear history of owning the reliability of load-bearing infrastructure. You've been the senior engineer a team relies on when production is on fire and the failure mode is in a layer no one's touched yet.
GPU Systems Obsession: Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale. You understand memory hierarchies, ECC and SBE/DBE behavior, thermal envelopes, NVLink and NVSwitch topology, and hardware failure modes from direct production experience. You also have opinions about what's coming next and why.
High-Performance Networking, in Production: Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training. You can diagnose a slow all-reduce, find a degraded link in a fat-tree, reason about congestion control, and design topology for the workloads it'll actually carry.
Distributed Training Internals: Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed, FSDP, DeepSpeed, Megatron, and modern checkpointing/recovery patterns. When a 1,000+ GPU job stalls, you know where to look first.
Production-Grade Engineering: Strong Go, Python, or Rust. You build production tooling, controllers, and automation - not throwaway scripts. Comfortable in Kubernetes-with-GPUs (device plugins, topology-aware scheduling, multi-cluster) and/or Slurm/HPC schedulers. Terraform/Helm/Ansible is table stakes.
Linux & Systems Internals: Expert-level: kernel tuning, NVIDIA driver and CUDA toolkit lifecycle, cgroups/namespaces, perf and BPF, firmware management.
On-Call Composure: Comfortable being the senior engineer on a P0 bridge with the customer on the line and the provider listening. You triage calmly, decide fast, and document afterward.
Customer Presence: Comfortable being the senior technical voice in a room with sophisticated AI infra customers, providers, and prospects. You can run an incident review with a customer's principal engineer, then walk into a deal review and frame the same content for a CTO buying compute.

Strong Candidates May Have

Built or significantly contributed to a custom GPU health system, fleet manager, fabric controller, or on-call/incident tooling in production.
Distributed storage depth (VAST, Weka, Lustre, GPFS) and a clear opinion on checkpoint I/O patterns at scale.
Profiling and diagnosis of distributed training - MFU work, straggler mitigation, collective tuning across multi-thousand-GPU runs.
Experience as the senior SRE partner in enterprise relationships for AI infrastructure or HPC.
Open-source contributions in the GPU/AI infra stack (NCCL, Kubernetes scheduler plugins, GPU operators, DCGM tooling, etc.).
Public talks, writing, or community presence in the GPU/AI infra industry.

Why You'll Love It Here
This is the role where one engineer's reliability decisions show up in every customer's training run. You'll have significant autonomy and the leverage of working on infrastructure that the most ambitious AI labs in the world depend on - staying as hands-on as you want in the code, in the room with customers, and on the bridge when it matters.
Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply

Remote Devops Sre Engineer Jobs in San Ramon, CA

Staff SRE, AI Infrastructure

Staff SRE, AI Infrastructure

Senior DevOps Engineer

Senior DevOps Engineer

Senior Site Reliability Engineer - AI Infrastructure

Senior Site Reliability Engineer - AI Infrastructure

Senior DevOps Engineer

Senior DevOps Engineer

DevFinOps Engineer

DevFinOps Engineer

Senior DevOps Engineer

Senior DevOps Engineer

Principal Software Engineer (SRE/DevOps) - Remote

Principal Software Engineer (SRE/DevOps) - Remote

DevOps Engineer

DevOps Engineer

Principal Software Engineer (SRE/DevOps) - Remote

Principal Software Engineer (SRE/DevOps) - Remote

Staff Platform Engineer (U.S.-based required) (Remote)

Staff Platform Engineer (U.S.-based required) (Remote)

Senior Platform Engineer

Senior Platform Engineer

Sr. SRE Platform Architect (Remote)

Sr. SRE Platform Architect (Remote)

Platform Engineer

Platform Engineer

Staff Engineer - Engineering Platform

Staff Engineer - Engineering Platform

DevOps Engineer

DevOps Engineer

DevOps Engineer

DevOps Engineer

Development Tooling Engineer - Remote / Telecommute

Development Tooling Engineer - Remote / Telecommute

DevOps Engineer

DevOps Engineer

DevOps Engineer

DevOps Engineer

Customer Reliability Engineer, Hypershield (remote)

Customer Reliability Engineer, Hypershield (remote)

Remote Devops Sre Engineer information

See San Ramon, CA salary details

How much do remote devops sre engineer jobs pay per hour?

What are the key skills and qualifications needed to thrive in the Remote Devops Sre Engineer position, and why are they important?

What are the primary responsibilities and challenges faced by a Remote DevOps SRE Engineer day-to-day?

What is a Remote DevOps SRE Engineer job?

Staff SRE, AI Infrastructure

Share this job

Job description

Share this job