1

Reliability Engineer Manager Jobs in California (NOW HIRING)

Reliability Engineer We are seeking a skilled and detail-oriented Reliability Engineer to join our ... Lead the development, implementation, and management of reliability standards for all suppliers ...

Reliability Engineer We are seeking a skilled and detail-oriented Reliability Engineer to join our ... Lead the development, implementation, and management of reliability standards for all suppliers ...

Reliability Engineer We are seeking a skilled and detail-oriented Reliability Engineer to join our ... Lead the development, implementation, and management of reliability standards for all suppliers ...

Reliability Engineer

Costa Mesa, CA · On-site

$110K - $138K/yr

Anduril's Reliability Engineering organization is seeking an experienced Reliability Engineer to ... Experience with risk management, change control/change management reviews, and software/firmware ...

Principal Reliability Engineer

Fremont, CA · Hybrid

$121K - $152K/yr

This creates an NPI environment with significant interdependence and the need for proactive risk management. The Principal Reliability Engineering role is pivotal in this process, focusing on design ...

Principal Reliability Engineer

Fremont, CA · On-site

$121K - $152K/yr

This creates an NPI environment with significant interdependence and the need for proactive risk management. The Principal Reliability Engineering role is pivotal in this process, focusing on design ...

next page

Showing results 1-20

Reliability Engineer Manager information

See California salary details

$57.6K

$122.2K

$146.8K

How much do reliability engineer manager jobs pay per year?

As of Jun 23, 2026, the average yearly pay for reliability engineer manager in California is $122,219.00, according to ZipRecruiter salary data. Most workers in this role earn between $106,800.00 and $133,800.00 per year, depending on experience, location, and employer.

How much do SRE managers make in the US?

Reliability Engineer Managers, often called SRE Managers, typically earn between $120,000 and $180,000 annually in the US, depending on experience, location, and company size. They oversee teams responsible for system reliability, incident response, and automation, often requiring skills in cloud platforms, monitoring tools, and leadership. Compensation may also include bonuses and stock options.

What does a Reliability Engineer Manager do?

A Reliability Engineer Manager oversees teams responsible for improving the reliability and performance of systems, machinery, or processes within an organization. They develop maintenance strategies, lead root cause analyses of failures, and implement best practices to minimize downtime and costs. Additionally, they collaborate with other departments to ensure that reliability goals align with business objectives and compliance standards. Their role is crucial in industries such as manufacturing, energy, and technology, where system uptime and safety are critical.

What engineering jobs pay $500,000?

Senior engineering roles such as Reliability Engineer Managers, Petroleum Engineers, and Software Engineering Directors can reach or exceed $500,000 annually, especially with experience, bonuses, and stock options. These positions often require advanced skills, leadership, and industry expertise, typically found in high-demand sectors like energy, technology, and aerospace.

What is the highest salary of SRE?

The highest salary for a Reliability Engineer (SRE) can exceed $200,000 annually in high-demand markets, especially for those with extensive experience, advanced skills in automation and cloud platforms, and leadership responsibilities. Senior SREs or SRE Managers often earn higher compensation, including bonuses and stock options, reflecting their expertise and strategic impact on system reliability.

What are some common challenges Reliability Engineer Managers face when balancing long-term reliability improvements with immediate operational demands?

Reliability Engineer Managers often need to prioritize urgent maintenance issues while also driving long-term reliability initiatives. Balancing these competing demands can be challenging, as immediate equipment failures may require quick fixes that temporarily interrupt ongoing improvement projects. Effective managers work closely with operations, maintenance, and engineering teams to communicate priorities, allocate resources, and implement sustainable solutions that address root causes rather than just symptoms. This role typically involves using data-driven decision-making and fostering a culture of proactive maintenance and continuous improvement.

What are the key skills and qualifications needed to thrive as a Reliability Engineer Manager, and why are they important?

To thrive as a Reliability Engineer Manager, you need a strong background in engineering principles, reliability analysis, and maintenance strategies, typically supported by a degree in engineering and experience in reliability roles. Familiarity with reliability-centered maintenance (RCM), failure mode and effects analysis (FMEA), and asset management software such as SAP or Maximo is common, along with certifications like Certified Reliability Engineer (CRE). Leadership, problem-solving, and effective communication are vital soft skills for managing teams and driving cross-functional initiatives. These competencies are crucial for minimizing downtime, optimizing equipment performance, and ensuring long-term operational efficiency.

What is the difference between Reliability Engineer Manager vs Reliability Engineer?

AspectReliability EngineerReliability Engineer Manager
Required CredentialsBachelor's in Engineering or related field; certifications like CRC, CRESame as Reliability Engineer, plus leadership experience
Work EnvironmentDesign, analyze, and improve system reliability; often in teamsOversees Reliability Engineers; manages projects and teams
Employer & Industry UsageManufacturing, aerospace, energy, automotiveSame industries, with added managerial responsibilities
Common Search & ComparisonFocuses on technical skills and hands-on reliability tasksFocuses on leadership, team management, and strategic planning

The main difference between a Reliability Engineer and a Reliability Engineer Manager lies in their responsibilities. The Reliability Engineer focuses on technical analysis and system improvements, while the Reliability Engineer Manager oversees teams, manages projects, and develops strategies to enhance reliability across the organization.

What is a reliability engineering manager?

A reliability engineering manager oversees teams responsible for ensuring the dependability and performance of equipment, systems, or products. They develop maintenance strategies, analyze failure data, and implement improvements to enhance system uptime, often using tools like FMEA and reliability modeling. Strong leadership, technical expertise, and knowledge of industry standards are essential for this role.
What are the most commonly searched types of Reliability Engineer jobs in California? The most popular types of Reliability Engineer jobs in California are:
What cities in California are hiring for Reliability Engineer Manager jobs? Cities in California with the most Reliability Engineer Manager job openings:
Infographic showing various Reliability Engineer Manager job openings in California as of June 2026, with employment types broken down into 97% Full Time, and 3% Part Time. Highlights an 87% Physical, 5% Hybrid, and 8% Remote job distribution, with an average salary of $122,219 per year, or $58.8 per hour.
Site Reliability Engineer Manager- Hybrid

Site Reliability Engineer Manager- Hybrid

Calance US

Santa Clara, CA

$67.25 - $89.50/hr

Contractor

Medical, Dental, Vision, Life

Posted 13 days ago


Job description

We are hiring Site Reliability Engineer Manager- Hybrid for a Contract To Hire position in santa clara, CA
The Role
You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, GCP), and the platform services customers use to collaborate on hardware and software deployments.
You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.
What You Will Do
Team Leadership & Strategy
Develop and manage a team of 3 5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
Serve as the senior technical escalation for critical incidents guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.
24 x7 Infrastructure Reliability & Observability
Own 24 7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services designing for failure domains, progressive delivery, and strict change control at every tier.
Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.
Automation & Infrastructure as Code
Drive IaC-first discipline across the team Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.
Documentation & Global Collaboration
Build a documentation culture and scale a follow-the-sun on-call model as we expands globally runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.
What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12+ years in SRE, infrastructure engineering, or production engineering (8 years minimum).
3+ years managing SRE or infrastructure teams hiring, growing, and retaining engineers in a fast-moving environment.
Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
IaC fluency: Terraform and Ansible at production scale module design, remote state, environment isolation, and change governance.
Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
Full observability stack ownership: Prometheus, Grafana, and/or DataDog SLO definition, alert design, and E2E signal quality.
Strong Python and/or Go production services, not just scripts; automation that touches real infrastructure safely.
Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models not just managing existing runbooks.
Strongly Preferred
Experience operating customer-facing infrastructure or platform services reliability expectations beyond internal tooling.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink setup, troubleshooting, and performance tuning.
HPC job scheduler experience: Slurm, LSF, or equivalent setup, tuning, and integration with infrastructure automation.
Multi-cloud hybrid operations: AWS, Azure, GCP alongside on-prem/colo unified observability and IaC across all tiers.
FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
ITIL knowledge or equivalent structured incident/problem/change management framework experience.
Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.
Estimated Pay Range: 90-120/hr