1

Reliability Program Manager Jobs in Elmhurst, IL

SRE Architect, AI-Powered Reliability

Chicago, IL · On-site

$58.75 - $78/hr

Incident Management * Establish the enterprise incident management framework: severity definitions ... Build and operate chaos engineering programs appropriate for WEX's financial systems, running ...

Site Reliability Engineer

Chicago, IL · On-site +1

$100K - $120K/yr

Strong knowledge of SRE best practices and incident management protocols * Deep experience using ... Education Assistance Program - to help colleagues pursue industry/role-specific certifications

Manage Customer Reliability Engineering activities driving Application Monitoring, Metrics ... Comprehensive, multi-carrier program for medical, dental and vision benefits * Retirement Benefits ...

Customer Reliability Engineer

Lisle, IL · On-site

$101K - $127K/yr

... Vendor Management Programs. Collabera recognizes true potential of human capital and provides ... The Customer Reliability Engineer will use cutting edge predictive analytics software and tools to ...

Mastery of Terraform module design and ArgoCD for managing immutable infrastructure at an ... Our compensation program also includes an annual target bonus opportunity for all employees, as ...

Site Reliability Engineer

Northbrook, IL · On-site

$72K - $158K/yr

Manage cloud services including compute, storage, networking, and identity * Optimize ... This position is eligible for a CVS Health bonus, commission or short-term incentive program in ...

next page

Showing results 1-20

Reliability Program Manager information

See Elmhurst, IL salary details

$61.8K

$117K

$167.8K

How much do reliability program manager jobs pay per year?

As of Jun 9, 2026, the average yearly pay for reliability program manager in Elmhurst, IL is $117,015.00, according to ZipRecruiter salary data. Most workers in this role earn between $94,100.00 and $139,400.00 per year, depending on experience, location, and employer.

What is a Reliability Program Manager?

A Reliability Program Manager is a professional responsible for developing, implementing, and managing programs that ensure the dependability and optimal performance of systems, equipment, or processes within an organization. They analyze data, identify potential risks, and create strategies to minimize downtime and improve reliability. Their role often involves cross-functional collaboration, continuous improvement initiatives, and adherence to industry standards and best practices. Reliability Program Managers are commonly found in sectors like manufacturing, energy, aerospace, and technology.

How does a Reliability Program Manager typically collaborate with engineering and operations teams to drive improvements?

A Reliability Program Manager works closely with both engineering and operations teams to identify areas where equipment or process reliability can be enhanced. This involves facilitating cross-functional meetings, analyzing failure data, and aligning improvement initiatives with production goals. The manager often acts as a bridge, translating technical findings into actionable plans that operations can implement. Effective communication and the ability to prioritize projects based on impact and resource availability are key to successful collaboration.

What are the key skills and qualifications needed to thrive as a Reliability Program Manager, and why are they important?

A Reliability Program Manager typically needs a solid background in engineering, reliability analysis, and project management, often supported by a relevant degree and certifications like Certified Reliability Engineer (CRE). Familiarity with reliability modeling software, root cause analysis tools, and asset management systems is common in the role. Strong leadership, problem-solving, and communication skills are crucial for effectively coordinating teams and driving reliability initiatives. These skills are essential to ensure equipment uptime, reduce operational risks, and optimize organizational performance.

What is the difference between Reliability Program Manager vs Maintenance Engineer?

AspectReliability Program ManagerMaintenance Engineer
Primary FocusDeveloping and implementing reliability strategies to improve equipment uptimePerforming maintenance tasks to repair and maintain equipment
CertificationsReliability certifications (e.g., RCM, RCPE) often preferredMechanical or electrical engineering degrees, maintenance certifications
Work EnvironmentOffice-based planning, cross-department collaborationFieldwork, plant or facility maintenance
Industry UsageCommon in manufacturing, energy, and industrial sectorsCommon in manufacturing, facilities management, and industrial plants

The Reliability Program Manager focuses on strategic reliability initiatives to prevent failures, while Maintenance Engineers handle hands-on repair and maintenance tasks. Both roles are essential for operational efficiency but differ in scope and responsibilities.

What are popular job titles related to Reliability Program Manager jobs in Elmhurst, IL? For Reliability Program Manager jobs in Elmhurst, IL, the most frequently searched job titles are:
What job categories do people searching Reliability Program Manager jobs in Elmhurst, IL look for? The top searched job categories for Reliability Program Manager jobs in Elmhurst, IL are:
What cities near Elmhurst, IL are hiring for Reliability Program Manager jobs? Cities near Elmhurst, IL with the most Reliability Program Manager job openings:
Infographic showing various Reliability Program Manager job openings in Elmhurst, IL as of May 2026, with employment types broken down into 1% Internship, 3% As Needed, 31% Full Time, 54% Part Time, 1% Temporary, and 10% Contract. Highlights an 95% Physical, 1% Hybrid, and 4% Remote job distribution, with an average salary of $117,015 per year, or $56.3 per hour.
SRE Architect, AI-Powered Reliability

SRE Architect, AI-Powered Reliability

WEX

Chicago, IL • On-site

$58.75 - $78/hr

Full-time

Medical, Dental, Vision, Life, Retirement, PTO

Posted 27 days ago


WEX Inc. rating

8.1

Company rating: 8.1 out of 10

Based on 16 frontline employees who took The Breakroom Quiz

7th of 17 rated payment service providers


Job description

About the Team & Role

WEX operates across multiple lines of business, Mobility, Benefits, and Travel, serving enterprise customers globally with payment and technology solutions that demand uncompromising reliability. These are mission-critical systems handling high-volume financial transactions where availability, transactional integrity, and low latency are non-negotiable. Our SRE practice is in its early stages, and the decisions made now will define how we build, operate, and continuously improve reliable systems for years to come.


This person will define and enforce the reliability standards, operational practices, and architectural guardrails that every line of business at WEX must meet, and will use AI as a primary tool to establish, scale, and continuously improve those standards faster than traditional approaches alone can achieve.


This is not a role embedded in a single business unit. It sits at the center of WEX engineering with a mandate that spans all LOBs. You will set the bar, and you will hold it , working with engineering leadership, platform teams, and LOB architects to make reliability a consistent, measurable, and continuously improving property of every system we operate.

How you'll make an impact

Enterprise Standards & Governance

  • Define, publish, and enforce enterprise-wide SRE best practices and operational standards covering observability, incident management, resilience, capacity planning, and reliability architecture, applicable across all WEX lines of business.

  • Define and lead WEX's AI-Powered Reliability Engineering strategy, driving adoption of SRE agents across the software lifecycle-from design and development through deployment and operations, to improve reliability, automation, and operational efficiency.

  • Architect and oversee the implementation of mission-critical systems, ensuring that reliability, availability, and transactional integrity requirements are designed in from the start, not bolted on after the fact.

  • Establish and govern SLO, SLI, and error budget frameworks across LOBs, partnering with engineering leadership to align reliability targets with business and commercial expectations.

  • Own the production readiness review process, defining the criteria every service must meet before going live and driving accountability for remediation when gaps are found.

  • Serve as the primary technical advisor to engineering leadership across WEX on matters of reliability, resilience architecture, and operational excellence.

Observability

  • Define the enterprise observability standard, what good looks like for metrics, distributed tracing, structured logging, and alerting, and hold all LOBs accountable to it.

  • Use AI-powered tooling to move beyond static dashboards: deploy intelligent anomaly detection, dynamic baselining, and automated signal correlation to reduce noise and surface actionable signals at scale.

  • Drive instrumentation practices that give engineering teams genuine insight into the health of high-availability, low-latency systems, including real-time payment flows and transaction pipelines where latency and consistency are critical.

  • Lead the evaluation and adoption of AI-assisted observability platforms that reason across telemetry sources to accelerate detection and diagnosis.

Incident Management

  • Establish the enterprise incident management framework: severity definitions, response playbooks, escalation paths, on-call standards, and cross-LOB communication protocols.

  • Integrate AI into the full incident lifecycle, intelligent triage and automated runbook suggestions at detection, real-time signal correlation during active incidents, and AI-assisted timeline and impact summaries at resolution.

  • Reduce cognitive burden on on-call engineers through tooling that surfaces relevant context, prior incidents, and likely remediation paths automatically during high-pressure situations.

  • Define, track, and report on incident metrics (MTTD, MTTR, recurrence rate) across all LOBs, using trends to drive systemic improvement rather than one-off fixes.

Resilience Engineering & Self-Healing Systems

  • Lead cross-functional initiatives to enhance system resilience and performance across WEX, advocating for circuit breakers, bulkheads, graceful degradation, retry strategies, and fault isolation as enterprise standards.

  • Design self-healing and auto-recovery mechanisms that allow systems to detect, respond to, and recover from common failure modes without human intervention, reducing toil and improving mean time to recovery.

  • Build and operate chaos engineering programs appropriate for WEX's financial systems, running controlled failure experiments that expose resilience gaps safely and systematically before they manifest as production incidents.

  • Use AI to proactively identify resilience risks: analyze production telemetry, deployment signals, and dependency graphs to surface systems most likely to fail under stress before incidents occur.

Capacity Planning & Load Testing

  • Develop enterprise capacity planning strategies, establishing the models, tooling, and review cadences that ensure every LOB can anticipate and provision for demand growth without last-minute scrambles or over-provisioning.

  • Define and enforce load testing standards as a gate in the software delivery lifecycle, ensuring that services can handle peak transactional load, including burst demand on payment and fleet systems, before they reach production.

  • Apply AI-driven forecasting to capacity planning: model historical growth patterns, seasonal demand signals, and business pipeline data to produce reliable capacity outlooks across LOBs.

Cloud Cost Optimization

  • Drive cloud cost optimization and budgeting initiatives across WEX engineering, establishing the frameworks, tooling, and governance processes that ensure cloud spend is rationalized against reliability and performance outcomes.

  • Identify and remediate cost inefficiencies without compromising availability: right-sizing, reserved capacity strategy, workload scheduling, and architecture patterns that reduce waste in high-availability deployments.

  • Partner with LOB engineering and finance leadership to produce credible cloud cost forecasts, and hold teams accountable to efficiency targets.

Blameless Postmortem Culture

  • Design and champion the enterprise blameless postmortem process, creating templates, facilitation standards, and review cadences that make postmortems genuinely useful and consistently practiced across all LOBs.

  • Use AI to accelerate postmortem quality: generate draft timelines from incident telemetry, surface contributing factors from logs and traces, and identify systemic patterns across multiple incidents over time.

  • Build a postmortem knowledge base that is searchable and actionable, so lessons from past incidents actively inform future architectural decisions and operational practices.

  • Close the loop on postmortem action items, tracking completion rates across LOBs and escalating chronic non-compliance to engineering leadership.

Technical Advisory & Cross-LOB Enablement

  • Serve as a technical advisor to engineering leaders and architects across WEX, reviewing system designs for reliability risk, providing guidance on high-availability and low-latency architecture patterns, and advising on operational tradeoffs.

  • Lead cross-functional initiatives that span LOBs, bringing together engineering teams to solve shared reliability challenges, establish common tooling, and align on enterprise standards.

  • Create and deliver internal enablement programs, workshops, documentation, office hours, and design review forums, that build SRE capability across WEX engineering without requiring headcount growth in every team.

  • Communicate clearly and influentially to senior leadership: produce written strategy documents, present reliability trends and investment recommendations, and maintain executive visibility into the state of reliability across the enterprise.

Experience you'll bring

Required

  • 12+ years in SRE, platform engineering, or distributed systems, with a hands-on track record of operating mission-critical systems at scale.

  • Deep practical expertise across observability, incident management, resilience engineering, and capacity planning, not just familiarity, but proven delivery in production environments.

  • Experience with high-availability, low-latency systems where transactional integrity and consistency are critical requirements, payment processing, financial platforms, or equivalent.

  • Demonstrated experience using AI tools to solve real reliability problems: anomaly detection, incident triage, noise reduction, postmortem acceleration, capacity forecasting, or auto-remediation.

  • Proven ability to define and enforce technical standards across multiple engineering teams or business units without direct managerial authority.

  • Experience designing self-healing and auto-recovery mechanisms in production distributed systems.

  • Strong background in cloud cost optimization, architecture patterns, governance frameworks, and tooling for managing cloud spend at scale (AWS, GCP, or Azure).

  • Excellent written and verbal communication skills, able to produce authoritative strategy documents, lead cross-LOB forums, and advise VP and C-level engineering leaders.

Preferred

  • Experience in payments, fintech, fleet technology, or benefits administration, familiarity with the reliability and compliance demands of financial transaction systems.

  • Experience building or maturing an SRE practice from an early stage across a multi-product or multi-LOB organization.

  • Familiarity with AI-native observability or AIOps platforms (Dynatrace, Honeycomb, Coralogix, or similar).

  • Background in chaos engineering (Gremlin, LitmusChaos, AWS Fault Injection Simulator) and controlled failure experimentation in regulated or financial environments.

  • Experience with systems requiring strict transactional consistency, distributed databases, event-driven architectures, or payment settlement pipelines.

  • Proficiency with Kubernetes, service mesh (Istio/Linkerd), and OpenTelemetry-based observability stacks.

  • BS/MS in Computer Science, Engineering, or equivalent practical experience.

The base pay range represents the anticipated low and high end of the pay range for this position. Actual pay rates will vary and will be based on various factors, such as your qualifications, skills, competencies, and proficiency for the role. Base pay is one component of WEX's total compensation package. Most sales positions are eligible for commission under the terms of an applicable plan. Non-sales roles are typically eligible for a quarterly or annual bonus based on their role and applicable plan. WEX's comprehensive and market competitive benefits are designed to support your personal and professional well-being. Benefits include health, dental and vision insurances, retirement savings plan, paid time off, health savings account, flexible spending accounts, life insurance, disability insurance, tuition reimbursement, and more. For more information, check out the "About Us" section.Pay Range: $200,600.00 - $250,400.00

What WEX Inc. employees say

Pay

Hours and flexibility

Workplace

Get the full story on Breakroom