1

Production Stability Engineer Jobs (NOW HIRING)

We are seeking a Sr Manager, Product Stability - ARA to own the operational health, reliability ... Monitor, triage, and actively drive production incident resolution across L1, L2, L3, SRE, platform ...

We are seeking a Sr Manager, Product Stability - ARA to own the operational health, reliability ... Monitor, triage, and actively drive production incident resolution across L1, L2, L3, SRE, platform ...

Sr. Mainframe Engineer

Jersey City, NJ · On-site

$127K - $168K/yr

Senior Mainframe Software Engineer The Team The Brokerage Record Technology Tax Engineering team ... Champion production stability by proactively monitoring, diagnosing, and resolving system issues ...

OR · On-site

$134K - $180K/yr

Ensure production stability across multiple independent deployment ecosystems throughout the ... Hands-On Engineering * Write, review, and contribute production-quality code across the services ...

next page

Showing results 1-20

Production Stability Engineer information

See salary details

$50.5K

$131.7K

$144K

How much do production stability engineer jobs pay per year?

As of Jun 4, 2026, the average yearly pay for production stability engineer in the United States is $131,667.00, according to ZipRecruiter salary data. Most workers in this role earn between $143,000.00 and $143,000.00 per year, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a Production Stability Engineer, and why are they important?

To thrive as a Production Stability Engineer, you need strong analytical skills, expertise in incident management, and a background in computer science or IT. Familiarity with monitoring tools (such as Splunk, Datadog, or Prometheus), automation frameworks, and ITIL certifications are commonly required. Effective communication, problem-solving, and the ability to remain calm under pressure help you collaborate with teams and resolve critical issues quickly. These skills ensure system reliability, minimize downtime, and support seamless business operations.

What are some typical challenges Production Stability Engineers face, and how can they proactively address them?

Production Stability Engineers often encounter challenges such as diagnosing complex, intermittent incidents and balancing quick response with long-term solutions. To address these, they collaborate closely with development and operations teams to identify root causes, implement monitoring tools, and automate repetitive recovery tasks. Regular post-incident reviews and clear communication channels are critical for continuous improvement and preventing recurring issues. Building strong cross-functional relationships also helps streamline response efforts and fosters a culture of reliability.

What are Production Stability Engineers?

Production Stability Engineers are IT professionals responsible for ensuring the reliability, availability, and overall health of software systems in production environments. They monitor system performance, troubleshoot and resolve incidents, and implement preventative measures to avoid future disruptions. Their goal is to minimize downtime and maintain seamless user experiences by collaborating with development, operations, and support teams. Production Stability Engineers also analyze root causes of issues and help improve system resilience through automation and best practices.

What is the difference between Production Stability Engineer vs Site Reliability Engineer?

AspectProduction Stability EngineerSite Reliability Engineer
Primary FocusEnsuring system stability and uptime in production environmentsBuilding and maintaining scalable, reliable systems with a focus on automation
Skills & CertificationsLinux, scripting, monitoring tools, certifications like AWS or Google CloudDevOps, cloud platforms, automation, similar certifications
Work EnvironmentOperations teams, production environments, monitoring dashboardsDevelopment and operations teams, cloud infrastructure

Both roles focus on system reliability, but Production Stability Engineers primarily concentrate on maintaining uptime and stability, while Site Reliability Engineers emphasize automation and scalable system design. They often collaborate but have distinct core responsibilities within the same industry.

Infographic showing various Production Stability Engineer job openings in the United States as of May 2026, with employment types broken down into 90% Full Time, and 10% Contract. Highlights an 90% In-person, and 10% Remote job distribution, with an average salary of $131,667 per year, or $63.3 per hour.
FLEX Senior Manager, Product Stability

FLEX Senior Manager, Product Stability

Marriott

Bethesda, MD

$135K - $178K/yr

Full-time

Medical, Dental, Vision, Life, Retirement, PTO

Posted 7 days ago


Fairfield By Marriott rating

5.7

Company rating: 5.7 out of 10

Based on 156 frontline employees who took The Breakroom Quiz

67th of 105 rated hotels


Job description

This is a temporary position.

We are seeking a Sr Manager, Product Stability - ARA to own the operational health, reliability, and production stability of the Auto Room Assignment (ARA) product, ensuring ARA is a dependable, trusted experience for property associates and a reliable foundation for longterm adoption. This role serves as the singlethreaded owner for production stability, responsible for driving incident resolution, enforcing SLAs, coordinating crossteam dependencies, and ensuring production insights inform product and engineering decisions. Reporting into Product Management, this role partners closely with Engineering, SRE, Global Operations, Service Desk, Release Management, and external vendors to protect guest and property experience while enabling delivery teams to remain focused on roadmap execution.

Key Responsibilities

  • Serve as the daytoday responsible lead for ARA production stability, a core capability of ARA, and endtoend incident management across ServiceNow and related intake channels.
  • Monitor, triage, and actively drive production incident resolution across L1, L2, L3, SRE, platform teams, and external vendors.
  • Track and enforce SLA response and resolution targets; proactively identify and escalate risks before SLA breaches occur.
  • Act as the primary escalation and communication point for production issues, ensuring clear, timely, and consistent stakeholder updates.
  • Coordinate followups, root cause analyses (RCAs), retrospectives, and corrective actions with dependent internal teams and vendors.
  • Ensure incidents are closed with appropriate validation, documentation, and RCA artifacts to reduce repeat issues and operational toil.
  • Partner with Operations to identify patterns of operational friction or ambiguity and convert them into product, design, or engineering improvements
  • Partner with Engineering, Operations, and SRE to identify systemic stability issues, process gaps, and opportunities for continuous improvement.
  • Support production readiness and release activities, including prerelease checks, postrelease monitoring, and knowledge transfer between teams.

Deliverables (Expected Role Outcomes)

  • Clear and consistent ownership of ARA production stability with no orphaned or aging incidents.
  • Improved SLA compliance, reduced MTTR, and predictable, trusted production behavior that supports associate adoption and confidence
  • Predictable escalation paths and timely resolution of crossteam dependencies.
  • Actionable operational insights tied to ARA KPIs, including Adoption, Reliability, and Accuracy.
  • Increased confidence from leadership and stakeholders in ARA's production readiness and stability.

Required Qualifications

  • 7+ years of experience in product operations, production support, service reliability, or technology operations roles.
  • Proven experience owning production incidents, SLAs, and crossteam resolution without direct line authority.
  • Strong working knowledge of ITSM tools (e.g., ServiceNow), incident/problem/change management practices, and SLA governance.
  • Demonstrated ability to partner effectively with Product, Engineering, SRE, and external vendors in highpressure production environments.
  • Excellent written and verbal communication skills, including executivelevel status and risk communication.

Preferred Qualifications

  • Experience supporting largescale, guestfacing or propertyfacing enterprise platforms.
  • Familiarity with SRE concepts such as toil reduction, error budgets, and reliability metrics.
  • Experience supporting release readiness, cutover activities, or postdeployment stabilization.
  • Experience building or consuming operational dashboards and KPI reporting.

NicetoHave Skills

  • Experience with observability or monitoring tooling (e.g., Dynatrace or similar).
  • Experience with Gen AI tools (CoPilot, Rovo, or similar).
  • Exposure to hospitality, travel, or highly transactional systems.
  • Prior experience in a productaligned operations or product reliability role.

Working Style

  • Operates with a productfirst stability mindset, balancing speed of delivery with reliability and risk management.
  • Calm, decisive, and effective under production pressure.
  • Highly collaborative, able to lead through influence with onshore and offshore technology, operations, and support teams.
  • Detailoriented while maintaining a strong focus on outcomes, trends, and systemic improvements.

At Marriott International, we are dedicated to being an equal opportunity employer, welcoming all and providing access to opportunity. We actively foster an environment where the unique backgrounds of our associates are valued and celebrated.Our greatest strength lies in the rich blend of culture, talent, and experiences of our associates.We are committed to non-discrimination on any protected basis, including disability, veteran status, or other basis protected by applicable law.

All locations offer 401(k) plan, stock purchase plan, discounts at Marriott properties, commuter benefits, employee assistance plan, and childcare discounts. Benefits are subject to terms and conditions, which may include rules regarding eligibility, enrollment, waiting period, contribution, benefit limits, election changes, benefit exclusions, and others. Click hereto learn more.

Full-time positions also offer coverage for medical, dental, vision, health care flexible spending account, dependent care flexible spending account, life insurance, disability insurance, accident insurance, adoption expense reimbursements, and paid parental leave.

Washington Applicants Only: Employees will accrue paid sick leave, 0.0384 PTO balance for every hour worked and be eligible to receive minimum of 9 holidays annually.

Marriott HQ is committed to a hybrid work environment that enables associates to Be connected. Headquarters-based positions are considered hybrid, for candidates within a commuting distance to Bethesda, MD; candidates outside of commuting distance to Bethesda, MD will be considered for Remote positions.

Marriott International is the world's largest hotel company, with more brands, more hotels and more opportunities for associates to grow and succeed. Be where you can do your best work, beginyour purpose,belongto an amazing globalteam, andbecomethe best version of you.

Employment Type: FULL_TIME

What Fairfield By Marriott employees say

Pay

Benefits

Hours and flexibility

Workplace

Get the full story on Breakroom