1

Fault Management Engineer Jobs (NOW HIRING)

... fault management logic and contingency responses. * Communicate concise technical updates on ... engineering teams, mission management, and stakeholders. What you bring to this role: * Hands-on ...

next page

Showing results 1-20

Fault Management Engineer information

See salary details

$29.5K

$111.1K

$183.5K

How much do fault management engineer jobs pay per year?

As of May 29, 2026, the average yearly pay for fault management engineer in the United States is $111,144.00, according to ZipRecruiter salary data. Most workers in this role earn between $75,500.00 and $143,000.00 per year, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a Fault Management Engineer, and why are they important?

To thrive as a Fault Management Engineer, you need a solid understanding of networking principles, troubleshooting methodologies, and a relevant degree in engineering or information technology. Familiarity with network management systems (NMS), SNMP, fault monitoring tools like Nagios or SolarWinds, and certifications such as CCNA or CompTIA Network+ are typically required. Analytical thinking, attention to detail, and effective communication are crucial soft skills for diagnosing issues and coordinating resolutions. These skills ensure quick identification and resolution of network faults, maintaining system reliability and minimizing downtime.

How does a Fault Management Engineer typically collaborate with other IT teams during incident resolution?

A Fault Management Engineer works closely with network operations, system administrators, and support teams to swiftly identify and resolve system faults. During incidents, they coordinate troubleshooting efforts, communicate findings, and escalate issues to specialized teams when necessary. This collaboration ensures minimal downtime and helps maintain service reliability. Effective communication and teamwork are essential, as engineers often participate in cross-functional meetings and post-incident reviews to improve future response strategies.

What does a Fault Management Engineer do?

A Fault Management Engineer is responsible for monitoring, detecting, and resolving faults or issues within a network or system to ensure optimal performance and minimal downtime. They use specialized tools to identify problems, analyze incident reports, and coordinate with technical teams for quick resolution. Their duties often include implementing automated monitoring solutions, performing root cause analysis, and documenting incidents to prevent future occurrences. Overall, they play a crucial role in maintaining the reliability and efficiency of IT infrastructure.

What is the difference between Fault Management Engineer vs Network Operations Center (NOC) Technician?

AspectFault Management EngineerNetwork Operations Center (NOC) Technician
CertificationsNetwork+ or CCNA, fault management certificationsNetwork+ or CCNA, basic troubleshooting certifications
Work EnvironmentDesign, analyze, and resolve network faults, often in a technical or engineering settingMonitor network performance, respond to alerts, and perform troubleshooting in a control room
Employer & IndustryTelecom, ISPs, large enterprise networksTelecom, ISPs, data centers, enterprise IT

Fault Management Engineers focus on diagnosing and resolving complex network faults, often working on system design and analysis. NOC Technicians monitor network health and handle routine troubleshooting. Both roles are essential in maintaining network reliability but differ in scope and responsibilities.

More about Fault Management Engineer jobs
Staff Systems Engineer, Fault Management

Staff Systems Engineer, Fault Management

Kodiak

San Francisco, CA

Other

Medical, Dental, Vision, Life, Retirement, PTO

Posted 18 days ago


Job description

The Systems and Safety Engineering team at Kodiak is seeking an experienced Systems Engineer to own the design and execution of Kodiak's next-generation Autonomy Fault Management System.  This individual will lead the effort end-to-end: from product and system requirement definition, through architecture and implementation, to verification and validation, and safety case integration.  This leader will ensure that the Kodiak Driver handles onboard system faults with the desired, correct, safe response.  This role is central to progressing towards achieving a scalable driverless deployment and will work closely with autonomy hardware, software, and system safety teams.  

This role directly shapes Kodiak's ability to operate sustainably at commercial scale. Fault management is not only a safety system-it is a primary lever of fleet availability, utilization, and cost per mile. You will own the technical strategies that determine when the system can continue operating safely, when it must degrade, and when it must exit service


In this role, you will:

  • Lead the end-to-end development of the next generation of Autonomy Fault Management System, leading the collaborative effort across hardware, software, system safety, and operations teams.
  • Own the systems and safety engineering execution for fault management across the full V-model lifecycle.  
  • Lead the development of systems engineering artifacts, including requirements, traceability, V&V plans, V&V evidence.
  • Define and lead the fault management architecture and concept of operations, including detection, isolation, response, safe-state definition, and minimum risk conditions.
  • Generate technical evidence in support of the adequacy, coverage, and sufficiency of the Fault Management System as an element of Kodiak's Driverless Safety Case.
  • Support quantitative and qualitative analyses used to set detection thresholds, prioritize hazards, and evaluate risk associated with fault responses and minimum risk maneuvers.
  • Lead and influence system architecture trade studies that impact the fault coverage, system availability, safety risk, and operational continuity.
  • Develop the strategy for managing system availability, degraded operation, and operational continuity through the Fault Management System.  
  • Quantify the commercial and safety impact of false positive and false negative detections.
  • Provide analysis to support complex autonomy system design trade-offs to inform system design decisions affecting safety and performance.
  • Serve as the technical leader to align cross-functional teams around a unified fault management strategy.

What you'll bring:

  • B.S., M.S., or PhD in engineering or related technical field
  • 5+ years experience within real-time safety-critical applications, preferably in highly automated or autonomous systems (autonomous vehicles, aerospace, nuclear, medical, etc).  
  • Experience with fault management, diagnostic development, safe state identification and development
  • Experience working with agile software engineering teams
  • Ability to read C/C++ code
  • Experienced in Systems Engineering V-model and application within product life cycle
  • Strong verbal and written communication skills
  • Ability to collaborate effectively with technical stakeholders spanning multiple technical disciplines

What we offer:

  • Competitive compensation package including equity and annual bonuses
  • Excellent Medical, Dental, and Vision plans through Kaiser Permanente, Cigna, and  MetLife (including a medical plan with infertility benefits)
  • MetLife Legal Services, Identity & Fraud Protection, Hospital Indemnity Insurance, Accident Insurance, & Critical Illness Insurance
  • Flexible PTO, 10 paid holidays, and generous parental leave policies
  • Our office is centrally located in Mountain View, CA
  • Office perks: dog-friendly, free catered lunch, a fully stocked kitchen, and free EV charging
  • Long Term Disability, Short Term Disability, Life Insurance
  • Wellbeing Benefits - Headspace through Cigna, Calm through Kaiser, One Medical, Gympass, Spring Health through Cigna, Rula (mental health navigation) 
  • Fidelity 401(k)
  • Commuter, FSA, Dependent Care FSA, HSA
  • Various incentive programs (referral bonuses, patent bonuses, etc.)