1

Data Center Reliability Engineer Jobs (NOW HIRING)

Reliability Engineer We are seeking a skilled and detail-oriented Reliability Engineer to join our ... data center environments. You maybe a good fit if you have * Bachelor's or Master's degree in ...

Reliability Engineer

Yocumtown, PA · On-site

$98K - $123K/yr

Experience with AI data center or hyperscale infrastructure . * Familiarity with high-speed copper and hybrid cable technologies (DAC, AOC, internal high-speed harnesses). * Working knowledge of SI ...

Staff Site Reliability Engineer

Austin, TX · On-site

$56.50 - $75/hr

Our footprint consists of 11 data center campuses across seven states, housing advanced ... Title Staff Site Reliability Engineer Reports To Site Reliability Engineering Manager The Job We ...

Mainframe and SRE Architect

$58.25 - $77.50/hr

... data centre operations and migration program Team Qualifications : Required : • Architect and implement Site Reliability Engineering (SRE) frameworks for mainframe and AWS hybrid cloud platforms ...

Mainframe and SRE Architect

$58.25 - $77.50/hr

... data centre operations and migration program Team Qualifications : Required : • Architect and implement Site Reliability Engineering (SRE) frameworks for mainframe and AWS hybrid cloud platforms ...

Site Reliability Engineer (SRE)

San Diego, CA · On-site

$60.50 - $80.50/hr

Experience building systems both on-premise (data center) and on public cloud (AWS, GCP or Azure ... managing SRE teams and supporting mission critical applications3+ years of Hybrid Cloud ...

Reliability Engineer

Costa Mesa, CA · On-site

$110K - $138K/yr

... data streams into a realtime, 3D command and control center. As the world enters an era of ... Anduril's Reliability Engineering organization is seeking an experienced Reliability Engineer to ...

Reliability Engineer

Costa Mesa, CA

$108K - $136K/yr

... data streams into a realtime, 3D command and control center. As the world enters an era of ... Anduril's Reliability Engineering organization is seeking an experienced Reliability Engineer to ...

next page

Showing results 1-20

Data Center Reliability Engineer information

See salary details

$61K

$118K

$141K

How much do data center reliability engineer jobs pay per year?

As of Jun 7, 2026, the average yearly pay for data center reliability engineer in the United States is $117,973.00, according to ZipRecruiter salary data. Most workers in this role earn between $102,500.00 and $129,000.00 per year, depending on experience, location, and employer.

What are some typical challenges faced by Data Center Reliability Engineers, and how can I prepare for them?

Data Center Reliability Engineers often encounter challenges such as minimizing downtime, ensuring redundancy, and proactively identifying potential points of failure within complex infrastructures. You may be required to respond quickly to incidents, balance multiple priorities, and collaborate closely with both IT and facilities teams. Preparing for these challenges involves developing strong troubleshooting skills, staying up-to-date with best practices in reliability engineering, and gaining experience with monitoring and automation tools commonly used in data centers.

What are the key skills and qualifications needed to thrive as a Data Center Reliability Engineer, and why are they important?

To thrive as a Data Center Reliability Engineer, you need a strong background in electrical and mechanical systems, critical facility operations, and often a degree in engineering or a related field. Familiarity with Building Management Systems (BMS), Computerized Maintenance Management Systems (CMMS), and certifications like Uptime Institute’s Accredited Tier Specialist or ASHRAE are common requirements. Exceptional problem-solving abilities, attention to detail, and effective communication are vital soft skills in this role. These competencies are crucial for maintaining optimal uptime, preventing failures, and ensuring the continuous operation of critical infrastructure.

What does a Data Center Reliability Engineer do?

A Data Center Reliability Engineer is responsible for ensuring the continuous and efficient operation of data center infrastructure. They monitor systems for potential issues, perform maintenance, and implement strategies to prevent downtime. Their work often involves collaborating with IT and facilities teams to optimize performance, improve reliability, and respond to emergencies. By proactively identifying risks, these engineers help maintain critical services and minimize disruptions to business operations.

Reliability Engineer

Etched

Cupertino, CA • On-site

$2K/mo

Full-time

Medical, Dental, Vision

Posted 14 days ago


Job description

About Etched
Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep chain-of-thought reasoning.
Reliability Engineer
We are seeking a skilled and detail-oriented Reliability Engineer to join our team. As a Reliability Engineer at Etched, you will play a critical role in ensuring that all components and systems meet our rigorous reliability standards, essential for our datacenter applications. This position requires a deep understanding of reliability engineering principles, as well as experience working with suppliers, ODMs, and JDMs.
Representative Projects:
  • Lead the development, implementation, and management of reliability standards for all suppliers working with Etched. Ensure that all components and systems meet or exceed the required reliability benchmarks.
  • Review and verify reliability reports from suppliers, ensuring accuracy and adherence to Etched's standards. Provide guidance and feedback to suppliers to ensure continuous improvement in reliability performance.
  • Collaborate with cross-functional teams to review and recommend component selection criteria based on reliability performance. Ensure that all selected components are capable of meeting the long-term reliability requirements of our datacenter applications.
  • Evaluate and approve reliability test plans proposed by external vendors. Ensure that the test methodologies and conditions are sufficient to validate long-term reliability under expected operating conditions.
  • Conduct in-depth analysis of reliability data provided by suppliers and vendors. Identify trends, potential issues, and areas for improvement to enhance overall reliability.
  • Work closely with ODMs (Original Design Manufacturers) and JDMs (Joint Design Manufacturers) to ensure that all products meet Etched quality and reliability standards. Provide technical guidance and support to maintain maximum operational uptime and long-term reliability.
  • Review and establish reliability metrics and standards for silicon components, ensuring they meet the stringent requirements for long-term reliability in data center environments.

You maybe a good fit if you have
  • Bachelor's or Master's degree in Reliability Engineering, Electrical Engineering, or a related field.
  • 5+ years of experience in reliability engineering, with a focus on datacenter applications preferred.
  • Strong understanding of reliability standards, testing methodologies, and data analysis techniques. DFMEA / PFMEA / SPC Engineering analysis experience desired.
  • Experience working with suppliers, ODMs, and JDMs in a high-tech environment.
  • Excellent communication skills, with the ability to convey complex technical concepts to diverse stakeholders.
  • Proven ability to manage multiple projects and deliver results in a fast-paced environment.

We encourage you to apply even if you do not believe you meet every single qualification.
How we're different:
Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.
We are a fully in-person team in Cupertino, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.
Benefits:
  • Full medical, dental, and vision packages, with 100% of premium covered, 90% for dependents
  • Housing subsidy of $2,000/month for those living within walking distance of the office
  • Daily lunch and dinner in our office
  • Relocation support for those moving to Cupertino