OpenAI
OpenAI

60 Openai Software Reliability Engineer Jobs Hiring Near You

About the Role As OpenAI continues to grow, we are looking for experienced, problem-solving ... You will work closely with cross-functional teams, including software engineers, product managers ...

Security Reliability Engineer

San Francisco, CA · On-site

$67.25 - $89.25/hr

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. They are seeking a Security Reliability Engineer to design ...

Reliability/DFX Engineer

San Francisco, CA · On-site

$225K - $445K/yr

About the Team OpenAI's Hardware organization develops silicon and system-level solutions designed ... with software and research partners to co-design hardware tightly integrated with AI models. In ...

Security Reliability Engineer

San Francisco, CA · Hybrid

$67.25 - $89.25/hr

... as OpenAI scales. About the Role We are looking for a Security Reliability Engineer to design ... build, and operate reliable, secure, and scalable infrastructure that underpins identity, access ...

next page

Showing results 1-20

OpenAI Jobs Information

What are the key skills and qualifications needed to thrive as a Software Reliability Engineer, and why are they important?

To thrive as a Software Reliability Engineer, you need a strong background in software development, system architecture, and incident response, often supported by a degree in computer science or related field. Familiarity with monitoring tools (like Prometheus), cloud platforms (AWS, GCP), automation frameworks, and certifications such as AWS Certified DevOps Engineer are highly valuable. Excellent problem-solving, collaboration, and communication skills help you coordinate effectively during high-pressure situations and with cross-functional teams. These abilities are crucial for maintaining system uptime, quickly resolving outages, and ensuring the overall reliability of critical software services.

How does a Software Reliability Engineer typically interact with development and operations teams to improve system stability?

Software Reliability Engineers (SREs) work closely with both development and operations teams to ensure that systems are reliable, scalable, and maintainable. They often participate in design reviews, provide input on architectural decisions, and help define service-level objectives. SREs also collaborate with developers to automate deployment processes and create monitoring solutions, and they partner with operations staff to manage incident response and root cause analysis. This collaborative environment enables them to proactively identify potential issues and drive cross-functional improvements.

What are Software Reliability Engineers?

Software Reliability Engineers (SREs) are IT professionals who focus on ensuring that software systems are reliable, scalable, and maintain high availability. They work at the intersection of software development and IT operations, often automating processes, monitoring system performance, and responding to incidents. SREs use engineering principles to solve operational problems, aiming to reduce downtime and improve user experience. Their responsibilities can include building tools, managing infrastructure, and collaborating with development teams to implement best practices for reliability.

What is the difference between Software Reliability Engineer vs Software Test Engineer?

AspectSoftware Reliability EngineerSoftware Test Engineer
Primary FocusEnsuring software reliability, stability, and performance over timeDesigning and executing tests to identify bugs and verify functionality
Skills & CertificationsKnowledge of reliability engineering, scripting, monitoring toolsTesting methodologies, automation tools, scripting
Work EnvironmentCollaborates with development and operations teams, often in DevOpsWorks primarily in QA/testing teams, often in dedicated testing phases
Industry UsageCommon in software companies focusing on product stabilityWidely used in software development and QA departments

The main difference is that Software Reliability Engineers focus on maintaining long-term software stability and performance, while Software Test Engineers concentrate on identifying bugs through testing. Both roles require technical skills and often collaborate, but their core objectives differ: reliability versus defect detection.

What other companies are hiring for Software Reliability Engineer jobs?
Infographic showing various Software Reliability Engineer job openings at Openai in the United States as of May 2026, with employment types broken down into 100% Full Time. Highlights an 68% Physical, 25% Hybrid, and 7% Remote job distribution.
Software Engineer, Reliability

Software Engineer, Reliability

OpenAI

San Francisco, CA • On-site

Full-time

This job post has expired today. Applications are no longer accepted.


Job description

Job Summary:
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The role of Software Engineer, Reliability involves ensuring the reliability, scalability, and performance of OpenAI's systems while collaborating with cross-functional teams to build resilient infrastructure that can handle a growing user base.
Responsibilities:
• Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands.
• Build and maintain the load, chaos and synthetic testing software leveraged by development teams to make the systems they design and operate more reliable.
• Build and maintain automation tools to streamline repetitive tasks and improve system reliability.
• Build and maintain the platform for CPU/storage, GPU, and network lifecycle management to drive efficiency, accountability and support dynamic optimization of our resources.
• Implement fault-tolerant and resilient design patterns to minimize service disruptions.
• Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability.
• Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world.
• Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability.
Qualifications:
Required:
• Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience).
• Proven experience as an SWE focused on reliability or a similar role in a fast-paced, rapidly scaling company.
• Strong proficiency in cloud infrastructure.
• Proficiency in programming languages.
• Experience with containerization technologies and container orchestration platforms like Kubernetes.
• Knowledge of IaC tools such as Terraform or CloudFormation.
• Excellent problem-solving and troubleshooting skills.
• Strong communication and collaboration skills.
• Experience with observability tools such as DataDog, Prometheus, Grafana and Splunk.
• Experience with microservices architecture and service mesh technologies.
• Knowledge of security best practices in cloud environments.
Company:
OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation. Founded in 2015, the company is headquartered in San Francisco, USA, with a team of 1001-5000 employees. The company is currently Late Stage.