Job Summary:
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. As a Software Engineer in Platform Systems, you will design and build distributed systems for large-scale AI training, focusing on failure detection, observability, and performance optimization.
Responsibilities:
• Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
• Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
• Improve observability, reliability, and performance across OpenAI’s training platform
• Debug and resolve issues in complex, high-throughput distributed systems
• Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
• Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads
Qualifications:
Required:
• Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
• Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
• Improve observability, reliability, and performance across OpenAI’s training platform
• Debug and resolve issues in complex, high-throughput distributed systems
• Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
• Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads
• Care deeply about performance, stability, and observability in distributed systems
• Enjoy finding and fixing issues in large-scale systems and automating operational workflows
• Have experience writing low-level software where system details matter
• Understand hardware, operating systems, networking, concurrency, and distributed systems
• Have a background in high-performance computing or low-level systems engineering
• Are excited to work on critical infrastructure that powers frontier AI research
Company:
OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation. Founded in 2015, the company is headquartered in San Francisco, USA, with a team of 1001-5000 employees. The company is currently Late Stage.