OpenAI
OpenAI

60 Openai Infrastructure Software Engineer Jobs Hiring Near You

Software Engineer, System Enablement

San Francisco, CA · On-site

$203K - $241K/yr

They are seeking a Software Engineer for the Scaling team to design and deliver advanced systems ... backbone of OpenAI's infrastructure. Responsibilities : • Own the end-to-end bring-up and ...

Software Engineer, System Enablement

Seattle, WA · On-site

$196K - $233K/yr

They are seeking a Software Engineer for the Scaling team to design and deliver advanced systems ... backbone of OpenAI's infrastructure. Responsibilities : • Own the end-to-end bring-up and ...

Software Engineer, Compute Infrastructure

New York, NY · On-site

$189K - $224K/yr

... software, agent infrastructure, developer tools, and observability into one coherent experience for ... About the Role We are looking for engineers who want to build the compute platform behind OpenAI ...

Software Engineer, Compute Infrastructure

Seattle, WA · On-site

$196K - $233K/yr

... software, agent infrastructure, developer tools, and observability into one coherent experience for ... About the Role We are looking for engineers who want to build the compute platform behind OpenAI ...

Showing results 21-40

Software Engineer, System Enablement

Software Engineer, System Enablement

OpenAI

San Francisco, CA • On-site

$203K - $241K/yr

Full-time

This job post has expired today. Applications are no longer accepted.


Job description

Job Summary:
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. They are seeking a Software Engineer for the Scaling team to design and deliver advanced systems that support the deployment and operation of cutting-edge AI models, focusing on the architectural and engineering backbone of OpenAI’s infrastructure.
Responsibilities:
• Own the end-to-end bring-up and bootstrap path for new systems and compute nodes from bare metal/early access in lab or production/cloud environments to schedulable fleet capacity: image build, user-data/config, cluster join, and readiness gates.
• Build and maintain “first-class” golden image + provisioning workflows across lab, and production environments, including working with partner-provided base images and reconciling OS/version requirements.
• Work with partner teams to integrate nodes into our fleet infrastructure and IaC pipelines (Terraform, Chef, etc.), ensuring cloud resources map cleanly onto our internal lifecycle expectations (e.g., VMSS/instance pools, image references).
• Partner with scheduling and platform owners to ensure new hardware is reachable and scheduled (pool definitions, network/WAN connectivity/routing, admission controls, platform-specific quirks), including cases where new SKUs require changes for scheduling integration.
• Drive registration and inventory correctness (e.g., systems that track nodes and their metadata), including hands-on support to get nodes registered and visible end-to-end.
• Collaborate with partner teams to implement baseline health + telemetry bring-up: minimum viable health signals, pass/fail checks, and automated reporting suitable for early ramp decisions.
• Debug issues across layers: PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network reachability, kubelet/control-plane connectivity, storage constraints, and early rack/lab realities.
Qualifications:
Required:
• BS in CS/EE (or equivalent practical experience).
• 5+ years of experience in systems SW development and building/operating Linux-based infrastructure in production or pre-production environments.
• Strong, hands-on experience with Kubernetes cluster operations (node lifecycle, bootstrap/join, debugging control-plane connectivity).
• Strong, hands-on experience with Infrastructure-as-Code / config management (Terraform, Chef/Ansible, etc.).
• Strong, hands-on experience with provisioning and imaging (PXE/iPXE, golden images, cloud-init/user-data).
• Strong, hands-on experience with networking fundamentals (L2/L3, routing, DNS, fire-walling; comfort debugging reachability).
• Proven ability to write automation in Python/Go/Bash and ship operational tooling/run-books.
Preferred:
• Experience bringing up new hardware platforms (early silicon/servers/NICs) in a lab setting and turning them into stable fleet capacity.
• Multi-cloud operational experience (Azure/GCP/AWS/OCI), especially with compute pools (e.g., VMSS / instance pools).
• Experience building telemetry/health pipelines (agent-based metrics/logging, health rollups, readiness criteria).
• Familiarity with WAN, peering, and multi-site network concepts for cluster deployments.
Company:
OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation. Founded in 2015, the company is headquartered in San Francisco, USA, with a team of 1001-5000 employees. The company is currently Late Stage.