Job Summary:
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. They are seeking a Software Engineer for the Scaling team to design and deliver advanced systems that support the deployment and operation of cutting-edge AI models, focusing on the architectural and engineering backbone of OpenAI’s infrastructure.
Responsibilities:
• Own the end-to-end bring-up and bootstrap path for new systems and compute nodes from bare metal/early access in lab or production/cloud environments to schedulable fleet capacity: image build, user-data/config, cluster join, and readiness gates.
• Build and maintain “first-class” golden image + provisioning workflows across lab, and production environments, including working with partner-provided base images and reconciling OS/version requirements.
• Work with partner teams to integrate nodes into our fleet infrastructure and IaC pipelines (Terraform, Chef, etc.), ensuring cloud resources map cleanly onto our internal lifecycle expectations (e.g., VMSS/instance pools, image references).
• Partner with scheduling and platform owners to ensure new hardware is reachable and scheduled (pool definitions, network/WAN connectivity/routing, admission controls, platform-specific quirks), including cases where new SKUs require changes for scheduling integration.
• Drive registration and inventory correctness (e.g., systems that track nodes and their metadata), including hands-on support to get nodes registered and visible end-to-end.
• Collaborate with partner teams to implement baseline health + telemetry bring-up: minimum viable health signals, pass/fail checks, and automated reporting suitable for early ramp decisions.
• Debug issues across layers: PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network reachability, kubelet/control-plane connectivity, storage constraints, and early rack/lab realities.
Qualifications:
Required:
• BS in CS/EE (or equivalent practical experience).
• 5+ years of experience in systems SW development and building/operating Linux-based infrastructure in production or pre-production environments.
• Strong, hands-on experience with Kubernetes cluster operations (node lifecycle, bootstrap/join, debugging control-plane connectivity).
• Strong, hands-on experience with Infrastructure-as-Code / config management (Terraform, Chef/Ansible, etc.).
• Strong, hands-on experience with provisioning and imaging (PXE/iPXE, golden images, cloud-init/user-data).
• Strong, hands-on experience with networking fundamentals (L2/L3, routing, DNS, fire-walling; comfort debugging reachability).
• Proven ability to write automation in Python/Go/Bash and ship operational tooling/run-books.
Preferred:
• Experience bringing up new hardware platforms (early silicon/servers/NICs) in a lab setting and turning them into stable fleet capacity.
• Multi-cloud operational experience (Azure/GCP/AWS/OCI), especially with compute pools (e.g., VMSS / instance pools).
• Experience building telemetry/health pipelines (agent-based metrics/logging, health rollups, readiness criteria).
• Familiarity with WAN, peering, and multi-site network concepts for cluster deployments.
Company:
OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation. Founded in 2015, the company is headquartered in San Francisco, USA, with a team of 1001-5000 employees. The company is currently Late Stage.