Job Summary:
AI Foundry is an AI building and consulting business working with some of the largest companies in the world. They are seeking AI Infrastructure Engineers to help build and operate large-scale GPU infrastructure for training and inference workloads, focusing on hands-on systems work across various technical domains.
Responsibilities:
• Build, configure, and operate GPU cluster infrastructure for training and inference workloads.
• Work across Linux systems, high-performance networking, storage, orchestration, monitoring, and automation.
• Support provisioning, configuration management, job scheduling, workload management, and lifecycle operations.
• Implement observability, alerting, incident response, and operational runbooks for infrastructure health.
• Partner with hardware, facilities, vendors, and engineering teams to resolve performance, reliability, and capacity issues.
• Automate repetitive operational tasks and improve the reliability of cluster operations.
• Help evaluate hardware, networking, storage, and platform components for AI workloads.
• Document systems clearly so global teams can operate and troubleshoot consistently.
• Travel to India 8+ times per year to work directly with infrastructure and client teams.
Qualifications:
Required:
• Strong Linux systems engineering experience and enjoy hands-on infrastructure work.
• Understanding of GPU infrastructure or deeply motivated to build expertise in NVIDIA systems, CUDA, NCCL, and distributed AI workloads.
• Experience with networking, storage, Kubernetes, Slurm, observability, automation, or related infrastructure tooling.
• Ability to troubleshoot complex systems methodically and communicate what you are seeing.
• Care about reliability, operational clarity, and repeatable systems.
• Comfortable working with incomplete information and learning quickly.
• Ability to collaborate with infrastructure, facilities, vendor, and engineering teams across time zones.
• Ability to write useful documentation and runbooks.
• Energized by building greenfield infrastructure at serious scale.
• Use AI and modern tools to improve how you build, debug, document, and operate systems.
• International travel required to India 8+ times per year.
Company:
Launching soon. Founded in , the company is headquartered in , , with a team of 2-10 employees. The company is currently Early Stage.