Job Summary:
Ai2 is a non-profit research institute dedicated to building AI for the common good. The Lead Software Engineer will be responsible for developing the infrastructure that supports high-performance computing for AI research, ensuring efficient scheduling and execution of workloads.
Responsibilities:
• Strategic Leadership: Develop the roadmap for managing large-scale HPC systems, including the deployment of compute, networking, and storage in partnership with leadership.
• Full-Stack Ownership: Lead the design and delivery of critical systems that span the entire stack—from the Beaker job scheduler to the execution runtime.
• System Automation: Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management.
• Performance Optimization: Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads.
• Mentorship & Culture: Foster a high-performance culture by reviewing code/design docs, mentoring team members, and driving process improvements across the organization.
• Evangelism: Represent Ai2’s infrastructure work across internal research teams.
Qualifications:
Required:
• 10+ years of professional experience developing business-critical software and operating large-scale compute infrastructure.
• Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience.
• Deep Linux Expertise: Expert-level knowledge of Linux internals, and container runtimes like Docker.
• Distributed Systems Mastery: A proven track record of designing, debugging, and optimizing high-scale distributed systems and databases.
• HPC Foundations: Applied experience with workload schedulers (like Kubernetes or Slurm) and high-performance networking (NCCL and InfiniBand).
• Cloud & Hardware Hybridity: Familiarity with the nuances of on-prem GPU cluster management and cloud infrastructure (GCP, AWS).
• Communication: Exceptional writing skills and the ability to drive consensus across diverse groups of researchers and engineers.
• A principled approach to engineering: you care about how systems are built and are excited by the unique constraints and freedoms of a non-profit research environment.
Preferred:
• Proficiency in Go and/or Python preferred.
• Prior experience training or fine-tuning frontier AI models.
• Deep systems administration expertise or 'Site Reliability Engineering' (SRE) background in an HPC context.
• Experience contributing to open-source infrastructure or orchestration projects.
• Familiarity with on-prem storage systems like WEKA and Ceph.
Company:
We are a Seattle-based non-profit AI research institute founded in 2014 by the late Paul Allen. Founded in 2014, the company is headquartered in Seattle, USA, with a team of 201-500 employees. The company is currently Growth Stage.