Job Summary:
Fluidstack is a company that builds the compute, data centers, and power for artificial superintelligence. They are seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their global GPU cloud infrastructure, partnering with various teams to scale systems for AI workloads.
Responsibilities:
• Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer.
• Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems.
• Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible.
• Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”.
• Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead.
Qualifications:
Required:
• 5+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience.
• Great verbal and written communication skills in English.
• Experience deploying and operating Kubernetes and/or SLURM clusters.
• Experience in writing Go, Python, Bash.
• Experience using Ansible, Terraform, and other automation or IAC tools.
• Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.
Preferred:
• You have built and operated an AI workload at 1000+ GPU scale.
• You have built multi-tenant, hyperscale Kubernetes based services.
• You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc.
• You have deployed and managed multi-tenant InfiniBand or RoCE networks.
• You have deployed and managed petabyte scale all-flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools.
Company:
Fluidstack provides cloud infrastructure for AI with GPU clusters, orchestration, and monitoring for intensive workloads. Founded in 2017, the company is headquartered in New York, USA, with a team of 51-200 employees. The company is currently Growth Stage.