Fluidstack

60 Fluidstack Site Reliability Engineer Jobs Hiring Near You

Site Reliability Engineer

San Francisco, CA · On-site

$67.25 - $89.25/hr

Fluidstack is a company that builds the compute, data centers, and power for artificial superintelligence. They are seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance ...

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Minimum Requirements 5+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience. * Great ...

Site Reliability Engineer

San Diego, CA · On-site

$175K - $320K/yr

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Minimum Requirements 5+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience. * Great ...

Collaborate closely with Networking, Infrastructure, SRE/DevOps, and Software Engineering to embed ... Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive ...

... SRE/DevOps, and Software Engineering to embed security best practices into designs, deployments ... Fluidstack provides cloud infrastructure for AI with GPU clusters, orchestration, and monitoring ...

Site Selection Manager

Austin, TX · On-site

$135K - $235K/yr

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Maintain relationships with brokers, landowners and developers to ensure we never miss a valid site

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Maintain relationships with brokers, landowners and developers to ensure we never miss a valid site

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Own 24x7 operational accountability ensuring uptime, reliability, security, and availability meet ...

About Fluidstack At Fluidstack, we build the compute, data centers, and power that will fuel ... Maintain relationships with brokers, landowners and developers to ensure we never miss a valid site

next page

Showing results 1-20

Fluidstack Jobs Information

What are the key skills and qualifications needed to thrive as a Site Reliability Engineer, and why are they important?

To thrive as a Site Reliability Engineer, you need a strong background in computer science, systems administration, and software engineering, often supported by a degree in a technical field. Familiarity with cloud platforms (like AWS or GCP), container orchestration (such as Kubernetes), infrastructure as code (Terraform or Ansible), and monitoring tools (Prometheus, Grafana) is typically expected. Strong problem-solving skills, effective communication, and a proactive mindset help SREs excel at incident management and cross-functional collaboration. These skills are crucial for maintaining system reliability, minimizing downtime, and driving continuous improvement in complex technical environments.

What are some of the most common challenges Site Reliability Engineers face when balancing system reliability with rapid software delivery?

Site Reliability Engineers (SREs) often navigate the challenge of maintaining highly reliable systems while supporting fast-paced software releases. This involves managing incidents, automating processes to reduce manual toil, and working closely with development teams to embed reliability into the software development lifecycle. SREs must carefully prioritize their efforts between proactive improvements and urgent, reactive fire-fighting. Effective communication and collaboration with both operations and development teams are crucial to ensuring service uptime without slowing down innovation.

What is a Site Reliability Engineer?

A Site Reliability Engineer (SRE) is a professional who applies software engineering principles to infrastructure and operations problems. Their primary goal is to create scalable and highly reliable software systems, often bridging the gap between development and IT operations. SREs automate tasks, monitor system health, respond to incidents, and work to improve system reliability and performance. They also help define service level objectives (SLOs) and ensure systems meet customer expectations for uptime and availability.

What is the difference between Site Reliability Engineer vs DevOps Engineer?

AspectSite Reliability EngineerDevOps Engineer
CredentialsTypically requires a computer science degree, certifications like AWS, Google Cloud, or KubernetesSimilar credentials, often with cloud certifications and scripting skills
Work EnvironmentFocuses on maintaining and improving system reliability, often in large-scale production environmentsWorks on automation, CI/CD pipelines, and deployment processes across development and operations teams
Industry UsageCommon in tech, cloud services, and large-scale enterprise companiesWidely used in software development, cloud, and IT organizations

Both roles require strong technical skills and cloud knowledge, but SREs focus more on system reliability and uptime, while DevOps engineers emphasize automation and deployment processes. They often collaborate but have distinct primary responsibilities.

Infographic showing various Site Reliability Engineer job openings at Fluidstack in the United States as of May 2026, with employment types broken down into 98% Full Time, and 2% Contract. Highlights an 96% Physical, 1% Hybrid, and 3% Remote job distribution.

Site Reliability Engineer

Fluidstack

San Francisco, CA • On-site

$67.25 - $89.25/hr

Full-time

Posted 22 days ago


Job description

Job Summary:
Fluidstack is a company that builds the compute, data centers, and power for artificial superintelligence. They are seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their global GPU cloud infrastructure, partnering with various teams to scale systems for AI workloads.
Responsibilities:
• Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer.
• Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems.
• Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible.
• Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”.
• Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead.
Qualifications:
Required:
• 5+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience.
• Great verbal and written communication skills in English.
• Experience deploying and operating Kubernetes and/or SLURM clusters.
• Experience in writing Go, Python, Bash.
• Experience using Ansible, Terraform, and other automation or IAC tools.
• Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields.
Preferred:
• You have built and operated an AI workload at 1000+ GPU scale.
• You have built multi-tenant, hyperscale Kubernetes based services.
• You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc.
• You have deployed and managed multi-tenant InfiniBand or RoCE networks.
• You have deployed and managed petabyte scale all-flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools.
Company:
Fluidstack provides cloud infrastructure for AI with GPU clusters, orchestration, and monitoring for intensive workloads. Founded in 2017, the company is headquartered in New York, USA, with a team of 51-200 employees. The company is currently Growth Stage.