Job Summary:
Rubrik is a company that leads at the intersection of data protection, cyber resilience, and enterprise AI acceleration. The Staff Site Reliability Engineer will serve as a primary technical leader, responsible for ensuring the reliability and performance of the company's cloud infrastructure while leading the Application-SRE team to resolve complex customer escalations.
Responsibilities:
• Formulate and execute the architectural vision for Rubrik's Cloud Platform, optimizing backend infrastructure systems like Kubernetes, MySQL, and cloud-native services for performance, security, and multi-region scale.
• Build, scale, and maintain sophisticated custom internal tools, platform controllers, and automation frameworks in Go or Python to systematically eliminate operational toil.
• Deploy, scale, and operate the AI infrastructure that powers Rubrik's SaaS offerings, owning the reliability, performance, cost, and security controls required to run AI workloads in multi-tenant, compliance-bound environments.
• Drive the adoption of AI-driven solutions across the SRE charter to compress toil and multiply the org - applying agentic and LLM-based approaches to automated triage, incident response, operational analysis, and developer productivity.
• Build the guardrails, controls, and platform patterns that keep Rubrik's SaaS reliable as AI adoption accelerates across product and engineering, ensuring new AI capabilities ship without eroding availability, performance, security, or cost posture.
• Wield engineering-wide influence to create technical consensus among component, platform, and security engineering teams, effectively 'shifting left' to embed structural resilience, capacity guards, and compliance from initial feature designs.
• Define, audit, and enforce robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across all critical enterprise platform services, translating telemetry insights into actionable product roadmaps during executive reviews.
• Serve as a primary Incident Commander for high-severity cloud outages, establishing roles, directing mitigation vectors under pressure, and orchestrating comprehensive, blameless post-mortems that drive durable systemic fixes.
• Architect cost-observability tools and attribution frameworks, leading cloud infrastructure capacity forecasting, resource quota optimization, and vendor SLA management.
• Set the technical direction for the Application-SRE team, raising the bar on how the team diagnoses, mitigates, and durably resolves the most complex customer-impacting issues across our platform.
• Champion SRE best practices, mentoring senior and junior individual contributors across the organization, participating in interview frameworks, and actively raising the collective technical bar.
• Participate in on-call rotations.
Qualifications:
Required:
• Must be a US Citizen currently residing on CONUS soil (strict regulatory requirement to enable support for federal and FedRAMP environments when required).
• BS, MS, or PhD in Computer Science, Computer Engineering, or a highly related technical discipline.
• A minimum of 8–12+ years of software engineering and production cloud infrastructure experience, with at least 5+ years dedicated to a formal SRE, DevOps, or Platform engineering role operating hyperscale SaaS products.
• Comprehensive, hands-on programming expertise in Golang, Python, or Java with a deep grasp of concurrency models, data structures, and test-driven software design patterns.
• Proven proficiency designing, deploying, analyzing, and auditing complex, large-scale distributed systems, database topologies, and high-availability public cloud meshes.
• Authoritative operational command of Unix/Linux operating system environments (process models, file systems, kernels), systems administration, and advanced L4/L7 networking protocols.
• Working knowledge of operating AI systems in production — including model serving, cost trade-offs, and the reliability and safety considerations of LLM- and agent-based workloads. Practical judgment on when AI is the right tool versus deterministic automation.
• Institutionalize the channel that converts patterns from customer escalations and POCs into prioritized product and reliability feedback, partnering directly with Product, Sales Engineering and Support leadership.
• Track record of partnering directly with Sales, Support, and customers on escalations and POCs, and translating field signals into engineering action.
• Demonstrated history of technical leadership, mapping architectural dependencies, managing multi-team technical projects, and guiding organizations through critical platform shifts with high technical judgment.
Preferred:
• Extensive production experience provisioning, lifecycle-managing, and recovering enterprise-scale Kubernetes (GKE, EKS) deployments and large-scale relational/non-relational databases (MySQL).
• Prior experience building, certifying, or auditing infrastructure environments under compliance structures such as FedRAMP (High/Moderate), SOC 2, ISO 27001, or CJIS.
• Fluency in Infrastructure-as-Code (Terraform, Pulumi) module design, multi-tenant state isolation, and enterprise observability fabrics (Prometheus, Grafana, OpenTelemetry).
• Exposure to building AI- or LLM-powered internal tooling and applying it to SRE, operations, or engineering productivity use cases.
• Familiarity with the operational considerations of running AI workloads on cloud and Kubernetes platforms.
Company:
Rubrik is a data security platform that delivers cyber resilience, cyber posture, and cyber recovery solutions. Founded in 2014, the company is headquartered in Palo Alto, USA, with a team of 1001-5000 employees. The company is currently Late Stage.