Job Summary:
Rubrik is a leading company in data protection and AI operations, seeking a Staff Site Reliability Engineer to ensure the reliability and performance of their enterprise infrastructure services. The role involves technical leadership, driving architectural vision, and managing cross-organizational reliability standards for cloud systems.
Responsibilities:
• Formulate and execute the architectural vision for Rubrik's Cloud Platform, optimizing backend infrastructure systems like Kubernetes, MySQL, and cloud-native services for performance, security, and multi-region scale.
• Build, scale, and maintain sophisticated custom internal tools, platform controllers, and automation frameworks in Go or Python to systematically eliminate operational toil.
• Wield engineering-wide influence to create technical consensus among component, platform, and security engineering teams, effectively 'shifting left' to embed structural resilience, capacity guards, and compliance from initial feature designs.
• Define, audit, and enforce robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across all critical enterprise platform services, translating telemetry insights into actionable product roadmaps during executive reviews.
• Serve as a primary Incident Commander for high-severity cloud outages, establishing roles, directing mitigation vectors under pressure, and orchestrating comprehensive, blameless post-mortems that drive durable systemic fixes.
• Architect cost-observability tools and attribution frameworks, leading cloud infrastructure capacity forecasting, resource quota optimization, and vendor SLA management.
• Set the technical direction for the Application-SRE team, raising the bar on how the team diagnoses, mitigates, and durably resolves the most complex customer-impacting issues across our platform.
• Champion SRE best practices, mentoring senior and junior individual contributors across the organization, participating in interview frameworks, and actively raising the collective technical bar.
• Participate in on-call rotations.
Qualifications:
Required:
• Must be a US Citizen currently residing on CONUS soil (strict regulatory requirement to enable support for federal and FedRAMP environments when required).
• BS, MS, or PhD in Computer Science, Computer Engineering, or a highly related technical discipline.
• A minimum of 8–12+ years of software engineering and production cloud infrastructure experience, with at least 5+ years dedicated to a formal SRE, DevOps, or Platform engineering role operating hyperscale SaaS products.
• Comprehensive, hands-on programming expertise in Golang, Python, or Java with a deep grasp of concurrency models, data structures, and test-driven software design patterns.
• Proven proficiency designing, deploying, analyzing, and auditing complex, large-scale distributed systems, database topologies, and high-availability public cloud meshes.
• Authoritative operational command of Unix/Linux operating system environments (process models, file systems, kernels), systems administration, and advanced L4/L7 networking protocols.
• Institutionalize the channel that converts patterns from customer escalations and POCs into prioritized product and reliability feedback, partnering directly with Product, Sales Engineering and Support leadership.
• Track record of partnering directly with Sales, Support, and customers on escalations and POCs, and translating field signals into engineering action.
• Demonstrated history of technical leadership, mapping architectural dependencies, managing multi-team technical projects, and guiding organizations through critical platform shifts with high technical judgment.
• Participate in on-call rotations.
Preferred:
• Extensive production experience provisioning, lifecycle-managing, and recovering enterprise-scale Kubernetes (GKE, EKS) deployments and large-scale relational/non-relational databases (MySQL).
• Prior experience building, certifying, or auditing infrastructure environments under compliance structures such as FedRAMP (High/Moderate), SOC 2, ISO 27001, or CJIS.
• Fluency in Infrastructure-as-Code (Terraform, Pulumi) module design, multi-tenant state isolation, and enterprise observability fabrics (Prometheus, Grafana, OpenTelemetry).
Company:
Rubrik is a data security platform that delivers cyber resilience, cyber posture, and cyber recovery solutions. Founded in 2014, the company is headquartered in Palo Alto, USA, with a team of 1001-5000 employees. The company is currently Late Stage.