Job Summary:
Anduril Industries is a defense technology company focused on transforming military capabilities with advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and scalability of their mission-critical systems, particularly those supporting the Lattice platform. The role involves building resilient systems, implementing monitoring and incident response strategies, and collaborating with engineering teams to enhance operational excellence.
Responsibilities:
• Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform
• Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues
• Build and maintain infrastructure automation using tools like Terraform, Kubernetes operators, and custom tooling to manage large-scale distributed systems
• Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability
• Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers, graceful degradation, and chaos engineering
• Develop capacity planning models and performance testing frameworks to ensure systems can handle growth and peak operational demands
• Create runbooks, documentation, and training materials to enable teams to operate production systems effectively
• Lead cross-functional efforts to improve deployment safety through progressive rollouts, automated testing, and rollback capabilities
• Implement security best practices and compliance controls for production environments handling sensitive defense data
• Build tooling and automation to reduce toil and improve operational efficiency for the engineering organization
• Participate in on-call rotations and serve as an escalation point for critical production incidents
Qualifications:
Required:
• 7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering
• Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
• Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes)
• Strong programming skills in one or more languages such as Go, Python, Rust, or Java with ability to build production-grade tooling
• Proven experience designing and implementing observability stacks (metrics, logging, tracing) using tools like Prometheus, Grafana, ELK/EFK, or equivalent
• Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices
• Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack
• Track record of improving system reliability through architectural changes, not just operational band-aids
• Strong incident management and communication skills, with experience leading responses to critical outages
• Must be a U.S. Person due to required access to U.S. export controlled information or facilities
• Eligible to obtain and maintain an active U.S. Secret security clearance
Preferred:
• Experience with defense, aerospace, or other mission-critical systems where downtime has severe consequences
• Expertise in performance optimization and capacity planning for high-throughput, low-latency systems
• Knowledge of chaos engineering principles and experience implementing resilience testing frameworks
• Experience with service mesh technologies (Istio, Linkerd) and advanced traffic management patterns
• Background in database operations and optimization (PostgreSQL, Cassandra, or similar at scale)
• Familiarity with CI/CD platforms and deployment automation (ArgoCD, FluxCD, Spinnaker, Jenkins)
• Understanding of networking fundamentals including load balancing, DNS, TLS/SSL, and network security
• Experience with configuration management and secrets management solutions (Vault, Sealed Secrets, SOPS)
• Strong written and verbal communication skills with ability to explain technical concepts to non-technical stakeholders
• Active Secret or higher security clearance
Company:
Anduril Industries is a defense technology company that specializes in developing advanced autonomous systems to enhance national security. Founded in 2017, the company is headquartered in Costa Mesa, USA, with a team of 1001-5000 employees. The company is currently Late Stage.