Job Summary:
Tesla is building critical applications to enable manufacturing and warehouse management with a strong emphasis on reliability, availability, scalability, speed, and security. As the Lead Site Reliability Engineer, you will be the primary technical owner and leader for the Factory Software team’s reliability, observability, and infrastructure strategy, combining deep hands-on engineering with leadership to ensure the full stack is highly reliable and performant.
Responsibilities:
• Provide technical leadership and set the vision for observability, reliability, and platform standardization across the Factory Software team
• Design and implement end-to-end observability and telemetry solutions (OTEL, Prometheus, Grafana, Tempo, etc.) while mentoring the team on best practices
• Own the reliability of the full stack: Kubernetes infrastructure, virtual machines, databases, and the middleware applications connecting PLCs, MES systems, and other factory services
• Define and drive SLIs, SLOs, error budgets, and golden signals across services
• Lead major initiatives to eliminate speed bottlenecks, database contention, and infrastructure issues through proactive monitoring and automation
• Write production-grade code and build tools to reduce toil and improve deployment, monitoring, and operational workflows
• Participate hands-on in on-call rotations, live troubleshooting during outages (NOC bridges), and blameless post-mortems
• Collaborate closely with Platform Engineering, Infrastructure, Controls Engineering, and Software Engineering teams to embed reliability and observability into architecture and development practices
• Mentor and coach engineers on technical excellence, observability, Kubernetes, Linux, networking, and reliable system design
• Drive continuous improvement in incident response, system performance, and engineering standards across the team
Qualifications:
Required:
• 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related systems roles, with significant hands-on experience at scale
• Strong technical expertise in Kubernetes, Docker, Linux administration, and networking (routing, VLANs, firewalls, load balancers)
• Deep experience with observability tools and concepts (Prometheus, Grafana, Tempo, OTEL, Splunk, etc.)
• Proven track record of designing and implementing reliable, observable distributed systems
• Proficiency in at least one high-level language (Go, Python, or Java) with experience writing production-grade code
• Demonstrated ability to lead technical initiatives and raise the engineering bar without formal people management authority
• Experience with on-call rotations, incident command, and driving reliability improvements through blameless post-mortems
• Strong bias for action, excellent communication skills, and a desire to mentor and uplift other engineers
Preferred:
• Experience in manufacturing, industrial automation, or complex operational environments is a strong plus
Company:
Tesla is an electric vehicle and clean energy company that provides electric cars, solar, and renewable energy solutions. Founded in 2003, the company is headquartered in Austin, USA, with a team of 10001+ employees. The company is currently Late Stage.