Role
We are looking for a Staff Site Reliability Engineer (Automation) to join our Engineering team. This is a hybrid role based in San Jose, CA (3 days in office), reporting to the Director, Site Reliability Engineer. You will be a key driver in provisioning and deploying new infrastructure, focusing heavily on infrastructure automation. Your expertise will help manage how customer traffic is routed within the cloud and ensure seamless troubleshooting across hardware and automated systems.
What you'll do (Role Expectations)
- Manage and maintain large-scale distributed systems using an infrastructure-as-code approach
- Develop and enhance tools to automate the deployment and management of large-scale services, focusing on reliable system architecture and maintaining high code quality
- Diagnose and resolve issues by editing code, adjusting infrastructure configurations, conducting performance and network analysis, and creating reusable tools
- Develop automation solutions and manage services efficiently using version-controlled infrastructure-as-code
- Support mission critical services and participate in on-call rotations as needed.
Who You Are (Success Profile)
- You act like an owner. Your passion for the mission fuels your bias for action. You operate with integrity because you genuinely care about the outcome. You adapt to what's needed, navigating seamlessly between high-level strategy and hands-on execution.
- You are a problem-solver. You seek out challenges because you are energized by finding solutions, knowing that solving the hard problems delivers the biggest impact.
- You are a learner. You have a true growth mindset and never stop developing yourself, actively seeking feedback to become a better partner and a stronger teammate. You love what you do and you do it with purpose.
- You are driven by innovation. You have a deep curiosity for how things work and are energized by solving complex technical challenges. You believe in the power of technology to accelerate transformation and are always looking for a better, more secure, and scalable way.
- You are resilient and adaptable. You view change as an opportunity and setbacks as temporary. You maintain composure and focus in high-pressure situations, guiding yourself and your team through complexity with a steady, positive hand.
What We're Looking for (Minimum Qualifications)
- 5+ years of relevant experience in site reliability or systems engineering
- Proficiency with Python or Ansible for automation tasks as well as proficiency with interacting with external APIs.
- Demonstrated experience building and maintaining automation solutions
- Strong background in systems administration, specifically with Linux or other major operating systems
- Bachelor's degree in Computer Science, a related field, or equivalent practical experience
What Will Make You Stand Out (Preferred Qualifications)
- Hands-on experience with Systems Kickstart using PXE and monitoring and observability tools like Prometheus, Grafana, or Nagios.
#LI-Hybrid #LI-KM9