Job Summary:
SpaceX is a company dedicated to making humanity multiplanetary through the development of reusable launch systems. They are seeking a Site Reliability Engineer to operate and scale mission-critical products for the Guidance, Navigation, and Control (GNC) teams, focusing on maintaining and improving GNC-focused tools and infrastructure. The role involves collaborating with software engineers and managing computational infrastructure to support SpaceX's ambitious goals.
Responsibilities:
• Deploy, upgrade, operate, and scale a suite of mission-critical GNC products and services
• Provision and maintain virtual and physical servers
• Work with SpaceX HPC team to monitor and maintain an HPC cluster consisting of tens of thousands of CPUs.
• Closely collaborate with GNC software engineers to create highly operable and maintainable products
• Monitoring and incident response for web applications and services
• Manage the underlying computational infrastructure of GNC in collaboration with IT stakeholders
• Engage in and improve the whole lifecycle of services from whiteboard to operational
• Make data-driven recommendations for future hardware purchases
• Practice sustainable incident response and postmortems
• Provide end-user support to GNC engineering for products by becoming an expert on analysis applications and support users in troubleshooting and pointing to features
• Configure automated deployment pipelines for web apps
• Develop or improve GNC web apps and tools for better usability, maintainability, and robustness
• Demo and document new software changes such as operating system upgrades, shared filesystem changes, or major tool rollouts
• Focus on performance bottlenecks and performance improvement techniques
Qualifications:
Required:
• Bachelor’s degree in computer science, information systems/IT, engineering, math, or scientific discipline and 2+ years of software development experience OR 4+ years of professional experience building software with site reliability or DevOps in lieu of a degree
• 1+ years of experience with Linux operating systems
• 1+ years of experience with Python and Python based development frameworks
Preferred:
• 2+ years of systems administration, site reliability engineering, or DevOps experience
• 2+ years of experience with Python and Python-based development frameworks
• 2+ years of Linux experience
• Expertise with Docker, Vagrant, and Kubernetes or similar technologies
• Extensive Experience with configuration management tools such as Ansible, Puppet, Terraform
• Experience with build systems (Make, Bazel / Pants / Buck, Gradle) and package management tools (pip, npm)
• Strong understanding of virtualization and hypervisor technologies
• Understanding of databases and data modeling
• Experience with automatically managing dozens or hundreds of servers
• Strong networking knowledge of TCP/IP
• Experience scaling web applications and optimizing applications for performance
• Experience with managing on-prem infrastructure, including direct experience managing GPU fleets
• Experience with high-performance computing systems or large-scale data analysis systems
• Must be comfortable working with mission-critical and sensitive systems, with a sense of urgency appropriate to the responsibilities
• Ability and willingness to obtain a Top Secret clearance
Company:
SpaceX designs, manufactures, and launches rockets and spacecraft to facilitate space exploration. Founded in 2002, the company is headquartered in Hawthorne, USA, with a team of 1001-5000 employees. The company is currently Late Stage.