Job Title: Senior Cloud Reliability Engineer (SRE)
Role Overview
We are seeking a Senior Cloud Reliability Engineer (SRE) to architect and implement the systems that ensure our cloud environment is bulletproof. This role goes beyond standard DevOps; you will act as a software engineer focused on building tools, frameworks, and automation that reduce toil, optimize costs, and proactively prevent incidents before they impact our users.
You will be the technical authority for reliability within our cloud ecosystem. The ideal candidate treats infrastructure as code, writes high-quality production code, and champions the "error budget" mindset across our development teams. If you are passionate about observability, chaos engineering, and building resilient distributed systems, we want to hear from you.
Key Responsibilities
Reliability Engineering & Development
• Platform Reliability: Design, develop, and maintain SRE utilities and automation solutions that minimize toil and drive self-service infrastructure capabilities.
• Infrastructure as Code (IaC): Architect and maintain complex Terraform modules to manage AWS resources. Implement cost-efficient design principles into the infrastructure lifecycle.
• Software Engineering: Apply TDD (Test-Driven Development), version control best practices, and code reviews to all reliability solutions. Develop custom APIs and tools to integrate disparate cloud services.
• CI/CD Orchestration: Build and optimize CI/CD pipelines to ensure the rapid, safe delivery of infrastructure updates.
Observability & Incident Governance
• SLO/SLI Management: Define and manage service level objectives and indicators. Own the process of monitoring system health and providing transparency into platform reliability.
• Incident Lifecycle: Participate in an on-call rotation. Lead root-cause analysis (RCA) efforts, produce blameless postmortems, and identify actionable patterns to prevent repeat incidents.
• Proactive Resilience: Conduct resilience testing and chaos engineering experiments to harden our architecture against failures.
Technical Leadership & Collaboration
• Standardization: Establish SRE standards, guidelines, and governance frameworks for adoption across cross-functional teams.
• Agile Integration: Collaborate within Agile/Scaled Agile environments, acting as a force multiplier for security, platform, and infrastructure teams.
• Continuous Innovation: Stay at the forefront of cloud-native development and SRE methodologies, driving the adoption of new AWS services and reliability patterns.
Qualifications & Requirements
Minimum Qualifications
• Experience Baseline: Minimum of seven (7) years of professional software development experience with a heavy focus on platform engineering or reliability.
• Technical Mastery:
oAdvanced Python: Minimum 5 years of experience building enterprise-grade tools, APIs, and automation utilities.
oAWS Ecosystem: Minimum 3 years of deep, hands-on experience with core AWS services (EC2, VPC, S3, Lambda, IAM, EventBridge, Step Functions).
oIaC/DevOps: Expert-level proficiency with Terraform (module development/state management) and CI/CD pipeline implementation.
• SRE Competency: Minimum 3 years of direct experience defining SLIs/SLOs, managing error budgets, and practicing SRE fundamentals (observability, toil reduction).
• Education: Bachelor's degree in Computer Science, Information Systems, or a related field.
Preferred Attributes
• Proficiency in GoLang or additional programming languages.
• Hands-on experience with observability tools (Grafana, CloudWatch, AWS Canary).
• Familiarity with ITSM workflows (Incident, Change, and Problem Management).
Equal Opportunity Employer / Disabled / Protected Veterans
The Know Your Rights poster is available here:
https://www.eeoc.gov/sites/default/files/2023-06/22-088_EEOC_KnowYourRights6.12.pdf
The pay transparency policy is available here:
https://www.dol.gov/sites/dolgov/files/ofccp/pdf/pay-transp_%20English_formattedESQA508c.pdf
For temporary assignments lasting 13 weeks or longer, AllSTEM Connections is pleased to offer major medical, dental, vision, 401k and any statutory sick pay where required.
We are committed to working with and providing reasonable accommodations to individuals with disabilities. If you need a reasonable accommodation for any part of the employment process, please contact your staffing representative who will reach out to our HR team.
AllSTEM Connections participates in the E-Verify program in certain locations as required by law. Learn more about the E-Verify program.
https://e-verify.uscis.gov/web/media/resourcesContents/E-Verify_Participation_Poster_ES.pdf
We also consider for employment qualified applicants regardless of criminal histories, consistent with legal requirements, including, if applicable, the City of Los Angeles' Fair Chance Initiative for Hiring Ordinance. Pursuant to applicable state and municipal Fair Chance Laws and Ordinances, we will consider for employment-qualified applicants with arrest and conviction records, including, if applicable, the San Francisco Fair Chance Ordinance. For Los Angeles, CA applicants: Qualified applications with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act.
Additional Skills
(none specified)
AllSTEM Representative Contact Info
Account Executive:
Nichols
Branch Phone:
(909) 244-1777
Location:
Ontario, CA