Overview:Role: SRE LeadLocation: Alpharetta, GA
Experience Level: Senior / Lead
Role OverviewWe are seeking an experienced
Site Reliability Engineering (SRE) Lead to own and drive the reliability, scalability, and operational excellence of cloud-native platforms. This role combines
hands-on technical depth with
people leadership, responsible for managing the SRE team while setting best practices across reliability engineering, automation, observability, and incident management.
The SRE Lead will work closely with engineering, security, and platform teams to ensure systems are resilient, secure, and performant at scale.
Key ResponsibilitiesLeadership & Ownership - Lead and manage the SRE team, owning end-to-end SRE responsibilities.
- Define SRE standards, reliability goals (SLIs/SLOs), and operational best practices.
- Mentor engineers and drive a culture of automation, resilience, and continuous improvement.
- Act as a key escalation point during critical incidents and outages.
Cloud & Platform Engineering - Design, implement, and manage cloud infrastructure using Google Cloud Platform (GCP) services:
- Compute Engine, GKE, VPC, Cloud IAM, Cloud Storage, Cloud SQL.
- Ensure high availability, fault tolerance, and scalability across environments.
Networking & Connectivity - Architect and manage:
- VPC peering, Shared VPCs
- Firewall rules, Load Balancers, DNS
- VPN tunnels and secure hybrid connectivity
Security & Identity - Debug and manage IAM policies and service accounts.
- Implement Workload Identity Federation and least-privilege access models.
- Partner with security teams to enforce cloud security best practices.
Infrastructure as Code & Automation - Develop and maintain Terraform modules with strong state management and dependency handling.
- Apply DRY principles across infrastructure code.
- Lead infrastructure automation initiatives to reduce manual intervention.
CI/CD & Deployment Strategies - Design and maintain pipelines using:
- Jenkins (Declarative & Scripted)
- GitHub Actions (YAML workflows)
- Implement advanced deployment strategies:
- Canary releases
- Blue/Green deployments
- Artifact management using Docker and Helm
Linux & Systems Engineering (Must-Have) - Deep hands-on expertise with RHEL, Ubuntu, and CentOS.
- Kernel tuning, systemd, storage management (LVM).
- OS-level performance optimization and troubleshooting.
Observability & Debugging - Diagnose and resolve CPU, memory, disk, and I/O bottlenecks.
- Analyze system and application logs.
- Troubleshoot boot issues and low-level system failures.
- Drive root cause analysis and post-incident reviews.
Programming & Scripting (Must-Have) - Strong proficiency in Python, Go (Golang), or Java for automation and tooling.
Required Skills - Proven experience leading SRE or Platform Engineering teams.
- Strong expertise in GCP infrastructure and Kubernetes (GKE).
- Advanced Linux systems knowledge.
- Infrastructure-as-Code and CI/CD mastery.
- Strong debugging, incident response, and reliability engineering skills.
Preferred Qualifications - Certifications:
- Google Professional Cloud DevOps Engineer
- Google Cloud Architect
- CKA (Certified Kubernetes Administrator)
- Experience with large-scale distributed systems and microservices.
- Familiarity with:
- ITIL processes
- Change Advisory Board (CAB)
- Incident and problem management frameworks