Job Summary (Sr. Manager SRE):
- Design, implement, and manage scalable, secure, and fault-tolerant cloud infrastructure using AWS, Azure, or GCP.
- Automate infrastructure provisioning and configuration with IaC tools such as Terraform, CloudFormation, and Ansible.
- Develop automation for anomaly detection, recovery, toil reduction, self-healing, and cloud cost optimization.
- Lead implementation of DevSecOps best practices and maintain secure CI/CD pipelines (e.g., GitLab, Jenkins, Docker).
- Enforce security through IAM, RBAC, vulnerability remediation, and security scanning tools (SAST/DAST/SCA).
- Architect and manage microservices, serverless solutions, and APIs with a focus on fault tolerance and resilience.
- Implement and maintain monitoring, logging, and observability using CloudWatch, Splunk/SignalFX, Dynatrace, and OpenTelemetry.
- Drive incident management, lead root cause analysis, postmortems, and minimize MTTR/MTTD.
- Define and monitor system reliability metrics (SLOs, SLIs, error budgets).
- Conduct chaos engineering, resiliency assessments, and implement self-healing architectures.
- Manage and optimize databases (PostgreSQL, MongoDB, DynamoDB, Oracle, Redshift) and provide production support.
- Participate in on-call rotations and support incident/problem management.
- Collaborate with development, QA, and operations teams to implement shift-left testing practices (BDD, TDD, Unit, Regression).
- Maintain architecture diagrams, knowledge documentation, and disaster recovery plans.
- Communicate effectively with stakeholders and demonstrate strong relationship management across teams.