We are seeking a Site Reliability Engineer (SRE) who will drive stability, reliability, and performance across our Azure and GCP-based platforms.
This role blends operational excellence, proactive incident management, and strong collaboration with DevOps, Cloud, and Security teams.
The ideal candidate will have hands-on experience with multi-cloud environments (Azure and GCP), IaC (Terraform/Ansible), CI/CD (Jenkins/GitHub Actions), and modern observability and AI-Ops systems. The engineer will also contribute to governance, cost optimization, and automation strategies that reduce toil and prevent issues before they occur. A key aspect of this role is the ability to perform deep-dive troubleshooting of application performance and errors by analyzing logs and traces in platforms like Grafana and Datadog.
This position includes 247 support coverage (rotational) and requires strong ownership in managing major incidents, RCA processes, and continuous service improvements.
Key Responsibilities
Reliability & Incident Management
- Serve as a key member of the 247 on-call rotation, responding to and managing incidents across production and pre-production environments.
- Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements.
- Maintain clear communication with cross-functional teams and leadership during major incidents.
Monitoring, AI-Ops, Alerts & Prevention
- Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics).
- Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure.
- Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact.
- Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation.
- Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery.
- Analyze trends to prevent recurring issues and support teams in resilience engineering.