The work
As a Site Reliability Engineer, you will play a pivotal role in advancing operational AI adoption within a cutting-edge Hub-and-Spoke architecture. Your primary focus will be on ensuring the reliability, scalability, and continuous monitoring of enterprise AI systems that support mission-critical applications and enterprise AI governance
Key responsibilities:
- Ensure the reliability, scalability, and performance of enterprise AI systems within a modern Hub-and-Spoke architecture
- Lead incident response efforts to minimize downtime and maintain service continuity
- Implement and manage SLOs/SLAs, capacity planning, and performance optimization strategies
- Operate and enhance observability platforms using OpenTelemetry, Prometheus, Grafana, Loki, and Tempo Drive FinOps practices to optimize operational costs and resource utilization
- Collaborate with cross-functional teams in AI, DevSecOps, data engineering, platform engineering, and cybersecurity
- Integrate monitoring and continuous feedback mechanisms for mission applications and agentic AI systems
- Support enterprise AI governance and scalable software delivery through robust operational workflows Proactively identify and resolve reliability and performance issues in production environments
- You will be responsible for incident response, performance optimization, and capacity planning, working closely with cross-functional teams to integrate AI, DevSecOps, data engineering, and cybersecurity into seamless operational workflows
- Your expertise will be essential in maintaining robust observability operations and supporting scalable software delivery for agentic AI systems
Here's what you need:
- Experience with OpenTelemetry, Prom, Grafana, Loki, and Tempo to enhance system observability and performance
- Hands-on experience with SLO/SLA management, FinOps practices, and advanced monitoring techniques to proactively identify and resolve issues before they impact mission outcomes
- Exposure to complex integration efforts, continuous delivery pipelines, and mission-focused operational environments will help you excel in this role
- Experience with reliability engineering, incident response and FinOps
Eligibility requirements:
- Must be a U.S Citizen
- An active TS/SCI clearance is required