Job Summary:
MANTECH seeks a motivated Site Reliability Engineer (SRE) for a new initiative that supports the rapid design and operation of enterprise-scale AI and data capabilities. The role focuses on ensuring operational reliability and optimizing system performance for enterprise AI systems.
Responsibilities:
• Apply core reliability engineering principles to ensure high availability and stability of production systems.
• Manage incident response, root cause analysis, and post-mortem processes for the AI platform.
• Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
• Oversee capacity planning, performance optimization, and FinOps practices.
• Define and continuously monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Qualifications:
Required:
• Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
• 5 or more years of experience in Site Reliability Engineering (SRE), DevOps, or production operations.
• Extensive experience with cloud-native infrastructure, particularly Kubernetes.
• Deep knowledge of monitoring, alerting, and logging systems.
• Proven ability to automate operational tasks and reduce toil.
• For onsite work, a TS/SCI clearance with Poly will be required.
• The person in this position must be able to remain in a stationary position 50% of the time.
• Frequently communicates with co-workers, management, and customers, which may involve delivering presentations.
• Constantly operates a computer and other office productivity machinery.
Preferred:
• Hands-on experience with the full observability stack: OpenTelemetry, Prometheus, Grafana, Loki, and Tempo.
• Experience with FinOps and optimizing cloud resource consumption.
• Experience supporting high-scale distributed systems in a secure environment.
Company:
ManTech is a technology company that offers cyber, IT, and data analytics technologies and solutions for security programs. Founded in 1968, the company is headquartered in Herndon, USA, with a team of 5001-10000 employees. The company is currently Late Stage.