1

Ai Reliability Engineer Jobs in Washington (NOW HIRING)

Site Reliability Engineer

Washington, DC · On-site

$64.25 - $85.50/hr

The role focuses on ensuring operational reliability and optimizing system performance for enterprise AI systems. Responsibilities : • Apply core reliability engineering principles to ensure high ...

Manage incident response, root cause analysis, and post-mortem processes for the AI platform ... , DevOps, or production operations. * Extensive experience with cloud-native infrastructure ...

Site Reliability Engineer

Herndon, VA · On-site

$86.80K - $198K/yr

Site Reliability Engineer The Opportunity: Engineering to make a system more resilient and ... Candidate AI Usage Policy AI is a part of our daily work at Booz Allen, and we are committed to the ...

Leidos Digital Modernization sector is seeking an experienced Senior Reliability Engineer to ... Knowledge of AI/ML model serving and deployment. * Experience in participating in Engineering ...

Leidos Digital Modernization sector is seeking an experienced Senior Reliability Engineer to ... Knowledge of AI/ML model serving and deployment. * Experience in participating in Engineering ...

Site Reliability Engineer - Hybrid

Reston, VA · On-site

$59.25 - $78.75/hr

AI/ML: We have certain machine learning projects which the SRE interacts with. So, AI/ML experience is a plus to have. * Previous Fannie Mae experience is a plus. Overall years of experience: 8+ ...

next page

Showing results 1-20

Ai Reliability Engineer information

What are the key skills and qualifications needed to thrive as an AI Reliability Engineer, and why are they important?

To thrive as an AI Reliability Engineer, you need a solid background in computer science or engineering, expertise in AI/ML concepts, and experience with software testing and reliability methodologies. Familiarity with tools like TensorFlow, PyTorch, CI/CD pipelines, and reliability testing frameworks, along with certifications in cloud platforms (e.g., AWS Certified Machine Learning), is highly valuable. Analytical thinking, problem-solving abilities, and strong collaboration skills set top performers apart in this role. These skills ensure robust, dependable AI systems that meet performance standards and maintain trust in critical applications.

What are some common challenges Ai Reliability Engineers face when ensuring model robustness in production environments?

Ai Reliability Engineers often encounter challenges such as monitoring AI model performance for drift or unexpected behavior, managing data quality issues, and implementing automated alerting systems for anomalies. In production, it's crucial to ensure that AI models operate consistently and remain reliable under varying conditions and data inputs. Collaborating closely with data scientists, software engineers, and DevOps teams is essential to address these challenges and to continuously improve model reliability and uptime.

What are AI Reliability Engineers?

AI Reliability Engineers are professionals responsible for ensuring that artificial intelligence systems function reliably, safely, and effectively over time. They work on monitoring AI models in production, identifying and mitigating potential failures, and improving the robustness of AI systems. Their tasks often include testing, validation, performance monitoring, and implementing best practices for maintaining AI infrastructure. By focusing on reliability, they help organizations deploy AI solutions that are dependable and trustworthy in real-world environments.

What is a $900,000 AI job?

A $900,000 AI job typically refers to highly senior roles such as AI executives, chief AI officers, or lead AI engineers at top technology companies, often involving advanced expertise in machine learning, deep learning, and AI strategy. These positions usually require extensive experience, specialized skills, and may include performance-based bonuses or stock options that contribute to the high total compensation.

What is the difference between Ai Reliability Engineer vs Data Scientist?

AspectAi Reliability EngineerData Scientist
Required CredentialsBachelor's or master's in CS, engineering, or related; certifications in AI/MLBachelor's or master's in CS, statistics, or related; certifications in data analysis or ML
Work EnvironmentTech companies, AI-focused teams, engineering departmentsResearch labs, tech firms, analytics teams
Employer & Industry UsageAI product development, machine learning systems, reliability testingData analysis, predictive modeling, business insights

While both roles involve AI and ML, Ai Reliability Engineers focus on ensuring AI system robustness and uptime, whereas Data Scientists analyze data to generate insights and models. The roles often collaborate but serve different primary functions within AI projects.

What job categories do people searching Ai Reliability Engineer jobs in Washington look for? The top searched job categories for Ai Reliability Engineer jobs in Washington are:
What cities in Washington are hiring for Ai Reliability Engineer jobs? Cities in Washington with the most Ai Reliability Engineer job openings:
Site Reliability Engineer

Site Reliability Engineer

MANTECH

Washington, DC • On-site

$64.25 - $85.50/hr

Full-time

Posted 6 days ago


ManTech rating

8.8

Company rating: 8.8 out of 10

Based on 13 frontline employees who took The Breakroom Quiz

31st of 183 rated software companies


Job description

Job Summary:
MANTECH seeks a motivated Site Reliability Engineer (SRE) for a new initiative that supports the rapid design and operation of enterprise-scale AI and data capabilities. The role focuses on ensuring operational reliability and optimizing system performance for enterprise AI systems.
Responsibilities:
• Apply core reliability engineering principles to ensure high availability and stability of production systems.
• Manage incident response, root cause analysis, and post-mortem processes for the AI platform.
• Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
• Oversee capacity planning, performance optimization, and FinOps practices.
• Define and continuously monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Qualifications:
Required:
• Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
• 5 or more years of experience in Site Reliability Engineering (SRE), DevOps, or production operations.
• Extensive experience with cloud-native infrastructure, particularly Kubernetes.
• Deep knowledge of monitoring, alerting, and logging systems.
• Proven ability to automate operational tasks and reduce toil.
• For onsite work, a TS/SCI clearance with Poly will be required.
• The person in this position must be able to remain in a stationary position 50% of the time.
• Frequently communicates with co-workers, management, and customers, which may involve delivering presentations.
• Constantly operates a computer and other office productivity machinery.
Preferred:
• Hands-on experience with the full observability stack: OpenTelemetry, Prometheus, Grafana, Loki, and Tempo.
• Experience with FinOps and optimizing cloud resource consumption.
• Experience supporting high-scale distributed systems in a secure environment.
Company:
ManTech is a technology company that offers cyber, IT, and data analytics technologies and solutions for security programs. Founded in 1968, the company is headquartered in Herndon, USA, with a team of 5001-10000 employees. The company is currently Late Stage.

What ManTech employees say

Pay

Benefits

Hours and flexibility

Workplace

Get the full story on Breakroom