#W2 only
Job title: MLOps Platform Engineer
Location: Reston VA - In person interviews so need Local In EAST coast only
Description:
MLOps Platform Engineer
The Data Modeling Analytics & AI Engineering team is seeking an experienced MLOps
Platform Engineer to design, build, and support enterprise-grade machine learning operations
capabilities. This role will play a key part in enabling scalable, reliable, and secure ML model
development and deployment across our cloud and container platforms.
This is a hands-on engineering role requiring strong expertise in AWS, Kubernetes (EKS),
CI/CD automation, containerization, and ML platform operations. The ideal candidate will have
solid engineering fundamentals combined with practical knowledge of ML workflows,
deployment patterns, and platform reliability.
Key Responsibilities
Platform Engineering & Operations
· Engineer, manage, and support MLOps platform components across AWS and EKS-based
environments.
· Oversee deployment, configuration, and operation of infrastructure used for ML training, batch
inference, and real-time model serving.
· Ensure platform availability, resilience, and performance across dev, test, and production
environments.
· Implement role-based access controls (RBAC), network policies, and scalable namespace
designs within EKS.
Model Deployment & CI/CD Automation
· Build and support CI/CD pipelines (GitLab) for model packaging, container image builds,
vulnerability scanning, and automated deployment flows.
· Enable standardized model release processes including environment promotion, versioning, and
rollback workflows.
· Integrate CI/CD with ML frameworks, model repositories, artifacts, and runtime environments.
Container & Kubernetes Workloads
· Design and manage EKS workloads supporting containerized ML jobs and microservices.
· Implement auto-scaling, resource quotas, cluster optimization, and multi-tenant workload
isolation.
· Support GPU and CPU-based training/inference workloads.
Monitoring, Observability & Optimization
· Implement logging, monitoring, and alerting for ML pipelines, model endpoints, batch jobs,
and platform components.
· Analyze compute, storage, and data transfer usage to optimize cost efficiency across ML
workloads.
· Perform incident response, root cause analysis, and long-term remediation planning.
Collaboration & Enablement
· Partner with Data Scientists, ML Engineers, and application teams to operationalize end-to-end
machine learning solutions.
· Provide technical guidance on best practices for ML model lifecycle management, deployment
patterns, and scalable architectures.
· Contribute to documentation, runbooks, onboarding materials, and internal knowledge bases.
---
Required Qualifications
· 3+ years of hands-on experience with AWS services, including EKS, EC2, S3, IAM,
CloudWatch, and ECR.
· Strong experience operating and troubleshooting Kubernetes (preferably AWS EKS).
· Proficiency in containerization (Docker) and orchestration concepts.
· Strong programming/scripting experience in Python and Bash.
· Experience building and managing CI/CD pipelines (GitLab or equivalent).
· Familiarity with machine learning workflows, including training, inference, and model
monitoring.
· Experience with infrastructure-as-code (Terraform or CloudFormation).
· Experience supporting production platforms, including incident management and root cause
analysis.