Job Title: Site Reliability Engineer (SRE)
Location: Washington, DC (Onsite)
Clearance: TS/SCI
Position Overview
Seeking a highly motivated Site Reliability Engineer (SRE) to support mission-critical enterprise applications and infrastructure in a high-availability environment. The SRE will be responsible for ensuring system reliability, performance, scalability, and operational efficiency through proactive monitoring, automation, and rapid incident response.
This role bridges development and operations, partnering closely with engineering teams to ensure new capabilities are delivered without compromising production stability. The ideal candidate brings strong Linux expertise, automation skills, and hands-on experience with cloud-native and containerized environments.
Key Responsibilities
Monitoring & Performance
ยท Monitor system health, availability, and performance using enterprise observability tools
ยท Analyze metrics and logs to proactively detect and remediate issues
ยท Tune alerting to reduce noise and prioritize mission impact
Incident Management & Reliability
ยท Respond to and resolve production incidents across distributed environments
ยท Perform root cause analysis and lead post-incident reviews
ยท Implement corrective and preventive actions to improve resilience
ยท Participate in on-call rotation for outages, upgrades, and urgent activities
Automation & DevOps Enablement
ยท Automate repetitive operational tasks to improve efficiency and reduce human error
ยท Support CI/CD pipelines and automated deployment workflows
ยท Develop scripts and tooling to improve reliability and repeatability
Platform & Infrastructure Support
ยท Maintain Linux/Unix systems and containerized workloads
ยท Support Kubernetes/Docker environments and microservices architectures
ยท Assist with configuration management and environment standardization
ยท Ensure secure and compliant system configurations
Collaboration & Continuous Improvement
ยท Partner with development teams to improve service reliability and performance
ยท Support backlog refinement and reliability engineering initiatives
ยท Document runbooks, procedures, and knowledge articles
ยท Contribute to continuous service improvement efforts
Required Qualifications
Education & Experience
ยท Bachelorโs degree in Computer Science, Engineering, or related technical field
ยท Minimum 5 years of relevant technical experience
ยท At least 3 years of systems programming or SRE/DevOps experience
Technical Skills
ยท Strong proficiency in Python, Bash, or similar scripting languages
ยท Hands-on experience with Linux/Unix administration
ยท Experience with Kubernetes and Docker
ยท Familiarity with cloud platforms (AWS, Azure, or Google Cloud)
ยท Experience with monitoring and logging tools (e.g., Grafana, Kibana, Prometheus, ELK)
ยท Working knowledge of CI/CD tools (e.g., GitLab, Jenkins, ArgoCD)
ยท Understanding of microservices architecture and DevOps practices
ยท Experience with Git-based workflows
Infrastructure & Networking
ยท Knowledge of networking fundamentals, load balancers, and firewalls
ยท Experience with identity and access management (IAM, SSH, VPN, security groups)
ยท Experience deploying to on-premises or data center environments
Professional Skills
ยท Strong analytical and troubleshooting abilities
ยท Excellent time management and ability to work independently
ยท Effective written and verbal communication skills
ยท Experience using Jira and Confluence in an Agile environment
Preferred Qualifications
ยท Experience defining or working with SLIs, SLOs, and error budgets
ยท Familiarity with Helm and Kubernetes deployment pipelines
ยท Experience supporting high-availability or mission-critical systems
ยท Knowledge of security best practices and compliance frameworks