General InformationRef #51550
DepartmentData / technology
Job SiteMission Pet Health
Date Published06-10-2026
Pay ClassFull-Time
Job DescriptionAbout the RoleWe're looking for a Senior DevOps Engineer to own our cloud infrastructure end-to-end - from operating a large multi-tenant Kubernetes environment to building CI/CD pipelines that teams actually trust. You'll work across AWS, drive infrastructure-as-code standards, and lead our migration toward GitLab CI and a Grafana-based observability stack while keeping production environments stable.
What You'll Do- Operate and scale a multi-tenant AWS EKS cluster where each client runs an isolated set of application services - owning tooling to onboard, scale, and observe hundreds of service instances reliably
- Build and improve CI/CD pipelines in GitLab CI and GitHub Actions with automated testing, static analysis, and build-gated releases; maintain ArgoCD GitOps workflows for production deployments
- Lead the migration from Datadog to a self-managed Grafana observability stack (Grafana, Loki, Mimir/Prometheus, Tempo) - dashboards, SLOs, alert routing, and on-call integration
- Manage secrets, IAM, and security scanning pipelines using AWS KMS, Secrets Manager, external-secrets operator, and Auth0/Dex OIDC - enforcing least-privilege across all environments
- Own and evolve the Redpanda (Kafka-compatible) streaming layer and its integrations with application workers
- Drive cloud cost optimization through right-sizing, autoscaling, and shared infrastructure patterns on EKS
- Document infrastructure with automated tooling (terraform-docs) and maintain standards that scale across teams
- Automate operational toil - certificate renewal, clinic environment provisioning, deployment validation, runbook automation
Responsibilities and BenefitsWhat We're Looking ForRequired- 5+ years in DevOps or infrastructure engineering
- 3+ years operating Kubernetes in production - AWS EKS preferred - including CSI drivers, cluster autoscaling, network policy (Calico), and pod identity
- 3+ years hands-on with AWS core services (IAM, S3, KMS, Secrets Manager, STS, EKS, Load Balancer Controller, ECR)
- Strong Terraform experience; GitOps experience with ArgoCD or Flux
- Hands-on experience with GitLab CI and/or GitHub Actions
- Scripting proficiency in Python and Bash
- Experience with IAM design and security best practices (SAST/DAST, secret scanning, OIDC federation)
- Familiarity with streaming or message-queue infrastructure (Redpanda, Kafka, or equivalent)
Nice to Have- Experience migrating from a SaaS observability tool (Datadog, New Relic) to a self-hosted Grafana stack
- Grafana stack depth - Loki for logs, Mimir or Thanos for metrics, Tempo for traces, Alertmanager for routing
- Experience with Redpanda specifically, or deep Kafka operations knowledge
- Background in multi-tenant SaaS platforms or per-customer service isolation patterns
- AWS certification
- Familiarity with chaos engineering tooling (chaos-mesh or LitmusChaos)
- Background in software engineering or scripting-heavy roles
Tech StackCurrent production: AWS (EKS, S3, KMS, Secrets Manager, STS, Load Balancer) • Terraform • GitHub Actions • ArgoCD • Kubernetes • Traefik • Coraza WAF • Redis HA • MongoDB • Auth0 • Dex • external-secrets • Datadog • Docker • Python • Bash • Linux
Where we're going: GitLab CI • Redpanda • Grafana • Loki • Prometheus/Mimir • Tempo • Alertmanager
Platform components you'll operate: ArgoCD • Traefik • Coraza WAF • Auth0 • Dex • Redis HA • MongoDB • API servers • client-facing portals • internal tooling
Why Join Us- Own infrastructure across a real multi-tenant platform serving production clinic environments
- Lead the observability and streaming migrations - greenfield decisions with lasting impact
- Collaborative engineering culture with high trust and low bureaucracy
- Competitive salary, benefits, and flexible work arrangements