About the Role:We are looking for a Senior DevOps Engineer to join our DevOps team at K Health. You will own and evolve the infrastructure underpinning a healthcare AI platform serving patients and enterprise health system partners. This is a high-ownership role: you will architect and operate cloud environments across K Health and its enterprise partners, lead complex infrastructure migrations, drive disaster recovery programs, and help build the next generation of AI-powered operations tooling. You will also mentor junior engineers and collaborate closely with product and engineering teams across the company. This is a hybrid role based in New York City (4 days/week in office) and includes participation in a daytime on-call rotation.
This role requires onsite presence in our New York City office 4 days a week and does not provide immigration support.
What you will do:- Own the design, implementation, and evolution of our GKE-based Kubernetes infrastructure across K Health and enterprise partner environments.
- Build and maintain our Terraform modular infrastructure library, including reusable modules with automated testing, across GCP, Cloudflare, and AWS.
- Architect, build, and maintain GitLab CI/CD shared pipeline templates used by all engineering teams (build, test, security scanning, deployment).
- Own and maintain self-hosted infrastructure software running in-cluster, including GitLab, ArgoCD, Langfuse, DependencyTrack, NGINX Ingress, and others.
- Implement and support security and compliance controls across infrastructure and the software supply chain - secrets management, pipeline secret detection, container scanning, SOC2 and HIPAA.
- Drive disaster recovery readiness: design failover scenarios, author runbooks, and lead periodic DR tests.
- Lead development of AI-powered operations tooling and agentic infrastructure.
- Monitor, troubleshoot, and improve production system reliability; respond to incidents during on-call shifts.
- Mentor junior DevOps engineers and establish team-wide engineering standards.
What we are looking for:- 5+ years of experience in DevOps, platform engineering, or site reliability engineering.
- Deep, hands-on experience with Kubernetes and the surrounding ecosystem - Helm, Helmfile, ArgoCD, Kyverno, cert-manager, and NGINX Ingress.
- Extensive experience with Google Cloud Platform - GKE, Cloud SQL, Memorystore, Cloud Storage, IAM, and Workload Identity.
- Strong Terraform expertise: modular architecture, multi-environment provisioning, and automated testing.
- Advanced knowledge of GitLab CI/CD and GitOps practices.
- Proficiency in Python and/or Go.
Plus:
- Advanced Bash scripting skills.
- Experience with secrets management solutions such as Akeyless or HashiCorp Vault.
- Experience with database administration across PostgreSQL, Redis, and MongoDB - including DR configuration and operational runbooks.
- Experience with Datadog or equivalent observability platform (APM, infrastructure, log management).
- Experience with Cloudflare for DNS, CDN, and security rules management.
- Demonstrated experience designing and executing disaster recovery programs, including failover testing and runbook authorship.
Bonus:- Experience in highly regulated environments - SOC2 and HIPAA.
- Excellent communication skills with the ability to lead cross-functional infrastructure initiatives.
- Demonstrated leadership experience, including mentoring junior engineers.
- Experience with HPC or GPU cluster infrastructure, including Slurm..
- Experience building or operating AI agents or agentic infrastructure.
- Experience with microservices architecture and API gateway / reverse proxy patterns.
- Experience with AWS.
Benefits & Perks: #LI-Hybrid
- Hybrid work schedule with weekly lunches and stocked fridges
- Monthly social committees for company events
- 18 vacation days, 9 company holidays, 5 sick days, and 2 personal days
- Stock options for every full-time employee
- Paid parental leave
- 401k benefit
- Commuter Benefits
- Competitive health, dental, and vision insurance options