Job Summary:
STACK Infrastructure is an award-winning industry leader providing digital infrastructure for innovative companies. The Cloud Infrastructure & Automation Engineer will own the cloud platform and operational infrastructure, ensuring the successful deployment and reliability of automation and data initiatives.
Responsibilities:
• Design, deploy, and manage Azure infrastructure across dual EA subscriptions (Dev/Non-Prod and Production) including Databricks workspaces, AI Search clusters, Cosmos DB instances, ADLS Gen2, Azure OpenAI Service endpoints, and Azure Functions.
• Implement Infrastructure-as-Code using Terraform, Bicep, or ARM templates with modular, version-controlled patterns enabling new workloads to deploy within hours.
• Configure Azure networking (VNets, Private Endpoints, NSGs, Private DNS) for secure, globally distributed platform environments across AMER, EMEA, and APAC.
• Build container-based deployment patterns (Azure Container Apps, AKS) for API serving, agent hosting, model inference, and automation execution.
• Provision and manage LLM/SLM serving infrastructure: Azure OpenAI deployments, model endpoints, token-based scaling, and multi-region failover.
• Design end-to-end CI/CD pipelines (Azure DevOps, GitHub Actions) for application deployment, model promotion, data pipeline orchestration, and automated testing with blue/green and canary patterns.
• Build MLOps pipelines for model registration, versioning, A/B testing, canary deployment, and automated rollback of LLM endpoints and RAG configurations.
• Deploy and manage automation runtime infrastructure: Azure Logic Apps, Power Automate, Azure Functions, Durable Functions, and event-driven triggers for intelligent workflows.
• Maintain agent hosting environments (Chainlit, FastAPI, Teams bots) for the HR PM Agent and future agentic solutions, with auto-scaling and health monitoring.
• Create reusable deployment accelerators (Terraform modules, Helm charts, pipeline templates) to reduce time-to-production for each successive initiative.
• Drive Azure cost optimization: commitment-tier analysis, right-sizing, automated shutdown policies, and token consumption tracking across LLM endpoints.
• Implement RBAC, managed identities, Key Vault integration, and least-privilege access across all platform components.
• Ensure SOX compliance, data residency, and governance using Microsoft Purview, Defender XDR, and Azure Policy.
• Manage secrets, certificates, API key rotation, and Entra ID integration for platform authentication across global regions.
• Produce monthly infrastructure cost and performance reports with spend trends, cost-per-query, and optimization metrics.
Qualifications:
Required:
• 7+ years of cloud infrastructure/DevOps experience with at least 2 years supporting AI/ML, automation, or data platform workloads at scale.
• Expert-level Azure skills: Databricks, Cosmos DB, Azure Functions, Logic Apps, ADLS Gen2, Azure AI Search, Azure OpenAI Service, Container Apps/AKS, and Azure Monitor.
• Strong IaC proficiency: Terraform (modules, state, workspaces), Bicep, or ARM templates with environment-templated patterns.
• Hands-on CI/CD engineering: Azure DevOps, GitHub Actions, container registries, Helm charts, and blue/green/canary deployment automation.
• Solid Python and Bash skills for infrastructure tooling, automation scripts, and deployment utilities.
• Deep understanding of Azure networking, security (RBAC, managed identities, Key Vault, Private Endpoints, Azure Policy), and cost management.
• Experience with containerization (Docker) and orchestration (AKS or Container Apps) for production workload and model serving.
• Familiarity with AI platform infrastructure: Databricks provisioning, Cosmos DB scaling, AI Search management, and LLM endpoint deployment.
Preferred:
• Experience deploying RAG platform infrastructure, vector search clusters, and LLM/SLM serving endpoints in production.
• Hands-on MLOps: model registries, experiment tracking, automated deployment pipelines, and A/B testing infrastructure.
• Background in enterprise IT environments with M365, Intune, and Entra ID.
• Azure certifications: AZ-104, AZ-400, AZ-305.
• FinOps certification or demonstrated cloud cost optimization experience delivering measurable savings.
• Experience supporting global operations across AMER, EMEA, and APAC with high-availability requirements.
Company:
STACK Infrastructure provides digital infrastructure solutions, focusing on data centers, colocation, and build-to-suit projects. Founded in 2019, the company is headquartered in Denver, USA, with a team of 1001-5000 employees. The company is currently Late Stage.