DevOps Engineer
Location: Remote
We are seeking a highly skilled and experienced Principal Software Engineer focused on Agentic AI and DevOps. The ideal candidate will architect and deliver agentic microservices and platform capabilities, lead cloud-native DevOps at scale, and partner with organizational leaders to communicate strategy, status, and results. Deep hands-on expertise with Azure, Kubernetes, CI/CD, infrastructure as code, and LLM/agent frameworks (LangChain/LangSmith/OpenAI/LiteLLM) is essential. Experience with dataflow orchestration (Apache NiFi), enterprise integrations (ServiceNow/Snowflake/Power BI/SharePoint), and production-grade observability is highly desirable.
What You'll Do:
- Architect, build, and operate agentic AI services and microservices leveraging LangChain, LangSmith, OpenAI/Azure OpenAI, and LiteLLM; implement tool-use orchestration, evaluation, and guardrails.
- Design, build, and maintain CI/CD pipelines using Azure DevOps (ADO) YAML and GitHub Actions; enforce trunk-based workflows, quality gates, progressive delivery, and automated rollbacks.
- Stand up and manage Azure infrastructure (AKS, Service Bus, Event Hubs, Storage Accounts, Key Vault, Bastion); codify environments with Terraform; implement secure networking, secrets, and RBAC.
- Containerize and ship services with Docker/Buildah; operate Kubernetes with CNI networking and Linkerd service mesh; implement canary/blue-green strategies and autoscaling.
- Create and operate Apache NiFi dataflows; deploy and manage NiFi clusters on AKS with VM Scale Sets, enabling resilient, scalable ingestion and orchestration.
- Implement enterprise-grade observability and logging: ELK/EFK (Elasticsearch, Fluentd/Fluent Bit, Kibana), Prometheus metrics, Azure Dashboards, and KQL-based alerting.
- Engineer data and analytics integrations: Azure Databricks, PostgreSQL, Snowflake; operationalize Power BI, SharePoint, and Jupyter-based workflows.
- Build robust platform and app integrations: ServiceNow APIs, REST APIs, SMTP/IMAP/POP email automations; configure and manage NGINX/HAProxy load balancers.
- Lead incident response, root-cause analysis, and postmortems; continuously improve reliability, performance, security, and cost.
- Mentor teams, drive architectural runway, and communicate plans, trade-offs, and outcomes to stakeholders and leadership.
Key Qualifications / Experience Required:
DevOps Experience
- Expert-level hands-on DevOps across Azure and Kubernetes: CI/CD, Git workflows, infrastructure as code, automated testing, monitoring, and secure deployment.
- Proficiency with Azure DevOps (ADO) YAML pipelines and GitHub Actions; experience optimizing pipelines for cloud-native systems.
- Strong Kubernetes operations including CNI networking and service mesh (Linkerd); container build and supply chain (Docker, Buildah).
- Observability at scale using ELK/EFK, Prometheus, Fluentd/Fluent Bit, Azure Monitor dashboards and alerting (KQL).
Automation Skills
- Deep automation with PowerShell, Bash, and Python to eliminate toil across build, release, environment, and operational workflows.
- Infrastructure as Code expertise with Terraform (Azure resources: AKS, Service Bus, Event Hubs, Storage, Key Vault, Bastion).
- Proven track record reducing manual intervention, increasing repeatability, and improving MTTR through automation.
Agentic AI Experience
- Practical, production experience delivering agentic AI solutions (task orchestration, tool-use, planning, retrieval, and evaluation).
- Hands-on with LangChain, LangSmith (tracing/eval), OpenAI/Azure OpenAI, and LiteLLM integration; familiarity with prompt engineering, safety/guardrails, and LLM observability (e.g., Arize).
- Experience operationalizing AI services within DevOps pipelines and platform governance.
Technical Proficiency
- Apache NiFi expertise: authoring and governing dataflows; deploying and scaling NiFi clusters on AKS with VM Scale Sets.
- Azure services: AKS, Service Bus, Event Hubs (setup and integration), Storage Accounts (setup and integration), Key Vault, Bastion, Azure Dashboards & Kusto Query Language (KQL).
- Data/analytics: Azure Databricks, PostgreSQL, Snowflake; Power BI and SharePoint integrations; Jupyter Notebook workflows.
- Networking fundamentals: DHCP/DNS; load balancer configuration and operations (NGINX, HAProxy); Kubernetes ingress best practices.
- Messaging and email protocols: SMTP, IMAP/POP.
- Microservices and app frameworks: Python and Node.js microservices (REST APIs), Electron build and packaging.
Required Technical Skills
- Windows PowerShell; Linux/Unix administration; Bash and Python.
- Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions.
- Docker and Buildah; Kubernetes (CNI), Linkerd; ELK/EFK, Prometheus, Fluentd/Fluent Bit.
- Apache NiFi flow development and clustered operations on Kubernetes with scale sets.
- Azure Databricks; PostgreSQL; Snowflake; REST APIs; ServiceNow APIs; Power BI; SharePoint.
- Azure Service Bus, Azure Event Hubs, Storage Accounts, Key Vault, Bastion.
- Jira; Jupyter Notebook; Azure Dashboards and KQL; SMTP/IMAP/POP.
- Python and Node.js microservice architecture; Electron build.
Project Management Skills
- Plan, schedule, and coordinate multi-team deliveries and releases; manage dependencies, risks, and change.
- Drive execution across platform, app, data, and AI workstreams with clear milestones and success criteria.
- Establish SLOs/SLAs and error budgets; align roadmaps to business priorities.
Communication and Interpersonal Skills
- Communicate architectural decisions, roadmaps, and trade-offs to technical and executive audiences.
- Lead cross-functional ceremonies; produce clear runbooks, architecture docs, and dashboards.
- Foster collaboration across engineering, product, security, and operations.
Analytical and Problem-Solving Abilities
- Rapid diagnosis and resolution of complex production issues; strong RCA and remediation planning.
- Attention to detail in reliability, security, performance, and cost optimization.
Adaptability and Continuous Learning
- Track and adopt evolving best practices in cloud, containers, DevOps, and agentic AI.
- Champion continuous improvement in engineering excellence and platform governance.
Experience and Education
- Typically requires 10–15+ years in software engineering, DevOps/SRE, or platform engineering with principal-level impact.
- Bachelor's degree in Computer Science, Information Technology, or related field preferred (or equivalent experience).
Secondary Skills and Experience (Desired)
- Design and Development
- Define and design subsystems and interfaces; allocate responsibilities across services and platforms.
- Translate non-functional requirements (security, reliability, scalability) into concrete designs.
- Technical Enablement
- Provide technical enablement for components and subsystems; drive critical design decisions and reviews.
- Establish patterns and reusable templates for CI/CD, IaC, and agentic service scaffolding.
- Continuous Delivery Pipeline
- Plan, define, and implement the continuous delivery pipeline with quality gates, progressive delivery, and rollback strategies.
- Architectural Runway
- Develop the architectural runway to support new features and capabilities; align with Solution and Enterprise Architects and portfolio stakeholders.
- Integration
- Architect and implement integrations with external components, systems, and platforms (ServiceNow, Snowflake, Power BI, SharePoint, email systems, and enterprise identity/secrets).
Top Skills:
- Windows PowerShell; Linux/Unix administration; Bash and Python
- Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions