Job Summary:
Vertiv is a global critical infrastructure and data center technology company seeking a skilled Platform Operations Engineer (Site Reliability Engineer) to enhance operational reliability within their Digital organization. The role involves designing and implementing monitoring solutions, managing incident responses, and ensuring the performance and resilience of digital platforms.
Responsibilities:
• Own Cross-Platform Monitoring & Observability: Design, implement, and maintain end-to-end monitoring, alerting, and observability solutions across Vertiv’s digital platform ecosystem — including AI platforms, automation tools, and internal applications — ensuring real-time visibility into system health, performance, and availability.
• Lead Incident Response & Management: Serve as the primary escalation point and incident commander for P1/P2 incidents across Digital platforms; lead root cause analysis (RCA), blameless post-mortems, and corrective action tracking to prevent recurrence and reduce mean time to resolution (MTTR).
• Manage Platform SLAs & Reliability Targets: Define, instrument, and enforce service level objectives (SLOs), service level indicators (SLIs), and error budgets across Digital platforms; produce regular SLA performance reports for leadership and drive platform improvements to meet or exceed agreed availability and performance targets.
• Drive Secure Coding & Operational Governance: Champion secure coding practices and DevSecOps standards within Digital delivery teams; conduct operational readiness reviews for new platform deployments, enforce configuration management and change control processes, and partner with IT Security and NPDI to ensure all platforms meet Vertiv’s security and compliance requirements.
• Automate Operations & Reduce Toil: Identify and eliminate manual operational toil through automation. This includes automated remediation runbooks and anomaly detection through the use of scripting, IaC tools, and approved automation platforms.
• Capacity Planning & Performance Engineering: Analyze platform utilization trends and conduct capacity planning across Digital environments; proactively identify performance bottlenecks and recommend architectural improvements to ensure platforms scale reliably with business demand.
• CI/CD Pipeline Reliability & Deployment Support: Partner with Digital delivery teams to ensure CI/CD pipelines are instrumented for reliability, deployment risk is managed through progressive rollout strategies, and production deployments are supported with appropriate rollback and health-check capabilities.
• Evaluate & Advance Observability Tooling: Stay current on advancements in observability, AIOps, and SRE tooling; evaluate and recommend new tools and practices that enhance Vertiv’s platform operations maturity, and drive adoption of modern reliability engineering standards across the Digital organization.
Qualifications:
Required:
• Bachelor’s degree in Computer Science, Information Systems, Engineering, or a related field; equivalent practical experience considered.
• 5+ years of professional experience in platform operations, site reliability engineering, DevOps, or a related software/infrastructure engineering discipline.
• 3+ years of hands-on experience with enterprise monitoring and observability platforms (e.g., Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) in a multi-platform environment.
• Demonstrated experience owning and managing incident response processes, post-mortem facilitation, and SLA/SLO governance.
• Experience implementing secure coding practices, DevSecOps standards, or operational governance frameworks in an enterprise software delivery environment.
• Proficiency with monitoring and observability tools (Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) for cross-platform health and performance tracking.
• Strong knowledge of SRE principles, including SLOs, SLIs, blameless post-mortems, and toil reduction practices.
• Hands-on experience with cloud platforms (AWS preferred) and familiarity with containerized environments (Docker, Kubernetes) and infrastructure-as-code tooling (Terraform, Ansible, or equivalent).
• Proficiency in at multiple programming languages (Python, Ruby, Powershell, Java, Javascript, C#, etc.) for automation and runbook development.
• Experience with CI/CD platforms (GitLab, Jenkins, GitHub Actions, Azure DevOps, or equivalent) and deployment reliability practices including progressive rollout, feature flags, and automated health checks.
Preferred:
• Google SRE certification, AWS DevOps Professional, Azure certifications, or equivalent SRE/cloud operations certification.
• Experience with AIOps tooling or AI-assisted anomaly detection and automated remediation capabilities.
• Familiarity with the Vertiv digital platform ecosystem: Workato, UiPath, Power Automate, Compass AI, Writer AI, or Cursor.
• Experience applying DevSecOps practices, including SAST/DAST scanning, secrets management, and compliance-as-code in enterprise environments.
• Experience working in Agile/Scrum delivery environments; familiarity with ITIL incident and change management frameworks.
Company:
Vertiv designs, builds and services critical infrastructure that enables vital applications for data centers and industrial facilities. Founded in 2016, the company is headquartered in Westerville, USA, with a team of 10001+ employees. The company is currently Late Stage.