Job Summary:
Royal Caribbean Group is a leader in the vacation industry, offering an exciting opportunity for a Senior Engineer in Site Reliability. This role involves owning and evolving the enterprise observability platform, ensuring system visibility across a complex technology environment, and driving improvements in service reliability.
Responsibilities:
• Own and evolve the enterprise observability platform spanning Cisco AppDynamics, Splunk, ThousandEyes, and PagerDuty AIOps across AWS and Azure environments.
• Architect and enforce a unified telemetry strategy — metrics, logs, traces, and events — standardized via OpenTelemetry across all application tiers.
• Design and govern telemetry data pipelines including ingestion, filtering, routing, and retention to optimize signal quality and platform cost at enterprise scale.
• Drive full-stack observability coverage across ship and shore environments, including maritime network paths, contact center platforms, and revenue-critical booking systems.
• Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for all critical services across RCG’s three brands.
• Build alerting frameworks that minimize noise, surface actionable signals, and integrate cleanly with PagerDuty AIOps on-call workflows.
• Partner with SRE teams to drive MTTR reduction, post-incident observability improvements, and proactive reliability practices.
• Instrument and publish DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) to support engineering productivity and release confidence.
• Drive AI-assisted incident detection, anomaly correlation, and root cause analysis using PagerDuty AIOps and Splunk IT Service Intelligence (ITSI).
• Tune and mature ML-based alert grouping and noise suppression models to reduce alert fatigue and accelerate triage.
• Integrate observability signals with ServiceNow ITSM for automated incident creation, enrichment, and closed-loop resolution workflows.
• Enable and govern Kubernetes observability for EKS and AKS workloads — container health, resource utilization, pod-level tracing, and cluster performance.
• Integrate observability instrumentation into CI/CD pipelines (GitHub Actions) to enable deployment-correlated performance analysis.
• Maintain and extend AWS CloudWatch and Azure Monitor integrations to ensure cloud infrastructure is fully represented in the observability estate.
• Define observability standards, instrumentation best practices, and onboarding frameworks for product and platform engineering teams.
• Mentor junior engineers and serve as the technical authority for observability discipline across SRE and Platform Engineering.
• Lead post-incident reviews (PIRs) and translate findings into observability platform improvements.
• Govern observability cost optimization: telemetry volume management, retention tiering, and platform licensing efficiency.
Qualifications:
Required:
• 6–9+ years in Observability, SRE, or Platform Engineering in enterprise-scale environments.
• Deep hands-on expertise with Cisco AppDynamics — APM configuration, business transaction mapping, code-level diagnostics, and baseline management.
• Strong proficiency with Splunk — SPL query development, ITSI service health trees, KPI configuration, alert policy management, and log pipeline design.
• Experience with Cisco ThousandEyes for network path monitoring, ISP/WAN intelligence, and BGP-level visibility.
• Proficiency with PagerDuty AIOps — intelligent alert grouping, noise suppression, event orchestration, and on-call workflow design.
• Strong command of OpenTelemetry — collector configuration, SDK instrumentation, semantic conventions, and multi-backend exporting.
• Hands-on Kubernetes experience (EKS/AKS) — container observability, resource metrics, and pod-level distributed tracing.
• Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability.
• Scripting and automation proficiency: Python, Bash, Terraform, and/or Ansible for observability tooling deployment and configuration.
• Experience defining SLIs/SLOs, error budgets, and actionable alerting strategies tied to business service reliability.
• ServiceNow ITSM integration experience — event management, incident auto-creation, and CMDB-enriched alerting.
• Experience with CI/CD observability integration (GitHub Actions or equivalent).
Preferred:
• Experience with Prometheus, Grafana, Loki, or Tempo for supplemental or hybrid observability architectures.
• Familiarity with eBPF-based observability tooling (e.g., Pixie, Cilium) for deep kernel-level and network-layer visibility.
• Experience with synthetic monitoring and real user monitoring (RUM) to capture end-user experience across digital channels.
• Familiarity with Cribl or equivalent telemetry pipeline tooling for data routing, enrichment, and cost governance.
• Exposure to DORA metrics instrumentation and developer experience observability frameworks.
• Experience in large-scale hospitality, travel, maritime, or consumer digital platforms.
• Certifications: Cisco AppDynamics Certified Associate, Splunk Core Certified Power User, AWS Solutions Architect, Kubernetes (CKA/CKAD), or OpenTelemetry Certified Associate (OTCA/CNCF).
Company:
Royal Caribbean Group is a cruise vacation company with a global fleet of 63 ships traveling around the world. Founded in 1968, the company is headquartered in Miami, USA, with a team of 10001+ employees. The company is currently Late Stage.