Job Title :Grafana Observability SME
Location :Onsite - Poughkeepsie, NY
Job Description:
1. Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
2. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
3. OpenTelemetry practitioner - OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js.
4. eBPF-based auto-instrumentation experience with Beyla (or equivalent - Pixie, Cilium Tetragon) in a production context.
5. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
6. Multi-environment hosting fluency - on-prem, AWS, Azure - and Linux/Windows host agent deployment at scale.
7. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
8. Excellent written communication - solution architecture documents, runbooks, and stakeholder-facing status reporting.
Role Summary
Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java, .NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only - server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends.
Key Responsibilities
• Platform architecture and configuration across all eight in-scope Grafana Cloud modules: Grafana 12 (visualization), Mimir (metrics, 13-month retention), Loki (logs), Tempo (distributed tracing via OTLP), Alloy (telemetry collection agent), Beyla (eBPF zero-code auto-instrumentation), Application Observability (OTel-native APM), and Unified Alerting.
• Tenancy and access design - organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
• Application instrumentation strategy by technology stack: Beyla eBPF as the default zero-code path for Simple and Medium apps; OpenTelemetry SDKs/agents (Java, .NET, Go, Python, Node.js) for Complex apps requiring deeper traces and custom metrics; JMX Exporter, prometheus_client, and runtime-specific exporters where stack-appropriate.
• Log pipeline engineering via Alloy - structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru - with parsing rules tuned per stack and LogQL-based dashboards and alerts.
• Alerting design - PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies. Build a low-noise, actionable alert feed; tune thresholds iteratively with application owners.
• Single Pane of Glass - design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
• Business Dashboards and Reporting - partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
• ServiceNow ITOM integration - co-own the design and review of Grafana → ServiceNow Event Management (native inbound integration) flow: event allow-list governance ("deny by default"), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
• Quality assurance authority across all technical deliverables - solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
• Phased delivery execution - Mobilise & Discover → Application Foundation (ML1) → Onboarding of 40 Simple apps (ML2) → Medium/Complex apps + ITOM Integration (ML2→3) → SPoG, Dashboards & Reporting (ML3→4) → Stabilisation, KT, and post-deployment support (ML4).
• Knowledge transfer - produce platform operating procedures and conduct structured handover to the client's run team.
Required Skills & Experience
• 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
• Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
• Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
• OpenTelemetry practitioner - OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js.
• eBPF-based auto-instrumentation experience with Beyla (or equivalent - Pixie, Cilium Tetragon) in a production context.
• Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment.
• Multi-environment hosting fluency - on-prem, AWS, Azure - and Linux/Windows host agent deployment at scale.
• Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly).
• Excellent written communication - solution architecture documents, runbooks, and stakeholder-facing status reporting.
Nice to Have
• Grafana Certified Professional or equivalent vendor certification.
• Prior experience in a regulated utility, energy, or critical-infrastructure environment.
• Familiarity with SolarWinds and Uptrends (sufficient to design clean boundaries with retained tooling, not to administer them).
• Experience with ServiceNow CSDM and Service Mapping governance.
• Exposure to FinOps for observability - cardinality control, log volume management, retention tuning in Mimir/Loki.
Out of Scope for This Role
• Server health and network monitoring (owned by SolarWinds).
• URL/synthetic endpoint monitoring (owned by Uptrends).
• ServiceNow ITSM workflow ownership - incident lifecycle remains with the client's ITSM/ITOM team; this role designs the integration, not the downstream process.
Years of Experience: 12.00 Years of Experience
Regards
Surya
surya@rurisoft.com