1

Observability Datadog Jobs (NOW HIRING)

The Cloud Engineer - Senior (Observability - Datadog) supports the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and ...

San Ramon, CA FACE TO FACE INTERVIEW Datadog - Key Responsibilities Design, implement, and manage observability solutions using Datadog (Logs, Metrics, APM, RUM, Synthetics, etc.) Develop real-time ...

next page

Showing results 1-20

Observability Datadog information

See salary details

$11

$17

$23

How much do observability datadog jobs pay per hour?

As of Jun 7, 2026, the average hourly pay for observability datadog in the United States is $17.34, according to ZipRecruiter salary data. Most workers in this role earn between $16.35 and $18.03 per hour, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as an Observability Engineer specializing in Datadog, and why are they important?

To excel as an Observability Engineer with a focus on Datadog, you need a strong background in IT operations, cloud infrastructure, and monitoring concepts, often supported by relevant degrees or certifications. Familiarity with Datadog's platform, scripting languages (like Python or Bash), and integrations with cloud services (AWS, Azure, GCP) is typically required. Analytical thinking, proactive problem-solving, and the ability to collaborate across teams are vital soft skills in this role. These skills ensure effective system monitoring, rapid incident response, and ongoing performance optimization in complex environments.

What are Observability Datadog roles?

Observability Datadog roles typically refer to professionals who implement, manage, and optimize observability practices using the Datadog platform. These specialists focus on monitoring application performance, infrastructure health, and ensuring real-time visibility into system operations. They configure dashboards, set alerts, and analyze logs, traces, and metrics to detect and resolve issues quickly. Their work helps organizations maintain system reliability, optimize performance, and improve incident response.

How does an Observability Datadog specialist typically collaborate with development and operations teams?

An Observability Datadog specialist works closely with both development and operations teams to ensure that applications and infrastructure are properly monitored. They often participate in sprint planning and incident response meetings, helping teams define meaningful metrics, set up dashboards, and configure alerting policies. Collaboration also includes training team members on best practices for using Datadog and troubleshooting monitoring issues together. This cross-functional role ensures that all stakeholders have visibility into system health and can respond quickly to performance or reliability concerns.

What is the difference between Observability Datadog vs Cloud Engineer?

AspectObservability DatadogCloud Engineer
Primary FocusMonitoring, analytics, and visualization of system performanceDesigning, implementing, and managing cloud infrastructure
Required SkillsMonitoring tools, scripting, data analysisCloud platforms, scripting, infrastructure as code
CertificationsDatadog certifications, cloud provider certificationsAWS, Azure, or GCP certifications
Work EnvironmentIT operations, DevOps teamsCloud infrastructure teams, DevOps

While both roles involve cloud technologies, Observability Datadog specialists focus on monitoring and analyzing system performance, whereas Cloud Engineers design and maintain cloud infrastructure. Understanding these differences helps organizations assign the right skills to each role.

Infographic showing various Observability Datadog job openings in the United States as of May 2026, with employment types broken down into 97% Full Time, and 3% Contract. Highlights an 74% Physical, 6% Hybrid, and 20% Remote job distribution, with an average salary of $36,065 per year, or $17.3 per hour.
Cloud Engineer - Senior (Observability - Datadog)

Cloud Engineer - Senior (Observability - Datadog)

Leidos

Remote

$57 - $76.25/hr

Full-time

Posted 17 days ago


Leidos rating

8.4

Company rating: 8.4 out of 10

Based on 146 frontline employees who took The Breakroom Quiz

56th of 425 rated business services


Job description

The Cloud Engineer - Senior (Observability - Datadog) supports the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This role is hands-on: instruments services with distributed tracing, code-level profiling, and custom metrics; builds and tunes Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; then uses that telemetry to solve production performance, reliability, and capacity problems. The engineer partners with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (OpenShift/Kubernetes), and drives reduction of alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data-driven operational decision-making in a 24x7x365 operating environment.
PRIMARY RESPONSIBILITIES
Observability Platform Engineering
- Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
- Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.
- Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate.
- Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows.
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost-controlled.
Cloud and Container Monitoring Engineering
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.
- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.
- Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.
- Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.
Performance Engineering and Problem Solving
- Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.
- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.
- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.
- Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.
- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.
Capacity, Reliability, and Continuous Improvement
- Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency.
- Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.
- Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.
- Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.
REQUIRED QUALIFICATIONS
Citizenship/Work Authorization: Must meet contract requirements.
Clearance: Ability to obtain and maintain SEC Public Trust (or higher if required).
EXPERIENCE
- Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering.
- Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).
- Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.
- Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
- Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production.
TECHNICAL SKILLS
- Enterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM
- Instrumentation with OpenTelemetry, Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3C TraceContext propagation, and continuous profiling
- Microsoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)
- Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry
- Cloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB, ElastiCache/Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APM
- Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment
- APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs
- Unified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls
- Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction
- Performance engineering, capacity analysis, and telemetry-driven root-cause analysis
- Integration of observability with ITSM (ServiceNow) and on-call/paging workflows
PREFERRED QUALIFICATIONS
- Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirements.
- Datadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certification.
- Hands-on experience with Red Hat OpenShift Virtualization (CNV/KubeVirt) or other KubeVirt-based container virtualization observability.
- Experience with eBPF-based observability tooling and service mesh telemetry (Istio, Linkerd).
- Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance.
- Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning.
- Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecasting.
- ITIL 4 Foundation
- AWS Certified Solutions Architect - Associate (or higher)
- Microsoft Certified: Azure Administrator Associate (or higher)
- Red Hat Certified Specialist in OpenShift Administration (or equivalent)
- HashiCorp Terraform Associate
WORK ENVIRONMENT / OTHER
Operational Support: Supports a 24x7x365 operating environment; participates in a defined on-call rotation and may require surge support based on operational needs.
Location: Telework
Travel: As required per contract direction.
EDUCATION & EXPERIENCE
BS and 4 - 8 years of prior relevant experience or Masters with 2 - 6 years of prior relevant experience. Preferred degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).
If you're looking for comfort, keep scrolling. At Leidos, we outthink, outbuild, and outpace the status quo - because the mission demands it. We're not hiring followers. We're recruiting the ones who disrupt, provoke, and refuse to fail. Step 10 is ancient history. We're already at step 30 - and moving faster than anyone else dares.
Original Posting:
May 19, 2026
For U.S. Positions: While subject to change based on business needs, Leidos reasonably anticipates that this job requisition will remain open for at least 3 days with an anticipated close date of no earlier than 3 days after the original posting date as listed above.
Pay Range:
Pay Range $87,100.00 - $157,450.00
The Leidos pay range for this job level is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job, education, experience, knowledge, skills, and abilities, as well as internal equity, alignment with market data, applicable bargaining agreement (if any), or other law.

What Leidos employees say

Pay

Benefits

Hours and flexibility

Workplace

Get the full story on Breakroom


Leidos logo

About Leidos

Sourced by ZipRecruiter

At Leidos, we deliver innovative solutions through the efforts of our diverse and talented people who are dedicated to our customers' success. We empower our teams, contribute to our communities, and operate sustainable practices. Everything we do is built on a commitment to do the right thing for our customers, our people, and our community.

Industry

It services

Company size

10,000+ Employees

Headquarters location

Reston, VA, US

Social media