1

Director Observability Jobs (NOW HIRING)

Observability Engineer (Splunk) Locations: Jacksonville, FL; Boston, MA; Kanas City, MO | Hybrid 6x ... Communication that is direct, respectful, and documented. This Role May Not Be a Fit If * You ...

Observability Engineer (Splunk) Locations: Jacksonville, FL; Boston, MA; Kanas City, MO | Hybrid 6x ... Communication that is direct, respectful, and documented. This Role May Not Be a Fit If * You ...

Observability Engineer (Splunk) Locations: Jacksonville, FL; Boston, MA; Kanas City, MO | Hybrid 6x ... Communication that is direct, respectful, and documented. This Role May Not Be a Fit If * You ...

Observability Engineer (Splunk) Locations: Jacksonville, FL; Boston, MA; Kanas City, MO | Hybrid 6x ... Communication that is direct, respectful, and documented. This Role May Not Be a Fit If * You ...

The Director of Systems Engineering for Observability area is a senior technology leader responsible for engineering delivery, platform reliability, and operational excellence of the enterprise ...

The Director of Systems Engineering for Observability area is a senior technology leader responsible for engineering delivery, platform reliability, and operational excellence of the enterprise ...

The Director of Systems Engineering for Observability area is a senior technology leader responsible for engineering delivery, platform reliability, and operational excellence of the enterprise ...

The Director of Systems Engineering for Observability area is a senior technology leader responsible for engineering delivery, platform reliability, and operational excellence of the enterprise ...

next page

Showing results 1-20

Director Observability information

What is the difference between Director Observability vs Site Reliability Engineer?

AspectDirector ObservabilitySite Reliability Engineer
Primary FocusOversees observability strategies, tools, and teams to ensure system visibility and performanceBuilds and maintains reliable systems, automates deployment, and manages incident response
CredentialsTypically requires advanced knowledge of monitoring, cloud platforms, and leadership experienceOften has software engineering background, with skills in scripting, automation, and systems engineering
Work EnvironmentLeads teams in tech companies, focusing on monitoring and analytics toolsWorks closely with development and operations teams to ensure system reliability

While both roles focus on system performance and reliability, the Director Observability primarily manages observability strategies and teams, whereas the Site Reliability Engineer is hands-on, building and maintaining reliable systems. The roles complement each other in ensuring optimal system performance and uptime.

What are the key skills and qualifications needed to thrive as a Director of Observability, and why are they important?

To thrive as a Director of Observability, you need deep expertise in monitoring, logging, and distributed systems, typically backed by a degree in computer science or a related field and extensive experience in IT or DevOps leadership roles. Proficiency with observability tools such as Prometheus, Grafana, Datadog, Splunk, and APM solutions, along with knowledge of cloud platforms and relevant certifications, is essential. Strong leadership, strategic thinking, and communication skills help drive cross-functional initiatives and foster a culture of reliability. These skills and qualities are crucial for ensuring system health, rapid incident response, and alignment between technical teams and organizational objectives.

How does a Director of Observability typically collaborate with engineering and operations teams to drive organizational goals?

A Director of Observability works closely with engineering and operations teams to ensure that systems are monitored effectively and issues are identified and resolved quickly. This collaboration often involves developing unified monitoring strategies, aligning observability tools and processes, and facilitating incident response post-mortems. The Director also leads cross-functional meetings to establish best practices, set key performance indicators (KPIs), and ensure observability is integrated into the software development lifecycle. By acting as a bridge between technical teams, they help foster a culture of transparency, reliability, and continuous improvement.

What does a Director of Observability do?

A Director of Observability leads the strategy and implementation of monitoring, logging, and tracing systems to ensure the health and performance of technical infrastructure. They work with engineering and operations teams to develop best practices, select appropriate tools, and set standards for observability across the organization. Their goal is to provide visibility into system behavior, quickly identify and resolve incidents, and support continuous improvement in system reliability and performance.
More about Director Observability jobs
What cities are hiring for Director Observability jobs? Cities with the most Director Observability job openings:
What are the most commonly searched types of Observability jobs? The most popular types of Observability jobs are:
What states have the most Director Observability jobs? States with the most job openings for Director Observability jobs include:
Infographic showing various Director Observability job openings in the United States as of June 2026, with employment types broken down into 100% Full Time. Highlights an 100% In-person job distribution.

Senior/Staff/Principal SWE- Observability Engineering

AppGate Cybersecurity, Inc.

New York, NY

$137K - $189K/yr

Full-time

Posted just now


Job description

About AppGate

AppGate secures and protects an organization's most valuable assets with its high performance Zero Trust Network Access (ZTNA) solution. AppGate is the only direct-routed ZTNA solution built for peak performance, superior protection and seamless interoperability. AppGate safeguards Fortune 500 enterprises worldwide. Learn more at appgate.com. 

About the Role

We're looking for an Observability Engineer (Senior/Staff/Principal level) who has shipped distributed tracing systems, designed high-cardinality pipelines, and knows OpenTelemetry inside and out. You will own the end-to-end design and implementation of the AppGate observability fabric - from telemetry SDKs in our clients and gateways, to the LogForwarder pipeline, to customer-side integrations.

You'll make the foundational technical decisions - transport protocols, sampling strategies, schema design, correlation models - that determine whether our platform scales gracefully to hundreds of millions of events per day. This is a builder's role with a strategist's reach.

Key Responsibilities

Your engineering work will directly enable next-generation capabilities, including:

       OpenTelemetry-Native Telemetry Fabric: Logs and distributed traces from clients, controllers, gateways, and connectors - all correlated by session, user, device, and trace ID across the full ZTNA flow.

       High-Cardinality Data Pipeline: An OTLP-based ingestion and routing layer engineered for 100M+ events per day, with attribute filtering, redaction, and tail-sampling.

       End-to-End Distributed Tracing: Span hierarchies decomposing login and session establishment across posture checks, policy decisions, TLS handshakes, and entitlement resolution - turning hours of triage into seconds.

       On-Demand Packet Capture: Admin-triggered PCAP coordinated across client and gateway, with the workflow fully observable through OTel logs and traces.

       AI-Ready Foundation: Structured, semantically rich telemetry that future LLM-based incident analysis agents can reason over. The schema you design today is the substrate for Phase 3.

       Architect the Observability Platform: Define telemetry schema, correlation model, transport, and sampling strategies spanning client devices, controllers, and gateways.

       Build the Telemetry SDKs and LogForwarder: Instrument AppGate components with OpenTelemetry and implement the enrichment, redaction, batching, and tail-sampling pipeline that scales horizontally under load.

       Validate at Customer Scale: Test in lab environments matching our largest deployments - hundreds of sites, tens of thousands of concurrent sessions - and hunt down cardinality explosions and pipeline backpressure before customers see them.

       Drive Integration Standards: Own the OTLP, Prometheus, and JSON-log compatibility surface and validate ingestion into Datadog, Splunk, Nexthink, and Elastic.

       Raise the Engineering Bar: Establish patterns and review practices the Data + AI team builds on. Mentor engineers and grow the observability discipline inside AppGate.

       Collaborate Cross-Functionally: Work directly with product, R&D, and marquee customers in defense and critical infrastructure to shape requirements and deliver outcomes that matter.

Required Qualifications

       8+ years of engineering experience with at least 4 years dedicated to observability, telemetry, or large-scale data infrastructure (Datadog, Splunk, Elastic, Honeycomb, New Relic, Grafana Labs, or equivalent).

       Deep OpenTelemetry expertise: OTLP, the OTel Collector, semantic conventions, context propagation, and head/tail sampling - you can debate the trade-offs in your sleep.

       Distributed tracing in production: You've designed or significantly contributed to a tracing system handling real customer traffic, not just a side project.

       High-throughput pipeline experience: Hands-on with systems ingesting 100M+ events per day, including back-pressure handling, batching, and storage trade-offs.

       Strong systems programming: Production Go and/or Rust preferred. Comfort across the stack, from agent code to backend services.

       Networking and security fluency: Comfortable with TLS, DNS, TCP, and identity protocols. Prior ZTNA, SASE, or SD-WAN experience is a strong plus.

       Mindset: Pragmatic, opinionated, and impact driven. You know when to prototype and when to ship.

Our Observability Vision

AppGate secures defense agencies, federal governments, and Fortune 100 enterprises. When a connection traverses our ZTNA fabric - across clients, gateways, controllers, and protected resources - every hop carries real consequences for national security and business continuity. Yet when something breaks, the answer to "Why can't I reach this resource?" is still buried in fragmented logs and tribal knowledge. That ends now.

We are building Observability AI - a purpose-built observability platform for the Zero Trust era. It emits high-fidelity, correlated telemetry across every AppGate component, is OpenTelemetry-native, engineered for 100M+ events per day, and designed to stream into Datadog, Splunk, Nexthink, Elastic, or any OTLP-compatible backend. The roadmap runs from a raw data-feed MVP, through native analytics and root-cause dashboards, to AI-driven incident analysis - LLM agents that read traces and explain failures in AppGate terms - and ultimately to autonomous remediation. This is the nervous system for networks that protect nations.

This is your chance to build the observability platform for networks that protect nations.

If you've shipped observability at scale and want to apply that craft where the stakes are highest, we want to hear from you.

AppGate is An Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or veteran status, age or any other federally protected class. In furtherance of AppGate's policy regarding affirmative action and equal employment opportunity, AppGate has developed a written affirmative action program. This program is available for review upon request by any applicant or employee during normal business hours by contacting the company's EEO Coordinator.