Job Summary:
Okta is a company focused on securing identities in the era of AI. They are seeking a highly technical Senior Observability Site Reliability Engineer to own and evolve their Splunk ecosystem and deliver a comprehensive, scalable Observability Platform.
Responsibilities:
• Design, build, and maintain scalable observability infrastructure using tools like Terraform.
• Optimize the collection, processing, and storage of log data to ensure high reliability and low latency of our Splunk services
• Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
• Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.
Qualifications:
Required:
• Minimum 5+ Experience scaling and managing Splunk Cloud at scale (1000+ SVCs), including Workload Management (WLM) and HEC optimization.
• Expertise in creating intuitive, actionable Splunk dashboards that correlate data across multiple sources.
• Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
• Strong coding skills in SPL, Go for building internal tools and automating workflows.
• Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).
• A data-driven approach to debugging complex, cross-service performance bottlenecks.
• This position requires the ability to access federal environments and/or have access to protected federal data.
• The successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.
• This person must attend in person onboarding in our San Francisco office the first week of employment.
Preferred:
• Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
• Experience in implementing Splunk charge-back app for usage reporting.
• Experience managing observability native tools within AWS or GCP.
Company:
Okta is a management platform that secures critical resources from cloud to ground for workforce and customers. Founded in 2009, the company is headquartered in San Francisco, USA, with a team of 5001-10000 employees. The company is currently Late Stage.