Job Summary:
Zealogics Inc is seeking a Site Reliability Engineer to support Cyber Data Risk & Resilience. The role involves ensuring the reliability and performance of critical cybersecurity platforms, monitoring system health, and improving operational visibility.
Responsibilities:
• Maintain and improve the reliability, availability, scalability, and performance of cybersecurity platforms, services, and supporting infrastructure
• Support day-to-day operational stability by monitoring system health, identifying risks, responding to incidents, and driving timely resolution of service-impacting issues
• Instrument infrastructure, applications, services, APIs, data pipelines, and cloud components to provide end-to-end visibility into system behavior and service health
• Design, build, and continuously refine monitoring, alerting, logging, tracing, and observability capabilities across distributed systems and cloud environments
• Develop meaningful and actionable alerts that reduce noise, improve signal quality, and enable teams to respond quickly to emerging issues
• Define and track key reliability metrics, including availability, latency, throughput, error rates, saturation, service-level indicators, service-level objectives, and operational risk indicators
• Build, maintain, and enhance dashboards for engineering, operations, product, risk, and executive stakeholders, ensuring information is accurate, timely, and decision-ready
• Continuously modify and improve executive dashboards to support regular leadership reviews of service health, reliability trends, incidents, risks, and operational performance
• Partner with engineering, cybersecurity, infrastructure, cloud, and application teams to identify reliability gaps and implement long-term improvements
• Participate in incident response, root-cause analysis, problem management, and post-incident reviews to prevent recurrence and improve operational maturity
• Automate operational tasks, health checks, reporting, deployment validation, and recovery procedures to improve efficiency and reduce manual effort
• Collaborate with application and platform teams to embed reliability, monitoring, and supportability requirements into the software development lifecycle
• Support CI/CD, DevOps, and release management practices by validating operational readiness, monitoring coverage, rollback plans, and production support requirements
• Contribute to resiliency engineering efforts, including capacity planning, performance tuning, failover validation, disaster recovery readiness, and chaos/resilience testing where applicable
• Ensure monitoring, alerting, dashboards, and operational processes align with enterprise security, risk, compliance, and governance standards
Qualifications:
Required:
• 7 to 10+ years of experience in site reliability engineering, systems engineering, software engineering, DevOps, infrastructure engineering, or production operations
• Strong experience supporting highly available, distributed, cloud-based, or mission-critical technology platforms
• Hands-on experience with observability practices, including monitoring, alerting, logging, metrics, tracing, dashboards, and service health reporting
• Experience instrumenting applications, services, APIs, infrastructure, databases, and cloud components to enable end-to-end operational visibility
• Strong understanding of reliability engineering concepts, including SLIs, SLOs, SLAs, error budgets, incident management, capacity management, and operational readiness
• Experience designing actionable alerts that support rapid issue detection, triage, escalation, and resolution
• Experience building and maintaining operational dashboards for technical teams, support teams, and senior/executive stakeholders
• Strong scripting or programming skills using Python, Java, Bash, PowerShell, or similar languages for automation and operational tooling
• Experience with cloud platforms such as AWS, Azure, or GCP
• Experience with Infrastructure-as-Code tools such as Terraform or similar technologies
• Experience working with CI/CD pipelines, DevOps workflows, release processes, and production support models
• Experience troubleshooting distributed systems, REST services, event-driven architectures, messaging platforms, and service-to-service integrations
• Familiarity with relational and non-relational databases, such as PostgreSQL, MSSQL, MongoDB, or similar platforms
• Strong analytical, troubleshooting, and problem-solving skills with the ability to diagnose complex technical issues across multiple layers of the stack
• Strong written and verbal communication skills, including the ability to translate technical issues into clear business and executive-level updates
Preferred:
• Experience supporting cybersecurity, risk, resilience, compliance, or enterprise security platforms
• Experience with observability and monitoring tools such as Splunk, Grafana, Prometheus, Datadog, Dynatrace, New Relic, Azure Monitor, CloudWatch, OpenTelemetry, or similar platforms
• Experience creating executive-level service health dashboards, reliability scorecards, operational risk reporting, or incident trend reporting
• Experience developing automated health checks, synthetic monitoring, service dependency maps, and operational runbooks
• Experience with incident response, major incident management, postmortems, root-cause analysis, and problem management practices
• Experience with containerized and cloud-native environments, including Kubernetes, Docker, serverless services, or managed cloud platforms
• Experience with distributed messaging or streaming platforms such as Apache Kafka
• Familiarity with cloud-native security, governance, and policy tooling such as Azure Policy, AWS SCP, GCP constraints, or related controls
• Familiarity with Cloud Security Posture Management tools such as Wiz, Prisma, CloudGuard, or similar platforms
• Experience with cloud-based AI services such as Azure AI, AWS Bedrock, or Google Vertex AI, particularly from an operational monitoring, reliability, or governance perspective
• Experience supporting Linux and Windows environments through scripting, automation, monitoring, and operational troubleshooting
• Exposure to web technologies, APIs, front-end services, or user-facing application monitoring
Company:
Zealogics Inc provides a broad range of IT and Engineering Services, Systems Implementation and Application Outsourcing Services through an optimized global delivery model. Founded in 2012, the company is headquartered in Bridgewater, USA, with a team of 501-1000 employees. The company is currently Late Stage.