Job Description: As part of the Site Reliability Engineering team within the Reference Data Engineering group, you'll help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to runtime problems. In this environment, you'll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you'll be part of application development org to build more resilient, self-healing applications that require minimum production operations.
Key Responsibilities:- Lead and conduct detailed Root Cause Analysis (RCA) for incidents, identifying underlying issues and recommending corrective actions.
- Document and communicate findings from RCA processes, ensuring transparency and knowledge sharing across the organization.
- Develop and maintain incident postmortem reports, providing insights and actionable recommendations to stakeholders.
- Monitor system performance and reliability metrics, proactively identifying potential issues before they escalate.
- Contribute to the design and implementation of automated monitoring and alerting systems to improve incident detection and response times.
- Continuously improve the incident management process, incorporating feedback and lessons learned from RCA activities.
- Participate in incident response activities.
Qualifications:- Bachelor's degree or equivalent experience in a software engineering discipline
- 5+ years of Software Engineering experience
- Excellent communication skills, with the ability to convey technical findings to both technical and non-technical audiences
- Excellent debugging and trouble shooting skills
- Experience in Site Reliability Engineering, DevOps, or a similar role, with a focus on incident management and RCA.
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Dynatrace).
- Familiarity with containerization technologies (e.g., Docker, Kubernetes).
Job Requirements
Additional Notes: 10/7/2025Top Skills Required- Strong communication and documentation skills
- Proven experience in Root Cause Analysis (RCA)/ postmortem reporting
- Hands-on experience managing Java-based systems, with basic understanding of Java and software architectures
- Strong troubleshooting abilities
- Nice to have: Background in Java development
Tasks: - Needs someone who can do end-to-end SRE - from identifying a failure, debugging it in Java, fixing and automating it, documenting what happened, and improving systems to prevent future incidents.
- Investigate and analyze system incidents to determine root causes
- Gather and consolidate information, collaborate with developers, and document findings clearly and comprehensively
- Prepare detailed incident and RCA documentation
- Someone who can help with automation in monitoring system
Additional Notes:- DevOps and infrastructure-heavy candidates are not needed, but can consider candidates if they are really strong in RCA/ Java
Additional Notes: 9/19/2025- They are currently not using AWS or any cloud platform at the moment - candidates must be comfortable with this
- Familarity with doing automation (preferably with Java)
Job Update: 08/29/2025- Strong root cause analysis (RCA) experience.
- Postmortem
- Good communication skills.
- Some software background - Java preferred who can read not necessarily write code
- The role does not prioritize AWS/infrastructure-heavy experience.
- Ideal candidate would be less infra/devops-focused and more investigative and software-aware.