Job Title: Site Reliability Engineer
Location: Cincinnati, OH (Onsite)
Duration: 12+ Months
About the Role:
- Client is looking for an enterprise-grade embedded finance platform enabling organizations to build, launch, and scale compliant banking, payments, and lending solutions.
- We are seeking a Principal Software Engineer to join our Production Engineering team. This is a hands-on technical leadership role focused on operating, debugging, and improving highly distributed, mission-critical payment systems. The ideal candidate thrives in complex production environments and enjoys solving deep technical challenges across applications, infrastructure, and data systems.
Key Responsibilities:
- Lead production triage and incident response across APIs, payment systems, distributed services, infrastructure, and databases.
- Diagnose and resolve complex production issues spanning code, infrastructure, data, and third-party dependencies.
- Partner with engineering teams to implement permanent fixes and improve platform reliability.
- Design and implement monitoring, alerting, automation, and operational tooling.
- Improve system observability, resiliency, and debuggability.
- Work across a mixed technology stack including Ruby on Rails, Java, AWS, APIs, and SQL databases.
- Develop runbooks and diagnostic workflows for operational excellence.
- Mentor engineers and influence best practices across engineering and SRE teams.
- Participate in architectural discussions to build highly reliable and scalable systems.
Required Skills & Experience:
- 10+ years of experience in Software Engineering, Production Engineering, SRE, or Distributed Systems.
- Strong experience debugging production issues end-to-end (application, infrastructure, data, and dependencies).
Hands-on experience with:
- AWS and cloud-native environments
- Ruby on Rails and/or Java
- APIs, Microservices, and Distributed Systems
- SQL and database troubleshooting
- Observability tools such as Splunk, Datadog, New Relic, etc.
Deep understanding of:
- System behavior in production
- Fault isolation and troubleshooting
- Performance optimization and resiliency patterns
- Excellent communication and stakeholder management skills.
- Ability to work effectively during incidents and high-pressure situations.
Preferred Qualifications:
- Experience in Payments, FinTech, Banking, or other regulated environments.
- Experience building and operating large-scale, high-availability platforms.
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.