Skip to Main Content
Site Reliability Engineer
PartsTech Cambridge, MA

Site Reliability Engineer

PartsTech
Cambridge, MA
Expired: 21 days ago Applications are no longer accepted.
  • $135,000 to $165,000 Yearly
  • Contractor
Job Description
Company Info
Job Description

PartsTech creates automotive e-commerce technology, helping repair shops, auto part distributors, and manufacturers run their businesses more effectively and profitably through e-commerce and data innovation. We increase efficiency for the automotive aftermarket by connecting repair shops, parts distributors, and manufacturers in one seamless, e-commerce platform. PartsTech makes finding and ordering the right parts simple, fast, and accurate.

PartsTech seeks a dynamic Site Reliability Engineer to support our platform and integrations. In this pivotal role, you will ensure that SLAs are exceeded and that we continue providing best-in-class services to our customers as our platform grows.

The ideal candidate will have in-depth experience with SaaS application technologies, especially in production support & incident management processes, and provide guidance to improve MTTD and MTTR for large-scale platforms/cloud-based applications with multiple integrations. You will contribute significantly to our team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of cloud-based SaaS applications.

Eastern Time Zone - Canada, Remote

What You'll Accomplish:

  • Utilize a variety of monitoring tools to observe system performance and detect anomalies. Tools to be utilized include Synthetic Monitoring, Application Performance Monitoring (APM), Infrastructure Monitoring, and API Monitoring.
  • Review system logs and be able to extract information to share with our partners on a per-request basis.
  • Work with the Integrations team and other internal teams to provide data around partner performance and SLAs.
  • Work with the outbound integrations team on measuring outbound partners' SLAs.
  • Create and manage alerts based on Service Level Agreements (SLAs) and the requirements of specific applications/microservices.
  • Lead cross-functional teams in the identification, triage, and resolution of critical incidents, categorized as severity 1 (critical), severity 2 (urgent), and some severity 3 incidents.
  • Ensure swift restoration and recovery of services, adhering to established SLAs and minimizing business impact.
  • Serve as the primary point of contact for all communications related to incidents.
  • Provide timely updates and escalations to stakeholders, senior management, customers, and partners.
  • Conduct post-mortem analyses for incidents of critical levels.
  • Recommend the implementation of a Correction of Error (COE) process for in-depth root cause analysis and preventive measures to avoid recurrence as the organization matures.
  • Prepare comprehensive incident reports. Ensure the regular update of Sigma with daily, weekly, and quarterly views for tracking and analysis purposes.
  • Weekly Business Reviews (WBRs): Conduct meetings between Customer Support, Product, Engineering, and Partner Support to review a real-time report.
  • Monthly Business Reviews (MBRs): Present to Executives every month an end-to-end view of Customer & Partner Support view.
  • Quarterly Business Reviews (QBRs): Snapshot report to BoD and partner.
  • Quarterly Business Review with Partners and Suppliers.

Who You Are:

  • Bachelor's Degree in Computer Science, Information Systems, or related Technical field or comparable work experience required.
  • Must reside in the Eastern Time Zone in Canada.
  • 5+ years of experience in utilizing Application Performance Monitoring (APM), API, and Infrastructure tools like AppDynamics, New Relic, DataDog, Grafana, Prometheus, CloudWatch, and Synthetics.
  • Experienced in cloud-based deployment environments.
  • Proficient with programming constructs, especially in engineering frameworks applicable to Kotlin.
  • Demonstrates a high sense of urgency in completing tasks and resolving issues, ensuring projects are delivered with excellence, on time.
  • Possesses strong written and verbal communication skills, and is comfortable engaging with business stakeholders and external clients.
  • Has an in-depth understanding of and experience with Incident Management processes, including detection, recovery, conducting Cause of Effect (COE) analysis, and following up with problem tasks.
  • Strong analytical and problem-solving skills.

Bonus Points:

  • Experience building large applications from scratch, complete with CI/CD infrastructure.
  • Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure).
  • Experience managing Kubernetes clusters or some other container orchestration infrastructure.
  • Experience with observability of large-scale distributed systems (100s+ microservices, 50+ integrations).

Compensation: Contract-to-Hire, Annual Salary Range - $135,000 - $165,000 CAD

Why You Should Join Us:

Our vision is to make it fast and easy for auto repair shops to find the right parts across all of their suppliers with one search. Together, PartsTech's team helped countless businesses save valuable time so they can focus on their customers — and we're just getting started.

The PartsTech team is a global, distributed group of passionate self-starters based in Cambridge, Hartford, CT, Eastern Europe, and beyond. We are remote-first, privately held and venture-backed.

PartsTech is proud to be an equal opportunity employer, and values diversity at every level of our company. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We believe you should bring your whole self to work, so come as you are.

PartsTech is an equal-opportunity employer and welcomes applications from candidates of all backgrounds.

Note: The job description provided is a general outline of responsibilities and qualifications for this role at PartsTech. Actual responsibilities and qualifications may vary depending on the specific needs of the company and department.

Get fresh Site Reliability Engineer jobs daily straight to your inbox!

By clicking the button above, I agree to the ZipRecruiter Terms of Use and acknowledge I have read the Privacy Policy, and agree to receive email job alerts.