$56.25 - $74.75/hr
Other
Medical, Dental, Vision, Retirement
Posted 4 days ago
Job description
If you’re up to the challenge, then take a chance at this rewarding opportunity! Position Overview: As a Lead SRE Platform Engineer, will drive reliability engineering strategy and execution across critical IT Business Solutions platforms This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven Root Cause Analysis (RCA). Will serve as the technical lead for SRE practices—establishing monitoring standards, improving MELT (Metrics, Events, Logs, Traces) strategy, influencing tooling decisions, and partnering across infrastructure, development, operations, and vendor teams.
This is a high-impact opportunity to build and mature reliability engineering capabilities from the ground up. Responsibilities: Reliability & Observability Leadership: Define and mature SRE best practices across cloud and on-prem environments. Design and implement comprehensive monitoring strategies using tools such as: Dynatrace, Datadog, Microsoft SCOM.
Develop dashboards, alerts, synthetic testing, and proactive monitoring capabilities. Establish and evolve a MELT data strategy to improve service reliability. Provide data-driven RCA investigations and implement preventative solutions.
Platform & Application Reliability: Support and enhance reliability across: Cloud & Infrastructure: Microsoft Azure (Software, Storage, Azure Local) Hyper-V and Legacy VMware Environments NetApp and Pure Storage Platforms Azure Log Analytics Infrastructure as Code using Terraform Migration from Azure DevOps to GitHub (strong GitHub experience, required) Order Management Systems: Azure-based, internally developed .NET / C# applications. Internal message queuing systems. Logging, analytics, and synthetic testing post-patching.
API-based integrations. Workforce & Payroll Platforms: Workday (Payroll) ADP Vantage (Timekeeping) Warehouse & Distribution Systems: Blue Yonder Warehouse Management System (WMS) Collect handheld voice picking devices. Network analytics for identifying dead zones and connectivity issues.
Barcode scanners and device connectivity troubleshooting. DevSecOps & Automation: Lead CI / CD reliability improvements (Azure DevOps → GitHub transition critical). Enhance pipeline automation with embedded security controls.
Advance Infrastructure-as-Code standards (Terraform). Improve configuration management and change governance. Drive automation to reduce manual intervention and operational risk.
ITSM & Incident Management: Work within BMC ecosystem including: BMC Helix BMC Remedy BMC Server Automation Optimize automated incident generation (SCOM → BMC workflows). Improve triage, escalation, and impact modeling across services. Monitor vendor performance and escalate appropriately.
Participate in off-hour escalation support when required. Strategic Impact: Develop predictive reliability models using statistical techniques. Identify systemic risk across production systems.
Guide tooling decisions (e.g., Dynatrace vs. Datadog or other observability platforms). Ensure regulatory and operational compliance standards are met.
Facilitate cross-functional collaboration and document SRE procedures and planning artifacts. Required Skills: 5-7 years of Software Engineering and Infrastructure / Database Engineering experience. Deep expertise in: DevSecOps practices Observability Platforms API Integrations Performance Management Tools ITIL Principles ITSM Data Analytics MELT Data Collection and Analysis Experience in Azure cloud environments.
Strong analytical and problem-solving skills. Demonstrated ability to influence technical direction. Excellent communication and cross-team collaboration skills.
Continuous improvement mindset focused on reliability engineering. Preferred Qualifications: Strong programming experience in: .NET / C# Python SQL Experience with MSSQL (primary) and Oracle (limited). Experience with GitHub (critical for upcoming transition).
Agile / Scrum experience. Knowledge of Reliability-Centered Engineering and maintenance strategies. Experience with synthetic testing and proactive validation post-deployment.
Bachelor's Degree in a related technical field. If hired, you will enjoy the following ECLARO Benefits: 401k Retirement Savings Plan administered by Merrill Lynch Commuter Check Pretax Commuter Benefits Eligibility to purchase Medical, Dental & Vision Insurance through ECLARO If interested, you may contact: Jeanine Hastings jeanine.hastings@eclaro.com 646-755-9303 Jeanine Hastings | LinkedIn Equal Opportunity Employer: ECLARO values diversity and does not discriminate based on Race, Color, Religion, Sex, Sexual Orientation, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status, in compliance with all applicable laws.
Most Popular Jobs Similar to Site Reliability Engineer
reliability engineer
senior reliability engineer
software reliability engineer
plant reliability engineer
maintenance reliability engineer
senior it engineer
senior devops engineer
senior infrastructure engineer
site engineer
infrastructure engineer
Other Helpful Pages Related To Site Reliability Engineering (SRE) Platform Engineer (Lead)
Reliability Engineer Salaries
Reliability Engineer Career Research
Frequently asked questions
Q: What skills or qualities help someone succeed as a Site Reliability Engineer?
A: To succeed as a Site Reliability Engineer (SRE), one should possess strong technical skills in areas such as programming languages (e.g., Python, Go), cloud computing platforms (e.g., AWS, GCP), and operating systems (e.g., Linux). Additionally, soft skills like effective communication, problem-solving, and collaboration are crucial, as SREs often work closely with cross-functional teams to identify and resolve complex technical issues. By combining these technical and soft skills, SREs can ensure high system reliability, efficiency, and scalability, ultimately driving business growth and career advancement opportunities.
Q: What is the career path for a Site Reliability Engineer?
A: A Site Reliability Engineer's typical career progression involves starting as a junior SRE, focusing on incident response, monitoring, and troubleshooting, before advancing to a mid-level role as a SRE Lead or Team Lead, where they oversee team operations and implement reliability best practices. At the senior level, SREs often become Technical Leads or Engineering Managers, driving strategic decisions and technical direction for the organization. Throughout their career, SREs can develop skills in areas like cloud computing, containerization, and automation, as well as soft skills like communication, collaboration, and problem-solving, ultimately leading to opportunities in leadership, architecture, or specialized roles like DevOps or Cloud Engineering.
