Role Summary
The Director of Site Reliability Engineering (SRE) is a strategic leadership role responsible for ensuring the reliability, scalability, and performance of T. Rowe Price's global technology infrastructure & services. This position involves leading a team of skilled engineers to ensure the continuous availability and optimal performance of systems and services, while driving innovation and efficiency in operations.
This position plays a critical role in ensuring overall resilience and efficiency of T. Rowe Price's technology infrastructure, contributing to the organization's success and growth. This position offers an exciting opportunity to lead dynamic teams and drive impactful change in a fast-paced environment.
Responsibilities
Leadership and Strategy:
- Develop and execute overall strategy for site reliability engineering and observability for the enterprise.
- Lead SRE and observability teams to ensure alignment with organizational goals and objectives.
- Establish proactive approach to service reliability that drives reduction in downtime for critical platforms and services.
- Lead, mentor, and grow a high-performing team of site reliability engineers, fostering a culture of collaboration, innovation, and continuous improvement.
- Collaborate with cross-functional teams, including software development, IT operations, architecture, security, and product management, to ensure seamless integration and delivery of services.
Reliability and Performance:
- Oversee the design, implementation, and maintenance of systems and processes that ensure high availability, reliability, and performance of services.
- Establish and monitor key performance indicators (KPIs) and service level objectives (SLOs) to measure and improve system reliability.
- Proactively identify and mitigate risks to system reliability, including capacity planning, incident management, and disaster recovery.
- Lead cross-functional efforts spanning development, engineering, architecture, and operations to identify root cause of instability and drive short/long term improvements.
Automation and Efficiency:
- Drive the adoption of automation tools and practices to enhance operational efficiency and reduce manual intervention.
- Implement and refine processes for continuous integration and continuous deployment (CI/CD) to accelerate delivery, improve developer experience and minimize downtime.
- Promote the use of infrastructure as code (IaC) and other modern practices to streamline operations and improve scalability.
Innovation and Improvement:
- Stay abreast of industry trends and emerging technologies to identify opportunities for innovation and improvement.
- Lead initiatives to optimize system architecture and infrastructure, ensuring scalability and adaptability to future needs.
- Foster a culture of experimentation and learning, encouraging the team to explore new solutions and approaches.
Communication and Collaboration:
- Serve as a key point of contact for stakeholders, providing regular updates on system reliability and performance.
- Facilitate effective communication and collaboration between the SRE team and other departments to ensure alignment and shared understanding.
- Advocate for best practices in site reliability engineering across the organization.
Qualifications
Required:
- Bachelor's or Master's degree (or the equivalent combination of education and relevant experience and 12+ years in site reliability engineering, cloud engineering, software development, and/or IT operations, with 5+ years of demonstrated technical leadership and team management.
- Established success implementing and scaling DevOps practices, tools, and frameworks. (i.e. IaC, CI/CD, Agile)
- Expertise of cloud computing, distributed systems, and modern infrastructure technologies.
- Proven experience deploying / supporting cloud platforms at scale (AWS, Azure, GCP).
- Extensive experience deploying observability platforms (i.e Dynatrace, Grafana, Splunk, Dynatrace, OpenTelemetry).
- Excellent problem-solving skills and the ability to make data-driven decisions.
- Exceptional communication and interpersonal skills, with the ability to influence and collaborate effectively across teams.
Preferred:
- Demonstrated expertise in financial services, specifically investment management
- Certifications: SRECP, CRP
FINRA Requirements
FINRA licenses are not required and will not be supported for this role.
Work Flexibility
This role is eligible for hybrid work, with up to three days per week from home.