Job Summary:
Together AI is a research-driven artificial intelligence company focused on building the next generation AI infrastructure. They are seeking an Engineering Manager for their Site Reliability Engineering organization to lead a team of engineers in enhancing production infrastructure through automation and effective team development.
Responsibilities:
• Lead and develop a team of ~10 SRE engineers across multiple function areas, partnering with technical leads on direction.
• Drive the team's shift from manual operations to systemic, automated, scalable infrastructure -including making toil visible, capping it, and prioritizing engineering work that reduces it.
• Stay hands-on: code, review architecture, lead incidents, and participate meaningfully in technical decisions.
• Build coaching and feedback rhythms that develop engineers over time, especially around incident leadership, on-call habits, and systemic problem-solving.
• Strengthen on-call practices and incident response, including blameless postmortems that produce real engineering follow-through.
• Partner with the other SRE EM (across timezone) to shape org-wide practices, hiring, and operational maturity.
• Help grow the team- own hiring, leveling, and career development for engineers in your region.
• Plan capacity, prioritize work across function areas in your portfolio, and represent SRE in broader engineering conversations.
Qualifications:
Required:
• Prior experience managing SRE, infrastructure, or platform engineering teams - ideally including time leading through a reliability or culture turnaround.
• Deep technical credibility in at least one of: bare-metal infrastructure with Ansible-based config management, Kubernetes on public cloud, or Kubernetes with virtualization.
• Strong Kubernetes and Terraform fundamentals, with hands-on production experience.
• A genuine player-coach orientation - you want to stay close to the technology and contribute as an engineer, not just review the work of others.
• Experience leading teams through serious production incidents and on-call rotations (PagerDuty or equivalent).
• A track record of coaching engineers and shifting team culture through engineering systems - not just process or frameworks. You can point to specific things you put in place and the change they produced.
• Comfort operating in a matrix structure where technical direction is shared with tech leads.
• Adaptability - the 50/50 management/IC balance shifts based on what the team needs, and you're at home with that.
• Based in (or willing to relocate to) San Francisco, with the ability to be in the office regularly.
Company:
Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models. Founded in 2022, the company is headquartered in San Francisco, USA, with a team of 201-500 employees. The company is currently Growth Stage.