Job Summary:
Together AI is a research-driven artificial intelligence company focused on creating innovative AI systems. As an AI Infrastructure Engineer, you will be responsible for maintaining user-facing services and production systems, implementing best practices for availability and scalability, and building monitoring systems to ensure high-quality service.
Responsibilities:
• Participate in on-call rotation (Pagerduty) to respond to production incidents
• Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
• Build monitoring systems to ensure the highest quality service for our customers
• Design and implement operational processes (such as deployments and upgrades)
• Debug production issues across all services and levels of the stack
• Identify improvements for the product architecture from the reliability, performance and availability perspectives
• Plan the growth of Together AI's infrastructure
Qualifications:
Required:
• 5+ years of professional AI Infra or related experience
• Bachelor's degree in Computer Science or a related field or equivalent work experience
• Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
• Proficiency in programming/scripting languages
• Direct experience in monitoring and observability practices
• Knowledge of cloud services
• Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts
Company:
Together AI provides a cloud platform for developing, training, fine-tuning, and deploying generative AI models. Founded in 2022, the company is headquartered in San Francisco, USA, with a team of 201-500 employees. The company is currently Growth Stage.