Job Summary:
Google is a leading technology company known for its innovative solutions and services. They are seeking a Systems Debug Engineer Manager to lead a technical team in managing and debugging cloud AI infrastructure, ensuring high performance and reliability while collaborating with various teams to resolve customer challenges.
Responsibilities:
• Drive technical team performance across on-call activities and system management by delivering leadership, mentorship, and career development while collaborating with primary responders to address system issues.
• Debug platform hardware, silicon, and AI/ML workloads to drive root-cause resolution, develop permanent infrastructure improvements, and build tools for faster diagnosis through troubleshooting and reproduction.
• Collaborate cross-functionally with Product, Quality, and Engineering teams to enhance product outcomes, and engage with Site Reliability Engineering (SRE) teams to ensure high-quality production and reliability.
• Resolve customer challenges on AI/ML infrastructure through effective diagnosis, resolution, and the implementation of investigation tools to increase productivity for critical reported issues.
• Serve as a consultant and subject matter expert for internal stakeholders to resolve deployment and operational obstacles across AI infrastructure environments daily.
Qualifications:
Required:
• Bachelor's degree in Computer Science or IT-related field, or equivalent practical experience.
• 8 years of experience with system design.
• 5 years of experience managing or leading a team.
• 5 years of experience with managing technical work, engineering strategy, and roadmaps.
• 5 years of experience with hardware debug (silicon debug, platform debug, IO interface, memory analysis).
• 3 years of experience with organizational design.
Preferred:
• 5 years of experience working with vendors or customers.
• 3 years of experience with leadership development and career growth of employees.
• 3 years of experience in analyzing and troubleshooting distributed systems.
• 2 years of CPU, dGPU, or TPU debug or validation experience.
• Understanding of memory and high-speed IO technologies.
Company:
Google specializes in internet-related services and products, including search, advertising, and software. It is a sub-organization of Alphabet. Founded in 1998, the company is headquartered in Mountain View, USA, with a team of 10001+ employees. The company is currently Late Stage.