Job Summary:
NVIDIA is seeking a world-class Senior Product Manager to architect for the operational future of Enterprise AI. In this role, you will define the vision and lead the development of products that transform DGX hardware into a high-availability, self-healing AI Factory.
Responsibilities:
• Set the vision for the Enterprise Operational Gold Standard.
• Define how the world’s most sophisticated companies deploy, manage, and scale their Enterprise AI Factories.
• Productize the On-Prem Lifecycle: Define the "Day 0 through Day 2" experience for DGX SuperPODs.
• Lead the development of products that handle everything from bare-metal provisioning and network fabric configuration to automated "one-click" firmware rollouts.
• Develop a definitive telemetry and diagnostic suite.
• When a job slows down in a private data center, your framework should provide the "one-click" answer—isolating a thermal throttle, a degraded InfiniBand rail, or a cabling fault instantly.
• Lead the integration of DGX systems into the cloud-native ecosystem.
• Ensure that enterprise-grade features like GPU partitioning (MIG), multi-node scaling, and niche scheduling are declarative and seamless.
• You aren't just building scripts; but building APIs and Services.
• Your goal is to eliminate "management snowflakes," ensuring that every enterprise DGX deployment is standardized, repeatable, and resilient.
• Move the needle from reactive maintenance to self-healing infrastructure.
• Thoughtfully define the features for automated health checks that keep the fleet at peak performance without manual intervention.
Qualifications:
Required:
• 12+ years demonstrated ability in Product Management, with specific experience around on-premise infrastructure, private cloud, or large-scale systems management.
• Bachelors Degree in Computer Science or related field or equivalent experience.
• A track record of turning complex hardware operations into software-defined workflows.
• Expert-level understanding of Kubernetes operators, container orchestration, and how to translate physical hardware constraints into declarative code.
• Experience managing large-scale Linux fleets in air-gapped or restricted enterprise environments.
• Deep familiarity with data center networking (InfiniBand/Ethernet), storage architectures, and the firmware-to-OS handshake.
• Ability to define the NVIDIA Datacenter Experience and transition into formal people management as the team expands.
Preferred:
• Experience with infrastructure-as-code (Ansible, Terraform, Pulumi) in a bare-metal context.
• Vision for using AI to manage AI—applying telemetry and machine learning to predict and prevent infrastructure failures.
• Belief that the 'Gold Standard' isn't just about speed—it's about the reliability and simplicity of the Automated Pit Crew.
Company:
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI. Founded in 1993, the company is headquartered in Santa Clara, USA, with a team of 10001+ employees. The company is currently Late Stage.