Job Summary:
NVIDIA AI is looking for a Senior Software Engineer to help build the NeMo Platform, a product for developing and operating AI systems at scale. The role involves designing APIs, building systems for evaluating AI agents, and collaborating with various teams to enhance agentic capabilities.
Responsibilities:
โข Design and implement Python-first APIs, SDK workflows, and plugin interfaces for building, measuring, and improving agents across multiple runtimes and product surfaces
โข Build reusable systems for observing behavior, measuring progress, detecting regressions, and turning runtime evidence into product decisions
โข Build systems for ingesting, normalizing, validating, and analyzing agent execution data and evaluation datasets
โข Partner with research, product, platform, and infrastructure teams to integrate agentic capabilities broadly across NVIDIA agent runtimes and developer workflows
โข Help turn emerging agent development and improvement techniques into reliable, reusable product capabilities
โข Improve reliability, observability, debuggability, and performance across NeMo Platform, SDKs, plugins, jobs, and developer workflows
โข Build strong test coverage across unit, integration, E2E, Docker, and Kubernetes workflows
โข Drive โspeed of lightโ engineering: fast iteration, high ownership, pragmatic decisions, and performance-minded implementation under production constraints
โข Provide senior technical leadership through design reviews, code reviews, mentoring, and ownership of ambiguous cross-component problems
Qualifications:
Required:
โข BS, MS, or equivalent experience in Computer Science, Computer Engineering, or a related technical field
โข 5+ years of professional software engineering experience building production systems
โข Excellent Python engineering skills, including API design, typing, testing, debugging, performance analysis, and maintainable software design
โข Experience designing SDKs, libraries, plugins, CLIs, or other developer-facing interfaces
โข Experience with distributed systems, cloud-native services, containers, Kubernetes, or job orchestration
โข Strong understanding of reliability, scalability, security, and performance tradeoffs in production infrastructure
โข Experience with structured data modeling and validation systems such as Pydantic, typed schemas, event/trace models, or SDK-generated types
โข Ability to work independently, define technical scope, break down ambiguous problems, and drive work across team boundaries
โข Clear communication skills and a track record of collaborating with engineering, product, research, or customer-facing teams
Preferred:
โข Experience building, deploying, and iterating on production agentic AI systems where evaluation was used to measure and improve real product outcomes
โข Experience designing evaluation workflows for heterogeneous agents, including tool-using agents, RAG agents, workflow agents, coding agents, or long-running autonomous systems
โข Experience integrating evaluation capabilities across multiple products, runtimes, or internal platforms, especially through Python SDKs, plugins, or shared developer tooling
โข Strong ability to connect technical evaluation work to business outcomes, product quality, user experience, reliability, or operational efficiency
โข Experience with enterprise AI systems where measurement, regression testing, observability, governance, and continuous improvement are required for production deployment
Company:
Explore the latest breakthroughs made possible with AI. Founded in , the company is headquartered in Santa Clara, CA, US, , with a team of 10001+ employees. The company is currently Late Stage.