Job Summary:
Dyna Robotics is a pioneering company in AI-driven robotics, known for its innovative embodied AI foundation model. They are seeking a ML Training Infrastructure Engineer to architect and build systems that optimize their multi-cloud GPU fleet for training, ensuring high performance and reproducibility for researchers.
Responsibilities:
โข Scale Distributed Training: Architect and own the infrastructure for large-scale GPU clusters. Youโll implement sharding, activation checkpointing, and memory optimization (ZeRO, FSDP) to enable the training of massive multimodal models.
โข Optimize Researcher Ergonomics: Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritizes fast iteration, automated retries, and seamless failure recovery.
โข High-Performance Data Handling: Design high-throughput pipelines to ingest and transform terabytes of multimodal robot data (video, proprioception, 3D signals), ensuring dataloaders never starve the GPUs.
โข Production Inference: Build low-latency inference pipelines for real-time robot control. Youโll apply quantization, distillation, and model compilation (TensorRT, Triton) to move models from the lab to the physical world.
โข Deep Systems Profiling: Dive into the weeds of GPU utilization, I/O bottlenecks, and memory fragmentation to squeeze every bit of performance out of our expanding compute fleet.
Qualifications:
Required:
โข 7+ Years of Engineering: With a track record of leading technical projects in high-performance computing (HPC) or ML infrastructure.
โข ML Systems Mastery: Deep experience with PyTorch and distributed training frameworks (DeepSpeed, Accelerate). You understand the nuances of mixed precision and gradient accumulation.
โข Infrastructure Expertise: Hands-on experience managing cloud GPU environments (GCP/AWS) and container orchestration (Kubernetes).
โข Low-Level Intuition: A fundamental understanding of distributed systems, including race conditions, memory management, and NCCL/inter-node communication.
โข Ownership Mindset: You don't just 'deploy' code; you design, build, and operate systems end-to-end to unblock fast-moving research.
Preferred:
โข Experience with Robotics Data Formats (MCAP, Protobuf) or multimodal models (VLAs).
โข Deep ML systems experience: custom kernels (Triton), compilers, or runtime optimization.
โข Experience as a founding or early-stage infrastructure hire.
Company:
Dyna Robotics develops advanced robotic manipulation models to automate repetitive and stationary tasks. Founded in 2024, the company is headquartered in Redwood City, USA, with a team of 11-50 employees. The company is currently Early Stage.