Job Summary:
MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) is focused on designing and operating ultra-scale GPU supercomputing systems for training foundation models. The Senior Distributed Systems Engineer will optimize communication stacks for large-scale distributed training, ensuring performance and reliability across GPU workloads.
Responsibilities:
• Design and optimize expert-parallel and hybrid-parallel communication patterns
• Drive high-performance hierarchical collectives for MoE workloads
• Co-design runtime orchestration with communication topology awareness
• Reduce tail latency and improve determinism across thousands of GPUs
• Architect fault-tolerant distributed execution under real-world cluster failures
• Communication-compute overlap and topology-aware collective optimization
• Deep debugging of NCCL, RDMA, and custom communication layers
• Hybrid expert parallel strategies in modern large-scale MoE systems
• Elastic and resilient distributed job orchestration concepts
• Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
• Microbenchmarking and performance modeling for communication-heavy workloads
• Hybrid expert parallel communication for Mixture-of-Experts training
• Scaling behavior under network pressure
• Distributed orchestration for elastic, large-scale training
• Fault detection and recovery in distributed GPU workloads
• Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Qualifications:
Required:
• Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
• Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
• Deep familiarity with NCCL and/or UCX internals
• Strong systems programming ability (C/C++, Rust, or Go)
• Strong familiarity with modern model training frameworks such as PyTorch
• Ability to troubleshoot and profile training performance issues related to communication bottlenecks
• Ability to translate research ideas into production-grade optimizations
• Experience debugging distributed hangs, desynchronization, and performance regressions
• Include a link to your GitHub (required)
• Provide links to relevant distributed systems, HPC, or large-scale training projects
• Include a list of publications and/or public technical reports (if applicable)
• Describe the hardest distributed debugging problem you solved
• Include measurable performance improvements you have delivered
• Master’s, or Bachelor’s + 1 year of relevant experience.
Company:
Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI. Founded in 2019, the company is headquartered in Abu Dhabi, ARE, with a team of 51-200 employees. The company is currently Growth Stage.