1

Deepspeed Jobs (NOW HIRING)

... FSDP, and DeepSpeed • Work with multimodal data pipelines involving video, sensory inputs, and action sequences • Evaluate model performance using both benchmark datasets and real-world ...

Research Engineer

San Francisco, CA · On-site

$175K - $275K/yr

Build distributed training infrastructure using PyTorch, FSDP, and DeepSpeed * Work with multimodal data pipelines involving video, sensory inputs, and action sequences * Evaluate model performance ...

Research Engineer

San Francisco, CA · On-site

$175K - $275K/yr

Build distributed training infrastructure using PyTorch, FSDP, and DeepSpeed * Work with multimodal data pipelines involving video, sensory inputs, and action sequences * Evaluate model performance ...

Train and fine-tune LLMs using PyTorch, DeepSpeed, and LoRA. * Optimize inference using ONNX, vLLM, TensorRT, and GPU acceleration. * Manage datasets, preprocess data, and implement RAG with vector ...

PyTorch, Python, GitHub, Snowflake, Huggingface Transformers, AWS Sagemaker, Microsoft DeepSpeed, TorchTune, Apache Airflow, SLURM, Kubernetes Compensation $200-220k+ base salary #LI-Remote #LI-DNP ...

next page

Showing results 1-20

Deepspeed information

What are the key skills and qualifications needed to thrive as a DeepSpeed Engineer, and why are they important?

To thrive as a DeepSpeed Engineer, you need a solid background in machine learning, deep learning frameworks (such as PyTorch), and distributed systems, often supported by a degree in computer science or a related field. Proficiency with DeepSpeed, parallel computing libraries, and cloud platforms, along with familiarity with tools like CUDA and NCCL, is typically expected. Strong problem-solving abilities, collaboration, and adaptability are crucial soft skills for optimizing large-scale AI models and working with cross-functional teams. Mastering these skills ensures efficient development and deployment of high-performance, scalable AI solutions in demanding environments.

What are some common challenges faced by engineers working with DeepSpeed and how can they be addressed?

Engineers working with DeepSpeed often encounter challenges related to optimizing large-scale model training, such as managing memory efficiency and tuning distributed training parameters. Troubleshooting issues like gradient accumulation, parallelism strategies, and ensuring compatibility with different hardware setups can be complex. Collaborating closely with data scientists, DevOps, and research teams is essential for addressing these challenges, as is staying updated with the latest DeepSpeed releases and documentation. Regular participation in code reviews and knowledge-sharing sessions can also help engineers overcome technical hurdles and continuously improve model performance.

What is Deepspeed?

Deepspeed is an open-source deep learning optimization library developed by Microsoft, designed to enable distributed training of large-scale models efficiently. It helps researchers and engineers train models that are too large to fit in the memory of a single GPU by offering features like ZeRO optimization, mixed-precision training, and advanced parallelism techniques. Deepspeed is widely used in the machine learning community for its scalability and performance improvements, making it easier to train state-of-the-art models on vast datasets. The library integrates seamlessly with PyTorch and supports training on multiple GPUs and even across multiple machines.

What is the difference between Deepspeed vs Data Scientist?

AspectDeepspeedData Scientist
Required credentialsKnowledge of machine learning frameworks, programming skills in Python, experience with AI model trainingDegree in Data Science, Statistics, Computer Science, or related fields; strong analytical skills
Work environmentAI research labs, tech companies, cloud computing environmentsBusiness, tech companies, research institutions
Industry usageAI model training, deep learning optimizationData analysis, predictive modeling, business insights

Deepspeed focuses on optimizing large-scale AI model training and deep learning performance, while Data Scientists analyze data to generate insights and build predictive models. Both roles require technical skills but serve different purposes within the AI and data ecosystem.

More about Deepspeed jobs
What cities are hiring for Deepspeed jobs? Cities with the most Deepspeed job openings:
What states have the most Deepspeed jobs? States with the most job openings for Deepspeed jobs include:
Infographic showing various Deepspeed job openings in the United States as of May 2026, with employment types broken down into 80% Full Time, and 20% Contract. Highlights an 80% In-person, and 20% Remote job distribution.

Machine Learning Infrastructure Engineer

Institute of Foundation Models

Sunnyvale, CA • On-site

$150K - $450K/yr

Full-time

Medical, Dental, Vision, Retirement, PTO

Posted 16 days ago


Job description

About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you'll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
The Role
We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You'll work side-by-side with world-class researchers and engineers to:
• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
• Implement distributed optimizers from mathematical specs
• Build robust config + launch systems across multi-node, multi-GPU clusters
• Own experiment tracking, metrics logging, and job monitoring for external visibility
• Improve training system reliability, maintainability, and performance
• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.
Key Responsibilities
• Distributed Framework Ownership - Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.
• Optimizer Implementation - Translate mathematical optimizer specs into distributed implementations.
• Launch Config & Debugging - Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.
• Metrics & Monitoring - Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.
• Infra Engineering - Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.
Qualifications
Must-Haves:
• 5+ years of experience in ML systems, infra, or distributed training
• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
• Strong software engineering fundamentals (Python, systems design, testing)
• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
• Ability to implement algorithms across GPUs/nodes based on mathematical specs
• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
• Experience with large-scale machine learning workloads (strong ML fundamentals)
Nice-to-Haves:
• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation
• Familiarity with performance profiling, kernel fusion, or memory optimization
• Open-source contributions or published research (MLSys, ICML, NeurIPS)
• CUDA or Triton kernel experience
• Experience with large-scale pre-training
• Experience building custom training pipelines at scale and modifying them for custom needs
• Deep familiarity with training infrastructure and performance tuning
$150,000 - $450,000 a year
• Comprehensive medical, dental, and vision
• 401(k) program
• Generous PTO, sick leave, and holidays
• Paid parental leave and family-friendly benefits
• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station