Deepspeed Jobs (NOW HIRING)

Senior Machine Learning Engineer, Model Training and Reinforcement Learning

Palo Alto, CA · On-site +1

$122K - $168K/yr

Build and maintain distributed training and RL infrastructure using frameworks such as Megatron-LM, DeepSpeed, PyTorch FSDP/DTensor, Ray, verl, slime, AReaL, or OpenRLHF. * Implement and debug ...

Senior Machine Learning Engineer, Model Training and Reinforcement Learning

Palo Alto, CA · On-site +1

$122K - $168K/yr

Machine Learning Infrastructure Engineer

Sunnyvale, CA · On-site

$150K - $450K/yr

You'll work side-by-side with world-class researchers and engineers to: • Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) • Implement distributed optimizers ...

Machine Learning Infrastructure Engineer

Sunnyvale, CA · On-site

$150K - $450K/yr

Hedra

Research Engineer

San Francisco, CA · On-site

$175K - $275K/yr

Build distributed training infrastructure using PyTorch, FSDP, and DeepSpeed * Work with multimodal data pipelines involving video, sensory inputs, and action sequences * Evaluate model performance ...

Hedra

Research Engineer

San Francisco, CA · On-site

$175K - $275K/yr

Machine Learning Infrastructure Engineer

Sunnyvale, CA · On-site

$150K - $450K/yr

Quick apply

Machine Learning Infrastructure Engineer

Sunnyvale, CA · On-site

$150K - $450K/yr

Causal Labs

Member of Technical Staff - Training Infrastructure

San Francisco, CA · On-site

PyTorch, Megatron-LM, DeepSpeed, XLA)

Causal Labs

Member of Technical Staff - Training Infrastructure

San Francisco, CA · On-site

PyTorch, Megatron-LM, DeepSpeed, XLA)

smart folks inc

Senior AI Solution Architect

Minneapolis, MN · On-site

Optimize AI models for speed, scalability, and cost efficiency using frameworks like DeepSpeed, PyTorch Lightning, and Hugging Face Accelerate. * Agentic AI: Drive autonomous AI agents leveraging ...

Quick apply

smart folks inc

Senior AI Solution Architect

Minneapolis, MN · On-site

LLM Pre-training & Distributed Engineer (AI Infrastructure)

San Francisco, CA · On-site

$126K - $166K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

San Francisco, CA · On-site

$126K - $166K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Seattle, WA

$122K - $160K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Seattle, WA

$122K - $160K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

Embedding VC

视频生成模型 · 训练 Infra 工程师

San Francisco, CA · On-site

$126K - $166K/yr

... 个主流训练框架(Megatron / DeepSpeed / FSDP / TorchTitan)的源码级理解. • ≥ 256 卡训练实战经验. • 有从零或半程搭建训练平台 / 集群调度 ...

New

Embedding VC

视频生成模型 · 训练 Infra 工程师

San Francisco, CA · On-site

$126K - $166K/yr

... 个主流训练框架(Megatron / DeepSpeed / FSDP / TorchTitan)的源码级理解. • ≥ 256 卡训练实战经验. • 有从零或半程搭建训练平台 / 集群调度 ...

New

ML Systems Engineer, Large-Scale Model Training & RL Infrastructure

Palo Alto, CA · On-site

$195K - $262K/yr

Integrate and extend frameworks such as Megatron-LM, DeepSpeed, PyTorch FSDP/DTensor, Ray, verl, slime, AReaL, OpenRLHF, or equivalent internal systems. * Implement and debug parallelism strategies ...

ML Systems Engineer, Large-Scale Model Training & RL Infrastructure

Palo Alto, CA · On-site

$195K - $262K/yr

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Boston, MA

$116K - $153K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Boston, MA

$116K - $153K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

LLM Pre-training & Distributed Engineer (AI Infrastructure)

San Francisco, CA

$126K - $166K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

LLM Pre-training & Distributed Engineer (AI Infrastructure)

San Francisco, CA

$126K - $166K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

Research Scientist - Distributed Machine Learning

Sunnyvale, CA · On-site

Responsibilities : • Build and scale distributed pre-training frameworks • Set up DeepSpeed / FSDP / Megatron-LM across multi-node GPU clusters. • Create robust launch scripts, resilient ...

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

Research Scientist - Distributed Machine Learning

Sunnyvale, CA · On-site

Altos Labs

Scientist /Senior Scientist, Multimodal & Relational Machine Learning Foundation Models

San Diego, CA · On-site

$97K - $132K/yr

... FSDP, DeepSpeed) to train models across multiple GPU nodes. • Gain insights into model performance based on theory, deep research, and the mathematical underpinnings of set-invariant and graph ...

Altos Labs

Scientist /Senior Scientist, Multimodal & Relational Machine Learning Foundation Models

San Diego, CA · On-site

$97K - $132K/yr

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Seattle, WA · On-site

$122K - $160K/yr

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. * Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.