1

Deepspeed Jobs (NOW HIRING)

Research Engineer

San Francisco, CA · On-site

$175K - $275K/yr

Build distributed training infrastructure using PyTorch, FSDP, and DeepSpeed * Work with multimodal data pipelines involving video, sensory inputs, and action sequences * Evaluate model performance ...

Train and fine-tune LLMs using PyTorch, DeepSpeed, and LoRA. * Optimize inference using ONNX, vLLM, TensorRT, and GPU acceleration. * Manage datasets, preprocess data, and implement RAG with vector ...

next page

Showing results 1-20

Deepspeed information

What are some common challenges faced by engineers working with DeepSpeed and how can they be addressed?

Engineers working with DeepSpeed often encounter challenges related to optimizing large-scale model training, such as managing memory efficiency and tuning distributed training parameters. Troubleshooting issues like gradient accumulation, parallelism strategies, and ensuring compatibility with different hardware setups can be complex. Collaborating closely with data scientists, DevOps, and research teams is essential for addressing these challenges, as is staying updated with the latest DeepSpeed releases and documentation. Regular participation in code reviews and knowledge-sharing sessions can also help engineers overcome technical hurdles and continuously improve model performance.

What is Deepspeed?

Deepspeed is an open-source deep learning optimization library developed by Microsoft, designed to enable distributed training of large-scale models efficiently. It helps researchers and engineers train models that are too large to fit in the memory of a single GPU by offering features like ZeRO optimization, mixed-precision training, and advanced parallelism techniques. Deepspeed is widely used in the machine learning community for its scalability and performance improvements, making it easier to train state-of-the-art models on vast datasets. The library integrates seamlessly with PyTorch and supports training on multiple GPUs and even across multiple machines.

What is the difference between Deepspeed vs Data Scientist?

AspectDeepspeedData Scientist
Required credentialsKnowledge of machine learning frameworks, programming skills in Python, experience with AI model trainingDegree in Data Science, Statistics, Computer Science, or related fields; strong analytical skills
Work environmentAI research labs, tech companies, cloud computing environmentsBusiness, tech companies, research institutions
Industry usageAI model training, deep learning optimizationData analysis, predictive modeling, business insights

Deepspeed focuses on optimizing large-scale AI model training and deep learning performance, while Data Scientists analyze data to generate insights and build predictive models. Both roles require technical skills but serve different purposes within the AI and data ecosystem.

What are the key skills and qualifications needed to thrive as a DeepSpeed Engineer, and why are they important?

To thrive as a DeepSpeed Engineer, you need a solid background in machine learning, deep learning frameworks (such as PyTorch), and distributed systems, often supported by a degree in computer science or a related field. Proficiency with DeepSpeed, parallel computing libraries, and cloud platforms, along with familiarity with tools like CUDA and NCCL, is typically expected. Strong problem-solving abilities, collaboration, and adaptability are crucial soft skills for optimizing large-scale AI models and working with cross-functional teams. Mastering these skills ensures efficient development and deployment of high-performance, scalable AI solutions in demanding environments.
More about Deepspeed jobs
What cities are hiring for Deepspeed jobs? Cities with the most Deepspeed job openings:
What states have the most Deepspeed jobs? States with the most job openings for Deepspeed jobs include:
Infographic showing various Deepspeed job openings in the United States as of June 2026, with employment types broken down into 100% Full Time. Highlights an 75% Physical, 6% Hybrid, and 19% Remote job distribution.

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Finoit Inc.

Redwood City, CA • On-site

$132K - $180K/yr

Other

This job post has expired today. Applications are no longer accepted.


Job description

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Short Job Description

We are seeking a Senior ML Infrastructure Engineer to design and scale the infrastructure powering large-scale machine learning training workloads. In this role, you'll build high-performance GPU training platforms, optimize distributed training pipelines, and improve the developer experience for ML researchers.

Responsibilities:

  • Design and scale distributed ML training infrastructure for large GPU clusters.
  • Build and optimize training pipelines using PyTorch, DeepSpeed, and distributed training frameworks.
  • Develop and maintain job scheduling systems using Kubernetes and/or SLURM.
  • Create high-throughput data pipelines for large-scale multimodal datasets.
  • Optimize GPU utilization, memory efficiency, and overall system performance.
  • Build low-latency inference pipelines for production ML deployments.

Required Skills:

  • 7+ years of experience in ML Infrastructure, HPC, or Distributed Systems.
  • Strong experience with PyTorch, DeepSpeed, FSDP, ZeRO, or similar distributed training frameworks.
  • Hands-on experience with Kubernetes, cloud platforms (AWS/Google Cloud Platform), and containerized environments.
  • Strong understanding of distributed systems, GPU optimization, NCCL, memory management, and performance tuning.
  • Experience building scalable ML infrastructure from development through production.

Location: Redwood City, CA (On-site)
Employment Type: Full-Time

Nice to Have:

  • Experience with multimodal AI, robotics data pipelines, Triton, TensorRT, custom ML kernels, or ML compiler/runtime optimization.