1

Deepspeed Jobs (NOW HIRING)

next page

Showing results 1-20

Deepspeed information

What are the key skills and qualifications needed to thrive as a DeepSpeed Engineer, and why are they important?

To thrive as a DeepSpeed Engineer, you need a solid background in machine learning, deep learning frameworks (such as PyTorch), and distributed systems, often supported by a degree in computer science or a related field. Proficiency with DeepSpeed, parallel computing libraries, and cloud platforms, along with familiarity with tools like CUDA and NCCL, is typically expected. Strong problem-solving abilities, collaboration, and adaptability are crucial soft skills for optimizing large-scale AI models and working with cross-functional teams. Mastering these skills ensures efficient development and deployment of high-performance, scalable AI solutions in demanding environments.

What are some common challenges faced by engineers working with DeepSpeed and how can they be addressed?

Engineers working with DeepSpeed often encounter challenges related to optimizing large-scale model training, such as managing memory efficiency and tuning distributed training parameters. Troubleshooting issues like gradient accumulation, parallelism strategies, and ensuring compatibility with different hardware setups can be complex. Collaborating closely with data scientists, DevOps, and research teams is essential for addressing these challenges, as is staying updated with the latest DeepSpeed releases and documentation. Regular participation in code reviews and knowledge-sharing sessions can also help engineers overcome technical hurdles and continuously improve model performance.

What is Deepspeed?

Deepspeed is an open-source deep learning optimization library developed by Microsoft, designed to enable distributed training of large-scale models efficiently. It helps researchers and engineers train models that are too large to fit in the memory of a single GPU by offering features like ZeRO optimization, mixed-precision training, and advanced parallelism techniques. Deepspeed is widely used in the machine learning community for its scalability and performance improvements, making it easier to train state-of-the-art models on vast datasets. The library integrates seamlessly with PyTorch and supports training on multiple GPUs and even across multiple machines.

What is the difference between Deepspeed vs Data Scientist?

AspectDeepspeedData Scientist
Required credentialsKnowledge of machine learning frameworks, programming skills in Python, experience with AI model trainingDegree in Data Science, Statistics, Computer Science, or related fields; strong analytical skills
Work environmentAI research labs, tech companies, cloud computing environmentsBusiness, tech companies, research institutions
Industry usageAI model training, deep learning optimizationData analysis, predictive modeling, business insights

Deepspeed focuses on optimizing large-scale AI model training and deep learning performance, while Data Scientists analyze data to generate insights and build predictive models. Both roles require technical skills but serve different purposes within the AI and data ecosystem.

More about Deepspeed jobs
What cities are hiring for Deepspeed jobs? Cities with the most Deepspeed job openings:
What states have the most Deepspeed jobs? States with the most job openings for Deepspeed jobs include:
Infographic showing various Deepspeed job openings in the United States as of May 2026, with employment types broken down into 80% Full Time, and 20% Contract. Highlights an 80% In-person, and 20% Remote job distribution.
Senior Cloud Infrastructure Engineer

Senior Cloud Infrastructure Engineer

Gatik AI

Mountain View, CA

$180K - $240K/yr

Other

Posted 12 days ago


Job description

Who we are

Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent deliveries while streamlining freight movement by reducing congestion. The company focuses on short-haul, B2B logistics for Fortune 500 retailers and in 2021 launched the world's first fully driverless commercial transportation service with Walmart. Gatik's Class 3-7 autonomous trucks are commercially deployed across major markets, including Texas, Arkansas, and Ontario, Canada, driving innovation in freight transportation. 

The company's proprietary Level 4 autonomous technology, Gatik Carrier, is custom-built to transport freight safely and efficiently between pick-up and drop-off locations on the middle mile. With robust capabilities in both highway and urban environments, Gatik Carrier serves as an all-encompassing solution that integrates advanced software and hardware powering the fleet, facilitating effortless integration into customers' logistics operations. 

About the role

We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective.

This role is onsite 5 days a week at our Mountain View, CA office!

What you'll do
  • Cloud-Native Orchestration & Kubernetes
    • Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
    • GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization.
    • Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools.
    • Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts.
  • Data Engineering & CI/CD Pipelines
    • Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats.
    • GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts.
    • Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry.
    • Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry.
  • Model Management & Lifecycle (MLOps)
    • Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration.
    • Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation.
    • High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime.
  • Distributed Training & ML Systems Support
    • Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod.
    • Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training.
    • Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.
What we're looking for
  • Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
  • Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration.
  • Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
  • Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed.
  • Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.
Bonus Qualifications
  • Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed.
  • AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation.
  • Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools.

Salary Range - $180,000- $240,000

More about Gatik

Founded in 2017 by experts in autonomous vehicle technology, Gatik has rapidly expanded its presence to Mountain View, Dallas-Fort Worth, Arkansas, and Toronto. As the first and only company to achieve fully driverless middle-mile commercial deliveries, Gatik holds a unique and defensible position in the AV industry, with a clear trajectory toward sustainable growth and profitability.

We have delivered complete, proprietary AV technology - an integration of software and hardware - to enable earlier successes for our clients in constrained Level 4 autonomy.  By choosing the middle mile - with defined point-to-point delivery, we have simplified some of the more complex AV challenges, enabling us to achieve full autonomy ahead of competitors. Given extensive knowledge of Gatik's well-defined, fixed route ODDs and hybrid architecture, we are able to hyper-optimize our models with exponentially less data, establish gate-keeping mechanisms to maintain explainability, and ensure continued safety of the system for unmanned operations.

Visit us at Gatik for more company information and Careers at Gatik for more open roles.

Notable News
  • Bloomberg: Autonomous Trucking Firm Gatik Inks Contracts Worth $600 Million
  • Forbes: Hundreds' Of Gatik Robot Delivery Trucks Headed For U.S. Roads
  • Forbes:Gatik And Loblaw Announce Largest Commercial Deployment Of AV Trucks
  • Forbes: Forget robotaxis. Upstart Gatik sees middle-mile deliveries as the path to profitable AVs
  • Tech Brew: Gatik AI exec unpacks the regulations that could shape the AV industry
  • Business Wire: Gatik Paves the Way for Safe Driverless Operations ('Freight-Only') at Scale with Industry-First Third-Party Safety Assessment Framework
  • Auto Futures: Autonomous Trucking Group Gatik Secures Investment From NIPPON EXPRESS HOLDINGS
  • Automotive News: Gatik foresees hundreds of self-driving trucks on road soon, and that's just the beginning
  • Forbes: Isuzu And Gatik Go All In To Scale Up Driverless Freight Services
  • Bloomberg: Autonomous Vehicle Startup Takes Off by Picking Off Easier Routes
  • Reuters: Driverless vehicles on limited routes bump along despite US robotaxi scrutiny
Taking care of our team

At Gatik, we connect people of extraordinary talent and experience to an opportunity to create a more resilient supply chain and contribute to our environment's sustainability. We are diverse in our backgrounds and perspectives yet united by a bold vision and shared commitment to our values. Our culture emphasizes the importance of collaboration, respect and agility.

We at Gatik strive to create a diverse and inclusive environment where everyone feels they have opportunities to succeed and grow because we know that together we can do great things. We are committed to an inclusive and diverse team. We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status.