1

Deep Learning Compression Jobs (NOW HIRING)

Senior Perception Learning Engineer

Sunnyvale, CA · On-site

$122K - $167K/yr

... deep learning approaches. • Expertise in model acceleration, quantization, or compression (TensorRT, ONNX Runtime). • Familiarity with real-time frameworks and middleware such as ROS 2, GStreamer ...

Senior Perception Learning Engineer

Sunnyvale, CA · On-site

$122K - $167K/yr

... deep learning approaches. • Expertise in model acceleration, quantization, or compression (TensorRT, ONNX Runtime). • Familiarity with real-time frameworks and middleware such as ROS 2, GStreamer ...

Strong classical computer vision skills (geometry-based methods, feature extraction) complementing deep learning approaches. * Expertise in model acceleration, quantization, or compression (TensorRT ...

About the Role EnCharge AI is looking for an experienced AI Research Engineer to optimize deep learning models for deployment on edge AI platforms. You will work on model compression, quantization ...

Design, implement, and refine deep learning models to ensure efficiency, scalability, and ... Optimize inference performance, model compression, and deployment across various hardware platforms ...

Principal Machine Learning Engineer

Boston, MA · On-site +1

$189K - $312K/yr

You will collaborate with our technical and research teams to develop LLM training and deployment pipelines, implement model compression algorithms, and productize deep learning research. If you are ...

Senior / Staff Machine Learning Engineer

Austin, TX · On-site

$124K - $171K/yr

Optimize inference performance, model compression, and deployment across various hardware platforms. * Explore and Apply Cutting-Edge ML Techniques: Stay current with advancements in deep learning ...

Senior / Staff Machine Learning Engineer

Austin, TX · On-site

$124K - $171K/yr

Optimize inference performance, model compression, and deployment across various hardware platforms. * Explore and Apply Cutting-Edge ML Techniques: Stay current with advancements in deep learning ...

Senior Perception Learning Engineer

Sunnyvale, CA · On-site

$122K - $168K/yr

Strong classical computer vision skills (geometry-based methods, feature extraction) complementing deep learning approaches. * Expertise in model acceleration, quantization, or compression (TensorRT ...

Senior Perception Learning Engineer

Sunnyvale, CA · On-site

$122K - $167K/yr

... deep learning approaches. • Expertise in model acceleration, quantization, or compression (TensorRT, ONNX Runtime). • Familiarity with real-time frameworks and middleware such as ROS 2, GStreamer ...

Senior Machine Learning Engineer

Boston, MA · On-site +1

$174K - $287K/yr

You will collaborate with our technical and research teams to develop LLM training and deployment pipelines, implement model compression algorithms, and productize deep learning research. If you are ...

... compression, and low-power operation. If you enjoy great rewards, Ambarella has it all, great ... Training and optimization of deep learning/ML based computer vision algorithm for edge devices.

next page

Showing results 1-20

Deep Learning Compression information

See salary details

$11K

$83.9K

$140K

How much do deep learning compression jobs pay per year?

As of Jun 7, 2026, the average yearly pay for deep learning compression in the United States is $83,885.00, according to ZipRecruiter salary data. Most workers in this role earn between $72,000.00 and $139,000.00 per year, depending on experience, location, and employer.

What are the typical challenges faced when working on deep learning compression projects?

Professionals in deep learning compression often encounter challenges balancing model size reduction with maintaining high accuracy. Adapting compression techniques—such as pruning, quantization, or knowledge distillation—to different architectures and datasets requires both strong technical knowledge and experimentation. Collaboration with data scientists and software engineers is common, as solutions must be integrated into production systems without sacrificing performance. Staying up to date with rapid advances in compression research is also essential to remain effective and innovative in this role.

What are the key skills and qualifications needed to thrive as a Deep Learning Compression Engineer, and why are they important?

To thrive as a Deep Learning Compression Engineer, you need a strong background in deep learning, machine learning, and mathematics, typically supported by a degree in computer science or a related field. Proficiency with frameworks like TensorFlow or PyTorch, experience with model compression techniques (such as pruning, quantization, and knowledge distillation), and familiarity with hardware accelerators are essential. Strong problem-solving skills, attention to detail, and effective communication help you innovate and collaborate with research and engineering teams. These skills are critical for developing efficient AI models that meet performance and resource constraints in real-world applications.

What is the difference between Deep Learning Compression vs Machine Learning Engineer?

AspectDeep Learning CompressionMachine Learning Engineer
Required CredentialsBachelor's or Master's in Computer Science, AI, or related fields; knowledge of neural networksBachelor's or Master's in Computer Science, AI, or related fields; programming skills
Work EnvironmentResearch labs, AI development teams, tech companies focusing on model optimizationSoftware development teams, AI startups, tech firms building ML applications
Industry UsageAI model deployment, edge computing, mobile AI applicationsDeveloping ML models, data analysis, AI product development

Deep Learning Compression focuses on reducing model size and improving efficiency of neural networks, often for deployment on limited hardware. Machine Learning Engineers develop, train, and optimize ML models across various applications. While both roles require knowledge of AI and neural networks, Deep Learning Compression specializes in model optimization techniques, whereas Machine Learning Engineers work broadly on model development and deployment.

What is deep learning compression?

Deep learning compression refers to techniques used to reduce the size, memory footprint, and computational requirements of deep neural networks without significantly sacrificing their performance. This is important for deploying models on resource-constrained devices such as smartphones or embedded systems. Common methods include pruning, quantization, knowledge distillation, and low-rank factorization. These approaches help make deep learning models more efficient and practical for real-world applications.
Infographic showing various Deep Learning Compression job openings in the United States as of May 2026, with employment types broken down into 100% Full Time. Highlights an 33% In-person, and 67% Hybrid job distribution, with an average salary of $83,885 per year, or $40.3 per hour.
Senior Machine Learning Engineer, Runtime and Serving

Senior Machine Learning Engineer, Runtime and Serving

Waymo

Mountain View, CA • On-site

$123K - $169K/yr

Other

Posted 8 days ago


Job description

The ML Optimization team at Waymo provides a set of tools to support and automate the lifecycle of the machine learning workflow, including feature and experiment management, model development, optimization and monitoring. These efforts have resulted in making machine learning more accessible to teams at Waymo, including Perception, Planner, Research and Simulation.

We are looking for engineers with ML software & systems expertise to help build the next generation Waymo onboard ML inference engine for Waymo fundamental model. You'll work across the entire ML stack from the system perspective, from efficient deep learning models, model compression, ML software (e.g. JAX, XLA, Triton, and CUDA), to . You will be pleasantly challenged with deploying Waymo ML models on limited computation resources. In this hybrid role, you will report to the Senior Manager of Runtime and Optimization. 

You will:

  • Architect and develop an efficient, high-performance ML runtime and serving system tailored for both onboard autonomous vehicle compute and large-scale, offboard data center environments.
  • Lead the integration and feature development for ML inference runtimes across both domains, balancing the strict real-time latency and memory constraints of onboard systems with the high-throughput, highly concurrent demands of offboard serving fleets.
  • Drive the strategic migration of ML workloads toward a JAX-native runtime architecture, which includes extending and modifying underlying ML compilers and runtimes (e.g., OpenXLA/PjRT, TensorRT).
  • Collaborate with world-class Waymo ML practitioners across perception, planner, and research to analyze system-level ML workloads and apply hardware-aware compute optimizations.
  • Design and build robust tooling for profiling, benchmarking, and identifying system-level bottlenecks across the end-to-end ML software stack.

You Have:

  • B.S. or M.S. in CS, EE, Deep Learning or a related field
  • 5+ years of professional software engineering experience focused on building, scaling, or maintaining ML systems and infrastructure.
  • 5+ years production programming in C++.
  • 3+ years of production experience in Python and major deep learning frameworks (e.g., PyTorch, JAX).
  • Experience optimizing ML software for hardware accelerators (e.g., GPUs, TPUs, custom silicon).
  • Experience building low-latency, highly concurrent distributed backend systems.

We Prefer

  • PhD in CS, EE, Deep Learning or a related field.
  • Experience modifying ML compilers, runtimes, or inference engines (e.g., TensorRT, ONNX Runtime, OpenXLA/PjRT, TVM).
  • Experience building or scaling LLM serving systems, including expertise in distributed inference and performance optimization (e.g., KV/prefix caching, continuous batching).
  • Experience with custom kernel development (e.g., CUDA/CUDA Tile, Triton, JAX/Pallas).
  • Experience architecting unified serving APIs and optimizing tensor buffer management (e.g., zero-copy data transfer, shared memory) for complex, multi-model inference pipelines.