1

Cuda Machine Learning Performance Engineer Jobs

next page

Showing results 1-20

Cuda Machine Learning Performance Engineer information

See salary details

$109K

$141K

How much do cuda machine learning performance engineer jobs pay per year?

As of Jun 3, 2026, the average yearly pay for cuda machine learning performance engineer in the United States is $139,529.00, according to ZipRecruiter salary data. Most workers in this role earn between $140,000.00 and $140,000.00 per year, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a CUDA Machine Learning Performance Engineer, and why are they important?

To thrive as a CUDA Machine Learning Performance Engineer, you need strong expertise in parallel programming, GPU architectures, and a solid background in computer science or related fields. Familiarity with CUDA, performance profiling tools (like Nsight), and deep learning frameworks such as TensorFlow or PyTorch is typically required. Analytical thinking, problem-solving, and clear communication are crucial soft skills for diagnosing performance bottlenecks and collaborating with cross-functional teams. These skills ensure optimal machine learning implementations, efficient resource utilization, and advancement of high-performance computing solutions.

What are some common challenges faced by a CUDA Machine Learning Performance Engineer when optimizing ML workloads?

CUDA Machine Learning Performance Engineers often encounter challenges in identifying and resolving performance bottlenecks within GPU-accelerated ML pipelines. Balancing memory usage, maximizing parallelism, and minimizing data transfer between the CPU and GPU are key concerns. Engineers must also keep up with rapid advancements in both hardware and software frameworks, requiring continuous learning and adaptation. Collaboration with data scientists and software engineers is frequent, as you’ll need to translate high-level ML models into efficient, scalable GPU implementations.

What are Cuda Machine Learning Performance Engineers?

Cuda Machine Learning Performance Engineers are specialized professionals who optimize and accelerate machine learning applications using NVIDIA's CUDA platform. They analyze code performance on GPUs, identify bottlenecks, and implement improvements to maximize computational efficiency. Their work often involves collaborating with data scientists and software developers to ensure machine learning algorithms run efficiently on CUDA-enabled hardware. They are proficient in parallel programming, GPU architectures, and performance profiling tools. Their expertise helps organizations achieve faster model training and inference, leading to more effective use of hardware resources.

What is the difference between Cuda Machine Learning Performance Engineer vs Data Scientist?

AspectCuda Machine Learning Performance EngineerData Scientist
Required CredentialsKnowledge of CUDA, GPU programming, machine learning frameworksStatistics, programming, data analysis skills, often a degree in data science or related fields
Work EnvironmentTechnical teams focused on optimizing ML models for GPU hardwareData analysis, model development, business insights
Industry UsageTech, AI, high-performance computing sectorsFinance, healthcare, marketing, tech

The Cuda Machine Learning Performance Engineer specializes in optimizing machine learning models for GPU hardware using CUDA, focusing on performance and efficiency. In contrast, a Data Scientist primarily develops and analyzes models to extract insights from data. While both roles require a strong understanding of machine learning, the Performance Engineer emphasizes technical optimization, whereas the Data Scientist focuses on data analysis and model interpretation.

Machine Learning Performance Engineer - CUDA Python

Machine Learning Performance Engineer - CUDA Python

3B Staffing LLC

Saint Louis, MO • Remote

$136.10K/yr

Contractor

Posted 7 days ago


Job description

Rate - + Expenses paid for travel
Will do. They said they have three positions total now. They need to be pre-sales minded when it comes to this experience also. They will be meeting with clients during the pre-sales process too. Really strong comm skills.
- Machine Learning Performance Engineer - CUDA Python -
U.S. Citizenship Status: U.S. Citizen; Green Card; Other legal status
Duration: 6 month contract with the likelihood to extend
Location: Remote but candidates must be willing to travel to different customer sites.
*Must be willing to travel
*Must have strong pre-sales abilities i.e. presentation skills, communication skills, etc.
*Must be willing to help train WWT employees and customers
Position Category: Infrastructure
Job Description: Your part here is optimizing the performance of our models - both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems, and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking, and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level - is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long?
• An understanding of modern ML techniques and toolsets
• The experience and systems knowledge required to debug a training run's performance end to end
• Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
• Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
• Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
• Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
• Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters
• An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
• An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools