Remote Cuda Developer Jobs (NOW HIRING)

Senior Site Reliability Engineer - AI Infrastructure

$67.25 - $89.25/hr

Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat ... Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed ...

Andromeda Cluster, Inc

Senior Site Reliability Engineer - AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

Reka

Member of Technical Staff (GPU Performance Engineer)

Experience writing and debugging low-level GPU code (CUDA, C++). * Experience scaling up GPU jobs ... Embracing a remote-first approach, our team brings together top talent from around the world. Our ...

Reka

Member of Technical Staff (GPU Performance Engineer)

Cerence

Sr. Principal Software Engineer

$185K - $280K/yr

QAIRT * Extend and tune inference engines using custom CUDA kernels * Adapt runtimes for ... Remote and/or hybrid work available depending on the position All compensation and benefits are ...

Cerence

Sr. Principal Software Engineer

$185K - $280K/yr

Andromeda Cluster, Inc

Site Reliability Engineer - AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat ... Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.). * Familiarity ...

Andromeda Cluster, Inc

Site Reliability Engineer - AI Infrastructure

San Francisco, CA · On-site +1

$67.25 - $89.25/hr

De Circle

Hyperbolic Labs - Senior GPU Infrastructure Engineer

San Francisco, CA · On-site +1

$127K - $173K/yr

... remote management, PXE boot, and automated OS deployment workflows * Deep understanding of GPU ... Strong infrastructure and DevOps engineering skills with proficiency in Terraform or Pulumi, CI/CD ...

De Circle

Hyperbolic Labs - Senior GPU Infrastructure Engineer

San Francisco, CA · On-site +1

$127K - $173K/yr

ARKA Group

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Aurora, CO · Remote

Apply Large Language Models (LLMs) to a variety of applications within remote sensing such as ... Experience implementing algorithms on the GPU in Python or C++ using CUDA and other CUDA libraries

ARKA Group

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Aurora, CO · Remote

Careers - Stratagem - Make a Lasting Impact

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Aurora, CO · Remote

Careers - Stratagem - Make a Lasting Impact

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Aurora, CO · Remote

Quantiphi, Inc.

Architect - Platform Engineer

US East/Canada (Remote) Role Overview: We are looking for a highly skilled Architect - Platform ... Enable and optimize the NVIDIA GPU stack (CUDA, cuDNN, NCCL, Triton, RAPIDS, etc.) * Collaborate ...

Quantiphi, Inc.

Architect - Platform Engineer

Careers - Stratagem - Make a Lasting Impact

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

King Of Prussia, PA · Remote

Careers - Stratagem - Make a Lasting Impact

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

King Of Prussia, PA · Remote

Nebius

... CUDA, OpenCL). * Customer-centric approach with a proven ability to build trust and foster ... Remote Work Reimbursement: Up to $85/month for mobile and internet. * Disability & Life Insurance:

Nebius

Cyngn

Senior DevOps Lead - Cloud & Autonomous System

Mountain View, CA · Remote

$133K - $170K/yr

Expertise in ARM and NVIDIA CUDA platform configurations * Strong programming skills in Python and ... Monthly meal and tech allowances for remote employees We may use artificial intelligence (AI) tools ...

Quick apply

Cyngn

Senior DevOps Lead - Cloud & Autonomous System

Mountain View, CA · Remote

$133K - $170K/yr

ARKA Group

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

King Of Prussia, PA · Remote

ARKA Group

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

King Of Prussia, PA · Remote

The Pennsylvania State University

$121K - $231K/yr

Approval of remote and hybrid work is not guaranteed regardless of work location.For additional ... Python, C++, CUDA, time-series analysis, and signal processing * Ability to express yourself and ...

The Pennsylvania State University

$121K - $231K/yr

Achira

Machine Learning Research Engineer (MLRE) - GPUs

New York, NY · On-site +1

$164K - $259K/yr

... remote candidates of exceptional talent who are willing to travel frequently to our two office ... Develop with frameworks like CUDA, Triton, Warp, etc. to accelerate performance critical code ...

Achira

Machine Learning Research Engineer (MLRE) - GPUs

New York, NY · On-site +1

$164K - $259K/yr

Prime Intellect

Member of Technical Staff - Inference

San Francisco, CA · On-site +1

$150 - $300/hr

Comfortable debugging CUDA/NCCL, drivers/kernels, containers, service mesh/networking, and storage ... Flexible work arrangement (remote or San Francisco office) * Full visa sponsorship and relocation ...

Prime Intellect

Member of Technical Staff - Inference

San Francisco, CA · On-site +1

$150 - $300/hr

Wood Plc

Software Development Engineer

Topeka, KS · On-site +1

NVIDIA, CUDA Python, PyCUDA, etc.) * Field Collection/Mobile Applications (IOS, Android, etc.) and ... Experience with data processing imagery and LiDAR from web services, UAS, or other remote sensing ...

Wood Plc

Software Development Engineer

Topeka, KS · On-site +1

Booz Allen Hamilton

HPC Engineer, Mid

$61K - $141K/yr

Remote Work: No Job Number: R0240542 Location: Beavercreek,OH,US Share job via: Share Additional ... Experience with GPU computing, including CUDA and ROCm, or GPU-accelerated workflows * Experience ...

Booz Allen Hamilton

HPC Engineer, Mid

$61K - $141K/yr

InfinitForm, Inc

Software Engineer, Finite Element Analysis & Structural Optimization

Los Angeles, CA · On-site +1

Full-time | Hybrid (LA / Orange County) or Remote About InfinitForm InfinitForm is building the ... Experience with GPU computing (CUDA, OpenCL, or similar) * Familiarity with parallel computing or ...

InfinitForm, Inc

Software Engineer, Finite Element Analysis & Structural Optimization

Los Angeles, CA · On-site +1

MatX

Runtime Engineer

Mountain View, CA · On-site +1

$175K - $362K/yr

Hands-on with at least one accelerator programming model (CUDA, ROCm, oneAPI Level Zero, TPU, or ... Remote Perks We work remotely Monday & Friday, supported by home-tech setup, and remote wifi ...

MatX

Runtime Engineer

Mountain View, CA · On-site +1

$175K - $362K/yr

Lila Sciences

Research Engineer, Frontier Capabilities

Cambridge, MA · On-site +1

C++/CUDA a plus * Experience with distributed ML training frameworks (Megatron-LM, TorchTitan ... Experience training MoE architectures Location San Francisco, CA or Cambridge, MA (Remote, Hybrid ...

Lila Sciences

Research Engineer, Frontier Capabilities

Cambridge, MA · On-site +1

Showing results 1-20

Remote Cuda Developer Jobs

Remote Cuda Developer information

See salary details

$83.5K

$102.5K

$135.5K

How much do remote cuda developer jobs pay per year?

As of Jul 6, 2026, the average yearly pay for remote cuda developer in the United States is $102,500.00, according to ZipRecruiter salary data. Most workers in this role earn between $90,000.00 and $115,000.00 per year, depending on experience, location, and employer.

How does a Remote CUDA Developer typically collaborate with team members across different locations?

As a Remote CUDA Developer, you will frequently collaborate with cross-functional teams such as data scientists, software engineers, and product managers through virtual meetings, code reviews, and collaborative platforms like GitHub or GitLab. Clear communication and thorough documentation are essential since team members may be in different time zones. You can expect to participate in regular stand-ups, sprint planning, and peer programming sessions, ensuring alignment and smooth integration of your GPU-accelerated code into larger projects. Tools like Slack, Zoom, and project management platforms help maintain connectivity and workflow efficiency.

What is a Remote CUDA Developer?

A Remote CUDA Developer is a software engineer who specializes in using NVIDIA's CUDA (Compute Unified Device Architecture) platform to develop parallel computing applications, often for high-performance tasks like machine learning, scientific computing, or data analysis. They work remotely, collaborating with teams online rather than being physically present in an office. These developers write and optimize code to run efficiently on NVIDIA GPUs, enabling applications to process large amounts of data much faster than traditional CPU-only solutions.

What are the key skills and qualifications needed to thrive as a Remote CUDA Developer, and why are they important?

To thrive as a Remote CUDA Developer, you need strong proficiency in C/C++ programming, parallel computing concepts, and a solid understanding of GPU architecture, typically backed by a degree in computer science or a related field. Experience with NVIDIA CUDA toolkit, GPU debugging tools, and version control systems like Git is commonly required. Excellent problem-solving skills, self-motivation, and effective remote communication abilities help distinguish high performers in this role. These skills are vital for efficiently delivering high-performance computing solutions and collaborating seamlessly with distributed teams.

What is the difference between Remote Cuda Developer vs Remote Machine Learning Engineer?

Aspect	Remote Cuda Developer	Remote Machine Learning Engineer
Required Credentials	CUDA programming certifications, computer science degree	Machine learning certifications, data science background
Work Environment	Software development, GPU optimization	Model development, data analysis
Industry Usage	High-performance computing, gaming, AI	AI, data science, predictive modeling

Remote Cuda Developers focus on GPU programming and optimization using CUDA, primarily in high-performance computing and AI applications. Remote Machine Learning Engineers develop and deploy machine learning models, often utilizing GPU resources but with a broader focus on data and algorithms. While both roles may involve GPU expertise, Cuda Developers specialize in low-level programming, whereas Machine Learning Engineers work on model development and deployment.

More about Remote Cuda Developer jobs

The 10 Top Types Of Remote Cuda Developer Jobs

What cities are hiring for Remote Cuda Developer jobs? Cities with the most Remote Cuda Developer job openings:

What are the most commonly searched types of Cuda Developer jobs? The most popular types of Cuda Developer jobs are:

What states have the most Remote Cuda Developer jobs? States with the most job openings for Remote Cuda Developer jobs include:

What job categories do people searching Remote Cuda Developer jobs look for? The top searched job categories for Remote Cuda Developer jobs are:

Remote Cuda Developer jobs near you

Senior Site Reliability Engineer - AI Infrastructure

Andromeda Cluster, Inc

San Francisco, CA • On-site, Remote

Apply

$67.25 - $89.25/hr

Full-time

Posted 15 days ago

Job description

Senior Site Reliability Engineer - AI Infrastructure
Location: Global Remote / San Francisco • Full-Time
About Andromeda
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.
We began with a single managed cluster - but it filled almost instantly. Since then, we've been quietly building the systems, network, and orchestration layer that makes the world's AI infrastructure more accessible.
Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it's needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.
Our long-term vision is to build the liquidity layer for global AI compute - a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.
We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
The Role
This is not a generalist SRE role.
You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.
We're looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric → kernel → framework.
What You'll Own

GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time.
Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations.
Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics.
Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes.

What We're Looking For

GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation.
High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale.
Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don't need to write the models, but you need to understand what's happening at the systems level when a 1,000-GPU training run stalls.
Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level.
Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued.
Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent).
Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards.
Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast.

Strong Candidates May Have

Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs.
Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs.
Cluster Buildout & Hardware: Experience involved in physical cluster design - rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale.
Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We're growing and need people who raise the bar for everyone around them.

Why You'll Love It Here
This is a high-impact, senior builder's role. You'll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You'll influence technical direction and help define what world-class AI infrastructure operations look like.
Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Apply

Remote Cuda Developer Jobs (NOW HIRING)

Senior Site Reliability Engineer - AI Infrastructure

Senior Site Reliability Engineer - AI Infrastructure

Member of Technical Staff (GPU Performance Engineer)

Member of Technical Staff (GPU Performance Engineer)

Sr. Principal Software Engineer

Sr. Principal Software Engineer

Site Reliability Engineer - AI Infrastructure

Site Reliability Engineer - AI Infrastructure

Hyperbolic Labs - Senior GPU Infrastructure Engineer

Hyperbolic Labs - Senior GPU Infrastructure Engineer

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Architect - Platform Engineer

Architect - Platform Engineer

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Customer Engineer

Customer Engineer

Senior DevOps Lead - Cloud & Autonomous System

Senior DevOps Lead - Cloud & Autonomous System

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Principal AI/ML Engineer (Large Language Model) (TS/SCI) {S}

Signal Processing Engineer

Signal Processing Engineer

Machine Learning Research Engineer (MLRE) - GPUs

Machine Learning Research Engineer (MLRE) - GPUs

Member of Technical Staff - Inference

Member of Technical Staff - Inference

Software Development Engineer

Software Development Engineer

HPC Engineer, Mid

HPC Engineer, Mid

Software Engineer, Finite Element Analysis & Structural Optimization

Software Engineer, Finite Element Analysis & Structural Optimization

Runtime Engineer

Runtime Engineer

Research Engineer, Frontier Capabilities

Research Engineer, Frontier Capabilities

Remote Cuda Developer information

See salary details

How much do remote cuda developer jobs pay per year?

How does a Remote CUDA Developer typically collaborate with team members across different locations?

What is a Remote CUDA Developer?

What are the key skills and qualifications needed to thrive as a Remote CUDA Developer, and why are they important?

What is the difference between Remote Cuda Developer vs Remote Machine Learning Engineer?

Senior Site Reliability Engineer - AI Infrastructure

Share this job

Job description

Share this job