... ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc ... • Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc ...
... ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc ... • Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc ...
Discover and configure inference services Interact with ML pipelines and workflows Monitor usage, health, and operational signals Establish best practices around testing, maintainability ...
Discover and configure inference services Interact with ML pipelines and workflows Monitor usage, health, and operational signals Establish best practices around testing, maintainability ...
Senior Product Manager - ROCm & AI/ML Inference Software
Santa Clara, CA · On-site
$179K/yr
... inference requirements and translates market signals into actionable product strategy. Open-Source Community Engagement * Serve as AMD's active presence in the open-source AI/ML community: monitor ...
Senior Product Manager - ROCm & AI/ML Inference Software
Santa Clara, CA · On-site
$179K/yr
... inference requirements and translates market signals into actionable product strategy. Open-Source Community Engagement * Serve as AMD's active presence in the open-source AI/ML community: monitor ...
Deep understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc. * Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS ...
Deep understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc. * Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS ...
... of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc. • Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc ...
... of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc. • Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
Collaborating with research teams on new ML serving capabilities * Driving technical decisions that shape the future of Neuron's inference stack About the team The Neuron Serving team is at the ...
Collaborating with research teams on new ML serving capabilities * Driving technical decisions that shape the future of Neuron's inference stack About the team The Neuron Serving team is at the ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
Collaborating with research teams on new ML serving capabilities * Driving technical decisions that shape the future of Neuron's inference stack About the team The Neuron Serving team is at the ...
Collaborating with research teams on new ML serving capabilities * Driving technical decisions that shape the future of Neuron's inference stack About the team The Neuron Serving team is at the ...
Senior Software Development Engineer, AI/ML, AWS Neuron, Model Inference
Cupertino, CA · On-site
$128K - $177K/yr
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
Senior Software Development Engineer, AI/ML, AWS Neuron, Model Inference
Cupertino, CA · On-site
$128K - $177K/yr
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
Senior Product Manager -ROCm& AI/ML Inference Software
Santa Clara, CA · On-site
$149K - $197K/yr
... inference requirements and translates market signals into actionable product strategy. Open-Source Community Engagement * Serve as AMD's active presence in the open-source AI/ML community: monitor ...
New
Senior Product Manager -ROCm& AI/ML Inference Software
Santa Clara, CA · On-site
$149K - $197K/yr
... inference requirements and translates market signals into actionable product strategy. Open-Source Community Engagement * Serve as AMD's active presence in the open-source AI/ML community: monitor ...
New
Apple's Server ML Frameworks team in GPU, Graphics and Machine Learning works on enabling Apple Intelligence through high-performance, distributed inference of GenAI applications (such as LLMs) on ...
Apple's Server ML Frameworks team in GPU, Graphics and Machine Learning works on enabling Apple Intelligence through high-performance, distributed inference of GenAI applications (such as LLMs) on ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
They are seeking a Staff AI Inference & Acceleration Engineer to own the on-board inference ... ML team to define model architecture constraints that are hardware-friendly from the outset. • ...
They are seeking a Staff AI Inference & Acceleration Engineer to own the on-board inference ... ML team to define model architecture constraints that are hardware-friendly from the outset. • ...
Our team primarily owns the orchestration layer that runs inference on our datacenter clusters which glues together the cloud components to the ML components. We are often the first team to face ...
Our team primarily owns the orchestration layer that runs inference on our datacenter clusters which glues together the cloud components to the ML components. We are often the first team to face ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
The Inference Enablement and Acceleration team is at the forefront of running a wide range of models and supporting novel architecture alongside maximizing their performance for AWS's custom ML ...
Ml Inference information
What is a $900000 AI job?
What is ML inference?
What is the difference between Ml Inference vs Data Scientist?
| Aspect | ML Inference | Data Scientist |
|---|---|---|
| Required Credentials | Knowledge of machine learning models, programming skills | Degree in data science, statistics, or related fields |
| Work Environment | Deploying models in production, real-time data processing | Data analysis, model development, research |
| Industry Usage | AI product deployment, software companies | Research institutions, tech firms, consulting |
ML Inference focuses on deploying trained models to make predictions on new data, often in real-time. Data Scientists develop and analyze models, working primarily in research and development. While both roles require understanding of machine learning, ML Inference emphasizes deployment and operationalization, whereas Data Scientists focus on model creation and analysis.
What engineer makes $500,000 a year?
Which 3 jobs will survive AI?
What are some common challenges faced by ML Inference Engineers when deploying models to production?
Will MLE be replaced by AI?
What are the key skills and qualifications needed to thrive in ML Inference, and why are they important?

Full-time
This job post has expired today. Applications are no longer accepted.
Job description
Databricks is the data and AI company that empowers organizations to unify and democratize data, analytics, and AI. They are seeking a Software Engineer for GenAI inference to design, develop, and optimize the inference engine powering their Foundation Model API, working at the intersection of research and production.
Responsibilities:
• Contribute to the design and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference
• Collaborate with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine
• Optimize for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators
• Build and maintain instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations
• Develop and enhance scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads
• Support reliability, reproducibility, and fault tolerance in the inference pipelines, including A/B launches, rollback, and model versioning
• Integrate with federated, distributed inference infrastructure – orchestrate across nodes, balance load, handle communication overhead
• Collaborate cross-functionally: with platform engineers, cloud infrastructure, and security/compliance teams
• Document and share learnings, contributing to internal best practices and open-source efforts when possible
Qualifications:
Required:
• BS/MS/PhD in Computer Science, or a related field
• Strong software engineering background (3+ years or equivalent) in performance-critical systems
• Solid understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc.
• Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc.)
• Comfortable designing and operating distributed systems, including RPC frameworks, queuing, RPC batching, sharding, memory partitioning
• Demonstrated ability to uncover and solve performance bottlenecks across layers (kernel, memory, networking, scheduler)
• Experience building instrumentation, tracing, and profiling tools for ML models
• Ability to work closely with ML researchers, translate novel model ideas into production systems
• Ownership mindset and eagerness to dive deep into complex system challenges
Preferred:
• Bonus: published research or open-source contributions in ML systems, inference optimization, or model serving
Company:
Databricks is a data and AI platform that unifies data engineering, analytics, and machine learning on a lakehouse architecture. Founded in 2013, the company is headquartered in San Francisco, USA, with a team of 5001-10000 employees. The company is currently Late Stage.
About Databricks
Sourced by ZipRecruiter
Industry
Software development
Company size
5,001 - 10,000 Employees
Headquarters location
San Francisco, CA, US
Year founded
2013