Focus Multimodal Foundation Models Representation Learning Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and representation ...
Focus Multimodal Foundation Models Representation Learning Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and representation ...
Focus Multimodal Foundation Models · Representation Learning · Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Quick apply
Focus Multimodal Foundation Models · Representation Learning · Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Focus Multimodal Foundation Models • Representation Learning • Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Focus Multimodal Foundation Models • Representation Learning • Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
The AI Scientist will build multimodal foundation models for biological systems, focusing on learning from complex data and developing models that can encode rich representations and generate ...
The AI Scientist will build multimodal foundation models for biological systems, focusing on learning from complex data and developing models that can encode rich representations and generate ...
[EOI] Postdoctoral Associate in Multimodal AI| Professor Saining Xie
New York, NY · On-site
$62K - $125K/yr
Conducting original research in multimodal learning, including model design, training, and evaluation * Developing scalable methods for aligning and integrating diverse data modalities
[EOI] Postdoctoral Associate in Multimodal AI| Professor Saining Xie
New York, NY · On-site
$62K - $125K/yr
Conducting original research in multimodal learning, including model design, training, and evaluation * Developing scalable methods for aligning and integrating diverse data modalities
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Research Scientist in Multimodal Interaction and World Model - Seed - Graduates - 2027 Start (BS/MS)
San Jose, CA · On-site
Preferred : • Experience in multimodal learning, reinforcement learning, or agent systems through internships is preferred. • Strong problem-solving and collaboration skills. Company : ByteDance ...
Research Scientist in Multimodal Interaction and World Model - Seed - Graduates - 2027 Start (BS/MS)
San Jose, CA · On-site
Preferred : • Experience in multimodal learning, reinforcement learning, or agent systems through internships is preferred. • Strong problem-solving and collaboration skills. Company : ByteDance ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Quick apply
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Senior Staff Machine Learning Scientist, Assets
OR · On-site +1
$91K - $124K/yr
Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...
Senior Staff Machine Learning Scientist, Assets
OR · On-site +1
$91K - $124K/yr
Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Quick apply
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Senior Staff Machine Learning Scientist, Assets
$93K - $127K/yr
Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...
Senior Staff Machine Learning Scientist, Assets
$93K - $127K/yr
Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Lead R&D portfolio involving machine learning on heterogeneous sensors (e.g., radar, audio, RF, IMU ... Multimodal learning and sensor fusion * High-frequency signal modeling and representation learning
Senior/Staff Applied Scientist, Multimodal Representation Learning (Oncology)
New York, NY · On-site
$150K - $200K/yr
Frontier AI (representation learning, multimodal learning, alignment, evaluation) * Messy biomedical reality (clinical endpoints, censoring, confounding, missingness, batch effects) * Mechanism ...
Senior/Staff Applied Scientist, Multimodal Representation Learning (Oncology)
New York, NY · On-site
$150K - $200K/yr
Frontier AI (representation learning, multimodal learning, alignment, evaluation) * Messy biomedical reality (clinical endpoints, censoring, confounding, missingness, batch effects) * Mechanism ...
2026 Fall Applied Science Internship - Natural Language Processing and Speech Technologies - Unit...
Seattle, WA · On-site
$17 - $22.75/hr
NLP/NLU, LLMs, Reinforcement Learning, Human Feedback/HITL, Deep Learning, Speech Recognition, Conversational AI, Natural Language Modeling, Multimodal Learning. In this role, you will work alongside ...
2026 Fall Applied Science Internship - Natural Language Processing and Speech Technologies - Unit...
Seattle, WA · On-site
$17 - $22.75/hr
NLP/NLU, LLMs, Reinforcement Learning, Human Feedback/HITL, Deep Learning, Speech Recognition, Conversational AI, Natural Language Modeling, Multimodal Learning. In this role, you will work alongside ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$252K/yr
As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and Multimodal Large ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$252K/yr
As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and Multimodal Large ...
Develop engaging, multimodal course content, including storyboards, scripts, and supporting ... Work within Learning Management Systems (LMS) and create SCORM-compliant content to ensure ...
Develop engaging, multimodal course content, including storyboards, scripts, and supporting ... Work within Learning Management Systems (LMS) and create SCORM-compliant content to ensure ...
Multimodal Learning information
See salary details
$21K - $29.5K
10% of jobs
$29.5K - $38K
14% of jobs
$39.2K is the 25th percentile. Wages below this are outliers.
$38K - $46.5K
10% of jobs
$46.5K - $55K
12% of jobs
The median wage is $57K / yr.
$55K - $63.5K
20% of jobs
$68.8K is the 75th percentile. Wages above this are outliers.
$63.5K - $72K
15% of jobs
$72K - $80.5K
4% of jobs
$80.5K - $89K
2% of jobs
$89K - $97.5K
4% of jobs
$97.5K - $106K
0% of jobs
$106K - $114.5K
9% of jobs
$21K
$61.7K
$114.5K
How much do multimodal learning jobs pay per year?
What is multimodal learning?
What is the difference between Multimodal Learning vs Data Scientist?
| Aspect | Multimodal Learning | Data Scientist |
|---|---|---|
| Required Credentials | Advanced degrees in AI, Machine Learning, or Computer Science | Bachelor's or Master's in Data Science, Statistics, or related fields |
| Work Environment | Research labs, AI development teams, academia | Business, tech companies, analytics teams |
| Industry Usage | AI research, multimedia applications, robotics | Data analysis, predictive modeling, business insights |
Multimodal Learning focuses on developing AI models that process and integrate multiple data types like images, text, and audio. Data Scientists analyze data to extract insights, build models, and support decision-making. While both roles involve data and algorithms, Multimodal Learning is specialized in AI model development for complex data integration, whereas Data Scientists work broadly across data analysis and interpretation.
What are the key skills and qualifications needed to thrive as a Multimodal Learning Specialist, and why are they important?
What are some common challenges faced by professionals working in multimodal learning roles, and how can they be addressed?
- Machine Learning Internship No Experience
- Internship Cloud Cyber Security
- Internship Phd Physics
- Internship Machine Learning Chemistry
- Internship Cyber Security Startup
- Disinformation Internship
- Full Time Student Summer
- Overnight Summer Animal Science Internship
- Internship Mathematical Programming
- Tensorflow

Other
Posted 10 days ago
Job description
Focus
Multimodal Foundation Models Representation Learning Method Innovation
We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.
Ideal candidates should have:
- Strong experimental rigor
- Solid systems and modeling intuition
- Hands-on engineering ability
- Interest in scalable multimodal AI systems for real-world autonomy
We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.
Responsibilities
1. Large-Scale Foundation Model Pretraining
- Develop scalable pretraining pipelines for large-scale multimodal driving data
- Design and optimize training strategies for:
- Vision-language-action models
- Video foundation models
- Long-context temporal modeling
- Multimodal representation alignment
- Improve:
- Training stability
- Data efficiency
- Scaling efficiency
- Representation robustness
- Work on distributed training systems and large-scale model optimization using frameworks such as:
- PyTorch Distributed
- DeepSpeed
- Megatron-LM
2. Representation Learning & Method Innovation
- Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
- Conduct architecture-level research on:
- Vision Transformers (ViT)
- Video / temporal architectures
- Multimodal fusion and alignment
- Embedding and retrieval systems
- Long-context and memory-efficient architectures
- Explore and improve:
- Pretraining objectives
- Loss functions
- Training paradigms
- Generalization and robustness
- Analyze model behavior through:
- Rigorous ablation studies
- Failure case analysis
- Representation probing and evaluation
3. Efficient Foundation Models & Scalable Deployment
- Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
- Work on areas such as:
- Model quantization
- Knowledge distillation
- Efficient attention mechanisms
- Sparse architectures and Mixture-of-Experts (MoE)
- Long-context and memory-efficient modeling
- Inference acceleration and serving optimization
- Training and inference system efficiency
- Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments
Requirements
- MS or PhD in:
- Computer Vision
- Machine Learning
- Robotics
- Computer Science
- Related fields
- Strong understanding of:
- Foundation models
- Self-supervised learning
- Representation learning
- Multimodal learning
- Large-scale pretraining
- Hands-on experience with methods such as:
- CLIP
- DINO / DINOv2
- MAE
- Contrastive learning
- Masked modeling
- MoE or scalable transformer architectures
- Experience with one or more of the following is highly valued:
- Video foundation models
- Long-context modeling
- Retrieval systems
- Efficient inference
- Distributed training
- Model compression and deployment optimization
- Strong publication record in top-tier venues is preferred:
- CVPR
- ICCV
- ECCV
- NeurIPS
- ICLR
- ICML