Focus Multimodal Foundation Models Representation Learning Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and representation ...
Focus Multimodal Foundation Models Representation Learning Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and representation ...
Focus Multimodal Foundation Models · Representation Learning · Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Quick apply
Focus Multimodal Foundation Models · Representation Learning · Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Focus Multimodal Foundation Models • Representation Learning • Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Focus Multimodal Foundation Models • Representation Learning • Method Innovation We are looking for strong technical builders and researchers who deeply understand foundation models and ...
Machine Learning Researcher
South San Francisco, CA · On-site +1
Design, implement, and train foundation models with multimodal bio data. * Continually improve the ... Expertise in Machine Learning, with deep experience in areas like: * Foundation models * Self ...
Machine Learning Researcher
South San Francisco, CA · On-site +1
Design, implement, and train foundation models with multimodal bio data. * Continually improve the ... Expertise in Machine Learning, with deep experience in areas like: * Foundation models * Self ...
Research Scientist Graduate (Multimodal Interaction and World Model) - 2026 Start (PhD)
San Jose, CA · On-site
... learning, multimodal learning, video understanding, or vision-language modeling Preferred : • Expertise in Transformers (Dense and MoE) and familiar with how to scale Transformers on GPUs or TPUs ...
Research Scientist Graduate (Multimodal Interaction and World Model) - 2026 Start (PhD)
San Jose, CA · On-site
... learning, multimodal learning, video understanding, or vision-language modeling Preferred : • Expertise in Transformers (Dense and MoE) and familiar with how to scale Transformers on GPUs or TPUs ...
Research Scientist in Multimodal Interaction and World Model - Seed - Graduates - 2027 Start (BS/MS)
San Jose, CA · On-site
Preferred : • Experience in multimodal learning, reinforcement learning, or agent systems through internships is preferred. • Strong problem-solving and collaboration skills. Company : ByteDance ...
Research Scientist in Multimodal Interaction and World Model - Seed - Graduates - 2027 Start (BS/MS)
San Jose, CA · On-site
Preferred : • Experience in multimodal learning, reinforcement learning, or agent systems through internships is preferred. • Strong problem-solving and collaboration skills. Company : ByteDance ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Quick apply
Help train and develop multimodal learning models using advanced learning techniques including RAG, self-supervised learning, semi-supervised, and transductive learning. Requirements Desired ...
Member of Technical Staff, Multimodal Vision
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Member of Technical Staff, Multimodal Vision
San Jose, CA · On-site
$180K - $450K/yr
Experience with large-scale machine learning systems and distributed training. * Strong background ... Experience with multimodal systems (vision + text, vision + audio) or real-time AI systems is a ...
Description As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and ...
Description As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$252K/yr
As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and Multimodal Large ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$252K/yr
As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and Multimodal Large ...
[2026] Senior Machine Learning Engineer, Multimodal AI, Computer Vision and Graphics - PhD Early ...
San Mateo, CA · On-site
$159K - $197K/yr
Expertise in one or more areas: computer vision, multimodal learning, 3D Graphics, or large-scale representation learning. * Experience developing and training deep learning models using modern ...
[2026] Senior Machine Learning Engineer, Multimodal AI, Computer Vision and Graphics - PhD Early ...
San Mateo, CA · On-site
$159K - $197K/yr
Expertise in one or more areas: computer vision, multimodal learning, 3D Graphics, or large-scale representation learning. * Experience developing and training deep learning models using modern ...
[2026] Senior Machine Learning Engineer, Account Identity - PhD Early Career
San Mateo, CA · On-site
$119K - $163K/yr
Expertise in one or more areas: computer vision, multimodal learning, deepfake detection, facial representation, adversarial machine learning, or VLM/LLM. * Strong coding skills with proficiency in ...
[2026] Senior Machine Learning Engineer, Account Identity - PhD Early Career
San Mateo, CA · On-site
$119K - $163K/yr
Expertise in one or more areas: computer vision, multimodal learning, deepfake detection, facial representation, adversarial machine learning, or VLM/LLM. * Strong coding skills with proficiency in ...
AI Research Scientist (Technical Leadership), Multimodal - Monetization GenAI
Menlo Park, CA · On-site
$219K - $301K/yr
... multimodal learning, or diffusion models • Demonstrated significant industry influence in the ... field of AI and/or recently published research in leading peer-reviewed conferences (e.g., ACL ...
AI Research Scientist (Technical Leadership), Multimodal - Monetization GenAI
Menlo Park, CA · On-site
$219K - $301K/yr
... multimodal learning, or diffusion models • Demonstrated significant industry influence in the ... field of AI and/or recently published research in leading peer-reviewed conferences (e.g., ACL ...
Research expertise in video generation/understanding, multimodal learning, or diffusion models * Demonstrated significant industry influence in the field of AI and/or recently published research in ...
Research expertise in video generation/understanding, multimodal learning, or diffusion models * Demonstrated significant industry influence in the field of AI and/or recently published research in ...
Staff Machine Learning Engineer
San Francisco, CA · On-site +1
Omnitag, our ML-powered multimodal data mining framework, is the engine that powers this discovery. As a Staff Machine Learning Engineer, you will serve as a technical leader defining the roadmap and ...
Quick apply
Staff Machine Learning Engineer
San Francisco, CA · On-site +1
Omnitag, our ML-powered multimodal data mining framework, is the engine that powers this discovery. As a Staff Machine Learning Engineer, you will serve as a technical leader defining the roadmap and ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$147K - $272K/yr
Description As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and ...
Machine Learning Research Engineer, SIML - ISE
Cupertino, CA · On-site
$147K - $272K/yr
Description As a Machine Learning Research Engineer, you will help design and develop models and algorithms for multimodal perception and reasoning leveraging Vision-Language Models (VLMs) and ...
Machine Learning: Multimodal Foundation Models
San Francisco, CA · On-site
$200K - $350K/yr
Machine Learning: Multimodal Foundation Models We are building unified foundation models that natively reason across text, image, video, and kinematics to drive intelligent robotic policies. You will ...
Machine Learning: Multimodal Foundation Models
San Francisco, CA · On-site
$200K - $350K/yr
Machine Learning: Multimodal Foundation Models We are building unified foundation models that natively reason across text, image, video, and kinematics to drive intelligent robotic policies. You will ...
Multimodal Learning information
What is multimodal learning?
What is the difference between Multimodal Learning vs Data Scientist?
| Aspect | Multimodal Learning | Data Scientist |
|---|---|---|
| Required Credentials | Advanced degrees in AI, Machine Learning, or Computer Science | Bachelor's or Master's in Data Science, Statistics, or related fields |
| Work Environment | Research labs, AI development teams, academia | Business, tech companies, analytics teams |
| Industry Usage | AI research, multimedia applications, robotics | Data analysis, predictive modeling, business insights |
Multimodal Learning focuses on developing AI models that process and integrate multiple data types like images, text, and audio. Data Scientists analyze data to extract insights, build models, and support decision-making. While both roles involve data and algorithms, Multimodal Learning is specialized in AI model development for complex data integration, whereas Data Scientists work broadly across data analysis and interpretation.
What are the key skills and qualifications needed to thrive as a Multimodal Learning Specialist, and why are they important?
What are some common challenges faced by professionals working in multimodal learning roles, and how can they be addressed?
- Entry Level Internship Ai Ml
- Internship Remote Bioinformatics Scientist
- Internship Particle Accelerator
- Internship Scientist Algae Research
- Internship Computational Physicist
- Flexible Computer Science Winter Break Internship
- Summer Internship Ai Ml
- Internship Computational Material Science
- Internship Document Formatting
- Marine Science Summer Internships

Other
Posted 18 days ago
Job description
Focus
Multimodal Foundation Models Representation Learning Method Innovation
We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.
Ideal candidates should have:
- Strong experimental rigor
- Solid systems and modeling intuition
- Hands-on engineering ability
- Interest in scalable multimodal AI systems for real-world autonomy
We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.
Responsibilities
1. Large-Scale Foundation Model Pretraining
- Develop scalable pretraining pipelines for large-scale multimodal driving data
- Design and optimize training strategies for:
- Vision-language-action models
- Video foundation models
- Long-context temporal modeling
- Multimodal representation alignment
- Improve:
- Training stability
- Data efficiency
- Scaling efficiency
- Representation robustness
- Work on distributed training systems and large-scale model optimization using frameworks such as:
- PyTorch Distributed
- DeepSpeed
- Megatron-LM
2. Representation Learning & Method Innovation
- Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
- Conduct architecture-level research on:
- Vision Transformers (ViT)
- Video / temporal architectures
- Multimodal fusion and alignment
- Embedding and retrieval systems
- Long-context and memory-efficient architectures
- Explore and improve:
- Pretraining objectives
- Loss functions
- Training paradigms
- Generalization and robustness
- Analyze model behavior through:
- Rigorous ablation studies
- Failure case analysis
- Representation probing and evaluation
3. Efficient Foundation Models & Scalable Deployment
- Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
- Work on areas such as:
- Model quantization
- Knowledge distillation
- Efficient attention mechanisms
- Sparse architectures and Mixture-of-Experts (MoE)
- Long-context and memory-efficient modeling
- Inference acceleration and serving optimization
- Training and inference system efficiency
- Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments
Requirements
- MS or PhD in:
- Computer Vision
- Machine Learning
- Robotics
- Computer Science
- Related fields
- Strong understanding of:
- Foundation models
- Self-supervised learning
- Representation learning
- Multimodal learning
- Large-scale pretraining
- Hands-on experience with methods such as:
- CLIP
- DINO / DINOv2
- MAE
- Contrastive learning
- Masked modeling
- MoE or scalable transformer architectures
- Experience with one or more of the following is highly valued:
- Video foundation models
- Long-context modeling
- Retrieval systems
- Efficient inference
- Distributed training
- Model compression and deployment optimization
- Strong publication record in top-tier venues is preferred:
- CVPR
- ICCV
- ECCV
- NeurIPS
- ICLR
- ICML