1

Multimodal Learning Jobs (NOW HIRING)

Senior Staff Machine Learning Scientist, Assets

OR ยท On-site +1

$91K - $124K/yr

Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...

Design, implement, train, and optimize large-scale vision and multimodal foundation models across ... Proficiency in modern deep learning frameworks such as PyTorch and TensorFlow. * Demonstrated ...

Omnitag, our ML-powered multimodal data mining framework, is the engine that powers this discovery. As a Staff Machine Learning Engineer, you will serve as a technical leader defining the roadmap and ...

Omnitag, our ML-powered multimodal data mining framework, is the engine that powers this discovery. As a Staff Machine Learning Engineer, you will serve as a technical leader defining the roadmap and ...

next page

Showing results 1-20

Multimodal Learning information

See salary details

$21K

$61.7K

$114.5K

How much do multimodal learning jobs pay per year?

As of Jun 28, 2026, the average yearly pay for multimodal learning in the United States is $61,692.00, according to ZipRecruiter salary data. Most workers in this role earn between $41,000.00 and $72,000.00 per year, depending on experience, location, and employer.

What is multimodal learning?

Multimodal learning is an area of machine learning that involves integrating and processing information from multiple types of data, such as text, images, audio, and video. The goal is to create models that can understand and make predictions based on more than one data modality, similar to how humans use various senses. This approach is used in applications like speech recognition with visual cues, image captioning, and video analysis. By combining different data types, multimodal learning systems can achieve better accuracy and more robust understanding.

What is the difference between Multimodal Learning vs Data Scientist?

AspectMultimodal LearningData Scientist
Required CredentialsAdvanced degrees in AI, Machine Learning, or Computer ScienceBachelor's or Master's in Data Science, Statistics, or related fields
Work EnvironmentResearch labs, AI development teams, academiaBusiness, tech companies, analytics teams
Industry UsageAI research, multimedia applications, roboticsData analysis, predictive modeling, business insights

Multimodal Learning focuses on developing AI models that process and integrate multiple data types like images, text, and audio. Data Scientists analyze data to extract insights, build models, and support decision-making. While both roles involve data and algorithms, Multimodal Learning is specialized in AI model development for complex data integration, whereas Data Scientists work broadly across data analysis and interpretation.

What are the key skills and qualifications needed to thrive as a Multimodal Learning Specialist, and why are they important?

To excel as a Multimodal Learning Specialist, you need a solid background in machine learning, data science, and computer vision, often supported by an advanced degree in a related field. Familiarity with deep learning frameworks like TensorFlow or PyTorch, experience integrating data from diverse sources (e.g., text, audio, images), and knowledge of relevant algorithms are crucial. Strong problem-solving abilities, creativity, and effective collaboration are standout soft skills for this role. These competencies are vital for developing innovative models that can process and interpret complex, multi-source data to drive impactful AI solutions.

What are some common challenges faced by professionals working in multimodal learning roles, and how can they be addressed?

Professionals in multimodal learning frequently encounter challenges related to integrating and aligning data from multiple sources, such as text, images, audio, or video. Ensuring data quality and consistency across modalities can be complex, and developing models that effectively combine heterogeneous information often requires advanced technical skills and innovative thinking. Collaboration with domain experts and other data scientists is key to overcoming these obstacles, as is staying up to date with the latest research and tools in machine learning. Regular team meetings and cross-disciplinary workshops can help foster a collaborative environment and promote knowledge sharing.
More about Multimodal Learning jobs
What cities are hiring for Multimodal Learning jobs? Cities with the most Multimodal Learning job openings:
What states have the most Multimodal Learning jobs? States with the most job openings for Multimodal Learning jobs include:
Infographic showing various Multimodal Learning job openings in the United States as of June 2026, with employment types broken down into 33% Internship, and 67% Full Time. Highlights an 100% In-person job distribution, with an average salary of $61,692 per year, or $29.7 per hour.

Full-time

Posted 4 days ago


Job description

Job Summary:
The University of Bristol's School of Physiology, Pharmacology and Neuroscience is seeking a Senior Research Associate in Multimodal Learning. The role involves conducting research on audio-visual understanding for smart hearing aids, collaborating with other researchers, and publishing findings in top-tier venues.
Responsibilities:
โ€ข Conducting novel research in multimodal audio-visual understanding โ€“ contributing novel research on designing, training and evaluating audio-visual understanding in conversational setting. This will include hands-on research using the latest deep learning approaches.
โ€ข Preparing API packages with low latency that will be integrated with partner demonstrations on quarterly basis.
โ€ข Presenting your work in regular meetings, taking feedback and integrating the goals of the proect into your individual research directions.
โ€ข Publishing in top-tier venues (conferences and journals). Communicating your work to the best possible audience.
โ€ข Collaborating with other researchers (postdocs and faculty) in the WeHear project.
โ€ข Co-advising junior PGR students.
Qualifications:
Required:
โ€ข PhD [near submission, submitted or graduated] in Multimodal Understanding, preferably with expertise in audio understanding, video understanding or multimodal visual models.
โ€ข Prior degree in computer science, engineering or mathematics
โ€ข Detailed knowledge of video understanding state-of-the-art, approaches, datasets and problems, preferably with expertise in egocentric datasets.
โ€ข Prior knowledge of egocentric audio-visual devices that work in real time like Meta Aria Glasses (Gen1 or Gen2) and Apple Vision Pro.
โ€ข Experience in handling audio-video data, for learning and inference
โ€ข Experience in modelling deep learning approaches
โ€ข Experience and evidence of publishing at high-calibre conferences and journals (at least one first-author paper in a major venue โ€“ CVPR/ICCV/ECCV/ICASSP/NeurIPs/PAMI/IJCV/Neurips/ICLR in the past 3 years).
โ€ข Excellent programming skills (Python)
โ€ข Proficiency in deep learning frameworks (PyTorch)
Company:
Research within the School of Physiology, Pharmacology and Neuroscience is conducted across Neuroscience, Cardiovascular and Cell Signalling. Founded in 1876, the company is headquartered in Bristol, Bristol, GB, , with a team of 51-200 employees. The company is currently Growth Stage.