Vision Language Model Jobs (NOW HIRING)

Research Scientist - Vision Language Model

Sunnyvale, CA · On-site

Research Scientist - Vision Language Model

Sunnyvale, CA · On-site

$150K - $450K/yr

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

Research Scientist - Vision Language Model

Sunnyvale, CA · On-site

$150K - $450K/yr

Research Scientist - Vision Language Model

Sunnyvale, CA · On-site

As a Research Scientist in the Vision Language Model team, you will advance multimodal foundation models that integrate visual understanding and reasoning, working on the research and development of ...

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

Research Scientist - Vision Language Model

Sunnyvale, CA · On-site

Senior Vision Language Model Engineer

Santa Clara, CA

$122.70K - $168.50K/yr

We are seeking a senior vision language model engineer to design and build agentic data and training workflows for Autonomous Vehicles, Robotics, and Medical applications. The right person for this ...

Senior Vision Language Model Engineer

Santa Clara, CA

$122.70K - $168.50K/yr

Senior Vision Language Model Engineer

Santa Clara, CA · On-site

$122.70K - $168.50K/yr

Senior Vision Language Model Engineer

Santa Clara, CA · On-site

$122.70K - $168.50K/yr

NVIDIA

Senior Vision Language Model Engineer

Santa Clara, CA · On-site

$121.80K - $167.20K/yr

NVIDIA

Senior Vision Language Model Engineer

Santa Clara, CA · On-site

$121.80K - $167.20K/yr

Liquid AI, Inc

Member of Technical Staff - Post Training, Applied (Vision)

San Francisco, CA · On-site +1

The Opportunity This is a rare chance to sit at the intersection of frontier vision-language models and real-world deployment. You'll own applied post-training work for VLMs end-to-end for some of ...

Liquid AI, Inc

Member of Technical Staff - Post Training, Applied (Vision)

San Francisco, CA · On-site +1

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

As a Research Scientist- Vision-Language-Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research and ...

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

Research Scientist- Vision- Language- Action (VLA) Models

Sunnyvale, CA · On-site

As a Research Scientist- Vision- Language- Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research and ...

Research Scientist- Vision- Language- Action (VLA) Models

Sunnyvale, CA · On-site

As a Research Scientist- Vision- Language- Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research and ...

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

OR · Hybrid

With demand for AI exploding, particularly in the realm of large language models (LLMs) and vision language models (VLMs, VLAs), we are significantly expanding our team. We're seeking a highly ...

OR · Hybrid

Manager, Large Language Model Inference

Santa Clara, CA · On-site

Manager, Large Language Model Inference

Santa Clara, CA · On-site

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

Quick apply

Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

Manager, Large Language Model Inference

Santa Clara, CA · Hybrid

Manager, Large Language Model Inference

Santa Clara, CA · Hybrid

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

As a Senior Research Scientist- Vision-Language-Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research and ...

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

Quick apply

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

Senior Research Scientist- Vision-Language-Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

Senior Research Scientist- Vision- Language- Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

As a Senior Research Scientist- Vision- Language- Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research ...

Senior Research Scientist- Vision- Language- Action (VLA) Models

Sunnyvale, CA · On-site

$115K - $146.50K/yr

As a Senior Research Scientist- Vision- Language- Action (VLA) Models, you contribute to research projects at the forefront of the ADAS/AD industry. Key responsibilities include: * Conduct research ...

AI Research Scientist, VLM (vision language models)

Menlo Park, CA · On-site

$184K/yr

AI Research Scientist, VLM (vision language models) Responsibilities: * Push state of the art in multimodal generative AI * Explore new techniques for advanced reasoning and multimodal understanding ...

AI Research Scientist, VLM (vision language models)

Menlo Park, CA · On-site

$184K/yr

AI Research Scientist, VLM (vision language models)

Bellevue, WA

$184K/yr

Vision Language Model Jobs

AI Research Scientist, VLM (vision language models)

Bellevue, WA

$184K/yr

Showing results 1-20

People also search for

Job

Ai Mod

Vision Language Model information

See salary details

$10

$31

$67

How much do vision language model jobs pay per hour?

As of Jun 4, 2026, the average hourly pay for vision language model in the United States is $31.37, according to ZipRecruiter salary data. Most workers in this role earn between $18.99 and $39.18 per hour, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a Vision Language Model Engineer, and why are they important?

To thrive as a Vision Language Model Engineer, you need a strong background in computer vision, natural language processing, machine learning, and often a graduate degree in computer science or a related field. Proficiency with deep learning frameworks such as TensorFlow or PyTorch, experience with large-scale datasets, and familiarity with model deployment tools are typically required. Strong problem-solving skills, creativity, and effective collaboration abilities help you stand out in this rapidly evolving field. These skills are essential for developing advanced AI systems that accurately interpret and generate language grounded in visual data, driving innovation in applications like image captioning and visual question answering.

What are some common challenges faced by professionals working with Vision Language Models, and how can they be addressed?

Professionals working with Vision Language Models often encounter challenges such as aligning visual and textual data, handling large-scale datasets, and ensuring model interpretability. Dealing with noisy or incomplete data from either modality can affect model performance, so strong data preprocessing and augmentation skills are essential. Collaboration with multidisciplinary teams—including data engineers, machine learning researchers, and domain experts—is key to refining models and deploying them effectively. Staying updated with the latest advancements and leveraging open-source resources can also help address these challenges.

What is a Vision Language Model?

A Vision Language Model (VLM) is an artificial intelligence system designed to understand and generate information using both visual data (like images or videos) and textual data (like written language). These models are trained on large datasets containing images paired with descriptive text, allowing them to perform tasks such as image captioning, visual question answering, and multimodal content generation. VLMs use advanced machine learning techniques to learn the relationships between visual elements and language, making them valuable for applications that require an integrated understanding of both modalities. They are widely used in fields such as robotics, accessibility technology, and automated content creation.

What is the difference between Vision Language Model vs Computer Vision Engineer?

Aspect	Vision Language Model	Computer Vision Engineer
Required credentials	Advanced degrees in AI, Machine Learning, or related fields	Degree in Computer Science, Electrical Engineering, or related fields
Work environment	Research labs, AI startups, tech companies focusing on multimodal AI	Tech companies, research institutions, industries applying image analysis
Industry usage	Developing multimodal AI systems combining vision and language	Creating algorithms for image recognition, object detection, and analysis
Search and comparison intent	Understanding roles in AI development involving vision and language	Focus on technical image processing and computer vision applications

While both roles involve working with visual data, a Vision Language Model specializes in integrating visual and textual information using advanced AI techniques, often in research or product development. In contrast, a Computer Vision Engineer focuses on developing algorithms for analyzing and interpreting visual data, primarily in applications like image recognition and object detection.

Infographic showing various Vision Language Model job openings in the United States as of May 2026, with employment types broken down into 1% As Needed, 39% Full Time, 55% Part Time, 1% Temporary, 3% Contract, and 1% Nights. Highlights an 91% Physical, 3% Hybrid, and 6% Remote job distribution, with an average salary of $65,246 per year, or $31.4 per hour.

Research Scientist - Vision Language Model