Vision Language Model Jobs (NOW HIRING)

Machine Learning Engineer - Geospatial (TS/SCI)

Springfield, VA · On-site

$175K - $250K/yr

New

Deep Learning Research Intern

Santa Clara, CA

$19 - $65/hr

Vision-Language models. You will use vision language models to generate meta actions for strategic decision making. Your project will also focus on designing and implementing advanced knowledge ...

Deep Learning Research Intern

Santa Clara, CA

$19 - $65/hr

Machine Learning Engineer - Geospatial (TS/SCI) with Security Clearance

Springfield, VA

$175K - $250K/yr

New

Machine Learning Engineer - Geospatial (TS/SCI) with Security Clearance

Springfield, VA

$175K - $250K/yr

New

Deep Learning Research Intern

Santa Clara, CA · On-site

$19 - $65/hr

Deep Learning Research Intern

Santa Clara, CA · On-site

$19 - $65/hr

Machine Learning Engineer - Visual Agents - Special Projects

Cupertino, CA · On-site

A successful candidate has hands-on experience with vision-language models, knows how to translate ambiguous product requirements into measurable evaluation criteria, and is excited to work at the ...

Machine Learning Engineer - Visual Agents - Special Projects

Cupertino, CA · On-site

ML Engineer

Manhattan, NY · On-site

$170K - $185K/yr

Visia's full-stack physical intelligence platform includes robust sensing systems across imaging modes (cameras, X-rays, cargo X-rays, LiDAR), foundation vision-language models that convert raw ...

ML Engineer

Manhattan, NY · On-site

$170K - $185K/yr

Senior Machine Learning Engineer, Computer Vision/VLM

Mountain View, CA · On-site

$204K - $259K/yr

Develop and prototype novel prompting strategies for Vision-Language Models (VLMs) to elicit complex, causal reasoning about driving scenarios. * Collaborate closely with the ML Infra, Perception ...

Senior Machine Learning Engineer, Computer Vision/VLM

Mountain View, CA · On-site

$204K - $259K/yr

ML Engineer

New York, NY · On-site +1

$170K - $185K/yr

ML Engineer

New York, NY · On-site +1

$170K - $185K/yr

Senior Machine Learning Engineer, Computer Vision/VLM

San Francisco, CA · On-site +1

$204K - $259K/yr

Senior Machine Learning Engineer, Computer Vision/VLM

San Francisco, CA · On-site +1

$204K - $259K/yr

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

Develop and deploy cutting-edge perception and deep learning models, including computer vision models, vision-language and large language models (VLMs and LLMs), for real-time integration into our ...

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

Senior Machine Learning Engineer, Computer Vision/VLM

Mountain View, CA · On-site +1

$204K - $259K/yr

Senior Machine Learning Engineer, Computer Vision/VLM

Mountain View, CA · On-site +1

$204K - $259K/yr

Qualcomm

AI Lead - Autonomous Driving/Reasoning/Vision Language Action Models

San Diego, CA · On-site

$108.80K - $143.30K/yr

Background in vision-language models, policy learning, or autonomous driving foundation models ... Knowledge of model compression techniques (quantization, distillation) for efficient deployment.

Qualcomm

AI Lead - Autonomous Driving/Reasoning/Vision Language Action Models

San Diego, CA · On-site

$108.80K - $143.30K/yr

AI Research Scientist, VLM (vision language models)

Menlo Park, CA · On-site

Lead, collaborate, and execute on research that pushes forward the state of the art in multimodal reasoning and generation research.Work towards lo.

AI Research Scientist, VLM (vision language models)

Menlo Park, CA · On-site

Lead, collaborate, and execute on research that pushes forward the state of the art in multimodal reasoning and generation research.Work towards lo.

Machine Learning Research Engineer, SIML - ISE

$147.40K - $272.10K/yr

This role requires experience in vision-language models, and ability to fine-tune/adapt/distill multi-modal LLMs. You will be part of a fast-paced, impact-driven Applied Research organization working ...

Machine Learning Research Engineer, SIML - ISE

Great American Insurance Group

$147.40K - $272.10K/yr

Senior Data Scientist - Predictive Analytics

Cincinnati, OH · On-site

Leads Vision Language Model initiatives and advanced computer vision projects using OpenCV and cutting-edge Vision AI technologies * Mentors junior data scientists and provides technical guidance on ...

Great American Insurance Group

Senior Data Scientist - Predictive Analytics

Cincinnati, OH · On-site

Machine Learning Research Engineer, SIML - ISE

Cupertino, CA · On-site

$147.40K - $272.10K/yr

Machine Learning Research Engineer, SIML - ISE

Cupertino, CA · On-site

$147.40K - $272.10K/yr

Machine Learning Engineer - Visual Agents - Special Projects

$126.80K - $220.90K/yr

Machine Learning Engineer - Visual Agents - Special Projects

$126.80K - $220.90K/yr

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

Quick apply

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

Vision Language Model Jobs

New Grads 2026 - Software Engineer - Computer Vision

San Jose, CA · On-site

$120K - $165K/yr

American Financial

Senior Data Scientist - Predictive Analytics

Cincinnati, OH · Hybrid

American Financial

Senior Data Scientist - Predictive Analytics

Cincinnati, OH · Hybrid

Showing results 1-20

People also search for

Job

Ai Mod

Vision Language Model information

See salary details

$10

$31

$67

How much do vision language model jobs pay per hour?

As of Jun 4, 2026, the average hourly pay for vision language model in the United States is $31.37, according to ZipRecruiter salary data. Most workers in this role earn between $18.99 and $39.18 per hour, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a Vision Language Model Engineer, and why are they important?

To thrive as a Vision Language Model Engineer, you need a strong background in computer vision, natural language processing, machine learning, and often a graduate degree in computer science or a related field. Proficiency with deep learning frameworks such as TensorFlow or PyTorch, experience with large-scale datasets, and familiarity with model deployment tools are typically required. Strong problem-solving skills, creativity, and effective collaboration abilities help you stand out in this rapidly evolving field. These skills are essential for developing advanced AI systems that accurately interpret and generate language grounded in visual data, driving innovation in applications like image captioning and visual question answering.

What are some common challenges faced by professionals working with Vision Language Models, and how can they be addressed?

Professionals working with Vision Language Models often encounter challenges such as aligning visual and textual data, handling large-scale datasets, and ensuring model interpretability. Dealing with noisy or incomplete data from either modality can affect model performance, so strong data preprocessing and augmentation skills are essential. Collaboration with multidisciplinary teams—including data engineers, machine learning researchers, and domain experts—is key to refining models and deploying them effectively. Staying updated with the latest advancements and leveraging open-source resources can also help address these challenges.

What is a Vision Language Model?

A Vision Language Model (VLM) is an artificial intelligence system designed to understand and generate information using both visual data (like images or videos) and textual data (like written language). These models are trained on large datasets containing images paired with descriptive text, allowing them to perform tasks such as image captioning, visual question answering, and multimodal content generation. VLMs use advanced machine learning techniques to learn the relationships between visual elements and language, making them valuable for applications that require an integrated understanding of both modalities. They are widely used in fields such as robotics, accessibility technology, and automated content creation.

What is the difference between Vision Language Model vs Computer Vision Engineer?

Aspect	Vision Language Model	Computer Vision Engineer
Required credentials	Advanced degrees in AI, Machine Learning, or related fields	Degree in Computer Science, Electrical Engineering, or related fields
Work environment	Research labs, AI startups, tech companies focusing on multimodal AI	Tech companies, research institutions, industries applying image analysis
Industry usage	Developing multimodal AI systems combining vision and language	Creating algorithms for image recognition, object detection, and analysis
Search and comparison intent	Understanding roles in AI development involving vision and language	Focus on technical image processing and computer vision applications

While both roles involve working with visual data, a Vision Language Model specializes in integrating visual and textual information using advanced AI techniques, often in research or product development. In contrast, a Computer Vision Engineer focuses on developing algorithms for analyzing and interpreting visual data, primarily in applications like image recognition and object detection.

Infographic showing various Vision Language Model job openings in the United States as of May 2026, with employment types broken down into 1% As Needed, 39% Full Time, 55% Part Time, 1% Temporary, 3% Contract, and 1% Nights. Highlights an 91% Physical, 3% Hybrid, and 6% Remote job distribution, with an average salary of $65,246 per year, or $31.4 per hour.

Machine Learning Engineer - Geospatial (TS/SCI)