1

Vision Language Model Jobs (NOW HIRING)

Vision-Language models. You will use vision language models to generate meta actions for strategic decision making. Your project will also focus on designing and implementing advanced knowledge ...

Vision-Language models. You will use vision language models to generate meta actions for strategic decision making. Your project will also focus on designing and implementing advanced knowledge ...

ML Engineer

Manhattan, NY · On-site

$170K - $185K/yr

Visia's full-stack physical intelligence platform includes robust sensing systems across imaging modes (cameras, X-rays, cargo X-rays, LiDAR), foundation vision-language models that convert raw ...

ML Engineer

New York, NY · On-site +1

$170K - $185K/yr

Visia's full-stack physical intelligence platform includes robust sensing systems across imaging modes (cameras, X-rays, cargo X-rays, LiDAR), foundation vision-language models that convert raw ...

next page

Showing results 1-20

Vision Language Model information

See salary details

$10

$31

$67

How much do vision language model jobs pay per hour?

As of Jun 4, 2026, the average hourly pay for vision language model in the United States is $31.37, according to ZipRecruiter salary data. Most workers in this role earn between $18.99 and $39.18 per hour, depending on experience, location, and employer.

What are the key skills and qualifications needed to thrive as a Vision Language Model Engineer, and why are they important?

To thrive as a Vision Language Model Engineer, you need a strong background in computer vision, natural language processing, machine learning, and often a graduate degree in computer science or a related field. Proficiency with deep learning frameworks such as TensorFlow or PyTorch, experience with large-scale datasets, and familiarity with model deployment tools are typically required. Strong problem-solving skills, creativity, and effective collaboration abilities help you stand out in this rapidly evolving field. These skills are essential for developing advanced AI systems that accurately interpret and generate language grounded in visual data, driving innovation in applications like image captioning and visual question answering.

What are some common challenges faced by professionals working with Vision Language Models, and how can they be addressed?

Professionals working with Vision Language Models often encounter challenges such as aligning visual and textual data, handling large-scale datasets, and ensuring model interpretability. Dealing with noisy or incomplete data from either modality can affect model performance, so strong data preprocessing and augmentation skills are essential. Collaboration with multidisciplinary teams—including data engineers, machine learning researchers, and domain experts—is key to refining models and deploying them effectively. Staying updated with the latest advancements and leveraging open-source resources can also help address these challenges.

What is a Vision Language Model?

A Vision Language Model (VLM) is an artificial intelligence system designed to understand and generate information using both visual data (like images or videos) and textual data (like written language). These models are trained on large datasets containing images paired with descriptive text, allowing them to perform tasks such as image captioning, visual question answering, and multimodal content generation. VLMs use advanced machine learning techniques to learn the relationships between visual elements and language, making them valuable for applications that require an integrated understanding of both modalities. They are widely used in fields such as robotics, accessibility technology, and automated content creation.

What is the difference between Vision Language Model vs Computer Vision Engineer?

AspectVision Language ModelComputer Vision Engineer
Required credentialsAdvanced degrees in AI, Machine Learning, or related fieldsDegree in Computer Science, Electrical Engineering, or related fields
Work environmentResearch labs, AI startups, tech companies focusing on multimodal AITech companies, research institutions, industries applying image analysis
Industry usageDeveloping multimodal AI systems combining vision and languageCreating algorithms for image recognition, object detection, and analysis
Search and comparison intentUnderstanding roles in AI development involving vision and languageFocus on technical image processing and computer vision applications

While both roles involve working with visual data, a Vision Language Model specializes in integrating visual and textual information using advanced AI techniques, often in research or product development. In contrast, a Computer Vision Engineer focuses on developing algorithms for analyzing and interpreting visual data, primarily in applications like image recognition and object detection.

Infographic showing various Vision Language Model job openings in the United States as of May 2026, with employment types broken down into 1% As Needed, 39% Full Time, 55% Part Time, 1% Temporary, 3% Contract, and 1% Nights. Highlights an 91% Physical, 3% Hybrid, and 6% Remote job distribution, with an average salary of $65,246 per year, or $31.4 per hour.
Machine Learning Engineer - Geospatial (TS/SCI)

Machine Learning Engineer - Geospatial (TS/SCI)

LaunchCode

Springfield, VA • On-site

$175K - $250K/yr

Full-time

Posted yesterday


Job description

Description
Title: AI/Machine Learning Engineer - Vision Language Models / Multimodal AI (NGA)
Location: Springfield or Herndon, VA (onsite)
Clearance: TS/SCI (CI Poly preferred)
Position Type: Full-Time, Direct Hire
Pay: $175,000 to $250,000 for an SME
Company: The name of our partner organization will be disclosed during the interview process. This is not a direct role with LaunchCode; it is a position through LaunchCode, working with one of our partner companies.
Disclaimer: We are unable to provide work sponsorship for this role
Overview:
We're hiring a AI/Machine Learning Engineer with strong experience in multimodal AI and large-scale model training to support advanced vision-language initiatives in a secure government environment. This role will focus on fine-tuning Vision Language Models (VLMs) on domain-specific geospatial imagery, building scalable AWS training infrastructure, and developing evaluation frameworks for image understanding and spatial reasoning. Ideal candidates will have deep experience with PyTorch, HuggingFace, distributed training, and computer vision, along with the ability to optimize and deploy multimodal models in mission-critical environments.
Huge plus for candidates who have hands-on experience taking multimodal models such as CLIP, LLaVA, Qwen-VL, or similar Vision Language Models and fine-tuning them on classified or mission-specific imagery datasets. The ideal candidate can build the AWS infrastructure needed to train and scale these models, evaluate performance improvements across real-world use cases, and deploy solutions into secure government or air-gapped environments.
Key Responsibilities:
  • Design and execute fine-tuning pipelines for Vision Language Models (VLMs) using domain-specific imagery datasets
  • Handle data preprocessing, training orchestration, and hyperparameter optimization for multimodal models
  • Build evaluation frameworks for image understanding, visual question answering, and spatial reasoning tasks
  • Develop scalable AWS-based ML infrastructure using SageMaker and GPU-enabled EC2 for distributed training
  • Create data pipelines for curating, annotating, and transforming geospatial imagery into model-ready datasets
  • Partner with applied scientists and architects on model architecture improvements, LoRA/QLoRA strategies, and inference optimization

Required Qualifications:
  • Active TS/SCI with CI Poly
  • 5+ years of machine learning engineering experience focused on deep learning
  • 1+ year of hands-on experience fine-tuning foundation models (LLMs or VLMs)
  • Experience with LoRA, QLoRA, adapters, supervised fine-tuning, instruction tuning, and RLHF/DPO
  • 4+ years of advanced Python development for ML workloads
  • Strong PyTorch and HuggingFace experience (Transformers, PEFT, Datasets, Accelerate)
  • Experience with distributed training frameworks such as DeepSpeed, FSDP, or Megatron
  • 3+ years working with computer vision or multimodal models
  • Familiarity with vision transformer architectures (ViT, CLIP, LLaVA, etc.)
  • Experience processing and augmenting image datasets at scale
  • 3+ years with AWS ML infrastructure including SageMaker, EC2 GPU environments, and S3
  • Experience with ML evaluation pipelines, benchmarking, metrics, and result analysis
  • Strong software engineering fundamentals including version control, testing, and CI/CD

Preferred Qualifications:
  • 2+ years working with geospatial or remote sensing imagery
  • Experience with EO or SAR satellite imagery
  • Understanding of geospatial metadata, coordinate systems, and imagery preprocessing
  • Experience with model quantization / inference optimization (vLLM, TensorRT, ONNX)
  • MLOps tooling experience (MLflow, Weights & Biases, SageMaker Experiments)
  • Familiarity with annotation tools and active learning workflows
  • Containerized ML experience with Docker / ECR / ECS / EKS
  • Experience supporting ATO processes and NIST 800-53 compliance
  • Experience deploying in air-gapped/disconnected environments
  • Familiarity with multimodal evaluation benchmarks (MMMU, MMBench, GQA)
  • Publications or contributions in computer vision, multimodal AI, or VLMs
  • Synthetic data generation experience for training augmentation