VLM(Vision Language Models)

VLM (Vision Language Model) is a broad term for any model that processes information from both visual and linguistic modalities.

FAQ

Are CLIP and Moondream/LLaVA both Vision Language Models?

Yes, just different types.
VLM (Vision Language Model) is a broad term for any model that processes information from both visual and linguistic modalities.

People say for captioning use CLIP, is it true?

CLIP can’t generate text
If you were generating caption you are probably using something different.
One popular library for this was: https://github.com/pharmapsychotic/clip-interrogator

Types

What language these vision models support depend on the underlying text encoder/decoder used. Eg. someone wanted to try ColPali + VLM pipeline or Turkish language, the issue was ColPali PaliGemma) isn’t trained on it so recommended ColQwen & Qwen2VL and it worked. Eg. voyage-multimodal-3 is also multilingual but how much support obviously differs etc.

Pre-training with Contrastive Objective(image-text) (Eg. CLIP)

Main purpose/training objective: Comparison, (putting things in the same latent space)

https://github.com/mlfoundations/open_clip (maintains both CLIP and SigLIP)

https://github.com/jina-ai/clip-as-service

Training CLIP

Feature encoder: Fuse text and image into one using self attention mechanism
Enhance encoding: Combines text and image
Language guided selection (Dont understand)
Cross-Modality Decoder (Dont understand)
Calculate loss
- Contrastive loss: Compare text and visual features
- DETR loss: Bounding box, detect object of interest

More stuff
- natural language processing - Why does CLIP use a decoder-only transformer for encoding text?

CLIP

dual-encoder VLM (image is encoded, text is encoded)
focused on learning joint image-text embeddings.
2 parts/components:
- text encoder: Some ViT
- image encoder
Use: https://github.com/rom1504/clip-retrieval

SigLIP

Also, as mentioned in the original comment I think using Siglip might be just better than CLIP as laion2b is a pretty messy dataset and to my knowledge Google did a really good job with Siglip. A lot of multimodal models use Siglip as the encoder. Probably just a straightforward improvement in some regards.

SigLIP

Other CLIP variations

Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images
open clip with the Laion2B dataset https://github.com/mlfoundations/open_clip
long clip (bigger token lengths during training) https://github.com/beichenzbc/Long-CLIP
meta-clip https://github.com/facebookresearch/metaclip
google siglip https://huggingface.co/docs/transformers/main/en/model_doc/siglip different loss function
apple dfn https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14 if I recall the 5b uses a higher image resolution, amongst a number of other innovations (378 ×378 resolution)
Blip 1 & 2 by salesforce I have worked with also https://huggingface.co/docs/transformers/en/model_doc/blip

Pre-training with Contrastive Objective (image-only) (Eg. DINO)

If you just want to do image Classification, this is probably better than the other VLM techniques

Also see Image replacement in Canva designs using reverse image search - Canva Engineering Blog

DINO

If you don’t need image-text pairs you might even try experimenting with self supervised models like Dinov2.
In my testing for retrieval of similar off-road driving images, Dino works better than CLIP thanks to visual pre training and CLIP suffers due to language grounding(it gets images which might have been captioned similarly, but have different colors)

DreamSim

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
It stacks dino and clip embeddings but also finetune’s these models to be a perceptual metric.
It’s much better at finding similar images rather than images with similar content that look quite different.

Pre-training with Contrastive Objective + Interleaving (image-text) (Eg. ColPali)

Leaderboard: https://huggingface.co/spaces/vidore/vidore-leaderboard

META: Understanding Interleaving and Modality Gap

Interleaving (missing info)

This is not a CLIP problem but more of a document processing problem, if you had a document with text & image, you’d parse the text (eg. pdf to markdown) and then embed the images in the document and then also embed the text using CLIP. There’s missing spatial info now. That’s what we call interleaving.

This allows mixing of different types of data (modalities) within a single input sequence or document. This is also sometimes called “interleaving”. Imagine a tutorial page: a paragraph of text, then an image illustrating a step, then more text explaining the image, then another image. This is interleaved data.

Modality Gap (skewed representation)

All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.

See this blogpost for more details. and also this thread

ColBERT-like models (ColPali, ColQwen2, ColSmolVLM, ColiVara)

See ColBERT in NLP (Natural Language Processing)
Used for PDF extraction, also works for images
ColPali is enabled by the latest advances in Vision Language Models & LLMs
- Col: Leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab(author of DSPy).
- Pali: The PaliGemma model from the Google Zürich team (See Computer Vision)

More info
- It’s not an extraction(ocr) replacement (I wondered the same thing). It’s retrieval that can bypass extraction.
- I’d misunderstood it as a vision LLM that could extract information from PDFs, it looks like it’s more of an embedding model that can represent a page from a PDF as a set of vectors, which means “will it just refuse to run” isn’t actually a concern (unlike Claude 3 Vision etc)
- In essence; ColPali is just an adapter on PaliGemma for retrieval task, while PaliGemma itself can be used for many others tasks. As the authors point out, you can remove or not use the adapter and also have it use the general capabilities to read the page.
- During indexing, the complex PDF parsing steps are replaced by using “screenshots” of the PDF pages directly.
- These screenshots are then embedded with the VLM. At inference time, the query is embedded and matched with a late interaction mechanism to retrieve the most similar document pages.

Resources

Voyage-Multimodal-3

This is like the hosted version of ColPali
In contrast to CLIP, this uses a unified encoder (uses the same encoder for text&image)
The architecture is similar to how LLaVa works(post projection), but in usecase/objective of comparision.

Document Screenshot Embedding (DSE, MCDSE)

See [2406.11251] Unifying Multimodal Retrieval via Document Screenshot Embedding

Pre-training with Generative Objective (Eg. Moondream/LLaVa/Gemini/OAI Multimodal)

Turns an existing LLM into a vision assistant

https://github.com/jiayev/GPT4V-Image-Captioner

LLaVA/Moondream

encoder-decoder VLM (where the vision & text encoders feeds into an LLM decoder).
visual instruction-following model
eg. moondream depends on Phi1.5(LLM) and SigLIP(visual encoder like CLIP) etc.
4 parts/components:
- image encoder: ViT (usually from CLIP, SigLIP etc, skipping the )
- projection layer:
  - The projection layer in VLMs like LLaVA acts as a learned translator, adapting visual features from a vision encoder into a format (dimensionally compatible embeddings) that a Large Language Model (LLM) can understand.
  - It’s trained, often with the LLM frozen initially, to map these visual features into the LLM’s input space using image-text datasets.
  - This enables the LLM to process and reason over both visual and textual information for tasks like visual Q&A.
- text encoder: LLM embedding
- text decoder: LLM

Understanding how LLaVA/Mooderam etc. is different from CLIP/ColPali

“Because LLMs such as Gemini — and other causal language models more broadly — are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren’t that useful for RAG or semantic search compared to what you get from actual embedding models.”

llm( projector(ViT(image.png)) + embed(text prompt) ) = text output

Here both of the arguments sent to llm are in dimensionally compatible embedding after projection.

Imagine the projection layer’s output (visual pseudo-tokens) as a report about the image written in a very specific technical shorthand that the “Committee Pre-processor” understands.
Imagine the LLM’s initial text embeddings for your prompt as your question, also translated into that same technical shorthand by the “Committee Pre-processor.”
Usecase
- Generative (GOOD)
  - You just send these 2 embeddings into the LLM and you should get a nice text output
- Comparison/Similarity (BAD/DONT DO)
  - See Embeddings FAQ for understanding that embedding != similarity
  - Here we’re trying to use a something which it models like LLaVA are not built for.
  - We CAN probably take the embedding generated by projection layer and the embedding of the text prompt and try running cosine similarity in it right as the embedding dimensionality match? That’ll be same as CLIP right?
  - Well, NO.
  - This most probably won’t work as expected, because there’s no direct training pressure in LLaVA to make these two specific sets of initial embeddings (projected visual vs. prompt text) align in a way that globally maximizes their cosine similarity for matching pairs.
  - Coming back to our example, these shorthand reports are just the starting point for the main Committee (LLM layers). The Committee’s job is to take both, discuss them deeply, cross-reference, infer, and then produce a final considered statement.
    - i.e for any further understanding of this embedding pari(image&text), we must pass it though the LLM. This is unlike CLIP, where the embedding themselves carry intrinsic meaning about where it belongs in a space.

Pre-training with Unified Understanding Objective (Eg. BLIP)

This does both embedding and captioning. Sort of like a middle child.

Example: BLIP, BLIP-2
[2205.01917] CoCa: Contrastive Captioners are Image-Text Foundation Models

Pre-training with No text, only Visual Thinking

This is very new(May’25)
See: [2505.11409] Visual Planning: Let’s Think Only with Images

Video Embeddings?

TODO Finetuning and Augumenting VLMs

Finetuning diffusion model(Image generation models) is different, see StableDiffusion for that.

Vision Language Models Explained 🌟
Paper page - mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Google Colab
FastVLM: Dramatically Faster Vision Language Model from Apple | Hacker News
Trying out QvQ – Qwen’s new visual reasoning model | Hacker News
PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning | Hacker News
- How to Fine-tune PaliGemma 2
Fine-tune SmolVLM on Visual Question Answering using Consumer GPU with QLoRA
- https://github.com/huggingface/smollm/blob/main/finetuning%2FSmol_VLM_FT.ipynb

🐏 mogoz

Table of Contents

VLM(Vision Language Models)

FAQ

Are CLIP and Moondream/LLaVA both Vision Language Models?

People say for captioning use CLIP, is it true?

Types

Pre-training with Contrastive Objective(image-text) (Eg. CLIP)

Training CLIP

CLIP

SigLIP

Other CLIP variations

Pre-training with Contrastive Objective (image-only) (Eg. DINO)

DINO

DreamSim

Pre-training with Contrastive Objective + Interleaving (image-text) (Eg. ColPali)

META: Understanding Interleaving and Modality Gap

ColBERT-like models (ColPali, ColQwen2, ColSmolVLM, ColiVara)

Voyage-Multimodal-3

Document Screenshot Embedding (DSE, MCDSE)

Pre-training with Generative Objective (Eg. Moondream/LLaVa/Gemini/OAI Multimodal)

LLaVA/Moondream

Understanding how LLaVA/Mooderam etc. is different from CLIP/ColPali

Pre-training with Unified Understanding Objective (Eg. BLIP)

Pre-training with No text, only Visual Thinking

Video Embeddings?

TODO Finetuning and Augumenting VLMs

Graph View

Backlinks

🐏 mogoz

Table of Contents

VLM(Vision Language Models)

FAQ §

Are CLIP and Moondream/LLaVA both Vision Language Models? §

People say for captioning use CLIP, is it true? §

Types §

Pre-training with Contrastive Objective(image-text) (Eg. CLIP) §

Training CLIP §

CLIP §

SigLIP §

Other CLIP variations §

Pre-training with Contrastive Objective (image-only) (Eg. DINO) §

DINO §

DreamSim §

Pre-training with Contrastive Objective + Interleaving (image-text) (Eg. ColPali) §

META: Understanding Interleaving and Modality Gap §

ColBERT-like models (ColPali, ColQwen2, ColSmolVLM, ColiVara) §

Voyage-Multimodal-3 §

Document Screenshot Embedding (DSE, MCDSE) §

Pre-training with Generative Objective (Eg. Moondream/LLaVa/Gemini/OAI Multimodal) §

LLaVA/Moondream §

Understanding how LLaVA/Mooderam etc. is different from CLIP/ColPali §

Pre-training with Unified Understanding Objective (Eg. BLIP) §

Pre-training with No text, only Visual Thinking §

Video Embeddings? §

TODO Finetuning and Augumenting VLMs §

Graph View

Backlinks

FAQ

Are CLIP and Moondream/LLaVA both Vision Language Models?

People say for captioning use CLIP, is it true?

Types

Pre-training with Contrastive Objective(image-text) (Eg. CLIP)

Training CLIP

CLIP

SigLIP

Other CLIP variations

Pre-training with Contrastive Objective (image-only) (Eg. DINO)

DINO

DreamSim

Pre-training with Contrastive Objective + Interleaving (image-text) (Eg. ColPali)

META: Understanding Interleaving and Modality Gap

ColBERT-like models (ColPali, ColQwen2, ColSmolVLM, ColiVara)

Voyage-Multimodal-3

Document Screenshot Embedding (DSE, MCDSE)

Pre-training with Generative Objective (Eg. Moondream/LLaVa/Gemini/OAI Multimodal)

LLaVA/Moondream

Understanding how LLaVA/Mooderam etc. is different from CLIP/ColPali

Pre-training with Unified Understanding Objective (Eg. BLIP)

Pre-training with No text, only Visual Thinking

Video Embeddings?

TODO Finetuning and Augumenting VLMs