OCR

tags : Machine Learning, Image Compression, Computer Vision

Comparison

Type	Name	Description
Service	Claude/OpenAI/AWS	They have APIs
LSTM-CNN	Tesseract
PP-OCR(DB+CRNN)	PaddleOCR	Works with rotated stuff
	EasyOCR
Toolbox, Modular models	doctr	Some people mention it works better than paddle and tesseract.
Pytorch+mmlabs	MMOCR	Might be nice if using mmdetection stuff
	surya	Only for documents, doesn’t work in handwritten. faster than tesseract, Language support. Tries to guess proper reading order.
VLM	MGP-STR	new kid (2024)
VLM	GOT	new kid (2024)
VLM	TrOCR
VLM	DONUT
VLM	InternVL
VLM	Idefics2

ColPali

ColPali combines:

Col -> the contextualized late interaction mechanism introduced in ColBERT

Pali -> with a Vision Language Model (VLM), in this case, PaliGemma

See ColBERT in NLP (Natural Language Processing)
Used for PDF extraction
ColPali is enabled by the latest advances in Vision Language Models
- notably the PaliGemma model from the Google Zürich team (See Computer Vision)
- and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab(author of DSPy).
Comments
- It’s not an extraction replacement (I wondered the same thing). It’s retrieval that can bypass extraction.
- I’d misunderstood it as a vision LLM that could extract information from PDFs, it looks like it’s more of an embedding model that can represent a page from a PDF as a set of vectors, which means “will it just refuse to run” isn’t actually a concern (unlike Claude 3 Vision etc)
- In essence; ColPali is just an adapter on PaliGemma for retrieval task, while PaliGemma itself can be used for many others tasks. As the authors point out, you can remove or not use the adapter and also have it use the general capabilities to read the page.

How does it work?

During indexing, the complex PDF parsing steps are replaced by using “screenshots” of the PDF pages directly.
These screenshots are then embedded with the VLM. At inference time, the query is embedded and matched with a late interaction mechanism to retrieve the most similar document pages.

Resources for colpali

Multilang support

someone wanted to try ColPali + VLM pipeline or Turkish language, the issue was ColPali PaliGemma) isn’t trained on it so recommended ColQwen & Qwen2VL and it works.

Resources

RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL
- https://huggingface.co/reducto/RolmOCR
MGP-STR : Better than EasyOCR it seems
stepfun-ai/GOT-OCR2_0 · Hugging Face
- Due to the use of opt-125 and a few other elements it is not allowed for commercial use. Otherwise they have code for inference on the hf card as well.
Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero | Hacker News
Qwen2-VL-7B Instruct model gets 100% accuracy extracting text from this handwritten document
Mistral OCR | Hacker News
- Mistral OCR: Revolutionary or Just Hype? - YouTube
https://github.com/facebookresearch/nougat
https://github.com/VikParuchuri/marker
Run a job queue for GOT-OCR | Modal Docs
Benchmarking vision-language models on OCR in dynamic video environments | Hacker News 🌟
- Show HN: Benchmarking VLMs vs. Traditional OCR | Hacker News
OCR4all | Hacker News
Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual) | Hacker News
edge
- we use clip embeddings in our first product for a lot of our search capabilities, but it’s mostly interesting because of its carryover to other parts of the app.
- we also do a lot of OCR work, so in order to filter down candidate images to look at and speed up preprocessing, we trained a small MLP to take in the preprocessed clip embeddings (instead of raw images) and predict whether they contain text.
- the classifier has an f1 score of 0.98. it takes 2-3min to train on a consumer laptop (given our dataset of 30k embeddings + 60k synthetic), it’s 300kb on device, and it runs at 10k fps, so it can absolutely rip through a photo library.
- so now instead of running useless OCR vision requests on images with no legible text, we can just skip them up front. for example on my library, in my last 5k photos nearly 2k have no text we can completely skip over processing. the fastest way to speed up work is to do no work at all!

🐏 mogoz

Table of Contents

OCR

Comparison

ColPali

How does it work?

Resources for colpali

Multilang support

Resources

Graph View

Backlinks

🐏 mogoz

Table of Contents

OCR

Comparison §

ColPali §

How does it work? §

Resources for colpali §

Multilang support §

Resources §

Graph View

Backlinks

Comparison

ColPali

How does it work?

Resources for colpali

Multilang support

Resources