tags : Machine Learning, Image Compression, Computer Vision
Comparison
Type | Name | Description |
---|---|---|
Service | Claude/OpenAI/AWS | They have APIs |
LSTM-CNN | Tesseract | |
PP-OCR(DB+CRNN) | PaddleOCR | Works with rotated stuff |
EasyOCR | ||
Toolbox, Modular models | doctr | Some people mention it works better than paddle and tesseract. |
Pytorch+mmlabs | MMOCR | Might be nice if using mmdetection stuff |
surya | Only for documents, doesn’t work in handwritten. faster than tesseract, Language support. Tries to guess proper reading order. | |
VLM | MGP-STR | new kid (2024) |
VLM | GOT | new kid (2024) |
VLM | TrOCR | |
VLM | DONUT | |
VLM | InternVL | |
VLM | Idefics2 |
ColPali
ColPali combines:
- Col -> the contextualized late interaction mechanism introduced in ColBERT
- Pali -> with a Vision Language Model (VLM), in this case, PaliGemma
- See ColBERT in NLP (Natural Language Processing)
- Used for PDF extraction
- ColPali is enabled by the latest advances in Vision Language Models
- notably the PaliGemma model from the Google Zürich team (See Computer Vision)
- and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab(author of DSPy).
- Comments
- It’s not an extraction replacement (I wondered the same thing). It’s retrieval that can bypass extraction.
- I’d misunderstood it as a vision LLM that could extract information from PDFs, it looks like it’s more of an embedding model that can represent a page from a PDF as a set of vectors, which means “will it just refuse to run” isn’t actually a concern (unlike Claude 3 Vision etc)
- In essence; ColPali is just an adapter on PaliGemma for retrieval task, while PaliGemma itself can be used for many others tasks. As the authors point out, you can remove or not use the adapter and also have it use the general capabilities to read the page.
How does it work?
- During indexing, the complex PDF parsing steps are replaced by using “screenshots” of the PDF pages directly.
- These screenshots are then embedded with the VLM. At inference time, the query is embedded and matched with a late interaction mechanism to retrieve the most similar document pages.
Resources for colpali
- https://www.analyticsvidhya.com/blog/2024/10/multimodal-retrieval-with-colqwen-vespa/ 🌟
- https://huggingface.co/blog/manu/colpali
- https://blog.vespa.ai/retrieval-with-vision-language-models-colpali/ 🌟
- https://antaripasaha.notion.site/ColPali-Document-Retrieval-with-Vision-Language-Models-10f5314a5639803d94d0d7ac191bb5b1
- https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb
Multilang support
someone wanted to try ColPali + VLM pipeline or Turkish language, the issue was ColPali PaliGemma) isn’t trained on it so recommended ColQwen & Qwen2VL and it works.
Resources
- RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL
- MGP-STR : Better than EasyOCR it seems
- stepfun-ai/GOT-OCR2_0 · Hugging Face
- Due to the use of opt-125 and a few other elements it is not allowed for commercial use. Otherwise they have code for inference on the hf card as well.
- Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero | Hacker News
- Qwen2-VL-7B Instruct model gets 100% accuracy extracting text from this handwritten document
- Mistral OCR | Hacker News
- https://github.com/facebookresearch/nougat
- https://github.com/VikParuchuri/marker
- Run a job queue for GOT-OCR | Modal Docs
- Benchmarking vision-language models on OCR in dynamic video environments | Hacker News 🌟
- OCR4all | Hacker News
- Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual) | Hacker News
- edge
- we use clip embeddings in our first product for a lot of our search capabilities, but it’s mostly interesting because of its carryover to other parts of the app.
- we also do a lot of OCR work, so in order to filter down candidate images to look at and speed up preprocessing, we trained a small MLP to take in the preprocessed clip embeddings (instead of raw images) and predict whether they contain text.
- the classifier has an f1 score of 0.98. it takes 2-3min to train on a consumer laptop (given our dataset of 30k embeddings + 60k synthetic), it’s 300kb on device, and it runs at 10k fps, so it can absolutely rip through a photo library.
- so now instead of running useless OCR vision requests on images with no legible text, we can just skip them up front. for example on my library, in my last 5k photos nearly 2k have no text we can completely skip over processing. the fastest way to speed up work is to do no work at all!