tags : Machine Learning, Image Compression, Computer Vision

Comparison

TypeNameDescription
ServiceClaude/OpenAI/AWSThey have APIs
LSTM-CNNTesseract
PP-OCR(DB+CRNN)PaddleOCRWorks with rotated stuff
EasyOCR
Toolbox, Modular modelsdoctrSome people mention it works better than paddle and tesseract.
Pytorch+mmlabsMMOCRMight be nice if using mmdetection stuff
suryaOnly for documents, doesn’t work in handwritten. faster than tesseract, Language support. Tries to guess proper reading order.
VLMMGP-STRnew kid (2024)
VLMGOTnew kid (2024)
VLMTrOCR
VLMDONUT
VLMInternVL
VLMIdefics2

ColPali

ColPali combines:

  • Col -> the contextualized late interaction mechanism introduced in ColBERT
  • Pali -> with a Vision Language Model (VLM), in this case, PaliGemma
  • See ColBERT in NLP (Natural Language Processing)
  • Used for PDF extraction
  • ColPali is enabled by the latest advances in Vision Language Models
    • notably the PaliGemma model from the Google Zürich team (See Computer Vision)
    • and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab(author of DSPy).
  • Comments
    • It’s not an extraction replacement (I wondered the same thing). It’s retrieval that can bypass extraction.
    • I’d misunderstood it as a vision LLM that could extract information from PDFs, it looks like it’s more of an embedding model that can represent a page from a PDF as a set of vectors, which means “will it just refuse to run” isn’t actually a concern (unlike Claude 3 Vision etc)
    • In essence; ColPali is just an adapter on PaliGemma for retrieval task, while PaliGemma itself can be used for many others tasks. As the authors point out, you can remove or not use the adapter and also have it use the general capabilities to read the page.

How does it work?

  • During indexing, the complex PDF parsing steps are replaced by using “screenshots” of the PDF pages directly.
  • These screenshots are then embedded with the VLM. At inference time, the query is embedded and matched with a late interaction mechanism to retrieve the most similar document pages.

Resources for colpali

Multilang support

someone wanted to try ColPali + VLM pipeline or Turkish language, the issue was ColPali PaliGemma) isn’t trained on it so recommended ColQwen & Qwen2VL and it works.

Resources