tags : Machine Learning, Image Compression, Computer Vision, Deploying ML applications (applied ML)
Comparison
Type | Name | Description |
---|---|---|
Service | Claude/OpenAI/AWS | They have APIs |
LSTM-CNN | Tesseract | |
PP-OCR(DB+CRNN) | PaddleOCR | Works with rotated stuff |
EasyOCR | ||
Toolbox, Modular models | doctr | Some people mention it works better than paddle and tesseract. |
Pytorch+mmlabs | MMOCR | Might be nice if using mmdetection stuff |
surya | Only for documents, doesn’t work in handwritten. faster than tesseract, Language support. Tries to guess proper reading order. | |
VLM | MGP-STR | new kid (2024) |
VLM | GOT | new kid (2024) |
VLM | olmOCR | olmOCR – Open-Source OCR for Accurate Document Conversion (has comparision to GOT) |
VLM | ROlmOCR | better and faster olmOCR |
VLM | TrOCR | |
VLM | DONUT | |
VLM | InternVL | |
VLM | Idefics2 |
olmOCR
introduces a technique they call “Document Anchoring”, where the quality of the extracted text is enhanced with any text and metadata present in the PDF file.
Resources
- RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL
- MGP-STR : Better than EasyOCR it seems
- stepfun-ai/GOT-OCR2_0 · Hugging Face
- Due to the use of opt-125 and a few other elements it is not allowed for commercial use. Otherwise they have code for inference on the hf card as well.
- Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero | Hacker News
- Qwen2-VL-7B Instruct model gets 100% accuracy extracting text from this handwritten document
- Mistral OCR | Hacker News
- https://github.com/facebookresearch/nougat
- https://github.com/VikParuchuri/marker
- Run a job queue for GOT-OCR | Modal Docs
- Benchmarking vision-language models on OCR in dynamic video environments | Hacker News 🌟
- OCR4all | Hacker News
- Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual) | Hacker News
- edge
- we use clip embeddings in our first product for a lot of our search capabilities, but it’s mostly interesting because of its carryover to other parts of the app.
- we also do a lot of OCR work, so in order to filter down candidate images to look at and speed up preprocessing, we trained a small MLP to take in the preprocessed clip embeddings (instead of raw images) and predict whether they contain text.
- the classifier has an f1 score of 0.98. it takes 2-3min to train on a consumer laptop (given our dataset of 30k embeddings + 60k synthetic), it’s 300kb on device, and it runs at 10k fps, so it can absolutely rip through a photo library.
- so now instead of running useless OCR vision requests on images with no legible text, we can just skip them up front. for example on my library, in my last 5k photos nearly 2k have no text we can completely skip over processing. the fastest way to speed up work is to do no work at all!