tags : Machine Learning, NLP (Natural Language Processing)
Vision tasks
Task | Variations |
---|---|
Object detection | Real time, non-realtime, rotated, captioning, classification |
Object tracking(video) | Similar to object detection but with some additional checks |
Segmentation | |
Contrastive Learning | |
Distillation | |
VDU (Visual Document Understanding) | OCR, OCR-free |
VQA (Visual QA) | |
Pose estimation |
Image generation
Object Detection
Architecture | Type | Name | Use | Other notes |
---|---|---|---|---|
Transformer | VLM | LLaVA | Visual Q/A, alternative to GPT-4V | |
VLM | moondream | same as LLaVa | can’t do ocr probably | |
VLM | CogVLM | same as LLaVa | Better than LLaVa in captioning | |
ViT | CLIP | txt-guided imagen, classification, caption | ||
ViT | BLIP | Same as CLIP, better than CLIP at captioning | Considered faster than CLIP? | |
ViT | DETIC | |||
ViT | GDINO | Better at detection than CLIP | similar to YOLO but slower | |
CNN | 1 stage | YOLO | Realtime object identification | No involvement of anything NLP like VLMs |
2 stage | Detectron-2 | Apache license, Fast-RCNN | ||
EfficientNetV2 | classification |
Theory
- What are some foundational papers in CV that every newcomer should read? : computervision
- Object Detection | Papers With Code
- A Dive into Vision-Language Models
- Closed-set: Detect from trained stuff. Eg. Find all dogs in the image
- Open-set: Detects un-trained stuff. Eg. Find the right-most dog, is a person holding a dog? (Transformer based work nicely here)
- In other words, allows to do
zero-shot
object detection
- In other words, allows to do
CNN based
- CNN uses pixel arrays
-
YOLO! (?)
Lot of crazy politics. anyone is coming up with anything. The newer version doesn’t mean the newer version of the same thing. superr confusing. Original author left the chat long back cuz ethical reasons
“The YOLOv5, YOLOv6, and YOLOv7 teams all say their versions are faster and more accurate than the rest, and for what it’s worth, the teams for v4 and v7 overlap, and the implementation for v7 is based on v5. At the end of the day, the only benchmarks that really matter to us are ones using our data and hardware”
- YOLO algorithm treats object detection as a regression problem, utilizing a single convolutional neural network to spatially separate bounding boxes and associate probabilities with detected objects.
- YOLO is a family of detection algorithms made by, at times, totally different groups of people.
- See Programming Comments - Darknet FAQ
- See [2304.00501] A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS
Year Name Description Darknet/YOLO Someone says it’s faster than the newer version, idk tf they talking about. Goes upto YOLOv7 2015 YOLOv1 Improved on R-CNN/ Fast R-CNN by doing 1 CNN archtecture, made things real fast 2016 YOLOv2 Improved on v1 (Darknet-19), anchor boxes 2018 YOLOv3 Improved on v2 (Darknet-53), NMS was added YOLOv4 Added CSPNet, k-means, GHM loss etc. YOLOv5 YOLOv6 YOLOv7 YOLOv8 YOLOv8-n YOLOv8-s YOLOv8-m YOLOv10 YOLO-X Based on YOLOv3 but adds features as anchor free and other things. YOLO NAS for detecting small objects, suitable for edge devices YOLO-World -
YOLO vs older CNN based models
From a reddit comment
- the R-CNN family:
- Find the interesting regions
- For every interesting region: What object is in the region?
- Remove overlapping and low score detections
- YOLO/SSD:
- Come up with a fixed grid of regions
- Predict N objects in every region all at once
- same as above
- the R-CNN family:
-
MMLabs 🌟
- mmdetection(MMLabs)
- More like a framework for vision models.
- Good choice if you’re just experimenting with a model right now. The SOTA model can be trained via config alone.
- Has both CNN and Transformer based stuff
- It also has YOLO model variants: https://github.com/open-mmlab/mmyolo
-
Doubts
- ResNet?
Transformer
- ViT splits the input images into visual tokens
- divides an image into fixed-size patches
- correctly embeds each of them
- includes positional embedding as an input to the transformer encoder
-
Vision Encoder (Transformer)
- The OG here is ViT(Google Brain team)
- Downstream variants include: BEiT, DeiT, Swin, CSWIn(better than Swin), MAE, DINO(Improved DETR)
- Prior work before ViT was DETR
- Has outperformed CNN models in certain cases
- ViT models outperform the current SOTA CNNs by almost x4 in terms of computational efficiency and accuracy.
- What we can do
- Ask what a image is about i.e
Input(Image) = Output(Text)
- Return us a set of images given a text. i.e
Input(Text) = Output(Image(s))
- Ask what a image is about i.e
- Examples of implementation: CLIP, GDINO
- ViT is what CLIP uses. In other words, CLIP is possible because of ViT.
- GroundingDINO(GDINO) uses DINO
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs - YouTube (combining DINO+CLIP for better results)
-
Architecture of ViT
- this blogpost has simple explanation of the architecture
-
Architecture of CLIP/GDINO
- Components
Text backbone
: Eg. BERT/ OpenAI EmbeddingsImage backbone
: SWIN/ViT etc.
- Steps
- Feature encoder: Fuse text and image into one using self attention mechanism
- Enhance encoding: Combines text and image
- Language guided selection (Dont understand)
- Cross-Modality Decoder (Dont understand)
- Calculate loss
- Contrastive loss: Compare text and visual features
- DETR loss: Bounding box, detect object of interest
- More stuff
- Components
-
Similarity search
- With such models we can look for similarity between the text and the image embedding
- Because the inner product is a similarity metric
- Process
- Create a bunch of prompts out of your classes (e.g. “A photo of a Dog”) and run them through the text encoder to get an embedding.
- Create the image embedding with the image encoder
- Multiply it with each text embedding to get the similarity of the texts and the images.
- The one with the highest similarity is your predicted class.
- With such models we can look for similarity between the text and the image embedding
- The OG here is ViT(Google Brain team)
-
VLM(Vision Language Model)
- MOONDREAM EXAMPLE PROMPT ENG: https://gist.github.com/geekodour/e9bb4c0c35957cabdb2dc3618be1f4e1
- These are technically a special usecase of using vision encoder with LLMs
- Capabilities keep improving/differ
- Examples: LLaVa, moondream2 etc.
- Example project using LVM to do captioning: jiayev/GPT4V-Image-Captioner
- See https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
-
Architecture of VLM
- Components
visual encoder
- VLMs need a visual encoder such as CLIP(ViT), this can be used in
- Plugging ViT into task-specific fine-tuning
- Combining ViT with V&L pre-training and transferring to downstream tasks
- So the visual understanding of VLM will only be as good as the ViT.
- In other words, when you attach a “decoder only LLM” to a
vision encoder
such as CLIP, you get a VLM.
LLM
- Then VLM also depend on some LLM
- Example
- eg. moondream depends on Phi1.5(LLM) and SigLIP(visual encoder like CLIP) etc.
- Components
-
Fine turning VLM
- Eg. moondream does not support OCR ootb, but we can finetune it to support OCR. But results also depends on the LLM used. Eg. From a reddit comment: “However would not recommend moondream for ocr due to it’s PHI tokenizer not splitting digits into separate tokens. It struggles to pick up sequential digits reliably. For example: 222 is likely to become 22 due to this issue.”
- https://github.com/BradyFU/Woodpecker (correction)
Combining Transformer based + CNN based
This is only useful if you need super fast inference, low on compute inference etc. Otherwise for most cases GDINO/CLIP etc goto.
- Now CNN based inference is faster than transformer based. So something like YOLO is still more preferable for realtime stuff.
- But we can use GDINO to generate label for our training dataset and then we can use this to train our YOLO models which will be fast.
- Essentially, use Transformer based detection for labeling & training the CNN model
- Use the CNN model to do fast inference in production
- Basically using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset.
- https://github.com/autodistill/autodistill allows to do exactly this.
- see https://www.youtube.com/@Roboflow/videos
Segmentation
Name | Description |
---|---|
SEEM | |
SAM |
- To improve segmentation we can tune the params, else we can also use some kind of object detection(eg. yolo etc) to draw bounding boxes before we apply segmentation to it. See this thread for more info.
Visual Document Understanding (VDU)
- OCR
- 2-stage pipeline: Usually when trying to understand a document, we’d do OCR and then run though another process for the understanding.
- Issue: Mostly with OCR, the result might not be what you want. Eg. No spatial understanding ( even different line etc). Using a OCR free approach might help.
- See OCR
- OCR-free
Meta
Datasets
- ImageNet
- COCO