tags : Machine Learning, NLP (Natural Language Processing)

Vision tasks

TaskVariations
Object detectionReal time, non-realtime, rotated, captioning, classification
Object tracking(video)Similar to object detection but with some additional checks
Segmentation
Contrastive Learning
Distillation
VDU (Visual Document Understanding)OCR, OCR-free
VQA (Visual QA)
Pose estimation

Image generation

Object Detection

ArchitectureTypeNameUseOther notes
TransformerVLMLLaVAVisual Q/A, alternative to GPT-4V
VLMmoondreamsame as LLaVacan’t do ocr probably
VLMCogVLMsame as LLaVaBetter than LLaVa in captioning
ViTCLIPtxt-guided imagen, classification, caption
ViTBLIPSame as CLIP, better than CLIP at captioningConsidered faster than CLIP?
ViTDETIC
ViTGDINOBetter at detection than CLIPsimilar to YOLO but slower
CNN1 stageYOLORealtime object identificationNo involvement of anything NLP like VLMs
2 stageDetectron-2Apache license, Fast-RCNN
EfficientNetV2classification

Theory

CNN based

  • CNN uses pixel arrays

  • YOLO! (?)

    Lot of crazy politics. anyone is coming up with anything. The newer version doesn’t mean the newer version of the same thing. superr confusing. Original author left the chat long back cuz ethical reasons

    “The YOLOv5, YOLOv6, and YOLOv7 teams all say their versions are faster and more accurate than the rest, and for what it’s worth, the teams for v4 and v7 overlap, and the implementation for v7 is based on v5. At the end of the day, the only benchmarks that really matter to us are ones using our data and hardware”

    YearNameDescription
    Darknet/YOLOSomeone says it’s faster than the newer version, idk tf they talking about. Goes upto YOLOv7
    2015YOLOv1Improved on R-CNN/ Fast R-CNN by doing 1 CNN archtecture, made things real fast
    2016YOLOv2Improved on v1 (Darknet-19), anchor boxes
    2018YOLOv3Improved on v2 (Darknet-53), NMS was added
    YOLOv4Added CSPNet, k-means, GHM loss etc.
    YOLOv5
    YOLOv6
    YOLOv7
    YOLOv8
    YOLOv8-n
    YOLOv8-s
    YOLOv8-m
    YOLOv10
    YOLO-XBased on YOLOv3 but adds features as anchor free and other things.
    YOLO NASfor detecting small objects, suitable for edge devices
    YOLO-World

    • YOLO vs older CNN based models

      From a reddit comment

      • the R-CNN family:
        • Find the interesting regions
        • For every interesting region: What object is in the region?
        • Remove overlapping and low score detections
      • YOLO/SSD:
        • Come up with a fixed grid of regions
        • Predict N objects in every region all at once
        • same as above
  • MMLabs 🌟

    • mmdetection(MMLabs)
    • More like a framework for vision models.
    • Good choice if you’re just experimenting with a model right now. The SOTA model can be trained via config alone.
      • Has both CNN and Transformer based stuff
    • It also has YOLO model variants: https://github.com/open-mmlab/mmyolo
  • Doubts

    • ResNet?

Transformer

  • ViT splits the input images into visual tokens
    • divides an image into fixed-size patches
    • correctly embeds each of them
    • includes positional embedding as an input to the transformer encoder
  • Vision Encoder (Transformer)

    • The OG here is ViT(Google Brain team)
      • Downstream variants include: BEiT, DeiT, Swin, CSWIn(better than Swin), MAE, DINO(Improved DETR)
      • Prior work before ViT was DETR
    • Has outperformed CNN models in certain cases
      • ViT models outperform the current SOTA CNNs by almost x4 in terms of computational efficiency and accuracy.
    • What we can do
      • Ask what a image is about i.e Input(Image) = Output(Text)
      • Return us a set of images given a text. i.e Input(Text) = Output(Image(s))
    • Examples of implementation: CLIP, GDINO
    • Architecture of ViT

    • Architecture of CLIP/GDINO

      • Components
        • Text backbone: Eg. BERT/ OpenAI Embeddings
        • Image backbone: SWIN/ViT etc.
      • Steps
        1. Feature encoder: Fuse text and image into one using self attention mechanism
        2. Enhance encoding: Combines text and image
        3. Language guided selection (Dont understand)
        4. Cross-Modality Decoder (Dont understand)
        5. Calculate loss
          • Contrastive loss: Compare text and visual features
          • DETR loss: Bounding box, detect object of interest
      • More stuff
    • Similarity search

      • With such models we can look for similarity between the text and the image embedding
        • Because the inner product is a similarity metric
      • Process
        • Create a bunch of prompts out of your classes (e.g. “A photo of a Dog”) and run them through the text encoder to get an embedding.
        • Create the image embedding with the image encoder
        • Multiply it with each text embedding to get the similarity of the texts and the images.
        • The one with the highest similarity is your predicted class.
  • VLM(Vision Language Model)

    • Architecture of VLM

      • Components
        • visual encoder
          • VLMs need a visual encoder such as CLIP(ViT), this can be used in
          • Plugging ViT into task-specific fine-tuning
          • Combining ViT with V&L pre-training and transferring to downstream tasks
          • So the visual understanding of VLM will only be as good as the ViT.
          • In other words, when you attach a “decoder only LLM” to a vision encoder such as CLIP, you get a VLM.
        • LLM
          • Then VLM also depend on some LLM
      • Example
        • eg. moondream depends on Phi1.5(LLM) and SigLIP(visual encoder like CLIP) etc.
    • Fine turning VLM

      • Eg. moondream does not support OCR ootb, but we can finetune it to support OCR. But results also depends on the LLM used. Eg. From a reddit comment: “However would not recommend moondream for ocr due to it’s PHI tokenizer not splitting digits into separate tokens. It struggles to pick up sequential digits reliably. For example: 222 is likely to become 22 due to this issue.”
      • https://github.com/BradyFU/Woodpecker (correction)

Combining Transformer based + CNN based

This is only useful if you need super fast inference, low on compute inference etc. Otherwise for most cases GDINO/CLIP etc goto.

  • Now CNN based inference is faster than transformer based. So something like YOLO is still more preferable for realtime stuff.
  • But we can use GDINO to generate label for our training dataset and then we can use this to train our YOLO models which will be fast.
    • Essentially, use Transformer based detection for labeling & training the CNN model
    • Use the CNN model to do fast inference in production
  • Basically using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset.
  • https://github.com/autodistill/autodistill allows to do exactly this.

Segmentation

NameDescription
SEEM
SAM
  • To improve segmentation we can tune the params, else we can also use some kind of object detection(eg. yolo etc) to draw bounding boxes before we apply segmentation to it. See this thread for more info.

Visual Document Understanding (VDU)

  • OCR
    • 2-stage pipeline: Usually when trying to understand a document, we’d do OCR and then run though another process for the understanding.
    • Issue: Mostly with OCR, the result might not be what you want. Eg. No spatial understanding ( even different line etc). Using a OCR free approach might help.
    • See OCR
  • OCR-free
    • 1 stage pipeline: OCR and understanding in one
    • Eg. Donut, (Document understanding transformer), LayoutLM (reciept understanding)
      • Some of the VLMs can do this as-well.

Meta

Datasets

  • ImageNet
  • COCO