tags : Open Source LLMs, NLP (Natural Language Processing), Machine Learning, Modern AI Stack, Information Retrieval

Embeddings are related to what we talk about when we talk about “Autoencoders” from NLP (Natural Language Processing)

FAQ

Can we reverse embedding?

  • reversing embeddings is probabilistic.

What about RAG?

See RAG

Does having embedding also mean we’ll be able to do concise similarity?

NO.

  • Mathematically, yes. can always mathematically calculate a distance or similarity between them (e.g., Euclidean distance, cosine similarity, Manhattan distance, etc.). The math will always work.
  • But does that calculated distance/similarity mean anything useful or reliable in the context of the problem you’re trying to solve?
  • The training process shapes the space so that this property holds.

There are different kinds of embeddings, Eg.

  • OpenAI’s embedding models (e.g., text-embedding-3) are separate from the internal token embeddings used in LLMs like GPT-4/O3.
  • Embedding models are optimized for semantic tasks like search and similarity, while LLM embeddings serve as internal input processing.
  • They may share architectural ideas but are trained and used for different purposes.
  • This is the same reason we can’t do real similarity search with LLaVa and instead need something like CLIP.

More Clarity

  • token embeddings and the embeddings/vectors(output of embedding models) are separate concepts.
  • token embedding
    • numerous token embeddings (one per token) which become contextualized as they propagate through the transformer
  • embedding/vectors
    • single vector/embedding that is output by embedding models
    • one per input data, such as long text, photo, or document screenshot
    • There are also embedding models which output multiple vectors based on usecase.(bge m3)

Types of embedding

By Modality

Text Embedding

Finding the Best Open-Source Embedding Model for RAG | Timescale

Image Embedding

MultiModal Embedding

  • multimodal embeddings where image and text representations are mapped into a common space.

By Purpose

Comparision

LLM Generation

numerous token embeddings (one per token) which become contextualized as they propagate through the transformer

By Language Support

Multilingual

Multilingual Embedding leaderboard: MTEB Leaderboard - a Hugging Face Space by mteb

By Architecture

SAE

Multi Vectors

(ColBERT, ModernBERT, BGE M3)

Deployment

Storing embeddings

VectorDBs / Vector Stores

“Binding generated embeddings to source data, so the vectors automatically update when the data changes is exactly how things should be.”

Vector StoreTypeSearch AlgorithmPerformance Characteristics
vectorliteSQLite extensionHNSW (Approximate Nearest Neighbors)Faster for large vector sets at the cost of some accuracy
sqlite-vecSQLite extensionBrute forceMore accurate but slower with large vector sets
usearchSQLite extensionBrute forceSimilar to sqlite-vec, only exposes vector distance functions
QdrantStandalone vector DBNot specifiedWorks well but “heavier” for many applications
  • sqlite-vec

    The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.

    If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.

    Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.

Serving embeddings

See Deploying ML applications (applied ML)

Providers

Selfhosting

Additionally there’s also: https://github.com/michaelfeil/infinity

Feature/AspectText Embeddings InferenceOllamavLLM
Primary Use CaseProduction embedding servingLocal development & testingLLM inference with embedding support
ImplementationRustGoPython
Setup ComplexityLowVery LowHigh
Resource UsageMinimalModerateHigh
GPU SupportYesYesYes (Optimized)
CPU SupportYesYesLimited
Model TypesEmbedding onlyBoth LLM and EmbeddingsBoth LLM and Embeddings
Production ReadyYesLimitedYes
Deployment TypeMicroserviceLocal/ContainerDistributed Service
CustomizationLimitedHighHigh
ThroughputVery High (embeddings)ModerateHigh (both)
Community SupportGrowingActiveVery Active
Architecture Supportx86, ARMx86, ARMPrimarily x86
Container SupportYesYesYes
Monitoring/MetricsBasicBasicExtensive
Hot-reload SupportNoYesNo
Memory EfficiencyHighModerateVaries (KV-cache focused)
Documentation QualityGoodExcellentExcellent

Learning resources