tags : Open Source LLMs, NLP (Natural Language Processing), Machine Learning, Modern AI Stack, Information Retrieval
Embeddings are related to what we talk about when we talk about “Autoencoders” from NLP (Natural Language Processing)
- Multilingual Embedding leaderboard: MTEB Leaderboard - a Hugging Face Space by mteb
What
- reversing embeddings is probabilistic.
RAG
See RAG
VectorDBs / Vector Stores
“Binding generated embeddings to source data, so the vectors automatically update when the data changes is exactly how things should be.”
Vector Store | Type | Search Algorithm | Performance Characteristics |
---|---|---|---|
vectorlite | SQLite extension | HNSW (Approximate Nearest Neighbors) | Faster for large vector sets at the cost of some accuracy |
sqlite-vec | SQLite extension | Brute force | More accurate but slower with large vector sets |
usearch | SQLite extension | Brute force | Similar to sqlite-vec, only exposes vector distance functions |
Qdrant | Standalone vector DB | Not specified | Works well but “heavier” for many applications |
sqlite-vec
The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.
If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.
Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.
- sqlite-vec v0.1.0 Launch Party Recording! - YouTube
- https://news.ycombinator.com/item?id=40244090: initial v0.1.0 release will only have linear scans, but I want to support ANN indexes like IVF/HNSW in the future!
Polars
Vector Tile
PostgreSQL (pgvector)
DuckDB
Learning resources
- https://github.com/erikbern/ann-benchmarks
- https://huggingface.co/hkunlp/instructor-xl (embeddings) 🌟
- Don’t use cosine similarity carelessly | Hacker News
- The secret ingredients of word2vec (2016) | Hacker News
- Evaluating Similarity Methods: Speed vs. Precision 🌟
- Nomic Blog: Data Maps, Part 2: Embeddings Are For So Much More Than RAG
- Understanding pgvector’s HNSW Index Storage in Postgres | Lantern Blog
- sqlite vec
- How does cosine similarity work? | Hacker News
- https://simonwillison.net/2023/Oct/23/embeddings/
- Binary vector embeddings are so cool | Lobsters
- Embeddings are underrated | Lobsters
- Embeddings are underrated | Hacker News
- Bengaluru System Meetup: Understanding sqlite-vec - YouTube
- https://pamacha.observablehq.cloud/spherical-umap/?s=35
- Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun | Wilson Lin
- Embedding + CRDT (https://x.com/JungleSilicon/status/1867603691005706515)
- Hacker News Data Map [180MB] | Hacker News
Selfhosting Embeddings
See Deploying ML applications (applied ML)
Additionally there’s also: https://github.com/michaelfeil/infinity
Feature/Aspect | Text Embeddings Inference | Ollama | vLLM |
---|---|---|---|
Primary Use Case | Production embedding serving | Local development & testing | LLM inference with embedding support |
Implementation | Rust | Go | Python |
Setup Complexity | Low | Very Low | High |
Resource Usage | Minimal | Moderate | High |
GPU Support | Yes | Yes | Yes (Optimized) |
CPU Support | Yes | Yes | Limited |
Model Types | Embedding only | Both LLM and Embeddings | Both LLM and Embeddings |
Production Ready | Yes | Limited | Yes |
Deployment Type | Microservice | Local/Container | Distributed Service |
Customization | Limited | High | High |
Throughput | Very High (embeddings) | Moderate | High (both) |
Community Support | Growing | Active | Very Active |
Architecture Support | x86, ARM | x86, ARM | Primarily x86 |
Container Support | Yes | Yes | Yes |
Monitoring/Metrics | Basic | Basic | Extensive |
Hot-reload Support | No | Yes | No |
Memory Efficiency | High | Moderate | Varies (KV-cache focused) |
Documentation Quality | Good | Excellent | Excellent |
Examples
- https://blog.brunk.io/posts/similarity-search-with-duckdb/
- https://simonwillison.net/2024/May/10/exploring-hacker-news-by-mapping-and-analyzing-40-million-posts/
- https://modal.com/blog/embedding-wikipedia
- https://modal.com/blog/fine-tuning-embeddings
- https://modal.com/docs/examples/text_embeddings_inference
- https://docs.vllm.ai/en/latest/getting_started/examples/openai_embedding_client.html