tags : Information Theory, NLP (Natural Language Processing), Machine Learning
See GitHub - kuutsav/information-retrieval: Neural information retrieval / semantic-search / Bi-Encoders
Theory
Forward Index
- Eg. Doc123 => “the”,“apple”,“in”,“tree”
- Keywords can be marked more or less relevant etc
- Problems: capitalization, phases, alternate spellings, other stuff.
- Inverted index can be generated out of it
Ranking
TF-IDF (Term Freq. and Inverse Doc. Freq)
- Not used too much these days but old OG
-
Formula
- Term Freq: How often word appears in doc
- Doc Freq: How often word occurs in ALL set of documents. (Tells us that “is” is pretty common)
- Relevancy =
- i.e Term Freq * 1/Doc Freq
- i.e Term Freq * Inverse Doc Freq
- i.e TF-IDF
Page Rank
- Again not used a lot anymore but the algorithm was similar to TF-IDF but includes backlinks and a damping factor into the eqn.
Search Engines
You can decompose a “search engine” into multiple big components: Gather, Search Index, Ranking, Query
Gather
Search Index
- Database cache of web content
- aka building the “search index”
- See Database, NLP (Natural Language Processing)
Ranking
- Algorithm of scoring/weighing/ranking of pages
Query engine
- Translating user inputs into returning the most “relevant” pages
- See Query Engines
Semantic Search Resources
- Spotify-Inspired: Elevating Meilisearch with Hybrid Search and Rust
- https://github.com/josephrmartinez/recipe-dataset/blob/main/tutorial.md
TODO Full Text Search (FTS)
- Think Wordebut issue has more links
- Structure of FTS5 Index in SQLite | Lobsters
- Why Full Text Search is Hard
- Full text search over Postgres: Elasticsearch vs. alternatives | Hacker News
- Meilisearch is too slow
- Postgres as a search engine