Information Retrieval

tags : Information Theory, NLP (Natural Language Processing), Machine Learning, Embeddings

![](/ox-hugo/shapes at 25-05-24 20.14.10.png)

See GitHub - kuutsav/information-retrieval: Neural information retrieval / semantic-search / Bi-Encoders

Theory

General idea

IR is about scale, you have a huge data you should be able to get things out.
There have been traditional methods used but more recently people have also use BERT, Embeddings, even LLMs to do IR tasks.
The use of IR can be task specific
- For search, you might just want a limited retrival where you return n ranked reponses
- For something like fact checking, you’d want it to get everything, run analysis and get back with the result

Forward Index

Eg. Doc123 => “the”,“apple”,“in”,“tree”
Keywords can be marked more or less relevant etc
Problems: capitalization, phases, alternate spellings, other stuff.
Inverted index can be generated out of it

Ranking

TF-IDF (Term Freq. and Inverse Doc. Freq)

Not used too much these days but old OG

Formula
- Term Freq: How often word appears in doc
- Doc Freq: How often word occurs in ALL set of documents. (Tells us that “is” is pretty common)
- Relevancy = $\frac{T er m F re q}{Doc u m e n tF re q}$
  - i.e Term Freq * 1/Doc Freq
  - i.e Term Freq * Inverse Doc Freq
  - i.e TF-IDF

Page Rank

Again not used a lot anymore but the algorithm was similar to TF-IDF but includes backlinks and a damping factor into the eqn.

Recommendation Engines

GitHub - bytedance/monolith: A Lightweight Recommendation System

Search Engines

You can decompose a “search engine” into multiple big components: Gather, Search Index, Ranking, Query

Gather

Web crawler/spiders
See Scraping and Archival

Search Index

Database cache of web content
aka building the “search index”
See Database, NLP (Natural Language Processing)

Ranking

Algorithm of scoring/weighing/ranking of pages

Query engine

Translating user inputs into returning the most “relevant” pages
See Query Engines

Semantic Search Resources

TODO Full Text Search (FTS)

ParadeDB

Postgres

You would have to build two different indexes, one with pg_trgm and one with tsvector. Alternatively you could use the partial match feature of FTS, with :*, but that doesn’t play well with stemming or with synonyms.

Meilisearch

Debugging

Ideally when run in development env, we should have the meilisearch UI running at MEILI_HTTP_ADDR but I am not sure why it’s not exposed. As a workaround for manual debugging/checks etc. We can use this: https://github.com/riccox/meilisearch-ui

🐏 mogoz

Table of Contents

Information Retrieval

Theory

General idea

Forward Index

Ranking

TF-IDF (Term Freq. and Inverse Doc. Freq)

Page Rank

Recommendation Engines

Search Engines

Gather

Search Index

Ranking

Query engine

Semantic Search Resources

TODO Full Text Search (FTS)

Postgres

Meilisearch

Debugging

Resources

Graph View

Backlinks

🐏 mogoz

Table of Contents

Information Retrieval

Theory §

General idea §

Forward Index §

Ranking §

TF-IDF (Term Freq. and Inverse Doc. Freq) §

Page Rank §

Recommendation Engines §

Search Engines §

Gather §

Search Index §

Ranking §

Query engine §

Semantic Search Resources §

TODO Full Text Search (FTS) §

Postgres §

Meilisearch §

Debugging §

Resources §

Graph View

Backlinks

Theory

General idea

Forward Index

Ranking

TF-IDF (Term Freq. and Inverse Doc. Freq)

Page Rank

Recommendation Engines

Search Engines

Gather

Search Index

Ranking

Query engine

Semantic Search Resources

TODO Full Text Search (FTS)

Postgres

Meilisearch

Debugging

Resources