tags : Information Theory, NLP (Natural Language Processing), Machine Learning, Embeddings

![](/ox-hugo/shapes at 25-05-24 20.14.10.png)

Theory

General idea

  • IR is about scale, you have a huge data you should be able to get things out.
  • There have been traditional methods used but more recently people have also use BERT, Embeddings, even LLMs to do IR tasks.
  • The use of IR can be task specific
    • For search, you might just want a limited retrival where you return n ranked reponses
    • For something like fact checking, you’d want it to get everything, run analysis and get back with the result

Forward Index

  • Eg. Doc123 => “the”,“apple”,“in”,“tree”
  • Keywords can be marked more or less relevant etc
  • Problems: capitalization, phases, alternate spellings, other stuff.
  • Inverted index can be generated out of it

Ranking

TF-IDF (Term Freq. and Inverse Doc. Freq)

  • Not used too much these days but old OG
  • Formula

    • Term Freq: How often word appears in doc
    • Doc Freq: How often word occurs in ALL set of documents. (Tells us that “is” is pretty common)
    • Relevancy =
      • i.e Term Freq * 1/Doc Freq
      • i.e Term Freq * Inverse Doc Freq
      • i.e TF-IDF

Page Rank

  • Again not used a lot anymore but the algorithm was similar to TF-IDF but includes backlinks and a damping factor into the eqn.

Recommendation Engines

Search Engines

You can decompose a “search engine” into multiple big components: Gather, Search Index, Ranking, Query

Gather

Search Index

Ranking

  • Algorithm of scoring/weighing/ranking of pages

Query engine

  • Translating user inputs into returning the most “relevant” pages
  • See Query Engines

Semantic Search Resources

TODO Full Text Search (FTS)

Postgres

  • You would have to build two different indexes, one with pg_trgm and one with tsvector. Alternatively you could use the partial match feature of FTS, with :*, but that doesn’t play well with stemming or with synonyms.

Meilisearch

Debugging

Ideally when run in development env, we should have the meilisearch UI running at MEILI_HTTP_ADDR but I am not sure why it’s not exposed. As a workaround for manual debugging/checks etc. We can use this: https://github.com/riccox/meilisearch-ui

Resources