tags : Information Theory, NLP (Natural Language Processing), Machine Learning

See GitHub - kuutsav/information-retrieval: Neural information retrieval / semantic-search / Bi-Encoders

Theory

Forward Index

  • Eg. Doc123 => “the”,“apple”,“in”,“tree”
  • Keywords can be marked more or less relevant etc
  • Problems: capitalization, phases, alternate spellings, other stuff.
  • Inverted index can be generated out of it

Ranking

TF-IDF (Term Freq. and Inverse Doc. Freq)

  • Not used too much these days but old OG
  • Formula

    • Term Freq: How often word appears in doc
    • Doc Freq: How often word occurs in ALL set of documents. (Tells us that “is” is pretty common)
    • Relevancy =
      • i.e Term Freq * 1/Doc Freq
      • i.e Term Freq * Inverse Doc Freq
      • i.e TF-IDF

Page Rank

  • Again not used a lot anymore but the algorithm was similar to TF-IDF but includes backlinks and a damping factor into the eqn.

Search Engines

You can decompose a “search engine” into multiple big components: Gather, Search Index, Ranking, Query

Gather

Search Index

Ranking

  • Algorithm of scoring/weighing/ranking of pages

Query engine

  • Translating user inputs into returning the most “relevant” pages
  • See Query Engines

Semantic Search Resources

TODO Full Text Search (FTS)

Resources