tags : Machine Learning, Data Engineering, Open Source LLMs

History of NLP

  • We did have attention when we were using DNN
  • But with the 2017 paper, we suggested that “attention is all you need”, aka Transformers.
  • AllenNLP - Demo

General ideas in NLP

Tokenization

Sub word

Token count

  • Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens.
  • 32k tokens ~ 25k words ~ 100 single spaced pages
  • See Understanding GPT tokenizers

Embeddings vs Tokens

  • The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning - YouTube
  • Tokens
    • These are inputs
    • The basic units of text that the language model operates on
    • “I love cats” would be tokens: [“I”, “love”, “cats”] if using word level tokenization.
  • Embeddings
    • Embeddings are learned as part of the model training process.
    • Refer to the vector representations of tokens in a continuous vector space.
    • The model maps token to an embedding vector, representing semantic properties.
    • As a result, two tokens with similar embeddings have similar meaning.
    • Any deep learning model that uses tokens as input at some point is an embedding model.

LIMA style training?

LLM

  • Language Models w > 100m parameters
  • They don’t have to use Transformers, but many do
  • They take text, convert it into tokens (integers), then predict which tokens should come next.
  • Pre-trained

LLM Types

LLM Implementations/Architectures

LLM Training

Generally LLMs are trained on 1 eval (epoch)

Gradient Accumulation

Gradient Checkpointing

Mixed-Precision

Dynamic Padding & Uniform-Length Batching

PEFT with Low-Rank Adaptation

RLHF

  • This is the secret sauce in all new LLMs
  • Reinforcement learning from human feedback

GPT in production

Embedding search + vectorDB

Basic idea

  • Embed internal data using the LLM tokenizer(create chunks), load it into a vectorDB
  • Then query the vector DB for the most relevant information
  • Add into the context window.

When documents/corpus are too big to fit into prompt. Eg. Because of token limits.

  • Obtain relevant chunks by similarity search on query from vector DB
  • Find top k most similar chunk embeddings.
  • Stuff as many top k chunks as you can into the prompt and run the query

Example

  • Imagine you have an LLM with a token limit of 8k tokens.
  • Split the original document or corpus into 4k token chunks.
  • Leaf nodes of a “chunk tree” are set to these 4k chunks.
    • Run your query by summarizing these nodes, pair-wise (two at a time)
    • Generate parent nodes of the leaf nodes.
    • You now have a layer above the leaf nodes.
    • Repeat until you reach a single root node.
    • That node is the result of tree-summarizing your document using LLMs.

Tools

Prompt tuning

  • Idea is similar to embedding search thing but here, you are allowed to insert the embeddings of the prompt into the LLM.
  • This is not currently possible with OpenAI’s API
  • This claims to better perform prompt search

Finetune

Train a model on how to respond, so you don’t have to specify that in your prompt.

LLMOps

  • The cost of LLMOps is in inference.
  • Input tokens can be processed in parallel, output is sequential