tags : Machine Learning, Data Engineering, Open Source LLMs
History of NLP
- We did have attention when we were using DNN
- But with the 2017 paper, we suggested that “attention is all you need”, aka Transformers.
- AllenNLP - Demo
General ideas in NLP
Tokenization
Sub word
- See GitHub - google/sentencepiece
- Uses
byte-pair encoding
(BPE)
Token count
- Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens.
- 32k tokens ~ 25k words ~ 100 single spaced pages
- See Understanding GPT tokenizers
Embeddings vs Tokens
- See Embeddings
- The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning - YouTube
- Tokens
- These are inputs
- The basic units of text that the language model operates on
- “I love cats” would be tokens: [“I”, “love”, “cats”] if using word level tokenization.
- Embeddings
- Embeddings are learned as part of the model training process.
- Refer to the vector representations of tokens in a continuous vector space.
- The model maps token to an embedding vector, representing semantic properties.
- As a result, two tokens with similar embeddings have similar meaning.
- Any deep learning model that uses tokens as input at some point is an embedding model.
LIMA style training?
LLM
- Language Models w > 100m parameters
- They don’t have to use Transformers, but many do
- They take text, convert it into tokens (integers), then predict which tokens should come next.
- Pre-trained
LLM Types
- Autoencoder LLMs are efficient for encoding (“understanding”) NL
- Autoregressive LLMs can encode and generate NL, but may be slower
- Discussion on encoder-decoder: A BERT for laptops, from scratch | Hacker News
LLM Implementations/Architectures
LLM Training
Generally LLMs are trained on 1 eval (epoch)
Gradient Accumulation
Gradient Checkpointing
Mixed-Precision
Dynamic Padding & Uniform-Length Batching
PEFT with Low-Rank Adaptation
RLHF
- This is the secret sauce in all new LLMs
- Reinforcement Learning from human feedback
GPT in production
Embedding search + vectorDB
Basic idea
- Embed internal data using the LLM tokenizer(create chunks), load it into a vectorDB
- Then query the vector DB for the most relevant information
- Add into the context window.
When documents/corpus are too big to fit into prompt. Eg. Because of token limits.
- Obtain relevant chunks by similarity search on query from vector DB
- Find top k most similar chunk embeddings.
- Stuff as many top k chunks as you can into the prompt and run the query
Example
- Imagine you have an LLM with a token limit of 8k tokens.
- Split the original document or corpus into 4k token chunks.
- Leaf nodes of a “chunk tree” are set to these 4k chunks.
- Run your query by summarizing these nodes, pair-wise (two at a time)
- Generate parent nodes of the leaf nodes.
- You now have a layer above the leaf nodes.
- Repeat until you reach a single root node.
- That node is the result of tree-summarizing your document using LLMs.
Tools
- llmaindex and langchain allow us to do this stuff.
- OpenAI cookbook suggests this approach, see gpt4langchain
- pinecode embedded search
- GPT 4: Superpower results with search - YouTube
Prompt tuning
- Idea is similar to embedding search thing but here, you are allowed to insert the embeddings of the prompt into the LLM.
- This is not currently possible with OpenAI’s API
- This claims to better perform prompt search
Finetune
Train a model on how to respond, so you don’t have to specify that in your prompt.
LLMOps
- The cost of LLMOps is in inference.
- Input tokens can be processed in parallel, output is sequential