tags : Machine Learning, Data Engineering, Open Source LLMs, Information Retrieval, Knowledge

NLP Tasks

TaskDescription
Document classificationSupervised learning, Assigns documents to known categories
ClusteringUnsupervised learning, Groups similar documents together based on patterns, Discovers natural groupings in data
Sentiment Analysis

More on document classification vs clustering

ClusteringDocument Classification
Usecaseorganize articles, customer segmentation on feedbackspam detection, sentiment analysis,
themes in social mediatopic categorization, language identification
EvaluationSilhouette score, Davies-Bouldin indexAccuracy, Precision, Recall, F1-score, Confusion matrix
Inertia, Inter/intra cluster distance
AlgorithmsK-means, Hierarchical clustering, DBSCAN, Mean-shiftNaive Bayes, Support Vector Machines, Random Forest, Neural Networks

Flow

  • Clustering:
    • Raw Documents → Feature Extraction → Clustering Algorithm → Document Groups (K-means, hierarchical)
  • Classification:
    • Training: Labeled Documents → Feature Extraction → Train Classifier → Model
    • Testing: New Document → Feature Extraction → Trained Model → Predicted Label

Sentiment Analysis

Averaging out sentiments may be erroneous. The important aspect is to look at totality of sentiments, degree of sentiments and also mixed sentiment. Here are two examples -

eg1. Your product is bad (moderate degree)

Your product is very bad (higher degree of bad)

Your product is terrible (highest degree of bad)

eg2. Your product is great but support is terrible.

another one

> (sentiment analysis) on news articles, should I do it sentence by sentence?

No.

Sentiment in one sentence often depends much on the context from the other sentences.

I think you’d be better off breaking it up into overlapping chunks. Perhaps for each sentence include both the preceding two sentences and the following sentence.

> news articles

That makes it even harder.

Often news articles have the patterns

  • Things were going wonderfully well, until they turned out horribly bad. Or
  • Things were going badly, but then they ended up mostly OK (which is happy considering the circumstances, but would otherwise be sad).

The approach of averaging sentences will miss the main idea in both of those scenarios.

Instead of averaging them, you might be better off looking for trends of which direction the sentiment was heading from beginning to end.

Consider a news article with sentences like

  • “She was a straight A student in high school.” [positive sentiment]
  • ”…[a couple more background paragraphs]…”
  • “Her kidnappers tortured her for 6 months.” [whoa - very negative]
  • ”…[a couple more recent paragraphs]…”
  • “She took his gun and shot him and fled.” [tough for BERT to guess unless it has context]

Unless BERT knows to value the life of the victim more than the suspect, and has the context of who was who, it’ll do extremely poorly.

History of NLP

  • We did have attention when we were using DNN
  • But with the 2017 paper, we suggested that “attention is all you need”, aka Transformers.
  • AllenNLP - Demo

General ideas in NLP

Tokenization

Byte Latent Transformer: Patches Scale Better Than Tokens | Hacker News (possible future of tokenization)

What?

Tokenization is string manipulation. It is basically a for loop over a string with a bunch of if-else conditions and dictionary lookups. There is no way this could speed up using a GPU. Basically, the only thing a GPU can do is tensor multiplication and addition. Only problems that can be formulated using tensor operations can be accelerated using a GPU.

The default tokenizers in Huggingface Transformers are implemented in Python. There is a faster version that is implemented in Rust.

Sub word

Token count

  • Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens.
  • 32k tokens ~ 25k words ~ 100 single spaced pages
  • See Understanding GPT tokenizers

Embeddings vs Tokens

  • See Embeddings
  • The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning - YouTube
  • Tokens
    • These are inputs
    • The basic units of text that the language model operates on
    • “I love cats” would be tokens: [“I”, “love”, “cats”] if using word level tokenization.
  • Embeddings
    • Embeddings are learned as part of the model training process.
    • Refer to the vector representations of tokens in a continuous vector space.
    • The model maps token to an embedding vector, representing semantic properties.
    • As a result, two tokens with similar embeddings have similar meaning.
    • Any deep learning model that uses tokens as input at some point is an embedding model.

TODO LIMA style training?

Old Age

See Information Retrieval

New Age (post 2018 tech)

  • Autoencoder LLMs are efficient for encoding (“understanding”) NL
  • Autoregressive LLMs can encode and generate NL, but may be slower

Meta Ideas

encoder-only, encoder-decoder, decoder-only

There are mainly 3 overarching paradigms of model architectures in the past couple of years.

unidirectional and bidirectional

LLMs like GPT4, Sonnet and RNNs etc. are usually unidirectional, bi-directional(Eg. BERT, MPNet) is useful for understanding vs generative.

  • Unidirectional (left-to-right) example

    • Sentence: “The cat chased the mouse.”
    • Word representation for “cat”
    • Only considers the preceding context: “The”
  • Unidirectional (right-to-left) example

    • Sentence: “The cat chased the mouse.”
    • Word representation for “cat”
    • Only considers the following context: “chased the mouse”
  • Bidirectional (BERT) example

    • Sentence: “The cat chased the mouse.”
    • Word representation for “cat”
    • Considers the full context: “The cat chased the mouse”
  • Cross encoders

PLM (Permuted Language Model (PLM))

Key Characteristics of PLM:

  • Considers multiple possible permutations of the input sequence during training
  • Can see all unmasked tokens regardless of their positions
  • Combines advantages of both:
    • Masked Language Modeling (like BERT)
    • Autoregressive Language Modeling (like GPT)

Eg.

Original Sequence: "The cat sat on the mat"

Possible Permutations:
[The] [cat] [sat] [on] [the] [mat]
[cat] [The] [on] [mat] [sat] [the]
[mat] [sat] [The] [cat] [on] [the]
... (other permutations)

Used by MPNet

Semantic Search & Embeddings

  • closest text or texts to a target sentence
  • For Embeddings, we could use OpenAI embeddings but we could also use sentence-transformers. We can start with the following and move to OpenAI embeddings if not work.
    • multi-qa-dot mpnet
    • gtr-t5-large
    • all-mpnet-base V2
    • paraphrase-multilingual-mpnet-base V2
  • “Your baseline for RAG apps should NOT be OpenAI embeddings. It should be ColBERT.” (single vector vs multi-vector)

Massive Text Embedding Benchmark (MTEB) Leaderboard

  • The MTEB leaderboard is also very helpful if you know how to use it. The basic way to look at it is : for all the datasets mentioned there in the benchmark, which one is the closest to the type of data you’re working with? Sort in order of performance of models for that dataset.
  • https://huggingface.co/spaces/mteb/leaderboard

AutoEncoders (Understanding)

Usecase: semantic textual similarity (STS), Classification

Also see Embeddings

Base Models

FAQ

  • Differences between SBERT and MPNet

    AspectSBERTMPNet
    Training ObjectiveSiamese network, Triplet loss, Paired-sentences, Similarity focusPermuted modeling, Masked prediction, Full token visibility, Position-aware
    PerformanceFast inference, High similarity accuracy, Short text optimal, Production-readyBetter benchmarks, Long dependencies, Compute heavy, Long text optimal
    Use CasesSearch, Clustering, Retrieval, ParaphrasingClassification, Long text, Research
  • Differences between BERT and SBERT

    SBERT models do contrastive learning on top of a pretrained BERT model.

    • BERT outputs a vector per token (????)
    • SBERT outputs a single vector embedding. (????)
    • Usecase of “classification”

      • To use BERT as a classifier, you need to reduce the dimension either by selecting the CLS token or using a pooling strategy. (so some cross-encoder type thing is needed)
      • Typically you would not freeze all the layers when you train a classifier on top of BERT, because the pretraining / pooling will adapt better if they aren’t frozen.
      • Embedding models like SBERT regularize the vector space better from their pretraining methods and are better suited toward direct classification. They may not perform better than training on an unfrozen BERT model, however.
  • BGE-M3 vs ModernBERT

    • BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.
    • ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.
  • What does context length have to do in an embedding/encoder model?

    • This is based on how we plan to query: The denser the content the smaller the chunk size.
    • So if we have dense content, smaller chunk sizes are better right, and we don’t need bigger chunk sizes! hmm, but not entirely true.
    • Bigger context sizes can be useful:
      • We don’t have to reduce a long context to a single embedding vector.
      • We make use of multi-vectors and pooling.
      • We compute the token embeddings of a long context and then pool those into sentence embeddings.
      • The benefit is:
        • Each sentence’s embedding is informed by all of the other sentences in the context.
        • So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context. (This is called late chunking, coming from late interaction)
  • What does “pooling” mean?

    “Pooling” is just aggregation methods. It could mean taking max or average values, or more exotic methods like attention pooling. It’s meant to reduce the one-per-token dimensionality to one per passage or document.

  • How does “semantic chunking” relate to “late chunking”?

    See RAG You want to partition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking’s improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.

    Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT’s MaxSim [2].

  • What does “late chunking” really mean? does it actually chunk?

    See RAG The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.

  • What is xlang-ai/instructor-embedding ?

    • It’s a way to adapt embedding models (independently of if they have multiple representations like M3) to specific domains.
    • To use M3+instructor embedding, you then need to re-train and include the instruction prefixes of instructor into M3

AutoRegressive (Generative)

Usecase: Generation

LLM

  • Language Models w > 100m parameters
  • They don’t have to use Transformers, but many do
  • They take text, convert it into tokens (integers), then predict which tokens should come next.
  • Pre-trained

  • LLM Implementations/Architectures

  • LLM Training

    Generally LLMs are trained on 1 eval (epoch)

    • Gradient Accumulation

    • Gradient Checkpointing

    • Mixed-Precision

    • Dynamic Padding & Uniform-Length Batching

    • PEFT with Low-Rank Adaptation

RLHF

GPT in production

See Deploying ML applications (applied ML)

Embedding search + vectorDB

  • Basic idea

    • Embed internal data using the LLM tokenizer(create chunks), load it into a vectorDB
    • Then query the vector DB for the most relevant information
    • Add into the context window.
  • When documents/corpus are too big to fit into prompt. Eg. Because of token limits.

    • Obtain relevant chunks by similarity search on query from vector DB
    • Find top k most similar chunk embeddings.
    • Stuff as many top k chunks as you can into the prompt and run the query
  • Example

    • Imagine you have an LLM with a token limit of 8k tokens.
    • Split the original document or corpus into 4k token chunks.
    • Leaf nodes of a “chunk tree” are set to these 4k chunks.
      • Run your query by summarizing these nodes, pair-wise (two at a time)
      • Generate parent nodes of the leaf nodes.
      • You now have a layer above the leaf nodes.
      • Repeat until you reach a single root node.
      • That node is the result of tree-summarizing your document using LLMs.

Prompt tuning

  • Idea is similar to embedding search thing but here, you are allowed to insert the embeddings of the prompt into the LLM.
  • This is not currently possible with OpenAI’s API
  • This claims to better perform prompt search

Finetune

Train a model on how to respond, so you don’t have to specify that in your prompt.

LLMOps

  • The cost of LLMOps is in inference.
  • Input tokens can be processed in parallel, output is sequential