Clustering / Classification

FAQ

Classification is a supervised learning task where we train a model to assign predefined labels to observations based on their features. The key aspect is that we have labeled training data that provides the “ground truth” of which class each observation belongs to. The algorithm learns to map features to these known categories.
Clustering is an unsupervised learning task where the algorithm identifies natural groupings in data without predetermined labels. The algorithm discovers the inherent structure based solely on feature similarity or distance metrics. The groups themselves are not known in advance.

To illustrate this with a concrete example:

Classification: Given images labeled as “cat” or “dog,” learn to predict which label applies to new images.
Clustering: Given unlabeled images, discover that there appear to be natural groupings (which might correspond to cats and dogs, but the algorithm doesn’t “know” these concepts).

BERT can be effectively used for both classification and clustering tasks, though in different ways:

LLM based (usually for classification you’d use non-llm systems like bert but it really depends on your evals)
- Mistral Classifier is there.
- https://arxiv.org/abs/1812.05774
- https://github.com/codelion/adaptive-classifier
Focus of BERT: BERT models are primarily designed for understanding existing text rather than generating new text.[5][9] While text generation isn’t impossible with BERT, it’s not its primary strength, and other architectures are generally recommended for such tasks.[5]