tags : Machine Learning, Statistics, Embeddings, NLP (Natural Language Processing)
FAQ
Clustering vs Classification
- Classification is a supervised learning task where we train a model to assign predefined labels to observations based on their features. The key aspect is that we have labeled training data that provides the “ground truth” of which class each observation belongs to. The algorithm learns to map features to these known categories.
- Clustering is an unsupervised learning task where the algorithm identifies natural groupings in data without predetermined labels. The algorithm discovers the inherent structure based solely on feature similarity or distance metrics. The groups themselves are not known in advance.
To illustrate this with a concrete example:
- Classification: Given images labeled as “cat” or “dog,” learn to predict which label applies to new images.
- Clustering: Given unlabeled images, discover that there appear to be natural groupings (which might correspond to cats and dogs, but the algorithm doesn’t “know” these concepts).
BERT can be effectively used for both classification and clustering tasks, though in different ways:
Learning resources
- Here are 3 that I like (Kaggle)
- Titanic for classification, House Prices for regression, and Digit Recognizer for computer vision.
- https://blog.codingconfessions.com/p/the-cap-theorem-of-clustering
- https://arxiv.org/abs/1812.05774
- Ask HN: Better ways to extract skills from job postings? | Hacker News
- https://docs.mistral.ai/capabilities/finetuning/classifier_factory/
- https://hookedondata.org/posts/2019-08-05_clustering-bob-ross-paintings/
- https://blog.wilsonl.in/hackerverse/
- K-means is like the McDonald’s of clustering algorithms. It’s fast, it’s conveni… | Hacker News
- Hierarchical Clustering | Hacker News