Skip to main content

Challenge 07: Clustering in Machine Learning

Estimated Time

20-30 min | Cost: Free | Domain: Machine Learning on Azure (15-20%)

Exam skills covered

  • Identify clustering machine learning scenarios
  • Describe how clustering differs from classification
  • Understand unsupervised learning concepts
  • Identify appropriate use cases for clustering

Overview

Clustering is the machine learning technique used to group similar items together when you don't have predefined categories. Unlike classification (where you know the categories in advance — spam/not-spam), clustering discovers natural groupings in data on its own.

Think of clustering like organizing a messy drawer. You dump out 100 items and start grouping things that seem similar: pens go together, batteries go together, cables go together. Nobody told you these categories in advance — you discovered them by looking at similarities. That's clustering.

The critical distinction: classification is supervised learning (you provide labeled examples), while clustering is unsupervised learning (no labels needed). Clustering finds patterns and groupings that you might not have known existed.

Explore

Task 1: Classification vs Clustering

Understanding the difference is one of the most-tested concepts:

AspectClassificationClustering
Learning typeSupervisedUnsupervised
Labels needed?Yes — training data has known categoriesNo — no labels required
CategoriesPredefined (you specify them)Discovered (the algorithm finds them)
GoalAssign items to KNOWN groupsDiscover UNKNOWN groups
Example"This email is spam" (label known)"These customers behave similarly" (groups discovered)

Task 2: Identify clustering scenarios

ScenarioWhy it's clustering
Customer segmentationGroup customers by purchasing behavior to discover segments you didn't know existed
Document groupingOrganize articles into topic groups without predefined categories
Anomaly detectionItems that don't fit any cluster may be outliers
Gene expression analysisGroup genes with similar expression patterns
Image compressionGroup similar colors together to reduce the number of unique colors

NOT clustering (these are classification):

  • Sorting emails into spam/not-spam (labels are known)
  • Diagnosing a disease as Type A, B, or C (categories predefined by doctors)
  • Grading student essays as A/B/C/D/F (grades are predetermined)

Task 3: Understand K-Means clustering

K-Means is the most common clustering algorithm and the one referenced in the exam:

  1. Choose K — decide how many clusters you want (e.g., K=3 for 3 groups)
  2. Initialize — place K random center points (centroids)
  3. Assign — each data point joins the nearest centroid's cluster
  4. Update — move each centroid to the center of its assigned points
  5. Repeat — keep assigning and updating until clusters stabilize

Key decisions:

  • How many clusters (K)? — There's no perfect answer. You try different values and evaluate which makes most business sense
  • What features to use? — The features you include determine what "similar" means

Task 4: Clustering in Azure Machine Learning

In Azure ML, you can build clustering models using:

  1. Azure ML Designer — drag-and-drop clustering pipeline

    • Use the "K-Means Clustering" module
    • Connect to a dataset (no label column needed!)
    • Configure the number of clusters
    • Evaluate results with metrics like silhouette score
  2. Key metrics for clustering:

    • Silhouette score: Measures how similar items are to their own cluster vs. other clusters (-1 to 1, higher is better)
    • Inertia: Sum of distances from points to their cluster center (lower is better)
Exam strategy

The exam trigger for clustering: "No labels" or "discover groups" or "segment customers". If the scenario says "we don't know the categories yet" or "find natural groupings" → clustering. If categories are already known → classification.

Key Concepts

ConceptDefinition
ClusteringUnsupervised ML technique that groups similar data points together
Unsupervised learningML approach that finds patterns without labeled training data
Supervised learningML approach that uses labeled training data (classification, regression)
K-MeansPopular clustering algorithm that divides data into K groups based on distance to centroids
CentroidThe center point of a cluster
K (number of clusters)A parameter you choose — how many groups the algorithm should create
Silhouette scoreMetric measuring how well-separated clusters are (-1 to 1)
Customer segmentationCommon use case: grouping customers by behavior to discover market segments

Common Misconceptions

MisconceptionReality
"Clustering and classification are the same thing"Classification assigns items to KNOWN categories using labeled data. Clustering DISCOVERS unknown groups without labels. The presence or absence of predefined labels is the key difference
"Clustering tells you what each group means"Clustering finds groups of similar items, but interpreting what each group represents is a human task. The algorithm says "these items are similar" — you decide the meaning
"You must know the number of clusters beforehand"While K-Means requires you to specify K, you typically try multiple values and use metrics (silhouette score) or business logic to pick the best number
"Clustering requires large datasets"Clustering can work with smaller datasets, though the quality of discovered groups improves with more data. Even a few hundred points can form meaningful clusters
"Unsupervised means no human involvement"Unsupervised means no labels in the data. Humans still choose features, set parameters (like K), interpret results, and validate that clusters are meaningful

Knowledge Check

1. A marketing team wants to group their customers into segments based on purchasing behavior, but they do not have predefined customer categories. Which ML technique should they use?

2. What is the KEY difference between clustering and classification?

3. In K-Means clustering, what does "K" represent?

4. Which of the following is NOT a clustering scenario?

5. A clustering algorithm groups data based on similarity. Who determines what the discovered groups MEAN or represent?

Learn More