Challenge 05: Regression in Machine Learning
25-35 min | Cost: Free | Domain: Machine Learning on Azure (15-20%)
Exam skills covered
- Identify regression machine learning scenarios
- Describe how training data is used in regression
- Identify features and labels in a dataset
- Understand model evaluation metrics for regression
Overview
Regression is the machine learning technique used to predict a numeric value. Whenever the answer to your question is a number — a price, a temperature, a duration, a quantity — you're looking at a regression problem.
Think of regression like drawing the best-fit line through a scatter plot of data points. If you plot house sizes on one axis and their prices on the other, regression finds the pattern (line or curve) that lets you predict the price of a new house based on its size. The model learns: "for every extra 100 square feet, the price increases by approximately $X."
The key vocabulary: features are the input data (square footage, number of rooms, location), and the label is what you're predicting (the price). Training data is historical examples where both features AND the label are known — the model learns the relationship between them.
Explore
Task 1: Understand regression terminology
| Term | Definition | Example (predicting house price) |
|---|---|---|
| Features | Input variables used for prediction | Square footage, bedrooms, zip code, year built |
| Label | The value being predicted (output) | Sale price ($) |
| Training data | Historical examples with known features AND labels | Past house sales with all details |
| Model | The mathematical relationship learned from training data | "Price = $150 × sqft + $20,000 × bedrooms + ..." |
| Prediction | The model's output for new, unseen data | Estimated price for a house not yet sold |
Task 2: Identify regression scenarios
Which of these are regression problems? (Answer: all the ones predicting a NUMBER)
| Scenario | Regression? | Why |
|---|---|---|
| Predicting tomorrow's high temperature | ✅ Yes | Output is a numeric value (degrees) |
| Predicting a student's exam score | ✅ Yes | Output is a number (0-100) |
| Determining if an email is spam | ❌ No | Output is a category (spam/not-spam) — this is classification |
| Predicting how long a delivery will take | ✅ Yes | Output is a number (minutes/hours) |
| Sorting photos into "cat" or "dog" | ❌ No | Output is a category — classification |
| Estimating a car's fuel efficiency (MPG) | ✅ Yes | Output is a numeric value (miles per gallon) |
Task 3: Explore Azure ML Designer sample regression
- Visit Azure Machine Learning Studio
- If you don't have a workspace, review this sample pipeline conceptually:
- Dataset: Automobile price data (features: make, body-style, engine-size, horsepower, etc.)
- Algorithm: Linear Regression
- Goal: Predict the price of a car based on its features
- The Designer provides a drag-and-drop experience to build ML pipelines without code
- Sample pipelines demonstrate regression with real datasets
Task 4: Understand regression evaluation metrics
After training a regression model, you evaluate how good its predictions are:
| Metric | What it measures | Good value |
|---|---|---|
| MAE (Mean Absolute Error) | Average difference between predicted and actual values | Lower is better |
| RMSE (Root Mean Squared Error) | Average error, penalizing large mistakes more | Lower is better |
| R² (R-squared) | How much of the variation the model explains | Closer to 1.0 is better |
Example: If a model predicts house prices with MAE of $15,000, it means on average, predictions are off by $15,000 from the actual price.
The exam tests whether you can IDENTIFY regression scenarios, not whether you can calculate metrics. The key question: "Is the output a number?" If yes → regression. If it's a category → classification.
Key Concepts
| Concept | Definition |
|---|---|
| Regression | ML technique that predicts a continuous numeric value |
| Features | Input variables (predictors) used by the model |
| Label | The target value being predicted |
| Training data | Historical data with known features and labels used to train the model |
| Linear regression | Simplest regression — finds a straight-line relationship between features and label |
| Mean Absolute Error (MAE) | Average magnitude of errors in predictions |
| R-squared (R²) | Proportion of variance in the label explained by the model (0 to 1) |
| Overfitting | Model memorizes training data instead of learning general patterns |
Common Misconceptions
| Misconception | Reality |
|---|---|
| "Regression means the data goes down (regresses)" | In ML, regression means predicting a numeric value. The term comes from statistics ("regression to the mean") — it has nothing to do with declining trends |
| "Regression can only predict future values" | Regression predicts any numeric value — past, present, or future. Predicting the age of a fossil or the price of a painting are both regression |
| "More features always make a better model" | Irrelevant features add noise and can worsen predictions. Feature selection — choosing the RIGHT inputs — is crucial |
| "Linear regression can only model straight lines" | Linear regression models straight-line relationships. But Azure ML offers many regression algorithms (decision trees, neural networks) that can model complex curves |
| "A high R² always means the model is good" | A very high R² on training data might indicate overfitting — the model memorized the training data but won't generalize to new data |
Knowledge Check
1. A company wants to predict how many units of a product they will sell next month based on historical sales data, advertising spend, and seasonal trends. What type of ML problem is this?
2. In a dataset used to predict house prices, which of the following would be the LABEL?
3. A regression model has an R-squared value of 0.92. What does this tell you?
4. Which scenario is NOT a regression problem?
5. What is the role of training data in a regression model?