Skip to main content

Challenge 11: Object Detection

Estimated Time

25-35 min | Cost: Free | Domain: Computer Vision on Azure (15-20%)

Exam skills covered

  • Identify features of object detection solutions
  • Understand bounding boxes and confidence scores
  • Differentiate object detection from image classification
  • Identify use cases for object detection

Overview

Object detection goes beyond image classification by not only identifying WHAT objects are in an image but also WHERE they are located. For each detected object, the model returns a bounding box (rectangle coordinates) and a confidence score. One image can contain multiple objects of different types.

Think of object detection like a wildlife photographer cataloging animals in a photo. Classification says "this photo contains elephants." Object detection says "there are 3 elephants: one in the upper-left, one in the center, and one in the lower-right" — each marked with a rectangle and a confidence level.

The key difference from classification: classification labels the entire image as one thing. Object detection finds multiple individual objects within the image and tells you exactly where each one is. This is critical for applications like autonomous driving (where is each car, pedestrian, and traffic sign?) or retail analytics (how many people are in each aisle?).

Explore

Task 1: Understanding bounding boxes

A bounding box defines the location of a detected object using coordinates:

Challenge 11 - Image Analysis Pipeline

Each detection includes:

  • Class/label: What the object is ("dog", "cat")
  • Confidence score: How certain the model is (0.94 = 94%)
  • Bounding box: Coordinates defining the rectangle (x, y, width, height)

Task 2: Object detection vs classification vs segmentation

TechniqueQuestion answeredOutputExample
Image Classification"What is this image?"Label(s) for the whole image"This is a beach scene"
Object Detection"What objects are here and WHERE?"Labels + bounding boxes"Car at (100,200), person at (400,300)"
Instance Segmentation"What shape is each object?"Labels + pixel-level outlinesExact outline of each car, person

For the exam: Focus on the classification vs detection distinction. The key differentiator is bounding boxes/localization.

Task 3: Explore object detection demos

  1. Visit Azure AI Vision demo
  2. Try the Dense Captioning or Object Detection features
  3. Upload an image with multiple objects (e.g., a street scene)
  4. Observe:
    • Multiple objects detected in one image
    • Each object has a bounding box drawn around it
    • Confidence scores vary per object
    • The model can detect the SAME type of object multiple times (3 cars, 2 people)

Task 4: Real-world object detection use cases

IndustryUse caseWhat's detected
RetailCustomer counting and flow analysisPeople in store aisles
Autonomous vehiclesNavigating safelyCars, pedestrians, signs, lanes
ManufacturingQuality inspectionDefects, components, alignment issues
SecuritySurveillance alertsPeople, vehicles, weapons
AgricultureCrop monitoringWeeds, pests, ripe fruit
HealthcareMedical imagingTumors, fractures, anomalies

Custom Object Detection with Azure Custom Vision:

  • Train with YOUR images and YOUR object types
  • Label objects by drawing bounding boxes on training images
  • Need at least 15 tagged images per object type
  • The model learns to find YOUR specific objects in new images
Exam strategy

Look for these keywords in exam scenarios:

  • "Locate", "find where", "bounding box", "position" → Object Detection
  • "How many of X are in the image" → Object Detection (counting requires locating each instance)
  • "What is this image of?" (whole image) → Classification

Key Concepts

ConceptDefinition
Object detectionIdentifying and locating multiple objects within an image using bounding boxes
Bounding boxRectangle defined by coordinates (x, y, width, height) that frames a detected object
Confidence thresholdMinimum confidence score required to accept a detection as valid
IoU (Intersection over Union)Metric measuring how much a predicted bounding box overlaps with the true location
Multiple detectionsOne image can contain many objects; each gets its own box and label
Custom Vision (Object Detection)Azure service to train custom object detectors with your own labeled images
Real-time detectionProcessing video frames in real-time to detect objects continuously

Common Misconceptions

MisconceptionReality
"Object detection is just image classification with locations"They are related but distinct. Classification labels the whole image. Object detection finds and locates individual objects — it handles multiple objects, overlapping objects, and objects of different types in one image
"Object detection can only find one object at a time"Object detection finds ALL objects in an image simultaneously. A street scene might return 5 cars, 3 people, 2 traffic lights, all with separate bounding boxes
"Bounding boxes are always perfectly aligned with objects"Bounding boxes are rectangles — they approximate the object's location. For irregular shapes, the box includes some background. Instance segmentation provides pixel-precise outlines
"You need video for object detection"Object detection works on single images. When applied to video, it processes individual frames. Real-time video is just fast image processing
"Higher confidence threshold is always better"Higher thresholds mean fewer false positives but more missed detections. The right threshold depends on the use case — a self-driving car needs to detect ALL pedestrians (lower threshold, higher recall)

Knowledge Check

1. A retail store wants to count how many customers are in each department at any given time using security cameras. Which computer vision technique is most appropriate?

2. What information does a bounding box provide in object detection?

3. An autonomous vehicle system detects a pedestrian with 0.55 confidence and the safety threshold is set to 0.30. What should the system do?

4. What is the KEY feature that distinguishes object detection from image classification?

5. A single image processed by an object detection model shows a street scene. Which result is most likely?

Learn More