Challenge 26: Custom Vision - Object Detection
Estimated Time
60 min | Cost: $2-5 (estimated) | Domain: Implement Computer Vision Solutions (10-15%)
Exam skills covered
- Train custom image model for object detection
- Label images with bounding box regions
- Evaluate detection metrics (mAP)
- Publish and consume object detection model
Overview
Object detection locates and classifies multiple objects within an image using bounding boxes. Unlike classification (which answers "what is this image?"), detection answers "what objects are here and where?"
Key concepts:
- Bounding box: Rectangle defined by (left, top, width, height) as normalized coordinates (0.0–1.0)
- IoU (Intersection over Union): Measures overlap between predicted and actual bounding boxes
- mAP (mean Average Precision): Primary metric averaging AP across all object classes
Prerequisites
- Azure subscription
- Custom Vision Training + Prediction resources
- Python 3.9+
- Package:
azure-cognitiveservices-vision-customvision
Implementation
Task 1: Create Object Detection Project
- Python SDK
import os
import time
from azure.cognitiveservices.vision.customvision.training import CustomVisionTrainingClient
from azure.cognitiveservices.vision.customvision.training.models import (
ImageUrlCreateEntry, Region
)
from azure.cognitiveservices.vision.customvision.prediction import CustomVisionPredictionClient
from msrest.authentication import ApiKeyCredentials
training_key = os.environ["CUSTOM_VISION_TRAINING_KEY"]
training_endpoint = os.environ["CUSTOM_VISION_TRAINING_ENDPOINT"]
credentials = ApiKeyCredentials(in_headers={"Training-key": training_key})
trainer = CustomVisionTrainingClient(training_endpoint, credentials)
# Find the Object Detection domain
domains = trainer.get_domains()
obj_detection_domain = next(d for d in domains if d.type == "ObjectDetection" and not d.exportable)
print(f"Domain: {obj_detection_domain.name} ({obj_detection_domain.id})")
# Create object detection project
project = trainer.create_project(
name="Vehicle-Detector",
domain_id=obj_detection_domain.id
)
print(f"Created project: {project.name} ({project.id})")
# Create tags for objects to detect
car_tag = trainer.create_tag(project.id, "car")
truck_tag = trainer.create_tag(project.id, "truck")
bicycle_tag = trainer.create_tag(project.id, "bicycle")
print(f"Tags: car={car_tag.id}, truck={truck_tag.id}, bicycle={bicycle_tag.id}")
Task 2: Upload Images with Bounding Box Regions
- Python SDK
# Regions use normalized coordinates (0.0 to 1.0 relative to image dimensions)
# Format: Region(tag_id, left, top, width, height)
training_images = [
{
"url": "https://example.com/traffic1.jpg",
"regions": [
Region(tag_id=car_tag.id, left=0.1, top=0.3, width=0.25, height=0.2),
Region(tag_id=car_tag.id, left=0.5, top=0.35, width=0.2, height=0.18),
Region(tag_id=truck_tag.id, left=0.7, top=0.2, width=0.28, height=0.3),
]
},
{
"url": "https://example.com/traffic2.jpg",
"regions": [
Region(tag_id=bicycle_tag.id, left=0.05, top=0.4, width=0.15, height=0.25),
Region(tag_id=car_tag.id, left=0.4, top=0.3, width=0.3, height=0.22),
]
}
]
# Upload images with regions
image_entries = []
for img in training_images:
entry = ImageUrlCreateEntry(
url=img["url"],
regions=img["regions"]
)
image_entries.append(entry)
upload_result = trainer.create_images_from_urls(
project.id,
images=image_entries
)
print(f"Upload success: {upload_result.is_batch_successful}")
for image in upload_result.images:
print(f" {image.source_url}: {image.status}")
Task 3: Train and Evaluate Object Detection Model
- Python SDK
# Train the model
print("Training object detection model...")
iteration = trainer.train_project(project.id)
while iteration.status != "Completed":
iteration = trainer.get_iteration(project.id, iteration.id)
print(f" Status: {iteration.status}")
time.sleep(10)
print(f"Training complete: {iteration.id}")
# Evaluate performance
performance = trainer.get_iteration_performance(project.id, iteration.id)
print(f"\nDetection Metrics:")
print(f" Precision: {performance.precision:.4f}")
print(f" Recall: {performance.recall:.4f}")
print(f" mAP: {performance.average_precision:.4f}")
for tag_perf in performance.per_tag_performance:
print(f" '{tag_perf.name}': precision={tag_perf.precision:.3f}, recall={tag_perf.recall:.3f}, AP={tag_perf.average_precision:.3f}")
# Publish
prediction_resource_id = "/subscriptions/<sub-id>/resourceGroups/rg-ai102-customvision/providers/Microsoft.CognitiveServices/accounts/cv-prediction-ai102"
publish_name = "vehicle-detector-v1"
trainer.publish_iteration(project.id, iteration.id, publish_name, prediction_resource_id)
print(f"\nPublished as: {publish_name}")
Task 4: Run Object Detection Predictions
- Python SDK
- REST API
prediction_key = os.environ["CUSTOM_VISION_PREDICTION_KEY"]
prediction_endpoint = os.environ["CUSTOM_VISION_PREDICTION_ENDPOINT"]
pred_credentials = ApiKeyCredentials(in_headers={"Prediction-key": prediction_key})
predictor = CustomVisionPredictionClient(prediction_endpoint, pred_credentials)
# Detect objects in a new image
test_url = "https://example.com/street-scene.jpg"
results = predictor.detect_image_url(project.id, publish_name, url=test_url)
print(f"\nDetection Results:")
print(f"Objects found: {len(results.predictions)}")
for detection in results.predictions:
if detection.probability > 0.5: # Confidence threshold
bbox = detection.bounding_box
print(f" {detection.tag_name} ({detection.probability:.1%})")
print(f" Box: left={bbox.left:.3f}, top={bbox.top:.3f}, "
f"width={bbox.width:.3f}, height={bbox.height:.3f}")
# Convert normalized to pixel coordinates (for a 1920x1080 image)
image_width, image_height = 1920, 1080
for detection in results.predictions:
if detection.probability > 0.5:
bbox = detection.bounding_box
pixel_left = int(bbox.left * image_width)
pixel_top = int(bbox.top * image_height)
pixel_width = int(bbox.width * image_width)
pixel_height = int(bbox.height * image_height)
print(f" {detection.tag_name}: ({pixel_left}, {pixel_top}) -> ({pixel_left+pixel_width}, {pixel_top+pixel_height})")
PREDICTION_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
PREDICTION_KEY="<key>"
PROJECT_ID="<project-id>"
curl -s "${PREDICTION_ENDPOINT}/customvision/v3.0/prediction/${PROJECT_ID}/detect/iterations/vehicle-detector-v1/url" \
-H "Prediction-Key: ${PREDICTION_KEY}" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/street-scene.jpg"}' \
| jq '.predictions[] | select(.probability > 0.5) | {tag: .tagName, probability: .probability, boundingBox}'
Expected Output
Domain: General (Object Detection)
Created project: Vehicle-Detector
Tags: car=..., truck=..., bicycle=...
Upload success: True
Training object detection model...
Status: Training
Status: Completed
Training complete: iter-67890
Detection Metrics:
Precision: 0.8850
Recall: 0.8200
mAP: 0.8734
'car': precision=0.920, recall=0.880, AP=0.910
'truck': precision=0.870, recall=0.790, AP=0.850
'bicycle': precision=0.865, recall=0.770, AP=0.860
Published as: vehicle-detector-v1
Detection Results:
Objects found: 4
car (95.2%)
Box: left=0.102, top=0.298, width=0.245, height=0.198
car (87.3%)
Box: left=0.510, top=0.320, width=0.190, height=0.175
truck (82.1%)
Box: left=0.720, top=0.180, width=0.260, height=0.310
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
| Regions rejected | Invalid region coordinates | Coordinates outside 0.0–1.0 range | Normalize: left+width ≤ 1.0, top+height ≤ 1.0 |
| Low mAP | Poor detection accuracy | Inconsistent bounding box labeling | Re-label with tight, consistent boxes; more training data |
| Overlapping detections | Duplicate predictions | No NMS threshold configured | Apply confidence threshold; use Non-Maximum Suppression |
| Training fails | BadRequestImageRegions | Regions too small or missing | Minimum region size ~5% of image area |
| Wrong endpoint | 404 on detection | Using classify endpoint for detection | Use /detect/ not /classify/ in prediction URL |
Knowledge Check
1. How are bounding box coordinates represented in Custom Vision object detection?
2. What does mAP (mean Average Precision) measure in object detection?
3. What is the key difference between the classify and detect prediction endpoints?
4. What is IoU (Intersection over Union) used for?
5. When labeling training images for object detection, what coordinates do you need for each object?
Cleanup
az group delete --name rg-ai102-customvision --yes --no-wait