Challenge 12: Optical Character Recognition (OCR)
25-35 min | Cost: Free | Domain: Computer Vision on Azure (15-20%)
Exam skills covered
- Identify features of optical character recognition (OCR) solutions
- Understand the difference between OCR and document intelligence
- Identify Azure services for reading text from images
- Describe capabilities of the Read API
Overview
Optical Character Recognition (OCR) is the technology that extracts text from images and documents. Any time you photograph a document, scan a receipt, or point your phone at a sign and it "reads" the text — that's OCR in action.
Think of OCR like teaching a computer to read. When you look at a photo of a restaurant menu, you instantly recognize letters and words. OCR does the same thing — it identifies the shapes of characters in an image and converts them into machine-readable text that applications can process, search, and store.
Azure provides OCR through two main services: Azure AI Vision (Read API) for general text extraction from images, and Azure AI Document Intelligence for structured document processing. The Read API handles printed and handwritten text from any image. Document Intelligence goes further — it understands document structure (fields, tables, key-value pairs) from specific document types like invoices, receipts, and forms.
Explore
Task 1: OCR vs Document Intelligence
| Feature | Azure AI Vision (Read API) | Azure AI Document Intelligence |
|---|---|---|
| What it extracts | Raw text from images | Structured fields, tables, and key-value pairs |
| Input | Any image with text | Documents (invoices, receipts, forms, IDs) |
| Output | Lines and words with positions | Named fields (e.g., "InvoiceTotal: $1,234.56") |
| Use case | Read a sign, extract text from a screenshot | Process 10,000 invoices and extract totals, dates, vendors |
| Analogy | Reading text out loud | Filling in a spreadsheet from a form |
Key distinction: OCR reads text character by character. Document Intelligence UNDERSTANDS document structure — it knows which number is the "total" and which is the "date."
Task 2: Try the Azure AI Vision OCR demo
- Visit Azure AI Vision demo
- Select the "Extract text from images" option
- Try with a sample image or upload your own (photo of a sign, document, or handwriting)
- Observe the results:
- Text is extracted line by line
- Each word has position coordinates (bounding polygon)
- Both printed and handwritten text can be detected
- The text is returned in reading order
Task 3: Understand the Read API response structure
The Read API returns a hierarchical structure:
Key features of the Read API:
- Handles printed and handwritten text
- Supports multiple languages (120+ languages)
- Works with rotated and skewed text
- Processes multi-page documents (PDF, TIFF)
- Returns confidence scores for each word
Task 4: Document Intelligence prebuilt models
Azure AI Document Intelligence offers prebuilt models for common document types:
| Prebuilt model | What it extracts |
|---|---|
| Invoice | Vendor name, invoice total, due date, line items |
| Receipt | Merchant, date, total, tax, items purchased |
| ID Document | Name, date of birth, document number, expiration |
| Business Card | Name, company, email, phone number |
| W-2 Tax form | Employee info, wages, taxes withheld |
| Health Insurance Card | Member info, plan details, group number |
Custom models: If your documents don't match prebuilt models, you can train Document Intelligence with your own document samples.
# Analyze an image with the Read API
az cognitiveservices account show \
--name my-ai-services \
--resource-group my-rg \
--query "properties.endpoint"
# Document Intelligence is accessed via REST API:
# POST {endpoint}/documentintelligence/documentModels/prebuilt-invoice:analyze?api-version=2024-02-29
Key Concepts
| Concept | Definition |
|---|---|
| OCR (Optical Character Recognition) | Technology that extracts text from images and scanned documents |
| Read API | Azure AI Vision capability that extracts printed and handwritten text |
| Azure AI Document Intelligence | Service that extracts structured data (fields, tables) from documents |
| Bounding box/polygon | Coordinates indicating where each word/line appears in the image |
| Printed text | Machine-generated text (fonts) — higher accuracy |
| Handwritten text | Human-written text — more challenging, lower accuracy |
| Prebuilt model | Pre-trained Document Intelligence model for specific document types |
| Custom model | User-trained Document Intelligence model for unique document formats |
| Confidence score | Reliability measure (0-1) for each extracted word |
Common Misconceptions
| Misconception | Reality |
|---|---|
| "OCR and Document Intelligence are the same thing" | OCR extracts raw text (characters and words). Document Intelligence understands document STRUCTURE — it knows which text is a date, which is a total, and which is a vendor name |
| "OCR only works with printed text" | Azure's Read API handles both printed and handwritten text. Printed text typically has higher accuracy, but handwriting recognition has improved dramatically |
| "OCR requires perfectly clear, straight images" | Modern OCR handles rotated, skewed, and even partially obscured text. The Read API compensates for imperfect image quality |
| "Document Intelligence requires custom training for every document type" | Prebuilt models work immediately for common documents (invoices, receipts, IDs). Custom training is only needed for unique/proprietary document formats |
| "OCR gives you structured data directly" | OCR gives you raw text in reading order. For structured data (key-value pairs, tables), you need Document Intelligence, which builds on OCR but adds document understanding |
Knowledge Check
1. A company receives thousands of paper invoices and needs to automatically extract the vendor name, invoice date, and total amount into their accounting system. Which Azure service is most appropriate?
2. A developer needs to extract all text from photographs of street signs in multiple languages. Which Azure capability should they use?
3. What does the Read API return in addition to the extracted text?
4. Which of the following can the Azure AI Vision Read API handle?
5. What is the key difference between OCR (Read API) and Document Intelligence?