Skip to main content

Challenge 32: PII Detection and Redaction

Estimated Time

40 min | Cost: $1-2 (estimated) | Domain: Implement NLP Solutions (15-20%)

Exam skills covered

  • Detect PII (Personally Identifiable Information) in text
  • Redact sensitive data from documents
  • Configure PII categories for targeted detection

Overview

PII Detection identifies and optionally redacts sensitive information in text. Categories include:

CategoryExamples
PersonNames
Emailemail@domain.com
PhoneNumber+1-555-123-4567
AddressStreet addresses
SSNSocial Security Numbers (US)
CreditCardNumberCredit card numbers
IPAddressIP addresses
OrganizationCompany names (when PII)
DateTimeDates of birth

The API returns both detected entities and a redacted text version with PII replaced by entity category labels.

Prerequisites

  • Azure subscription
  • Azure AI Language resource
  • Python 3.9+ or .NET 8
  • Package: azure-ai-textanalytics (v5.3+)

Implementation

Task 1: Detect PII in Text

import os
from azure.ai.textanalytics import TextAnalyticsClient, PiiEntityCategory
from azure.core.credentials import AzureKeyCredential

client = TextAnalyticsClient(
endpoint=os.environ["AZURE_AI_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_AI_KEY"])
)

documents = [
"My name is John Smith and my email is john.smith@contoso.com. "
"My SSN is 123-45-6789 and I live at 123 Main Street, Seattle, WA 98101. "
"You can reach me at 555-123-4567.",

"Patient Jane Doe (DOB: 03/15/1985) was seen at the clinic. "
"Insurance ID: ABC-123456789. Credit card ending in 4532."
]

# Detect all PII
results = client.recognize_pii_entities(documents, language="en")

for idx, result in enumerate(results):
if result.is_error:
print(f"Error: {result.error.message}")
continue

print(f"Document {idx}:")
print(f" Redacted: {result.redacted_text}")
print(f" Entities found: {len(result.entities)}")

for entity in result.entities:
print(f" [{entity.category}] '{entity.text}' "
f"(confidence: {entity.confidence_score:.3f}, "
f"offset: {entity.offset}, length: {entity.length})")
print()

Task 2: Filter by Specific PII Categories

# Detect only specific PII categories
results = client.recognize_pii_entities(
documents,
language="en",
categories_filter=[
PiiEntityCategory.US_SOCIAL_SECURITY_NUMBER,
PiiEntityCategory.CREDIT_CARD_NUMBER,
PiiEntityCategory.EMAIL,
PiiEntityCategory.PHONE_NUMBER
]
)

print("=== FILTERED PII (SSN, CC, Email, Phone only) ===")
for idx, result in enumerate(results):
if not result.is_error:
print(f"Doc {idx} redacted: {result.redacted_text}")
for entity in result.entities:
print(f" [{entity.category}] '{entity.text}'")

Task 3: Domain-Specific PII (PHI for Healthcare)

# Use healthcare domain for PHI (Protected Health Information)
healthcare_docs = [
"Patient John Doe, MRN: 12345, was diagnosed with diabetes on 01/15/2024. "
"Prescribed metformin 500mg. Next appointment: 02/15/2024 with Dr. Smith."
]

results = client.recognize_pii_entities(
healthcare_docs,
language="en",
domain_filter="phi" # Protected Health Information domain
)

print("=== PHI DETECTION (Healthcare Domain) ===")
for result in results:
if not result.is_error:
print(f"Redacted: {result.redacted_text}\n")
for entity in result.entities:
print(f" [{entity.category}] '{entity.text}' ({entity.confidence_score:.3f})")

Expected Output

Document 0:
Redacted: My name is ********* and my email is ********************.
My SSN is *********** and I live at *********************************.
You can reach me at ************.
Entities found: 5
[Person] 'John Smith' (confidence: 0.950, offset: 11, length: 10)
[Email] 'john.smith@contoso.com' (confidence: 0.990, offset: 38, length: 22)
[USSocialSecurityNumber] '123-45-6789' (confidence: 0.980, offset: 72, length: 11)
[Address] '123 Main Street, Seattle, WA 98101' (confidence: 0.920, offset: 98, length: 35)
[PhoneNumber] '555-123-4567' (confidence: 0.970, offset: 152, length: 12)

=== FILTERED PII (SSN, CC, Email, Phone only) ===
Doc 0 redacted: My name is John Smith and my email is *********************.
My SSN is *********** and I live at 123 Main Street, Seattle, WA 98101.
You can reach me at ************.

=== PHI DETECTION (Healthcare Domain) ===
Redacted: Patient ********, MRN: *****, was diagnosed with diabetes on **********...
[Person] 'John Doe' (0.980)
[MedicalRecordNumber] '12345' (0.920)
[DateTime] '01/15/2024' (0.990)

Break & fix

ScenarioSymptomRoot CauseFix
PII not detectedEntity missedUnusual format or low confidenceLower threshold; check supported formats
Over-redactionNon-PII text removedBroad category detectionUse categories_filter to target specific PII types
Wrong categoryEmail detected as URLAmbiguous patternsCategories overlap; check confidence and use filtering
Redaction format wrongStars instead of labelsDefault redaction uses * charactersRedacted text replaces PII with asterisks by default
PHI not detectedHealthcare entities missedUsing default domainSet domain_filter="phi" for healthcare-specific detection

Knowledge Check

1. What does the redacted_text property contain?

2. How do you limit PII detection to only specific categories like SSN and email?

3. What is the 'phi' domain filter used for?

4. What information does each detected PII entity include?

5. Can PII detection process multiple documents in a single request?

Cleanup

az group delete --name rg-ai102-nlp --yes --no-wait

Learn More