PII/PHI Detection and Data Masking

Identify sensitive personal and health information in user input and model output, apply policy-driven actions, and mask detected values before downstream processing.

Overview

PII/PHI detection runs as a pre-processing middleware step before agent or LLM processing. It integrates Microsoft Presidio Analyzer to identify sensitive entities in text, supports configurable confidence thresholds and per-entity policies, and complements entity detection with regex-based PHI pattern matching for healthcare identifiers.

Use this guardrail when you need fine-grained control over how different data types are handled—flagged for review, blocked, masked, or passed through—while maintaining an audit trail of every detection event.

Prerequisites

PII/PHI detection requires the spaCy English language model. Install it after installing the Elsai Guardrails package:

bash

pip install --extra-index-url https://elsai-core-package.optisolbusiness.com/root/elsai-guardrails/ elsai-guardrails==0.1.3
python -m spacy download en_core_web_lg

This download is a one-time setup step and is required before enabling PII/PHI detection in your configuration.

How It Works

Text is analyzed before agent or LLM processing (and optionally on output).
Presidio identifies configured entity types with confidence scores.
Additional regex-based patterns detect PHI identifiers such as medical record numbers and patient IDs.
Each entity is evaluated against global and per-entity confidence thresholds.
The configured policy action is applied: flag, block, review, or pass.
When masking is enabled, detected values are replaced before downstream processing.
Detection events are logged with entity type, confidence score, action taken, session ID, and timestamp.

Configuration

Enable PII/PHI Detection

yaml

guardrails:
  pii:
    enabled: true
    input_checks: true
    output_checks: true
    language: en

Supported Entity Types

The following entity types can be detected and configured individually:

Entity Type	Description
`PERSON`	Personal names
`LOCATION`	Geographic locations
`EMAIL_ADDRESS`	Email addresses
`PHONE_NUMBER`	Phone numbers
`CREDIT_CARD`	Credit card numbers
`NRP`	Nationalities, religious, or political groups
`MEDICAL_LICENSE`	Medical license numbers
`US_SSN`	U.S. Social Security numbers
`IBAN_CODE`	International bank account numbers
`IP_ADDRESS`	IP addresses
`PHI_MRN`	Medical record numbers (regex-based PHI detection)
`PHI_PATIENT_ID`	Patient identifiers (regex-based PHI detection)

Specify which types to scan:

yaml

guardrails:
  pii:
    entity_types:
      - PERSON
      - LOCATION
      - EMAIL_ADDRESS
      - PHONE_NUMBER
      - CREDIT_CARD
      - NRP
      - MEDICAL_LICENSE
      - US_SSN
      - IBAN_CODE
      - IP_ADDRESS

Confidence Thresholds

Set a global default threshold and override it for specific entity types. Entities below the applicable threshold are handled according to below_threshold_action.

yaml

guardrails:
  pii:
    default_confidence_threshold: 0.5
    below_threshold_action: flag
    entity_thresholds:
      PERSON: 0.7

Policy-Based Actions

Each entity type can be assigned an action and optional masking behavior:

Action	Behavior
`flag`	Record the detection and allow processing to continue
`block`	Stop processing and reject the request
`review`	Mark for human review while allowing or holding the request
`pass`	Allow the entity through without intervention

yaml

guardrails:
  pii:
    default_action: flag
    default_mask: true
    enable_phi_detection: true
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      US_SSN:
        action: block
        mask: true
      EMAIL_ADDRESS:
        action: flag
        mask: true
      PHONE_NUMBER:
        action: flag
        mask: true
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true

Complete Example

yaml

guardrails:
  input_checks: true
  output_checks: true
  check_toxicity: true
  check_sensitive_data: true
  check_semantic: true

  pii:
    enabled: true
    input_checks: true
    output_checks: true
    language: en
    default_confidence_threshold: 0.5
    below_threshold_action: flag
    default_action: flag
    default_mask: true
    enable_phi_detection: true
    entity_types:
      - PERSON
      - LOCATION
      - EMAIL_ADDRESS
      - PHONE_NUMBER
      - CREDIT_CARD
      - NRP
      - MEDICAL_LICENSE
      - US_SSN
      - IBAN_CODE
      - IP_ADDRESS
    entity_thresholds:
      PERSON: 0.7
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      US_SSN:
        action: block
        mask: true
      EMAIL_ADDRESS:
        action: flag
        mask: true
      PHONE_NUMBER:
        action: flag
        mask: true
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true

Configuration Reference

Option	Type	Default	Description
`enabled`	bool	`false`	Enable PII/PHI detection
`input_checks`	bool	`true`	Run detection on user input
`output_checks`	bool	`true`	Run detection on model output
`language`	str	`"en"`	Language code for entity analysis
`default_confidence_threshold`	float	`0.5`	Global minimum confidence for entity recognition
`below_threshold_action`	str	`"flag"`	Action for entities below their threshold
`default_action`	str	`"flag"`	Default action when no entity policy is defined
`default_mask`	bool	`true`	Mask detected values by default
`enable_phi_detection`	bool	`true`	Enable regex-based PHI pattern detection
`entity_types`	list	—	Entity types to detect
`entity_thresholds`	dict	—	Per-entity confidence overrides
`entity_policies`	dict	—	Per-entity action and masking rules

Each key under entity_policies supports:

Field	Type	Values	Description
`action`	str	`flag`, `block`, `review`, `pass`	Policy action applied when the entity is detected
`mask`	bool	`true`, `false`	Whether to mask the detected value before downstream processing

Audit Logging

Each detection event is logged with the following fields:

entity_type — The detected entity category
confidence_score — Model confidence for the detection
action_taken — Policy action applied (flag, block, review, pass)
session_id — Session identifier for traceability
timestamp — Time of the detection event

Use these logs for compliance reporting, security monitoring, and tuning confidence thresholds over time.

Data Masking

When mask: true is set (globally via default_mask or per entity in entity_policies), detected values are replaced in the text before it reaches the agent or LLM. This reduces exposure of sensitive data in prompts, logs, and downstream systems while still allowing the request to proceed when the policy permits.

Input vs Output Checks

Control where detection runs independently of the global input_checks and output_checks settings:

yaml

guardrails:
  pii:
    enabled: true
    input_checks: true   # Scan user messages
    output_checks: true  # Scan model responses

For applications that only need to protect inbound user data, enable input checks alone. For applications that must prevent the model from leaking sensitive information, enable both.

Use Cases

Healthcare Applications

Block or review PHI while masking patient identifiers:

yaml

guardrails:
  pii:
    enabled: true
    enable_phi_detection: true
    entity_policies:
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true
      US_SSN:
        action: block
        mask: true

Financial Services

Strict blocking for high-risk financial identifiers:

yaml

guardrails:
  pii:
    enabled: true
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      IBAN_CODE:
        action: block
        mask: true

Monitoring Mode

Detect and log without blocking:

yaml

guardrails:
  pii:
    enabled: true
    default_action: flag
    default_mask: false
    entity_policies:
      EMAIL_ADDRESS:
        action: flag
        mask: false

Best Practices

Start with flag and mask — Use non-blocking actions while tuning thresholds before enforcing blocks in production.
Set entity-specific thresholds — Names and locations often need higher thresholds than structured identifiers like email addresses.
Enable PHI detection for healthcare — Turn on enable_phi_detection when handling medical or patient-related content.
Review audit logs regularly — Use logged confidence scores to refine thresholds and policies.
Combine with existing checks — PII/PHI detection complements Sensitive Data Detection and Toxicity Detection for layered protection.

Next Steps

Token Budget Enforcement — Limit request and run token usage
Sensitive Data Detection — Pattern-based sensitive data checks
Guardrails Configuration — Full configuration reference
YAML Configuration — Complete configuration examples

PII/PHI Detection and Data Masking ​

Overview ​

Prerequisites ​

How It Works ​

Configuration ​

Enable PII/PHI Detection ​

Supported Entity Types ​

Confidence Thresholds ​

Policy-Based Actions ​

Complete Example ​

Configuration Reference ​

Audit Logging ​

Data Masking ​

Input vs Output Checks ​

Use Cases ​

Healthcare Applications ​

Financial Services ​

Monitoring Mode ​

Best Practices ​

Next Steps ​

PII/PHI Detection and Data Masking

Overview

Prerequisites

How It Works

Configuration

Enable PII/PHI Detection

Supported Entity Types

Confidence Thresholds

Policy-Based Actions

Complete Example

Configuration Reference

Audit Logging

Data Masking

Input vs Output Checks

Use Cases

Healthcare Applications

Financial Services

Monitoring Mode

Best Practices

Next Steps