Skip to content

PII/PHI Detection and Data Masking

Identify sensitive personal and health information in user input and model output, apply policy-driven actions, and mask detected values before downstream processing.

Overview

PII/PHI detection runs as a pre-processing middleware step before agent or LLM processing. It integrates Microsoft Presidio Analyzer to identify sensitive entities in text, supports configurable confidence thresholds and per-entity policies, and complements entity detection with regex-based PHI pattern matching for healthcare identifiers.

Use this guardrail when you need fine-grained control over how different data types are handled—flagged for review, blocked, masked, or passed through—while maintaining an audit trail of every detection event.

Prerequisites

PII/PHI detection requires the spaCy English language model. Install it after installing the Elsai Guardrails package:

bash
pip install --extra-index-url https://elsai-core-package.optisolbusiness.com/root/elsai-guardrails/ elsai-guardrails==0.1.3
python -m spacy download en_core_web_lg

This download is a one-time setup step and is required before enabling PII/PHI detection in your configuration.

How It Works

  1. Text is analyzed before agent or LLM processing (and optionally on output).
  2. Presidio identifies configured entity types with confidence scores.
  3. Additional regex-based patterns detect PHI identifiers such as medical record numbers and patient IDs.
  4. Each entity is evaluated against global and per-entity confidence thresholds.
  5. The configured policy action is applied: flag, block, review, or pass.
  6. When masking is enabled, detected values are replaced before downstream processing.
  7. Detection events are logged with entity type, confidence score, action taken, session ID, and timestamp.

Configuration

Enable PII/PHI Detection

yaml
guardrails:
  pii:
    enabled: true
    input_checks: true
    output_checks: true
    language: en

Supported Entity Types

The following entity types can be detected and configured individually:

Entity TypeDescription
PERSONPersonal names
LOCATIONGeographic locations
EMAIL_ADDRESSEmail addresses
PHONE_NUMBERPhone numbers
CREDIT_CARDCredit card numbers
NRPNationalities, religious, or political groups
MEDICAL_LICENSEMedical license numbers
US_SSNU.S. Social Security numbers
IBAN_CODEInternational bank account numbers
IP_ADDRESSIP addresses
PHI_MRNMedical record numbers (regex-based PHI detection)
PHI_PATIENT_IDPatient identifiers (regex-based PHI detection)

Specify which types to scan:

yaml
guardrails:
  pii:
    entity_types:
      - PERSON
      - LOCATION
      - EMAIL_ADDRESS
      - PHONE_NUMBER
      - CREDIT_CARD
      - NRP
      - MEDICAL_LICENSE
      - US_SSN
      - IBAN_CODE
      - IP_ADDRESS

Confidence Thresholds

Set a global default threshold and override it for specific entity types. Entities below the applicable threshold are handled according to below_threshold_action.

yaml
guardrails:
  pii:
    default_confidence_threshold: 0.5
    below_threshold_action: flag
    entity_thresholds:
      PERSON: 0.7

Policy-Based Actions

Each entity type can be assigned an action and optional masking behavior:

ActionBehavior
flagRecord the detection and allow processing to continue
blockStop processing and reject the request
reviewMark for human review while allowing or holding the request
passAllow the entity through without intervention
yaml
guardrails:
  pii:
    default_action: flag
    default_mask: true
    enable_phi_detection: true
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      US_SSN:
        action: block
        mask: true
      EMAIL_ADDRESS:
        action: flag
        mask: true
      PHONE_NUMBER:
        action: flag
        mask: true
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true

Complete Example

yaml
guardrails:
  input_checks: true
  output_checks: true
  check_toxicity: true
  check_sensitive_data: true
  check_semantic: true

  pii:
    enabled: true
    input_checks: true
    output_checks: true
    language: en
    default_confidence_threshold: 0.5
    below_threshold_action: flag
    default_action: flag
    default_mask: true
    enable_phi_detection: true
    entity_types:
      - PERSON
      - LOCATION
      - EMAIL_ADDRESS
      - PHONE_NUMBER
      - CREDIT_CARD
      - NRP
      - MEDICAL_LICENSE
      - US_SSN
      - IBAN_CODE
      - IP_ADDRESS
    entity_thresholds:
      PERSON: 0.7
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      US_SSN:
        action: block
        mask: true
      EMAIL_ADDRESS:
        action: flag
        mask: true
      PHONE_NUMBER:
        action: flag
        mask: true
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true

Configuration Reference

OptionTypeDefaultDescription
enabledboolfalseEnable PII/PHI detection
input_checksbooltrueRun detection on user input
output_checksbooltrueRun detection on model output
languagestr"en"Language code for entity analysis
default_confidence_thresholdfloat0.5Global minimum confidence for entity recognition
below_threshold_actionstr"flag"Action for entities below their threshold
default_actionstr"flag"Default action when no entity policy is defined
default_maskbooltrueMask detected values by default
enable_phi_detectionbooltrueEnable regex-based PHI pattern detection
entity_typeslistEntity types to detect
entity_thresholdsdictPer-entity confidence overrides
entity_policiesdictPer-entity action and masking rules

Each key under entity_policies supports:

FieldTypeValuesDescription
actionstrflag, block, review, passPolicy action applied when the entity is detected
maskbooltrue, falseWhether to mask the detected value before downstream processing

Audit Logging

Each detection event is logged with the following fields:

  • entity_type — The detected entity category
  • confidence_score — Model confidence for the detection
  • action_taken — Policy action applied (flag, block, review, pass)
  • session_id — Session identifier for traceability
  • timestamp — Time of the detection event

Use these logs for compliance reporting, security monitoring, and tuning confidence thresholds over time.

Data Masking

When mask: true is set (globally via default_mask or per entity in entity_policies), detected values are replaced in the text before it reaches the agent or LLM. This reduces exposure of sensitive data in prompts, logs, and downstream systems while still allowing the request to proceed when the policy permits.

Input vs Output Checks

Control where detection runs independently of the global input_checks and output_checks settings:

yaml
guardrails:
  pii:
    enabled: true
    input_checks: true   # Scan user messages
    output_checks: true  # Scan model responses

For applications that only need to protect inbound user data, enable input checks alone. For applications that must prevent the model from leaking sensitive information, enable both.

Use Cases

Healthcare Applications

Block or review PHI while masking patient identifiers:

yaml
guardrails:
  pii:
    enabled: true
    enable_phi_detection: true
    entity_policies:
      PHI_MRN:
        action: review
        mask: true
      PHI_PATIENT_ID:
        action: review
        mask: true
      US_SSN:
        action: block
        mask: true

Financial Services

Strict blocking for high-risk financial identifiers:

yaml
guardrails:
  pii:
    enabled: true
    entity_policies:
      CREDIT_CARD:
        action: block
        mask: true
      IBAN_CODE:
        action: block
        mask: true

Monitoring Mode

Detect and log without blocking:

yaml
guardrails:
  pii:
    enabled: true
    default_action: flag
    default_mask: false
    entity_policies:
      EMAIL_ADDRESS:
        action: flag
        mask: false

Best Practices

  1. Start with flag and mask — Use non-blocking actions while tuning thresholds before enforcing blocks in production.
  2. Set entity-specific thresholds — Names and locations often need higher thresholds than structured identifiers like email addresses.
  3. Enable PHI detection for healthcare — Turn on enable_phi_detection when handling medical or patient-related content.
  4. Review audit logs regularly — Use logged confidence scores to refine thresholds and policies.
  5. Combine with existing checks — PII/PHI detection complements Sensitive Data Detection and Toxicity Detection for layered protection.

Next Steps

Released under the MIT License.