>

How to de-identify PHI before it reaches your LLM

How to de-identify PHI before it reaches your LLM

By Mat Steinlin, Head of Information Security

Last updated: April 2026

Most developers encounter de-identification as a compliance question: do I need to de-identify this data? The answer is usually no: if you have a valid Business Associate Agreement with your LLM provider, you can send PHI to the model without de-identifying first, and many teams stop there.

That framing treats de-identification as a checkbox to evaluate and skip. This chapter argues for a different frame: de-identification as a security control that reduces risk even when it's not required, and that's meaningfully harder to implement correctly than most developers expect.

For the compliance baseline (when de-identification is required under HIPAA, the safe harbor standard, and the basic structured identifier approach), see HIPAA-Compliant AI. This chapter covers what that one doesn't: the security argument for de-identification when you already have a BAA, and how to handle unstructured clinical text, which is the majority of what healthcare AI features actually process.

De-identification isn’t just a compliance workaround

The common mental model: de-identification is what you do when you cannot get a BAA. Get the BAA, skip the de-identification.

This misses what de-identification actually does. A BAA is a legal agreement about the provider's liability for how they handle your data. It doesn't change what data the provider's infrastructure processes, stores during inference, caches, or logs for abuse monitoring. The provider's obligations under the BAA are real and contractually enforced, but the data still passes through their systems in identifiable form.

De-identification before sending to the model means the provider's infrastructure never sees the PHI; the token [PATIENT_1] passes through inference, not the patient's name. If the provider has a security incident, the data in their systems at the time is de-identified. If their abuse monitoring logs retain more than the BAA allows, what they retained is tokens. If a misconfigured feature at the provider caches requests in plain text, it caches tokens.

Beyond the provider, de-identification has infrastructure-level security value on your own side. Your audit logs contain every prompt and response you send to the LLM. Those logs are a long-lived, queryable store of data that, for compliance, must be retained for years. An audit log containing de-identified prompts has a meaningfully smaller blast radius if that storage is ever compromised than one containing raw PHI. The same logic applies to any caching, debugging infrastructure, or log drain that handles LLM traffic.

When de-identification is worth the complexity

De-identification is not free. It adds latency (30–500ms depending on approach), implementation complexity, re-identification risk (which is its own failure mode), and ongoing maintenance. Whether the tradeoff is worth it depends on your specific use case.

Cases where it adds meaningful security value

Processing full clinical notes or discharge summaries. When your AI feature sends complete free-text clinical records to the LLM, the PHI exposure surface is large. Names, dates, locations, diagnoses, and medication details are all present in identifiable form. De-identification here significantly reduces what the provider's infrastructure sees and what lives in your audit logs.

Features with high data sensitivity. Prior authorization assistance, diagnostic support, and anything involving mental health records or substance use disorders carry both elevated sensitivity and, in some cases, additional regulatory requirements (42 CFR Part 2 for substance use records adds requirements beyond standard HIPAA). Defense in depth is appropriate here.

Evaluating a new model provider. When you are testing a model from a provider you have not worked with before, when you have not yet signed a BAA with that provider, or when evaluating a new model before you have established operational trust, de-identification limits what you expose during the evaluation period.

When your audit log retention is extensive. If your logs are retained for years and your log storage security posture is not perfect, de-identified logs reduce the value of a compromise significantly.

Cases where it does not justify the overhead

Narrow structured data fields. If your LLM call takes three specific fields (age range, diagnosis code, medication name) and those fields are already de-identified or not individually identifying, the overhead is not warranted.

Internal tooling with controlled access. Developer tools and internal dashboards where all users are authenticated employees with appropriate access controls and where no raw patient records are involved have a lower risk profile.

High-latency-sensitive patient-facing features. Adding 200–400ms to a synchronous patient-facing interaction for marginal security benefit is often the wrong tradeoff. Async workflows and batch processing are better candidates.

When you are using synthetic test data. De-identification is unnecessary and counterproductive in development environments using synthetic or anonymized test datasets.

Structured identifiers: the basic approach

HIPAA-Compliant AI covers the structured identifier approach: regex-based detection and replacement of the 18 HIPAA Safe Harbor identifiers: Social Security numbers, MRNs, dates of birth, phone numbers, and similar fields that have recognizable patterns.

That approach works for structured fields and programmatically-generated data. It doesn't work for unstructured text. A clinical note that says "The patient, John Daniels, was seen on Tuesday by Dr. Rivera at the downtown clinic" contains a name, a date reference, a provider name, and a location, none of which regex pattern matching will reliably detect in free-form prose. The rest of this chapter covers the tools and architectures for that problem.

To meet the Safe Harbor standard, all 18 identifiers must be removed, not most of them. A clinical note with 17 identifiers scrubbed, but a patient name remaining is not de-identified under HIPAA. The structured regex approach handles several of the 18 reliably. It does not handle names, date references in prose, or contextual identifiers, which is why NLP is necessary for unstructured clinical text rather than optional.

Unstructured clinical text: the hard problem

Unstructured clinical text is the default medium for clinical documentation. Physician notes, discharge summaries, nursing assessments, referral letters, and patient messages are written in natural language with abbreviations, misspellings, non-standard terminology, and institutional conventions that vary by provider and by clinician.

De-identifying this text requires identifying PHI in context, not by pattern. The same challenge that makes NLP-based de-identification necessary also makes it imperfect: natural language understanding is probabilistic, and clinical text is harder than general text because it is densely abbreviated, domain-specific, and often grammatically irregular.

This has a practical implication that the section on verification will address: NLP-based de-identification has a false negative rate, and some PHI won't be detected and replaced. Implementing de-identification for clinical text isn't the same as guaranteeing no PHI reaches the model. Design accordingly.

spaCy with a clinical NER model

spaCy is an open-source Python NLP library with strong support for named entity recognition. For clinical text, scispaCy (developed by the Allen Institute for AI) extends spaCy with models trained on biomedical literature and provides entity types relevant to clinical text.

scispaCy's models recognize biomedical entities but are not trained specifically for PHI detection. For production de-identification use, you need either a custom NER model trained on annotated clinical text (the i2b2/n2c2 de-identification datasets are the standard training resource) or a model fine-tuned from scispaCy's base models on PHI-labeled data.

# pip install spacy scispacy
# pip install <https://s3-us-west-2.amazonaws.com/ai2-s3-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz>

import re
import spacy
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class PHISpan:
    text: str
    label: str      # e.g. "PERSON", "DATE", "ORG", "GPE"
    start: int
    end: int

# PHI-relevant entity types in clinical text.
# Standard spaCy/scispaCy labels that commonly correspond to PHI:
PHI_ENTITY_LABELS = {
    "PERSON",     # Patient and provider names
    "DATE",       # Dates of birth, visit dates, event dates
    "TIME",       # Times that may narrow date identification
    "GPE",        # Cities, states, countries (geographic identifiers)
    "LOC",        # Locations, facilities
    "ORG",        # Hospitals, clinics, institutions
    "CARDINAL",   # Numbers that may be MRNs or identifiers in context
}

# Supplement NER with regex for structured identifiers that NER misses.
SUPPLEMENTAL_PHI_PATTERNS = {
    "PHONE":    re.compile(r'\\\\b(\\\\+?1[-.\\\\s]?)?\\\\(?\\\\d{3}\\\\)?[-.\\\\s]?\\\\d{3}[-.\\\\s]?\\\\d{4}\\\\b'),
    "SSN":      re.compile(r'\\\\b\\\\d{3}-\\\\d{2}-\\\\d{4}\\\\b'),
    "EMAIL":    re.compile(r'\\\\b[A-Za-z0-9._%+\\\\-]+@[A-Za-z0-9.\\\\-]+\\\\.[A-Z|a-z]{2,}\\\\b'),
    "ZIP":      re.compile(r'\\\\b\\\\d{5}(?:-\\\\d{4})?\\\\b'),
    "MRN":      re.compile(r'\\\\bMRN\\\\s*[:#]?\\\\s*\\\\d{5,10}\\\\b', re.IGNORECASE),
    "NPI":      re.compile(r'\\\\bNPI\\\\s*[:#]?\\\\s*\\\\d{10}\\\\b', re.IGNORECASE),
    "URL":      re.compile(r'https?://\\\\S+'),
}

class SpacyPHIDetector:
    """
    PHI detector using spaCy NER for unstructured clinical text.

    For production use, replace the base model with a model trained
    or fine-tuned on PHI-labeled clinical text (i2b2/n2c2 datasets).
    The base scispaCy models detect many PHI-adjacent entities but
    are not optimized for de-identification specifically.

    Always validate detection quality on representative samples from
    your actual clinical text before deploying.
    """

    def __init__(self, model_name: str = "en_core_sci_lg"):
        self.nlp = spacy.load(model_name)

    def detect(self, text: str) -> list[PHISpan]:
        doc = self.nlp(text)
        spans: list[PHISpan] = []

        # NER-detected entities
        for ent in doc.ents:
            if ent.label_ in PHI_ENTITY_LABELS:
                spans.append(PHISpan(
                    text=ent.text,
                    label=ent.label_,
                    start=ent.start_char,
                    end=ent.end_char,
                ))

        # Regex-detected structured identifiers
        for label, pattern in SUPPLEMENTAL_PHI_PATTERNS.items():
            for match in pattern.finditer(text):
                spans.append(PHISpan(
                    text=match.group(),
                    label=label,
                    start=match.start(),
                    end=match.end(),
                ))

        # Remove duplicates and sort by position
        seen = set()
        unique_spans = []
        for span in sorted(spans, key=lambda s: s.start):
            key = (span.start, span.end)
            if key not in seen:
                seen.add(key)
                unique_spans.append(span)

        return unique_spans


The hybrid approach (NER plus regex) is important: NER handles contextual identification, regex handles structured identifiers that pattern matching reliably catches. Neither alone is sufficient.

Practical considerations:

  • Model loading takes 1–3 seconds; load once at application startup, not per request

  • Inference time for a 500-word clinical note is typically 30–80ms, depending on hardware

  • False negative rate on out-of-domain clinical text can be significant; validate on your actual data

AWS Comprehend Medical

AWS Comprehend Medical provides a managed PHI detection API trained specifically on clinical text. The DetectPHI operation returns entity types directly relevant to HIPAA de-identification: NAME, AGE, ADDRESS, PROFESSION, PHONE, ID, EMAIL, DATE, LOCATION_OTHER, and URL.

import boto3
from dataclasses import dataclass

@dataclass
class PHIEntity:
    text: str
    entity_type: str    # Comprehend Medical PHI type
    begin_offset: int
    end_offset: int
    score: float        # Confidence score 0.0 – 1.0

def detect_phi_comprehend(
    text: str,
    region: str = "us-east-1",
    min_score: float = 0.80,
) -> list[PHIEntity]:
    """
    Detect PHI in clinical text using AWS Comprehend Medical.

    Filters to entities above min_score confidence. The default threshold
    of 0.80 balances recall against false positives; lower it to be more
    conservative (catch more PHI at the cost of more false positives).

    Note on text length: Comprehend Medical has a limit of 20,000 UTF-8
    characters per request. Split longer documents before calling.
    """
    if len(text.encode("utf-8")) > 20_000:
        raise ValueError(
            "Text exceeds Comprehend Medical's 20,000-byte limit. "
            "Split the document and call this function per chunk."
        )

    client = boto3.client("comprehendmedical", region_name=region)
    response = client.detect_phi(Text=text)

    return [
        PHIEntity(
            text=entity["Text"],
            entity_type=entity["Type"],
            begin_offset=entity["BeginOffset"],
            end_offset=entity["EndOffset"],
            score=entity["Score"],
        )
        for entity in response["Entities"]
        if entity["Score"] >= min_score
    ]

def chunk_and_detect_phi(text: str, region: str = "us-east-1") -> list[PHIEntity]:
    """
    Handle documents longer than Comprehend Medical's 20,000-byte limit
    by splitting on paragraph boundaries and adjusting offsets.
    """
    MAX_BYTES = 18_000  # Conservative limit to stay under the 20,000-byte cap
    chunks = []
    current_chunk = []
    current_size = 0

    for paragraph in text.split("\\\\n\\\\n"):
        para_bytes = len(paragraph.encode("utf-8"))
        if current_size + para_bytes > MAX_BYTES and current_chunk:
            chunks.append("\\\\n\\\\n".join(current_chunk))
            current_chunk = [paragraph]
            current_size = para_bytes
        else:
            current_chunk.append(paragraph)
            current_size += para_bytes

    if current_chunk:
        chunks.append("\\\\n\\\\n".join(current_chunk))

    all_entities = []
    offset = 0
    for chunk in chunks:
        entities = detect_phi_comprehend(chunk, region=region)
        for entity in entities:
            all_entities.append(PHIEntity(
                text=entity.text,
                entity_type=entity.entity_type,
                begin_offset=entity.begin_offset + offset,
                end_offset=entity.end_offset + offset,
                score=entity.score,
            ))
        offset += len(chunk) + 2  # +2 for the "\\\\n\\\\n" separator

    return all_entities


Practical considerations:

  • Latency: 150–400ms per API call, depending on text length and region

  • Cost: $0.01 per 100 characters (verify current pricing in the AWS pricing page)

  • No model training required; Comprehend Medical is production-ready out of the box

  • The managed service means no infrastructure to maintain, but the text leaves your environment for AWS inference

Microsoft Presidio

Microsoft Presidio is an open-source de-identification framework that combines spaCy-based NER with an extensible recognizer architecture. It ships with recognizers for common PII types and can be extended with custom recognizers for healthcare-specific identifiers.

# pip install presidio-analyzer presidio-anonymizer
# python -m spacy download en_core_web_lg

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from presidio_analyzer import PatternRecognizer, Pattern

def build_medical_analyzer() -> AnalyzerEngine:
    """
    Build a Presidio analyzer with clinical-text-appropriate configuration.
    Add custom recognizers for healthcare identifiers beyond the defaults.
    """
    configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
    }
    provider = NlpEngineProvider(nlp_configuration=configuration)
    nlp_engine = provider.create_engine()
    analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])

    # Custom recognizer for Medical Record Numbers
    mrn_recognizer = PatternRecognizer(
        supported_entity="MEDICAL_RECORD_NUMBER",
        patterns=[
            Pattern(name="MRN", regex=r'\\\\bMRN\\\\s*[:#]?\\\\s*\\\\d{5,10}\\\\b', score=0.9),
            Pattern(name="MRN_bare", regex=r'\\\\b[Mm]edical\\\\s+[Rr]ecord\\\\s+[Nn]umber\\\\s*:?\\\\s*\\\\d{5,10}\\\\b', score=0.85),
        ],
    )
    analyzer.registry.add_recognizer(mrn_recognizer)

    # Custom recognizer for National Provider Identifiers
    npi_recognizer = PatternRecognizer(
        supported_entity="NPI",
        patterns=[
            Pattern(name="NPI", regex=r'\\\\bNPI\\\\s*[:#]?\\\\s*\\\\d{10}\\\\b', score=0.95),
        ],
    )
    analyzer.registry.add_recognizer(npi_recognizer)

    return analyzer

def deidentify_clinical_text(
    text: str,
    analyzer: AnalyzerEngine,
    anonymizer: AnonymizerEngine,
) -> tuple[str, list[RecognizerResult]]:
    """
    De-identify clinical text using Presidio.
    Returns the de-identified text and the list of detected entities
    (needed for re-identification mapping).
    """
    phi_entities = [
        "PERSON", "DATE_TIME", "LOCATION", "PHONE_NUMBER",
        "EMAIL_ADDRESS", "US_SSN", "URL", "US_PASSPORT",
        "MEDICAL_RECORD_NUMBER", "NPI",
    ]

    results = analyzer.analyze(text=text, entities=phi_entities, language="en")

    # Replace detected PHI with entity-type tokens
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "DEFAULT": OperatorConfig("replace", {"new_value": "<PHI>"}),
            "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
            "DATE_TIME": OperatorConfig("replace", {"new_value": "<DATE>"}),
            "LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"}),
        },
    )

    return anonymized.text, results


Practical considerations:

  • Runs locally; PHI detection does not leave your infrastructure

  • Highly extensible: adding a custom recognizer for a new entity type is straightforward

  • Default models are general-purpose NLP, not clinical-specific; accuracy on clinical abbreviations and non-standard terminology is lower than Comprehend Medical's clinical-trained models

  • No per-call cost, but infrastructure and maintenance overhead

Tool options for de-identification

For teams implementing de-identification in application code, the three main options are:


spaCy + clinical model

AWS Comprehend Medical

Microsoft Presidio

Training required

Yes (for production accuracy)

No

No (but customization recommended)

Clinical text accuracy

High (with appropriate model)

High (trained on clinical text)

Moderate (general NLP base)

Latency

30–80ms local

150–400ms API call

50–150ms local

PHI leaves your infrastructure

No

Yes (to AWS)

No

Cost

Infrastructure only

Per character

Infrastructure only

Operational complexity

High (model training, updates)

Low

Medium

Custom entity types

Full flexibility

Limited to built-in types

Extensible via recognizers

For teams using Aptible AI Gateway, de-identification will be available as a managed infrastructure feature, applying automatically to traffic through the gateway without requiring any application code changes. The gateway is available in beta independently of Aptible's PaaS. If you want de-identification handled at the infrastructure layer rather than in application code, request beta access.

The token mapping architecture

De-identification is only half of the problem. Most healthcare AI features need the model's output to be interpretable in context: if the model responds that <PERSON> should take <PHI> twice daily, that response needs to be translated back before it reaches a clinician. That requires re-identification: restoring original values from tokens.

The HIPAA-Compliant AI guide covers the basic token mapping pattern. This section covers the production considerations that the basic pattern omits.

Token collision avoidance

The basic pattern assigns sequential tokens ([PATIENT_1], [PATIENT_2]). The problem: two patients with the same name in the same prompt both become [PATIENT_1] if the detector finds both references and assigns the same token to the same string value.

The fix is to generate tokens based on value-within-document uniqueness, not just position:

import hashlib
import secrets
import threading
from datetime import datetime, timezone, timedelta
from typing import Optional

class ProductionTokenMapper:
    """
    Thread-safe token mapper with collision avoidance and lifecycle management.

    Tokens are scoped to a single request. The mapping is held in memory
    for the duration of the request and must be explicitly cleared after
    re-identification is complete.

    Do not persist this mapping to your audit logs alongside the de-identified
    prompt. The mapping is what makes the de-identified data re-identifiable.
    Store it separately, encrypted, with a short TTL.
    """

    def __init__(self, request_id: str):
        self.request_id = request_id
        self._lock = threading.Lock()
        # value -> token (for detecting repeated values)
        self._value_to_token: dict[str, str] = {}
        # token -> original value (for re-identification)
        self._token_to_value: dict[str, str] = {}
        self._created_at = datetime.now(timezone.utc)

    def tokenize(self, value: str, entity_type: str) -> str:
        """
        Return a stable token for a value within this request.
        The same value always gets the same token (within this request).
        Different values always get different tokens.
        """
        with self._lock:
            if value in self._value_to_token:
                return self._value_to_token[value]

            # Generate a short, non-guessable token
            token_id = secrets.token_hex(4).upper()
            token = f"[{entity_type}_{token_id}]"

            # Avoid collisions with existing tokens (extremely rare but handle explicitly)
            while token in self._token_to_value:
                token_id = secrets.token_hex(4).upper()
                token = f"[{entity_type}_{token_id}]"

            self._value_to_token[value] = token
            self._token_to_value[token] = value
            return token

    def reidentify(self, text: str) -> str:
        """Restore original values from tokens in the model's response."""
        with self._lock:
            result = text
            # Sort by token length descending to avoid partial replacements
            for token, value in sorted(
                self._token_to_value.items(), key=lambda x: len(x[0]), reverse=True
            ):
                result = result.replace(token, value)
            return result

    def clear(self) -> None:
        """
        Securely clear all mappings from memory.
        Call this after re-identification is complete.
        The mapping must not persist beyond the request lifecycle.
        """
        with self._lock:
            self._value_to_token.clear()
            self._token_to_value.clear()

    def export_encrypted(self, encryption_key: bytes) -> bytes:
        """
        Export the mapping for short-term encrypted storage if the request
        is async and re-identification happens in a separate process.
        Use Fernet or similar symmetric encryption; never store unencrypted.
        """
        from cryptography.fernet import Fernet  # pip install cryptography — <https://cryptography.io/en/latest/fernet/>
        import json

        with self._lock:
            data = json.dumps(self._token_to_value).encode("utf-8")
            return Fernet(encryption_key).encrypt(data)

def deidentify_with_mapping(
    text: str,
    phi_spans: list,          # List of PHISpan or PHIEntity from your detector
    mapper: ProductionTokenMapper,
) -> str:
    """
    Replace PHI spans with tokens, using the mapper for stable token assignment.
    Processes spans in reverse order to preserve character offsets.
    """
    spans_sorted = sorted(phi_spans, key=lambda s: s.begin_offset if hasattr(s, 'begin_offset') else s.start, reverse=True)
    result = text

    for span in spans_sorted:
        start = span.begin_offset if hasattr(span, 'begin_offset') else span.start
        end = span.end_offset if hasattr(span, 'begin_offset') else span.end
        entity_type = span.entity_type if hasattr(span, 'entity_type') else span.label
        original = span.text

        token = mapper.tokenize(original, entity_type)
        result = result[:start] + token + result[end:]

    return result

Re-identification failure modes

Re-identification is higher-stakes than de-identification. A de-identification false negative means some PHI reached the model (bad, but contained by your BAA and the security controls around it; de-identification is a defense-in-depth layer, not the primary protection). A re-identification error means the model's output is mapped back incorrectly, which in a clinical context means the wrong information reaching the wrong clinician or being associated with the wrong patient.

The most common failure mode: the model's response includes a token in a different grammatical form or case than it appeared in the mapping, and the re-identification step doesn't find a match, leaving the token in the response as a literal string. The clinician sees [PERSON_A3F2] where the patient's name should be.

Mitigations:

  • Validate that every token that appears in the model's response has a corresponding mapping entry; if any do not, treat the response as an error rather than returning partially re-identified output

  • For critical clinical workflows (medication dosing, care plan updates), add a human review step before the re-identified output is acted upon

  • Log re-identification failures and alert on them; they indicate a gap in your token handling

Token deletion

Token deletion closes the full lifecycle. A token mapping that persists indefinitely reintroduces the PHI exposure that de-identification was intended to reduce: anyone with access to both the de-identified audit log and the mapping can reconstruct the original data. Deleting the mapping as soon as re-identification is complete is not optional.

For synchronous requests, call mapper.clear() immediately after re-identification:

# Full lifecycle for a synchronous request
mapper = ProductionTokenMapper(request_id=request_id)
deidentified_text = deidentify_with_mapping(clinical_note, phi_spans, mapper)

response = llm_client.complete(model=model, messages=[{"role": "user", "content": deidentified_text}])

reidentified_response = mapper.reidentify(response.choices[0].message.content)
mapper.clear()  # Mapping no longer needed; clear it now


For async workflows where the mapping must be persisted between steps, set a short TTL (minutes, not hours) and delete explicitly after re-identification:

# Async: persist mapping with TTL, delete after re-identification
encrypted_mapping = mapper.export_encrypted(encryption_key)
cache.set(f"token-map:{request_id}", encrypted_mapping, ttl_seconds=300)
mapper.clear()  # Clear in-memory copy immediately

# ... in the re-identification step, separate process ...
encrypted_mapping = cache.get(f"token-map:{request_id}")
cache.delete(f"token-map:{request_id}")  # Delete before using, not after
mapper = ProductionTokenMapper.from_encrypted(encrypted_mapping, encryption_key)
reidentified = mapper.reidentify(llm_response)
mapper.clear()


Deleting before using (not after) ensures the mapping is removed even if the re-identification step raises an exception. Log mapping deletions (the deletion event, not the mapping content) so you can confirm the full lifecycle completed during an audit or incident investigation.

Verifying your de-identification is working

NLP-based de-identification has a false negative rate. Building the pipeline and running it isn't the same as knowing it's working correctly on your actual clinical text. The gap between "implemented de-identification" and "verified de-identification" is where incidents occur.

import re
from dataclasses import dataclass

@dataclass
class DeidentificationTestCase:
    input_text: str
    known_phi_values: list[str]   # Values that must not appear in the output
    description: str

def run_deidentification_verification(
    detector,
    test_cases: list[DeidentificationTestCase],
    verbose: bool = True,
) -> dict:
    """
    Verify that the de-identification pipeline removes known PHI values.

    Use test cases that reflect real patterns in your clinical text —
    not just obvious examples. Clinical text contains PHI in abbreviations,
    misspellings, and non-standard formats that may not be detected.

    A passing test suite does not guarantee zero false negatives in production;
    it confirms the pipeline handles known patterns correctly.
    """
    results = {"passed": 0, "failed": 0, "failures": []}

    for case in test_cases:
        phi_spans = detector.detect(case.input_text)
        deidentified = deidentify_with_mapping(
            case.input_text,
            phi_spans,
            ProductionTokenMapper(request_id="test"),
        )

        case_passed = True
        for phi_value in case.known_phi_values:
            # Check exact and case-insensitive match
            if phi_value.lower() in deidentified.lower():
                results["failures"].append({
                    "description": case.description,
                    "missed_phi": phi_value,
                    "deidentified_output": deidentified,
                })
                case_passed = False

        if case_passed:
            results["passed"] += 1
        else:
            results["failed"] += 1
            if verbose:
                print(f"FAIL: {case.description}")

    return results

# Example test cases drawn from realistic clinical note patterns
SAMPLE_TEST_CASES = [
    DeidentificationTestCase(
        input_text="Pt: Maria Gonzalez, DOB 04/12/1978. Seen by Dr. Chen on Tuesday.",
        known_phi_values=["Maria Gonzalez", "04/12/1978", "Dr. Chen"],
        description="Standard name and date in clinical note header",
    ),
    DeidentificationTestCase(
        input_text="f/u w/ pt (M. Gonzalez, MRN 8847291) re: HTN mgmt",
        known_phi_values=["M. Gonzalez", "8847291"],
        description="Abbreviated name and MRN in follow-up note",
    ),
    DeidentificationTestCase(
        input_text="Called pt at 617-555-0192, no answer. Left voicemail.",
        known_phi_values=["617-555-0192"],
        description="Phone number in call note",
    ),
    DeidentificationTestCase(
        input_text="Pt lives at 142 Oak Street, Cambridge. Caregiver: son James.",
        known_phi_values=["142 Oak Street", "Cambridge", "James"],
        description="Address and family member name",
    ),
]


Run this verification suite against a representative sample of real (appropriately consented or already de-identified) clinical notes from your specific use case. Pay particular attention to abbreviations, provider-specific shorthand, and edge cases your clinical staff have surfaced. The test suite should grow as you find new patterns the detector misses.

For ongoing verification in production, consider adding a canary pattern to a small percentage of requests: a synthetic PHI value that should always be detected, embedded in a real-feeling clinical phrase. If the canary is not detected, alert.

De-identification and audit logging

When de-identification is applied before sending a request to the LLM, your audit logs contain de-identified prompts rather than raw PHI. This has a useful implication and a compliance consideration.

The implication: de-identified audit logs have a smaller blast radius if your log storage is ever compromised. The PHI is not there to expose. This is one of the compounding security benefits of de-identification even when a BAA is in place.

The compliance consideration: HIPAA's audit control standard (45 CFR 164.312(b)) requires the ability to record and examine PHI-related activity. A log containing [PERSON_A3F2] was prescribed [PHI_B891] without any way to map those tokens back to the original PHI does not satisfy that requirement for incident investigation purposes.

The architecture that satisfies both: log the de-identified prompt and response, and separately retain the token mapping (encrypted, short TTL) in a way that can be correlated with the audit log entry via the request_id. During an investigation, the audit log provides the record of activity; the token mapping (if still within retention) provides the re-identification key to determine what PHI was involved.

The token mapping is itself sensitive data. It must be stored encrypted, access-controlled, and not co-located with the audit logs in the same unencrypted store. The request_id is the only link between them that should appear in the audit log.

FAQs

Is de-identification 100% effective?

No. NLP-based de-identification has false negative rates that vary by model, training data, and the clinical text it is applied to. Published research on de-identification systems typically reports recall in the 90–99% range for common PHI types on benchmark datasets. Performance on your specific clinical text may be higher or lower. De-identification is defense in depth, not a guarantee. It significantly reduces the amount of PHI that reaches the model, not to zero.

Do we need to de-identify if we are running the model on our own infrastructure?

If you are self-hosting an open-source model on your own infrastructure, no third party receives the data, so the BAA question doesn't arise. De-identification still has value if the model's responses are logged, cached, or otherwise persisted, because it limits what PHI lives in those stores.

Can we use de-identification to avoid getting a BAA?

It depends on the standard you're using. HIPAA defines de-identification under 45 CFR 164.514 via two methods: Safe Harbor (removing all 18 specified identifiers) and Expert Determination (a qualified expert determines re-identification risk is very small). De-identification that meets the Safe Harbor standard means the data is no longer PHI under HIPAA, and therefore does not require a BAA.

NLP-based de-identification rarely meets the Safe Harbor standard in practice, because false negatives mean some identifiers remain in the text. If your goal is avoiding a BAA, you need to demonstrate that your de-identification meets the Safe Harbor or Expert Determination standard. That's a compliance and legal determination, not just a technical one. Consult your compliance and legal counsel.

For clinical note processing where the output needs to be re-identifiable, the token mapping approach means the data is coded rather than de-identified under HIPAA, and a BAA is still required.

The same logic extends to 42 CFR Part 2 records (substance use disorder treatment data), which this guide references in the section on when de-identification adds meaningful security value. Part 2 has its own de-identification standard: records that do not identify a patient and for which there is no reason to believe the information could be used to identify a patient fall outside Part 2's restrictions. Meeting HIPAA's Safe Harbor standard does not automatically mean you've met Part 2's standard — they are separate determinations. If your pipeline processes substance use disorder records, get a legal opinion on whether your de-identification approach satisfies Part 2 specifically.

How do we handle de-identification in async workflows?

In synchronous workflows, the token mapping lives in memory for the duration of the request and is cleared after re-identification. In async workflows (where the de-identification, LLM call, and re-identification happen in separate processes or at different times), the mapping must be persisted between steps.

Use the export_encrypted method in the ProductionTokenMapper above to serialize the mapping with symmetric encryption. Store it with a short TTL (minutes to hours, not days). Pass the request_id through the async pipeline so each step can retrieve the corresponding mapping. Delete the mapping explicitly after re-identification is complete or after the TTL expires, whichever comes first.

Next steps

De-identification addresses PHI in your sanctioned infrastructure. Shadow AI addresses PHI that leaves your infrastructure entirely, through tools your compliance posture doesn't cover.