Why We Scrub PHI Before the AI Even Starts Thinking

Inside Audit Sentinel’s Pass 1 architecture and the HIPAA Safe Harbor method

Most AI-powered healthcare tools treat data privacy as a policy layer — a set of rules about who can access what, enforced by permissions and logging. Audit Sentinel treats it as an architecture layer. The very first thing our pipeline does with every clinical note, before a single line of coding logic executes, is strip it of protected health information. We call this Pass 1: the PHI Scrubber. It runs a fast, lightweight frontier language model on Google Cloud Vertex AI whose only job is entity recognition and redaction. It doesn’t analyze MDM complexity. It doesn’t validate ICD-10 codes. It reads the note, identifies every Safe Harbor identifier, replaces each one with a standardized placeholder, and emits a de-identified version of the note. That’s it. That de-identified note is the only artifact that Pass 2 and Pass 3 ever see.

The method behind the redaction is the HIPAA Safe Harbor standard defined at 45 CFR § 164.514(b)(2). Safe Harbor specifies 18 categories of identifiers that must be removed for data to be considered de-identified under the Privacy Rule: names, geographic data smaller than a state, all date elements except year, phone and fax numbers, email addresses, SSNs, medical record numbers, health plan IDs, account numbers, certificate and license numbers, vehicle identifiers, device serials, URLs, IP addresses, biometric data, photographs, and any other unique identifier. Our Pass 1 model targets all 18 categories and replaces each with a typed placeholder — [REDACTED_NAME], [REDACTED_MRN], [REDACTED_DATE], and so on — so that downstream passes can still recognize that an entity existed in the note without knowing its value. A reference to “[REDACTED_NAME] presented with chest pain” preserves the clinical structure; the auditor model knows a patient presented, it just doesn’t know who.

A reasonable question is: why not just use regex or a rule-based NER system? The answer is recall. Clinical notes are messy — dictated, templated, copy-forwarded, littered with abbreviations and non-standard formatting. Rule-based systems excel at structured fields (an MRN that always appears in a header, a date in MM/DD/YYYY format) but struggle with free-text identifiers embedded in narrative paragraphs, unusual name spellings, or identifiers that appear in unexpected locations like an assessment or plan section. A frontier language model brings contextual understanding: it recognizes that “Dr. Patel discussed the case with the patient’s daughter, Maria” contains two names that need redaction, even though neither appears in a labeled field. That said, we are transparent that no automated system is perfect. Our product UI and customer documentation instruct users not to paste actual patient names, real MRNs, or full street addresses into the submission field. The scrubber is a defense-in-depth layer, not a license to submit raw identifiers.

What happens to the raw note after Pass 1? It’s gone. The original text is held only in volatile memory for the duration of the scrubbing inference and is discarded the moment the de-identified output is emitted. It is not written to any database, not logged, not cached, and not available to any Audit Sentinel engineer or support agent. The de-identified note is persisted as part of the audit record; the raw note is not. Customer submissions are also never used to train, fine-tune, or update the underlying foundation models — a commitment backed by our sub-processor agreement with Google Cloud. We built Pass 1 this way because we believe the strongest privacy posture isn’t “we promise not to look at your PHI.” It’s “we architecturally cannot, because it doesn’t exist past the first ten seconds.”

Audit Sentinel AI is an educational and advisory audit tool. It is not a substitute for a certified coder, licensed attorney, or payer determination. For methodology details, see our Audit Methodology White Paper.

← Back to Blog