Published April 23, 2026 | https://doi.org/10.59350/eh5s6-t1s24

Privacy filter: OpenAI's open-source PII scrubber

Creators & Contributors

  • 1. ROR icon University of Virginia
Feature image

Yesterday OpenAI released Privacy Filter under Apache 2.0 on Hugging Face and GitHub (announcement). It detects and masks eight categories of PII in text: names, addresses, emails, phone numbers, URLs, dates, account numbers, and secrets like API keys. 1.5B total parameters, 50M active (mixture-of-experts), 128k context, 96% F1 on PII-Masking-300k.

You can run it pretty easily with uv, as described in a previous post.

Here's a python script that runs using uv with dependencies declared inline (also copied below).

First run pulls ~2.8 GB to your HF cache. After that it's pretty fast.

Use cases: pre-scrubbing free-text fields, clinical notes, or support transcripts before they hit a frontier model provider. Caveat: this isn't a standalone anonymization guarantee. OpenAI says so plainly in the model card. Missed identifiers and over-redaction both happen.

Here's an example:

./privacy-filter.py "My name is Stephen Turner. Social is 123-45-6789. You can reach me at 434-555-1234 or notmyrealemail@example.com."

Output:

My name is [PRIVATE_PERSON]. Social is [ACCOUNT_NUMBER]. You can reach me at [PRIVATE_PHONE] or [PRIVATE_EMAIL].

Subscribe now

You might try to reach for Ollama or llama.cpp. Don't. Privacy Filter is not a generative model. It's a bidirectional token classifier with a Viterbi decoder on top, built on a gpt-oss-style backbone with the LM head swapped for a BIOES span classifier over 33 labels. You feed it text, it returns labeled spans. No chat, no completions. In other words, Ollama and llama.cpp can't run it. No GGUF exists because GGUF is for causal generative models. You need the transformers library.

Here's the privacy-filter.py script used above:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "transformers>=4.50",
#   "torch",
# ]
# ///
"""
Run openai/privacy-filter over text passed as the first argument.

Usage: ./privacy_filter.py "My name is Stephen and my phone is 555-1234"
"""
import sys
from transformers import pipeline

if len(sys.argv) < 2:
    sys.exit("usage: privacy_filter.py TEXT")

text = sys.argv[1]

classifier = pipeline(
    task="token-classification",
    model="openai/privacy-filter",
    aggregation_strategy="simple",
)

spans = classifier(text)

print(f"Input: {text}\n")

if not spans:
    print("No PII detected.")
    sys.exit(0)

print(f"Detected {len(spans)} span(s):")
for s in spans:
    span_text = text[s["start"]:s["end"]]
    print(f"  [{s['entity_group']}] {span_text!r}  (score={s['score']:.3f})")

redacted = text
for s in sorted(spans, key=lambda x: x["start"], reverse=True):
    redacted = redacted[:s["start"]] + f"[{s['entity_group'].upper()}]" + redacted[s["end"]:]

print(f"\nRedacted: {redacted}")

Subscribe now

Additional details

Description

OpenAI's Apache-licensed PII scrubber and a uv/Python script to run it.

Dates

Issued
2026-04-23T11:29:50
Updated
2026-04-23T11:29:50