2024-07-10 · HuggingFace

Experimenting with Automatic PII Detection on the Hub using Presidio

research

Experimenting with Automatic PII Detection on the Hub using Presidio

Source: HuggingFace Date: 2024-07-10 URL: https://huggingface.co/blog/presidio-pii-detection

Summary

Feature announcement: HF experiments with automatic PII detection on the Dataset Hub using Microsoft’s Presidio. Generates per-dataset reports estimating PII presence (emails, sensitive info) to inform training decisions. Targets both annotated PII datasets and pre-training web-crawled corpora where PII slips through filtering. Example report shown for allenai/c4.

Implications

Thread: HF as open-source ML hub. PII reporting at the dataset level is a meaningful data governance addition — the risk isn’t just privacy harm, it’s models learning statistical associations with sensitive attributes and generating PII outputs. Surfacing Presidio reports on the Hub card reduces the due diligence burden on practitioners who currently have no automated signal about dataset PII exposure. The “experiment” framing suggests this isn’t yet consistently applied to all datasets, but establishing the pattern matters: Hub datasets becoming self-documenting about data risks is a step toward responsible data infrastructure.

← all signals