talk-data.com talk-data.com

Justine BEL-LETOILE

Speaker

Justine BEL-LETOILE

2

talks

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →
Balancing Privacy and Utility: Efficient PII Detection and Replacement in Textual Data

Anonymizing free-text data is harder than it seems. While structured databases have well-established anonymization techniques, textual data — like invoices, resumes, or medical records — poses unique challenges. Personally identifiable information (PII) can appear anywhere, in unpredictable formats, and how to modify it while preserving the dataset's usefulness?

Let's explore a practical, open-source 2-step approach to text anonymization: (1) detecting PII using NER models and (2) replacing it while preserving key dataset characteristics (e.g. document formatting, statistical distributions). We will demonstrate how to build a robust pipeline leveraging tools such as pre-trained PII detection models, gliner for fine-tuning, or Faker for generating meaningful replacements.

Ideal for those with a basic understanding of NLP, this session offers practical insights for anyone working with sensitive textual data.

For some natural language processing (NLP) tasks, based on your production constraints, a simpler custom model can be a good contender to off-the-shelf large language models (LLMs), as long as you have enough qualitative data to build it. The stumbling block being how to obtain such data? Going over some practical cases, we will see how we can leverage the help of LLMs during this phase of an NLP project. How can it help us select the data to work on, or (pre)annotate it? Which model is suitable for which task? What are common pitfalls and where should you put your efforts and focus?