talk-data.com talk-data.com

Meetup workshop 2025-03-27 at 16:00

Data Prep Kit Workshop: Clean and Prepare High-Quality Datasets

Description

Hands-on workshop on using Data Prep Kit to extract content from PDFs/HTML, clean up data, remove SPAM, score and remove low-quality documents, identify and remove PII data, and detect and remove HAP (Hate Abuse Profanity) speech to improve dataset quality. Code will be run in Google Colab using Python.