talk-data.com
Meetup
workshop
2025-03-27 at 16:00
Data Prep Kit Workshop: Clean and Prepare High-Quality Datasets
Description
Hands-on workshop on using Data Prep Kit to extract content from PDFs/HTML, clean up data, remove SPAM, score and remove low-quality documents, identify and remove PII data, and detect and remove HAP (Hate Abuse Profanity) speech to improve dataset quality. Code will be run in Google Colab using Python.