talk-data.com talk-data.com

Topic

data prep kit

13

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

13 activities · Newest first

Hands-on workshop on cleaning and preparing high-quality datasets using Data Prep Kit. Topics include extracting content from PDFs and HTML, cleaning up markup, detecting and removing SPAM content, scoring and removing low-quality documents, identifying and removing PII data, and detecting and removing HAP (Hate Abuse Profanity) speech. More about Data Prep Kit: https://github.com/IBM/data-prep-kit

Hands-on workshop on using Data Prep Kit to extract content from PDFs/HTML, clean up data, remove SPAM, score and remove low-quality documents, identify and remove PII data, and detect and remove HAP (Hate Abuse Profanity) speech to improve dataset quality. Code will be run in Google Colab using Python.

Hands-on workshop on using Data Prep Kit to clean and prepare high-quality datasets: extract content from PDFs/HTML, cleanup markups, remove SPAM, score and filter low-quality documents, identify and remove PII data, and detect Hate/Abusive language. Prerequisites: comfortable with Python; run the workshop in Google Colab.

Hands-on session to explore Data Prep Kit and accelerate data preparation for building robust LLM applications. Topics include getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleanup of excess markup, detecting/removing duplicate documents, and removing low-quality and spam documents. Attendees should be comfortable with Python; workshop code will run in Google Colab.

Hands-on workshop to explore IBM Data Prep Kit for data preparation, including getting started, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating data, and removing low-quality or spam documents. The session will be run in Google Colab and is suitable for LLM app developers, data scientists, and data engineers. Prerequisites: comfortable with Python.