talk-data.com

Hands-on workshop on cleaning and preparing high-quality datasets using Data Prep Kit. Topics include extracting content from PDFs and HTML, cleaning up markup, detecting and removing SPAM content, scoring and removing low-quality documents, identifying and removing PII data, and detecting and removing HAP (Hate Abuse Profanity) speech. More about Data Prep Kit: https://github.com/IBM/data-prep-kit

Data Prep Kit Workshop: Clean and Prepare High-Quality Datasets

2025-03-27 · [AI Alliance] Workshop: Preparing High Quality Datasets with Data Prep Kit

workshop

Python google colab html parsing pdf parsing

Hands-on workshop on using Data Prep Kit to extract content from PDFs/HTML, clean up data, remove SPAM, score and remove low-quality documents, identify and remove PII data, and detect and remove HAP (Hate Abuse Profanity) speech to improve dataset quality. Code will be run in Google Colab using Python.

Data Prep Kit Workshop: Data wrangling for ML and data apps

2025-03-27 · [AI Alliance] Workshop: Preparing High Quality Datasets with Data Prep Kit

workshop

Python google colab

Hands-on workshop on using Data Prep Kit to clean and prepare high-quality datasets: extract content from PDFs/HTML, cleanup markups, remove SPAM, score and filter low-quality documents, identify and remove PII data, and detect Hate/Abusive language. Prerequisites: comfortable with Python; run the workshop in Google Colab.

Data Prep Kit Hands-on Workshop

2025-03-20 · [AI Alliance] Workshop: Hands-on with Data Prep Kit

workshop

Python google colab

Hands-on session to explore Data Prep Kit and accelerate data preparation for building robust LLM applications. Topics include getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleanup of excess markup, detecting/removing duplicate documents, and removing low-quality and spam documents. Attendees should be comfortable with Python; workshop code will run in Google Colab.

Data Prep Kit Workshop

2025-03-20 · [AI Alliance] Workshop: Hands-on with Data Prep Kit

Hands-on workshop

Python google colab

Hands-on workshop to explore IBM Data Prep Kit for data preparation, including getting started, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating data, and removing low-quality or spam documents. The session will be run in Google Colab and is suitable for LLM app developers, data scientists, and data engineers. Prerequisites: comfortable with Python.

Hands-on workshop: Data Prep Kit for data preparation and LLM applications

2025-03-20 · [AI Alliance] Workshop: Hands-on with Data Prep Kit

workshop

Python google colab

Hands-on session to explore Data Prep Kit and how to accelerate data preparation for LLM applications. The workshop covers getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating content, and detecting/removing low-quality or spam documents.