talk-data.com talk-data.com

Event

[AI Alliance] Workshop: Hands-on with Data Prep Kit

2025-03-20 – 2025-03-20 Meetup Visit website ↗

Activities tracked

1

Overview When building machine learning and data applications, a significant portion of your time will be dedicated to data wrangling - from content extraction and cleaning to de-duplication and filtering out problematic data. In this hands-on session we will explore Data Prep Kit - an open source toolkit, designed to streamline these essential tasks. Attendees will learn first hand how to use the Data Prep Kit to accelerate data preparation, improve overall data quality, and enhance the efficiency of building robust LLM applications.

Description Data Prep Kit is a comprehensive Python library that democratizes and accelerates data preparation by providing out-of-the-box solutions for common tasks. Engineered to scale from a single laptop to large cloud clusters, it has been successfully used to process terabytes of data for training IBM Granite Large Language Models (LLMs).

Data Prep Kit offers a robust feature set including duplicate elimination, advanced document and code handling, language detection (for both spoken and programming languages), removal of personally identifiable information (PII), as well as spam, hate speech, and malware detection.

More about Data Prep Kit : https://github.com/IBM/data-prep-kit

Join us for this hands-on session to explore how to use Data Prep Kit to accelerate data preparation, enhance data quality.

In this workshop we will do the following:

  • getting started with Data Prep Kit
  • Extract content from various documents (PDFs, DOCX, HTML)
  • Cleanup documents by removing excess markup
  • Detect and remove duplicate documents
  • Detect and remove low quality and spam documents

What do you need to participate in this workshop?

  • Comfortable in python programming language
  • We will run the workshop code using Google Collab (free) - no other setup is needed!

Session Type Hands-on workshop

Audience LLM app developers, data scientists, data engineers

Technical Level Beginner - Intermediate

Prerequisites

  • Comfortable in python programming language
  • We will run the workshop using Google Collab (free) - no other setup is needed!

Duration 60 mins

Industry Cross industry

Speaker Bio https://sujee.dev/bio

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

Sessions & talks

Showing 1–1 of 1 · Newest first

Search within this event →

Hands-on workshop: Data Prep Kit for data preparation and LLM applications

2025-03-20
workshop

Hands-on session to explore Data Prep Kit and how to accelerate data preparation for LLM applications. The workshop covers getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating content, and detecting/removing low-quality or spam documents.