Hands-on workshop on data engineering for large language models using the Data Prep Kit.
talk-data.com
Topic
data prep kit
13
tagged
Activity Trend
Top Events
Top Speakers
Hands-on workshop on cleaning and preparing high-quality datasets using Data Prep Kit. Topics include extracting content from PDFs and HTML, cleaning up markup, detecting and removing SPAM content, scoring and removing low-quality documents, identifying and removing PII data, and detecting and removing HAP (Hate Abuse Profanity) speech. More about Data Prep Kit: https://github.com/IBM/data-prep-kit
Hands-on workshop on using Data Prep Kit to extract content from PDFs/HTML, clean up data, remove SPAM, score and remove low-quality documents, identify and remove PII data, and detect and remove HAP (Hate Abuse Profanity) speech to improve dataset quality. Code will be run in Google Colab using Python.
Hands-on workshop on using Data Prep Kit to clean and prepare high-quality datasets: extract content from PDFs/HTML, cleanup markups, remove SPAM, score and filter low-quality documents, identify and remove PII data, and detect Hate/Abusive language. Prerequisites: comfortable with Python; run the workshop in Google Colab.
Hands-on session to explore Data Prep Kit and accelerate data preparation for building robust LLM applications. Topics include getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleanup of excess markup, detecting/removing duplicate documents, and removing low-quality and spam documents. Attendees should be comfortable with Python; workshop code will run in Google Colab.
Hands-on workshop to explore IBM Data Prep Kit for data preparation, including getting started, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating data, and removing low-quality or spam documents. The session will be run in Google Colab and is suitable for LLM app developers, data scientists, and data engineers. Prerequisites: comfortable with Python.
Hands-on session to explore Data Prep Kit and how to accelerate data preparation for LLM applications. The workshop covers getting started with Data Prep Kit, extracting content from PDFs, DOCX, and HTML, cleaning markup, deduplicating content, and detecting/removing low-quality or spam documents.
In this session we will review the following data preparation tools and techniques we have discussed in the previous sessions: Data Prep Kit; Docling; Open source RAG with Data Prep Kit + Milvus + Llama.
In this session we will review the following data preparation tools and techniques we have discussed in the previous sessions: Data Prep Kit, Docling, Open source RAG with Data Prep Kit + Milvus + Llama.
In this session we will review the following data preparation tools and techniques we have discussed in the previous sessions: Data Prep Kit, Docling, Open source RAG with Data Prep Kit + Milvus + Llama.
In this workshop, we will demonstrate implementing an end-to-end RAG pipeline using open source technologies: Data Prep Kit for processing documents; Milvus as vector database; Granite 3 as the LLM.
A detailed look at Data Prep Kit, its features and usage.
In this talk, I will introduce the capabilities of Data Prep Kit and Docling, walk you through their key features, and demonstrate how to get started with these powerful tools to streamline your data preparation workflows.