This project delivers a fully automated software pipeline that converts raw sustainability reports into ESRS-tagged, XBRL-ready disclosures for CSRD compliance. The tool ingests diverse file formats (PDF, iXBRL, CSV), classifies content using a fine-tuned BERT model, validates completeness and consistency against ESRS rules, and exports compliant XBRL packages. By automating what is traditionally a 6–12-week manual process, the tool reduces turnaround to 1–2 days and lowers costs by up to €500K.
talk-data.com
Topic
4
tagged
Activity Trend
Hands-on workshop on cleaning and preparing high-quality datasets using Data Prep Kit. Topics include extracting content from PDFs and HTML, cleaning up markup, detecting and removing SPAM content, scoring and removing low-quality documents, identifying and removing PII data, and detecting and removing HAP (Hate Abuse Profanity) speech. More about Data Prep Kit: https://github.com/IBM/data-prep-kit
Hands-on session exploring how to use Docling for data extraction and cleanup across PDFs, HTML, and DOCX. Includes getting started with Docling, extracting content from documents, handling table and image data, and extracting content from scanned PDF documents using OCR.
Hands-on workshop on using Docling to extract and clean data from documents, including PDFs, HTML, and OCR for scanned PDFs. Key activities: getting started with Docling; extracting content from PDFs/HTML; handling table and image data; extracting content from scanned PDFs using OCR.