talk-data.com
Event
PyData Paris 2025
Activities tracked
7
Top Topics
Sessions & talks
Showing 1–7 of 7 · Newest first
Repetita Non Iuvant: Why Generative AI Models Cannot Feed Themselves
As AI floods the digital landscape with content, what happens when it starts repeating itself? This talk explores model collapse, a progressive erosion where LLMs and image generators loop on their own results, hindering the creation of novel output.
We will show how self-training leads to bias and loss of diversity, examine the causes of this degradation, and quantify its impact on model creativity. Finally, we will also present concrete strategies to safeguard the future of generative AI, emphasizing the critical need to preserve innovation and originality.
By the end of this talk, attendees will gain insights into the practical implications of model collapse, understanding its impact on content diversity and the long-term viability of AI.
Documents Meet LLMs: Tales from the Trenches
Processing documents with LLMs comes with unexpected challenges: handling long inputs, enforcing structured outputs, catching hallucinations, and recovering from partial failures. In this talk, we’ll cover why large context windows are not a silver bullet, why chunking is deceptively hard and how to design input and output that allow for intelligent retrial. We'll also share practical prompting strategies, discuss OCR and parsing tools, compare different LLMs (and their cloud APIs) and highlight real-world insights from our experience developing production GenAI applications with multiple document processing scenarios.
Comment exploiter tout le potentiel de la GenAI tout en protégeant un corpus documentaire sensible et critique.
Building Data Science Tools for Sustainable Transformation
The current AI hype, driven by generative AI and particularly large language models, is creating excitement, fear, and inflated expectations. In this keynote, we'll explore geographic & mobility data science tools (such as GeoPandas and MovingPandas) to transform this hype into sustainable and positive development that empowers users.
ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences
The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.
To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.
From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.
In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.
The repository of the project : https://github.com/emilienschultz/activetigger/
The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.