Gardez le contrôle sur l'IA: choisissez votre moteur (OpenAI, Gemini, etc.) et personnalisez les prompts pour une transparence totale.
talk-data.com
Event
PyData Paris 2025
Activities tracked
9
Top Topics
Sessions & talks
Showing 1–9 of 9 · Newest first
RAG avancé : combiner LLM et recherche intelligente pour pertinence, traçabilité et fiabilité, avec architecture robuste à grande échelle.
This talk explores the disconnect between MLOps fundamental principles and their practical application in designing, operating and maintaining machine learning pipelines. We’ll break down these principles, examine their influence on pipeline architecture, and conclude with a straightforward, vendor-agnostic mind-map, offering a roadmap to build resilient MLOps systems for any project or technology stack. Despite the surge in tools and platforms, many teams still struggle with the same underlying issues: brittle data dependencies, poor observability, unclear ownership, and pipelines that silently break once deployed. Architecture alone isn't the answer — systems thinking is.
We'll use concrete examples to walk through common failure modes in ML pipelines, highlight where analogies fall apart, and show how to build systems that tolerate failure, adapt to change, and support iteration without regressions.
Topics covered include: - Common failure modes in ML pipelines - Modular design: feature, training, inference - Built-in observability, versioning, reuse - Orchestration across batch, real-time, LLMs - Platform-agnostic patterns that scale
Key takeaways: - Resilience > diagrams - Separate concerns, embrace change - Metadata is your backbone - Infra should support iteration, not block it
Repetita Non Iuvant: Why Generative AI Models Cannot Feed Themselves
As AI floods the digital landscape with content, what happens when it starts repeating itself? This talk explores model collapse, a progressive erosion where LLMs and image generators loop on their own results, hindering the creation of novel output.
We will show how self-training leads to bias and loss of diversity, examine the causes of this degradation, and quantify its impact on model creativity. Finally, we will also present concrete strategies to safeguard the future of generative AI, emphasizing the critical need to preserve innovation and originality.
By the end of this talk, attendees will gain insights into the practical implications of model collapse, understanding its impact on content diversity and the long-term viability of AI.
Documents Meet LLMs: Tales from the Trenches
Processing documents with LLMs comes with unexpected challenges: handling long inputs, enforcing structured outputs, catching hallucinations, and recovering from partial failures. In this talk, we’ll cover why large context windows are not a silver bullet, why chunking is deceptively hard and how to design input and output that allow for intelligent retrial. We'll also share practical prompting strategies, discuss OCR and parsing tools, compare different LLMs (and their cloud APIs) and highlight real-world insights from our experience developing production GenAI applications with multiple document processing scenarios.
Et si le plus important en 2025 c’était la donnée, et non le modèle de LLM ?
Et si nous parlions qualités - et non pas qualité - de la donnée ?
Beyond Prototyping: Building Production-Level Apps with Streamlit
Streamlit is a great tool for prototyping data apps, but is it also fit for complex, production-level apps? In this talk, the Streamlit team will showcase new features, LLM integrations, and deployment options that can help you effectively use Streamlit in your company, whether it’s a small startup or a large enterprise.
Gardez le contrôle sur l'IA: choisissez votre moteur (OpenAI, Gemini, etc.) et personnalisez les prompts pour une transparence totale.
ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences
The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.
To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.
From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.
In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.
The repository of the project : https://github.com/emilienschultz/activetigger/
The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.