PyData Paris 2024

Unveiling Mamba 2.0: The Future of Fast Package Management

2024-09-25

talk

Johan Mabille , Julien Jerphanion

In this presentation, we introduce Mamba 2.0, the latest version of the multi-platform, language-agnostic package manager that has garnered significant adoption within the scientific open-source community for its speed and efficiency.

Bridging the worlds: pixi reimplements pip and conda in Rust

2024-09-25

talk

Nichita Morcotilo

Bash CI/CD GitHub Rust

Pixi goes further than existing conda-based package managers in many ways:

From scratch implemented in Rust and ships as a single binary
Integrates a new SAT solver called resolvo
Supports lockfiles like poetry / yarn / cargo do
Cross-platform task system (simple bash-like syntax)
Interoperability with PyPI packages by integrating uv
It's 100% open-source with a permissive licence

We’re looking forward to take a deep-dive together into what conda and PyPI packages are and how we are seamlessly integrating the two worlds in pixi.

We will show you how you can easily setup your new project using just one configuration file and always have a reproducible setup in your pocket. Which means that it will always run the same for your contributors, user and CI machine ( no more "but it worked on my machine!" ).

Using pixi's powerful cross-platform task system you can replace your Makefile and a ton of developer documentation with just pixi run task!

We’ll also look at benchmarks and explain more about the difference between the conda and pypi ecosystems.

This talk is for everyone who ever dealt with dependency hell.

More information about Pixi:

https://pixi.sh https://prefix.dev https://github.com/prefix-dev/pixi

Building Large Scale ETL Pipelines with Dask

2024-09-25

talk

Patrick Hoefler

Cloud Computing ETL/ELT

Building scalable ETL pipelines and deploying them in the cloud can seem daunting. It shouldn't be. Leveraging proper technologies can make this process easy. We will discuss the whole process of developing a composable and scalable ETL pipeline centred around Dask that is fully built with Open Source tools and how we can deploy to the cloud.

Unveiling new maps of biology with Squidpy

2024-09-25

talk

Inácio Medeiros

Spatial Transcriptomics, named the method of the year by Nature in 2020, offers remarkable visuals of gene expression across tissues and organs, providing valuable insights into biological processes. This talk presents the Squidpy library for analyzing and visualizing spatial molecular data, including demonstrations of gene expression visualization in mouse brain tissue.

Lunch Break

2024-09-25

talk

Lunch Break

2024-09-25

talk

Lunch Break

2024-09-25

talk

JupyterLite, Emscripten-forge, Xeus, and Mamba -- The computational quartet for in browser interactive computing"

2024-09-25

talk

Thorsten Beier , Jeremy Tuloup , Ian Thomas (Publicis Spine)

Cloud Computing GitHub JavaScript

JupyterLite is a JupyterLab distribution that runs entirely in the web browser, backed by in-browser language kernels. With standard JupyterLab, where kernels run in separate processes and communicate with the client by message passing, JupyterLite uses kernels that run entirely in the browser, based on JavaScript and WebAssembly.

This means JupyterLite deployments can be scaled to millions of users without the need for individual containers for each user session, only static files need to be served, which can be done with a simple web server like GitHub pages.

This opens up new possibilities for large-scale deployments, eliminating the need for complex cloud computing infrastructure. JupyterLite is versatile and supports a wide range of languages, with the majority of its kernels implemented using Xeus, a C++ library for developing language-specific kernels.

In conjunction with JupyterLite, we present Emscripten-forge, a conda/mamba based distribution for WebAssembly packages. Conda-forge is a community effort and a GitHub organization which contains repositories of conda recipes and thus provides conda packages for a wide range of software and platforms. However, targeting WebAssembly is not supported by conda-forge. Emscripten-forge addresses this gap by providing conda packages for WebAssembly, making it possible to create custom JupyterLite deployments with tailored conda environments containing the required kernels and packages.

In this talk, we delve deep into the JupyterLite ecosystem, exploring its integration with Xeus Mamba and Emscripten-forge.

We will demonstrate how this can be used to create sophisticated JupyterLite deployments with custom conda environments and give an outlook for future developments like R packages and runtime package resolution.

Leveraging LLMs to build supervised datasets suitable for smaller models

2024-09-25

talk

Cérès Carton , Justine BEL-LETOILE

LLM NLP

For some natural language processing (NLP) tasks, based on your production constraints, a simpler custom model can be a good contender to off-the-shelf large language models (LLMs), as long as you have enough qualitative data to build it. The stumbling block being how to obtain such data? Going over some practical cases, we will see how we can leverage the help of LLMs during this phase of an NLP project. How can it help us select the data to work on, or (pre)annotate it? Which model is suitable for which task? What are common pitfalls and where should you put your efforts and focus?

Solara: Pure Python web apps beyond prototypes and dashboards

2024-09-25

talk

Maarten Breddels , Iisakki Rotko

API JavaScript Python

Many Python frameworks are suitable for creating basic dashboards or prototypes but struggle with more complex ones. Taking lessons from the JavaScript community, the experts on building UI’s, we created a new framework called Solara. Solara scales to much more complex apps and compute-intensive dashboards. Built on the Jupyter stack, Solara apps and its reusable components run in the Jupyter notebook and on its own production quality server based on Starlette/FastAPI.

Solara has a declarative API that is designed for dynamic and complex UIs yet is easy to write. Reactive variables power our state management, which automatically triggers rerenders. Our component-centric architecture stimulates code reusability, and hot reloading promotes efficient workflows. With our rich set of UI and data-focused components, Solara spans the entire spectrum from rapid prototyping to robust, complex dashboards.

Elevating Data Stories: Exploring Quarto Dashboard for impactful and visual communication

2024-09-25

talk

Christophe Dervieux

Dashboard

Embark on a journey to explore how Quarto Dashboard can enhance the narrative of your analysis from your Jupyter Notebook. This talk will show how to create cool interactive charts and graphs that bring your data to life, by using Quarto - an open-source scientific and technical publishing system.

Learn how to make your data communications more engaging and dynamic using Quarto Dashboard. Practical examples and simple explanations will guide you through the process, making it easy to understand and apply to your projects.

Enhancing RAG-based apps by constructing and leveraging knowledge graphs with open-weights LLMs

2024-09-25

talk

Alonso Silva

LLM RAG

Graph Retrieval Augmented Generation (Graph RAG) is emerging as a powerful addition to traditional vector search retrieval methods. Graphs are great at representing and storing heterogeneous and interconnected information in a structured manner, effortlessly capturing complex relationships and attributes across different data types. Using open weights LLMs removes the dependency on an external LLM provider while retaining complete control over the data flows and how the data is being shared and stored. In this talk, we construct and leverage the structured nature of graph databases, which organize data as nodes and relationships, to enhance the depth and contextuality of retrieved information to enhance RAG-based applications with open weights LLMs. We will show these capabilities with a demo.

Jupylates: spaced repetition for teaching with Jupyter

2024-09-25

talk

Nicolas M. Thiéry , Chiara Marmo

AI/ML GitLab Pandas

Jupyter based environments are getting a lot of traction for teaching computing, programming, and data sciences. The narrative structure of notebooks has indeed proven its value for guiding each student at it's own pace to the discovery and understanding of new concepts or new idioms (e.g. how do I extract a column in pandas?). But then these new pieces of knowledge tend to quickly fade out and be forgotten. Indeed long term acquisition of knowledge and skills takes reinforcement by repetition. This is the foundation of many online learning platforms like Webwork or WIMS that offer exercises with randomization and automatic feedback. And of popular "AI-powered" apps -- e.g. to learn foreign languages -- that use spaced repetition algorithms designed by educational and neuro sciences to deliver just the right amount of repetition.

What if you could author such exercizes as notebooks, to benefit from everything that Jupyter can offer (think rich narratives, computations, visualization, interactions)? What if you could integrate such exercises right into your Jupyter based course? What if a learner could get personalized exercise recommandations based on their past learning records, without having to give away these sensitive pieces of information away?

That's Jupylates (work in progress). And thanks to the open source scientific stack, it's just a small Jupyter extension.

Collaborative editing in Jupyter

2024-09-25

talk

David Brochart

AI/ML

The Jupyter stack has undergone a significant transformation in recent years with the integration of collaborative editing features: users can now modify a shared document and see each other's changes in real time, with a user experience akin to that of Google Docs. The underlying technology uses a special data structure called Conflict-free Replicated Data Types (CRDTs), that automatically resolves conflicts when concurrent changes are made. This allows data to be distributed rather than centralized in a server, letting clients work as if data was local rather than remote. In this talk, we look at new possibilities that CRDTs can unlock, and how they are redefining Jupyter's architecture. Different use cases are presented: a suggestion system similar to Google Doc's, a chat system allowing collaboration with an AI agent, an execution model allowing full notebook state recovery, a collaborative widget model. We also look at the benefits of using CRDTs in JupyterLite, where users can interact without a server. This may be a great example of a distributed system where every user owns their data and shares them with their peers.

Evaluating the evaluator: RAG eval libraries under the loop

2024-09-25

talk

Maria Knorps , Nour El Mawass

LLM RAG

Retrieval-augmented generation (RAG) has become a key application for large language models (LLMs), enhancing their responses with information from external databases. However, RAG systems are prone to errors, and their complexity has made evaluation a critical and challenging area. Various libraries (like RAGAS and TruLens) have introduced evaluation tools and metrics for RAGs, but these evaluations involve using one LLM to assess another, raising questions about their reliability. Our study examines the stability and usefulness of these evaluation methods across different datasets and domains, focusing on the effects of the choice of the evaluation LLM, query reformulation, and dataset characteristics on RAG performance. It also assesses the stability of the metrics on multiple runs of the evaluation and how metrics correlate with each other. The talk aims to guide users in selecting and interpreting LLM-based evaluations effectively.

High Performance Data Visualization for the Web

2024-09-25

talk

Tim Paine

DataViz JavaScript Data Streaming

Are you looking for a high performance visualization component for the web? Need to filter, sort, pivot, and aggregate static/streaming data in realtime? Daunted by the massive JS ecosystem? In this talk, we’ll build a high performance web frontend using the open source library Perspective.

Coffee Break

2024-09-25

talk

Coffee Break

2024-09-25

talk

Coffee Break

2024-09-25

talk

Building web-based engineering applications with JupyterLab components.

2024-09-25

talk

Trung Le

Cloud Computing

In the past few years, web-based engineering software has been steadily gaining momentum over traditional desktop-based applications. It represents a significant shift in how engineers access, collaborate, and utilize software tools for design, analysis, and simulation tasks. However, converting desktop-based applications to web applications presents considerable challenges, especially in translating the functionality of desktop interfaces to the web. It requires careful planning and design expertise to ensure intuitive navigation and responsiveness.

JupyterLab provides a flexible, interactive environment for scientific computing. Despite its popularity among data scientists and researchers, the full potential of JupyterLab as a platform for building scientific web applications has yet to be realized.

In this talk, we will explore how its modular architecture and extensive ecosystem facilitate the seamless integration of components for diverse functionalities: from rich user interfaces, accessibility, and real-time collaboration to cloud deployment options. To illustrate the platform's capabilities, we will demo JupyterCAD, a parametric 3D modeler built on top of JupyterLab components.

Python 3.12's new monitoring and debugging API

2024-09-25

talk

Johannes Bechberger

API Python

Python 3.12 introduced a new low-impact monitoring API with PEP669, which can be used to implement far faster debuggers than ever before. This talk covers the main advantages of this API and how you can use it to develop small tools.

Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems

2024-09-25

talk

Nicolas Guenon des Mesnards

AI/ML GenAI LLM NLP

In the last year there hasn’t been a day that passed without us hearing about a new generative AI innovation that will enhance some aspect of our lives. On a number of tasks large probabilistic systems are now outperforming humans, or at least they do so “on average”. “On average” means most of the time, but in many real life scenarios “average” performance is not enough: we need correctness ALL of the time, for example when you ask the system to dial 911.

In this talk we will explore the synergy between deterministic and probabilistic models to enhance the robustness and controllability of machine learning systems. Tailored for ML engineers, data scientists, and researchers, the presentation delves into the necessity of using both deterministic algorithms and probabilistic model types across various ML systems, from straightforward classification to advanced Generative AI models.

You will learn about the unique advantages each paradigm offers and gain insights into how to most effectively combine them for optimal performance in real-world applications. I will walk you through my past and current experiences in working with simple and complex NLP models, and show you what kind of pitfalls, shortcuts, and tricks are possible to deliver models that are both competent and reliable.

The session will be structured into a brief introduction to both model types, followed by case studies in classification and generative AI, concluding with a Q&A segment.

Keynote: DIY Personalization: When, how and why to build your own models

2024-09-25

talk

Katharine Jarmul (Cape Privacy)

AI/ML

With increased ease of smaller "AI" models, better chips and on-device learning, is it possible now to build and train your own models for your own use? In this keynote, we'll explore learnings of small, medium and large-sized model personalization, but driven by yourself and for yourself. A walk through what's possible, what's not and what we should prioritize if we'd like AI & ML to be made for everyone.

Introductory PyData Paris

2024-09-25

talk

Breakfast

2024-09-25

talk

talk-data.com

Top Topics

Top Speakers

Unveiling Mamba 2.0: The Future of Fast Package Management

Bridging the worlds: pixi reimplements pip and conda in Rust

Building Large Scale ETL Pipelines with Dask

Unveiling new maps of biology with Squidpy

Lunch Break

Lunch Break

Lunch Break

JupyterLite, Emscripten-forge, Xeus, and Mamba -- The computational quartet for in browser interactive computing"

Leveraging LLMs to build supervised datasets suitable for smaller models

Solara: Pure Python web apps beyond prototypes and dashboards

Elevating Data Stories: Exploring Quarto Dashboard for impactful and visual communication

Enhancing RAG-based apps by constructing and leveraging knowledge graphs with open-weights LLMs

Jupylates: spaced repetition for teaching with Jupyter

Collaborative editing in Jupyter

Evaluating the evaluator: RAG eval libraries under the loop

High Performance Data Visualization for the Web

Coffee Break

Coffee Break

Coffee Break

Building web-based engineering applications with JupyterLab components.

Python 3.12's new monitoring and debugging API

Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems

Keynote: DIY Personalization: When, how and why to build your own models

Introductory PyData Paris

Breakfast