PyData Paris 2024

Foundational Models for Time Series Forecasting: are we there yet?

2024-09-26

talk

Luca Baggi , Gabriele Orlandi

LLM NLP

Transformers are everywhere: NLP, Computer Vision, sound generation and even protein-folding. Why not in forecasting? After all, what ChatGPT does is predicting the next word. Why this architecture isn't state-of-the-art in the time series domain?

In this talk, you will understand how Amazon Chronos and Salesforece's Moirai transformer-based forecasting models work, the datasets used to train them and how to evaluate them to see if they are a good fit for your use-case.

Is your marketing effective? Let Bayes decide!

2024-09-26

talk

Emanuele Fabbiani

AI/ML Analytics HTML Marketing Python

Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator.

The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled.

We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system.

Attendees will walk away with:

A simple understanding of the Bayesian approach and why it matters.
Concrete examples of the transformative impact on WeRoad's marketing strategy.
A blueprint to harness predictive models in their business strategies.

Visualization of the sky in Notebooks: the ipyaladin widget extension

2024-09-26

talk

Matthieu Baumann , Manon Marchand

API GitHub Python

Aladin allows to visualize images of the sky or planetary surfaces just as an astronomical "openstreetmap" app. The view can be panned and explored interactively. In the ipyaladin widget -- that brings Aladin in the Jupyter Notebook environnement -- these abilities are extended with a python API. The users can send astronomical data in standard formats back and forth the viewer and their Python code. Such data can be images of the sky in different wavelengths, but also tabular data, complex shapes that characterize telescope observation regions, or even special sky features (such as probability region for the provenance of a gravitational event).

With these already existing features, and current work we are doing with the new development framework anywidget, ipyaladin is really close to a version 1.0.0. It is already used in its beta version in different experimental science platforms, for example in the ESCAPE European Science Cluster of Astronomy & Particle Physics project and in the experimental SKA (Square Kilometre Array, a telescope for radio astronomy) analysis platform.

In this presentation, we will share our feedback on the development of a widget thanks to anywidget compared to the bare ipywidget framework. And we will demonstrate the functionalities of the widget through scientific use cases.

Coffee Break

2024-09-26

talk

Coffee Break

2024-09-26

talk

Coffee Break

2024-09-26

talk

Color-composite images from the James Webb Space Telescope

2024-09-26

talk

Camilla Pacifici , Jesse Averbukh

Cloud Computing Python

The astronomical community has built a good amount of software to visualize and analyze the images obtained with the James Webb Space Telescope (JWST). In this talk, I will present the open-source Python package Jdaviz. I will show you how to visualize publicly available JWST images and build the pretty color images that we have all seen in the media. Half the talk will be an introduction to JWST and Jdaviz and half will be a hands on session on a cloud platform (you will only need to create an account) or on your own machine (the package is available on PyPI).

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

2024-09-26

talk

Erik Welch , Rick Ratzel

Analytics Pandas Python

NetworkX is arguably the most popular graph analytics library available today, but one of its greatest strengths - the pure-python implementation - is also possibly its biggest weakness. If you're a seasoned data scientists or a new student of the fascinating field of graph analytics, you're probably familiar with NetworkX and interested in how to make this extremely easy-to-use library powerful enough to handle realistically large graph workflows that often exceed the limitations of its pure-python implementation.

This talk will describe a relatively new capability of NetworkX; support for accelerated backends, and how they can benefit NetworkX users by allowing it to finally be both easy to use and fast. Through the use of backends, NetworkX can also be incorporated into workflows that take advantage of similar accelerators, such as Accelerated Pandas (cudf.pandas), to finally make these easy to use solutions scale to larger problems.

Attend this talk to learn about how you can leverage the various backends available to NetworkX today to seamlessly run graph analytics on GPUs, use GraphBLAS implementations, and more, all without leaving the comfort and convenience of the most popular graph analytics library available.

Unpack business metrics to explain their evolution

2024-09-26

talk

Max Halford

Analytics KPI Python

One of the more mundane tasks in the business analytics world is to measure KPIs: averages, sums, ratios, etc. Typically, these are measured period over period, to see how they trend. If you're a data analyst, you've likely been asked to debug/explain a metric, because a stakeholder wants to understand why a number has changed.

This topic isn't well grounded theory, and the answers we come up with can be lacklustre. In this talk, we discuss solutions to this very common topic. We will look at a methodology we have developed at Carbonfact, and the opensource Python tool we are sharing.

Keynote: Building with Mistral

2024-09-26

talk

Sophia Yang

AI/ML LLM

In the rapidly evolving landscape of Artificial Intelligence (AI), open source and openness AI have emerged as crucial factors in fostering innovation, transparency, and accountability. Mistral AI's release of the open-weight Mistral 7B model has sparked significant adoption and demand, highlighting the importance of open-source and customization in building AI applications. This talk focuses on the Mistral AI model landscape, the benefits of open-source and customization, and the opportunities for building AI applications using Mistral models.

Forewords

2024-09-26

talk

Breakfast

2024-09-26

talk

Breakfast

2024-09-26

talk

Breakfast

2024-09-26

talk

Community

2024-09-25

talk

Lightning Talks

2024-09-25

talk

Keynote: Open-source AI: why it matters and how to get started

2024-09-25

talk

Merve Noyan (Hugging Face)

AI/ML

In this talk, we will go through everything open-source AI: the state of open-source AI, why it matters, the future of it and how you can get started with it.

Coffee Break

2024-09-25

talk

Coffee Break

2024-09-25

talk

Coffee Break

2024-09-25

talk

On the structure and reproducibility of Python packages - data crunch

2024-09-25

talk

Zhihan Zhang , Maria Knorps

Python

Did you know that all top PyPI packages declare their 3rd party dependencies? In contrast, only about 53% of scientific projects do the same. The question arises: How can we reproduce Python-based scientific experiments if we're unaware of the necessary libraries for our environment? In this talk, we delve into the Python packaging ecosystem and employ a data-driven approach to analyze the structure and reproducibility of packages. We compare two distinct groups of Python packages: the most popular ones on PyPI, which we anticipate to adhere more closely to best practices, and a selection from biomedical experiments. Through our analysis, we uncover common development patterns in Python projects and utilize our open-source library, FawltyDeps, to identify undeclared dependencies and assess the reproducibility of these projects. This discussion is especially valuable for enthusiasts of clean Python code, as well as for data scientists and engineers eager to adopt best practices and enhance reproducibility. Attendees will depart with actionable insights on enhancing the transparency and reliability of their Python projects, thereby advancing the cause of reproducible scientific research.

Polars Plugins: how you (yes, you!) can extend Polars

2024-09-25

talk

Marco Gorelli

API Polars Python Rust

Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.

No prior Rust experience required, intermediate Python and programming experience required. By the end of the talk, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.

xsimd: from xtensor to firefox

2024-09-25

talk

Serge « sans » Paille

Arrow

Almost all modern CPU have a vector processing unit, making it possible to write faster code for a large category of problems, at the cost of portability - there a re many different instruction sets in the wild! The xsimd library makes it possible to write portable C++ code that targets different architectures and sub-architectures. The specialization choice can be made at compile-time or at runtime, using a provided dispatching mechanism. Intel, ARM, RiscV and Webassembly are supported, and the library has already been adopted by Xtensor, Pythran, Apache Arrow and Firefox.

Geoscience at Massive Scale

2024-09-25

talk

Hendrik Makait

API Cloud Computing NumPy

When scaling geoscience workloads to large datasets, many scientists and developers reach for Dask, a library for distributed computing that plugs seamlessly into Xarray and offers an Array API that wraps NumPy. Featuring a distributed environment capable of running your workload on large clusters, Dask promises to make it easy to scale from prototyping on your laptop to analyzing petabyte-scale datasets.

Dask has been the de-facto standard for scaling geoscience, but it hasn’t entirely lived up to its promise of operating effortlessly at massive scale. This comes up in a few ways: - Correctly chunking your dataset has a significant impact on Dask’s ability to scale - Workers accidentally run out of memory due to: - Data being loaded too eagerly - Rechunking - Unmanaged memory

Over the last few months, Dask has addressed many of those pains and continues to do so through: - Improvements to its scheduling algorithms - A faster and more memory-stable method for rechunking - First-of-its-kind logical optimization layer for a distributed array framework (ongoing)

Join us as we dive into real-world geoscience workloads, exploring how Dask empowers scientists and developers to run their analyses at massive scale. Discover the impact of improvements made to Dask, ongoing challenges, and future plans for making it truly effortless to scale from your laptop to the cloud.

The expanding Apache Arrow universe - standardizing and accelerating tabular data access and interchange

2024-09-25

talk

Joris Van den Bossche

Arrow

Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. Beyond the standardized and language-independent columnar memory format for tabular data, the Apache Arrow project also has a growing set of supplementary specifications and language implementations. This talk will give an overview of the recent developments in the Apache Arrow ecosystem, including ADBC, nanoarrow, new data types, and the Arrow PyCapsule protocol.

talk-data.com

Top Topics

Top Speakers

Foundational Models for Time Series Forecasting: are we there yet?

Is your marketing effective? Let Bayes decide!

Visualization of the sky in Notebooks: the ipyaladin widget extension

Coffee Break

Coffee Break

Coffee Break

Color-composite images from the James Webb Space Telescope

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

Unpack business metrics to explain their evolution

Keynote: Building with Mistral

Forewords

Breakfast

Breakfast

Breakfast

Community

Lightning Talks

Keynote: Open-source AI: why it matters and how to get started

Coffee Break

Coffee Break

Coffee Break

On the structure and reproducibility of Python packages - data crunch

Polars Plugins: how you (yes, you!) can extend Polars

xsimd: from xtensor to firefox

Geoscience at Massive Scale

The expanding Apache Arrow universe - standardizing and accelerating tabular data access and interchange