La gouvernance des données est une transformation culturelle clé à l’ère du Big Data et de l’IA. Cette conférence explore comment en faire u
talk-data.com
Event
PyData Paris 2025
Activities tracked
5
Top Topics
Sessions & talks
Showing 1–5 of 5 · Newest first
Découvrez comment agir dès aujourd’hui lors de notre démo session à Big Data & IA Paris.
Du Big Data à la vidéo instantanée
Du Big Data à la vidéo instantanée : l’expérience client réinventée par PULP'IN Découvrez comment générer des expériences data driven à gran
A Journey Through a Geospatial Data Pipeline: From Raw Coordinates to Actionable Insights
Every dataset has a story — and when it comes to geospatial data, it’s a story deeply rooted in space and scale. But working with geospatial information is often a hidden challenge: massive file sizes, strange formats, projections, and pipelines that don't scale easily.
In this talk, we'll follow the life of a real-world geospatial dataset, from its raw collection in the field to its transformation into meaningful insights. Along the way, we’ll uncover the key steps of building a robust, scalable open-source geospatial pipeline.
Drawing on years of experience at Camptocamp, we’ll explore:
- How raw spatial data is ingested and cleaned
- How vector and raster data are efficiently stored and indexed (PostGIS, Cloud Optimized GeoTIFFs, Zarr)
- How modern tools like Dask, GeoServer, and STAC (SpatioTemporal Asset Catalogs) help process and serve geospatial data
- How to design pipelines that handle both "small data" (local shapefiles) and "big data" (terabytes of satellite imagery)
- Common pitfalls and how to avoid them when moving from prototypes to production
This journey will show how the open-source ecosystem has matured to make geospatial big data accessible — and how spatial thinking can enrich almost any data project, whether you are building dashboards, doing analytics, or setting the stage for machine learning later on.
CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training
Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.