Topic

PySpark

big_data distributed_computing python

Activities

2

tagged

Activity Trend

14 peak/qtr

2020-Q1 2026-Q1

Top Events

O'Reilly Data Engineering Books 19 Databricks DATA + AI Summit 2023 16 Data + AI Summit 2025 13 Data Engineering Podcast 4 O'Reilly Data Science Books 2 PyData Berlin 2025 2 PyData Cardiff - July 2025 1 From a Fintech lens: MCP server live-coding & feature selection data hacks 1 dbt Coalesce 2025 1 PyData Seattle 2025 1 PyConDE & PyData Berlin 2023 1 SciPy 2025 1

Top Speakers

Tobias Macey 4 Marco Gorelli (Narwhals) 3 Denny Lee (Databricks) 3 Pramod Singh 3 Sundar Krishnan 2 Tomasz Drabas 2 Raju Kumar Mishra 2 Allison Wang (Databricks) 2 Ramcharan Kakarla 2 Xiao Li (Databricks) 2 Stuart Moncada (Google Cloud) 1 Benjamin Bengfort 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: PyData Berlin 2025 ×

From Manual to LLMs: Scaling Product Categorization

2025-09-02 · PyData Berlin 2025 Watch

talk

by Ansgar Grüne , Giampaolo Casolla

AI/ML API LLM

How to use LLMs to categorize hundreds of thousands of products into 1,000 categories at scale? Learn about our journey from manual/rule-based methods, via fine-tuned semantic models, to a robust multi-step process which uses embeddings and LLMs via the OpenAI APIs. This talk offers data scientists and AI practitioners learnings and best practices for putting such a complex LLM-based system into production. This includes prompt development, balancing cost vs. accuracy via model selection, testing mult-case vs. single-case prompts, and saving costs by using the OpenAI Batch API and a smart early-stopping approach. We also describe our automation and monitoring in a PySpark environment.

Narwhals: enabling universal dataframe support

2025-09-02 · PyData Berlin 2025 Watch

talk

by Marco Gorelli (Narwhals)

Data Science DuckDB Pandas Plotly Polars

Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals.

Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted.