talk-data.com talk-data.com

Topic

Pandas

data_manipulation data_analysis python

18

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

18 activities · Newest first

Efficient Time-Series Forecasting with Thousands of Local Models on Databricks

In industries like energy and retail, forecasting often requires local models when each time series has unique behavior — though training thousands of them can be overwhelming. However, training and managing thousands of such models presents scalability and operational challenges. This talk shows how we scaled local models on Databricks by leveraging the Pandas API on Spark, and shares practical lessons on storage, reuse, and scaling challenges to make this approach efficient when it’s truly needed

How to do real TDD in data science? A journey from pandas to polars with pelage!

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

Advanced Polars: Lazy Queries and Streaming Mode

Do you find yourself struggling with Pandas' limitations when handling massive datasets or real-time data streams?

Discover Polars, the lightning-fast DataFrame library built in Rust. This talk presents two advanced features of the next-generation dataframe library: lazy queries and streaming mode.

Lazy evaluation in Polars allows you to build complex data pipelines without the performance bottlenecks of eager execution. By deferring computation, Polars optimises your queries using techniques like predicate and projection pushdown, reducing unnecessary computations and memory overhead. This leads to significant performance improvements, particularly with datasets larger than your system’s physical memory.

Polars' LazyFrames form the foundation of the library’s streaming mode, enabling efficient streaming pipelines, real-time transformations, and seamless integration with various data sinks.

This session will explore use cases and technical implementations of both lazy queries and streaming mode. We’ll also include live-coding demonstrations to introduce the tool, showcase best practices, and highlight common pitfalls.

Attendees will walk away with practical knowledge of lazy queries and streaming mode, ready to apply these tools in their daily work as data engineers or data scientists.

Narwhals: enabling universal dataframe support

Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals.

Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted.

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

No-Code Change in Your Python UDF for Arrow Optimization

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.

Polars, DuckDB, PySpark, PyArrow, pandas, cuDF: how Narwhals has brought them all together!

Suppose you want to write a data science tool to do feature engineering. Your experience may go like this: - Expectation: you can focus on state-of-the art techniques for feature engineering. - Reality: you keep having to make you codebase more complex because a new dataframe library has come out and users are demanding support for it.

Or rather, it might have gone like that in the pre-Narwhals era. Because now, you can focus on solving the problems which your tool set out to do, and let Narwhals handle the subtle differences between different kinds of dataframe inputs!

Jay Alexander Clifford: A 101 in Time Series Analytics with Apache Arrow, Pandas and Parquet

Join Jay Alexander Clifford in a deep dive into Time Series Analytics with Apache Arrow, Pandas, and Parquet. 📈🐍 Explore the power of columnar databases, and learn how to build efficient and scalable analytics applications for time series data using open-source tools. 🚀 #TimeSeries #analytics

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

If a Duck Quacks in the Forest and Everyone Hears, Should You Care?

YES! "Duck posting" has become an internet meme for praising DuckDB on Twitter. Nearly every quack using DuckDB has done it once or twice. But, why all the fuss? With advances in CPUs, memory, SSDs, and the software that enables it all, our personal machines are powerful beasts relegated to handling a few Chrome tabs and sitting 90% idle. As data engineers and data analysts, this seems like a waste that's not only expensive, but also impacting the environment.

In this session, you will see how DuckDB brings SQL analytics capabilities to a 2MB standalone executable on your laptop that only recently required a large cluster. This session will explain the architecture of DuckDB that enables high performance analytics on a laptop: great query optimization, vectorized execution, continuous improvements in compression and more. We will show its capabilities using live demos, from the pandas library to WASM, to the command-line. We'll demonstrate performance on large datasets, and talk about how we're exploring using the laptop to augment cloud analytics workloads.

Talk by: Ryan Boyd

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Python with Spark Connect

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Writing Data-Sharing Apps Using Node.js and Delta Sharing

JavaScript remains the top programming language today with most code repositories written using JavaScript on GitHub. However, JavaScript is evolving beyond just a language for web application development into a language built for tomorrow. Everyday tasks like data wrangling, data analysis, and predictive analytics are possible today directly from a web browser. For example, many popular data analytics libraries, like Tensorflow.js, now support JavaScript SDKs.

Another popular library, Danfo.js, makes it possible to wrangle data using familiar pandas-like operations, shortening the learning curve and arming the typical data engineer or data scientist with another data tool in their toolbox. In this presentation, we’ll explore using the Node.js connector for Delta Sharing to build a data analytics app that summarizes a Twitter dataset.

Talk by: Will Girten

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

Auto-Encoder-Decoder is a type of deep learning neural network architecture with an hourglass shape, high dimensional inputs are compressed to latent space through the encoder. The decoder mirrors the encoder architecture and reconstructs the input data from the latent space. Auto-Encoder-Decoder models are commonly used for anomaly detection, after training, the reconstructed error of normal data is minimized thus anomaly can be detected if its reconstructed error gets higher than the “normal threshold”. This presentation will demonstrate an Auto-Encoder-Decoder anomaly detection solution built with the Lakehouse Paradigm, from data management to after-deployment monitoring, to explain the entire model life cycle. It will also highlight the flexibility and scalability that MLflow custom model and Pandas UDF can bring when a large number of individual models need to be trained, deployed, and monitored in parallel.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

PySpark in Apache Spark 3.3 and Beyond

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Deep Dive into the New Features of Apache Spark 3.2 and 3.3

Apache Spark has become the most widely-used engine for executing data engineering, data science and machine learning on single-node machines or clusters. The number of monthly maven downloads of Spark has rapidly increased to 20 million.

We will talk about the higher-level features and improvements in Spark 3.2 and 3.3. The talk also dives deeper into the following features + Introducing pandas API on Apache Spark to unify small data API and big data API. + Completing the ANSI SQL compatibility mode to simplify migration of SQL workloads. + Productionizing adaptive query execution to speed up Spark SQL at runtime. + Introducing RocksDB state store to make state processing more scalable

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

SQL users working with Pandas and Spark quickly realize SQL is a second-class interface, invoked between predominantly Python code.

We will introduce FugueSQL, an enhanced SQL interface that allows SQL lovers to express end-to-end workflows predominantly in SQL. With a Jupyter notebook extension, SQL commands can be used in Databricks notebooks for interactive handling of in-memory datasets. This allows heavy SQL users to fully leverage Spark in their preferred grammar.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/