Topic

PySpark

big_data distributed_computing python

Activities

2

tagged

Activity Trend

14 peak/qtr

2020-Q1 2026-Q2

Top Events

O'Reilly Data Engineering Books 19 Databricks DATA + AI Summit 2023 16 Data + AI Summit 2025 13 Data Engineering Podcast 4 O'Reilly Data Science Books 2 PyData Berlin 2025 2 PyData Cardiff - July 2025 1 From a Fintech lens: MCP server live-coding & feature selection data hacks 1 dbt Coalesce 2025 1 PyData Seattle 2025 1 PyConDE & PyData Berlin 2023 1 SciPy 2025 1

Top Speakers

Tobias Macey 4 Marco Gorelli (Narwhals) 3 Denny Lee (Databricks) 3 Pramod Singh 3 Sundar Krishnan 2 Tomasz Drabas 2 Raju Kumar Mishra 2 Allison Wang (Databricks) 2 Ramcharan Kakarla 2 Xiao Li (Databricks) 2 Stuart Moncada (Google Cloud) 1 Benjamin Bengfort 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Allison Wang ×

Polars on Spark: Unlocking Performance with Arrow Python UDFs

2025-11-07 · PyData Seattle 2025

talk

by Shujing Yang , Allison Wang (Databricks)

Arrow Polars Python Rust Spark

PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by LU QIU (LanceDB) , Allison Wang (Databricks)

AI/ML Analytics API Big Data Data Analytics Lance Python Spark SQL

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.