Topic

Arrow

Apache Arrow

data_processing columnar_memory_format big_data

Activities

2

tagged

Activity Trend

6 peak/qtr

2020-Q1 2026-Q1

Top Events

Data Engineering Podcast 10 Data Council Austin 2024 - Day 1 5 Databricks DATA + AI Summit 2023 4 PyConDE & PyData Berlin 2023 3 PyData Paris 2025 3 The Analytics Engineering Podcast 2 Data Council 2023 2 O'Reilly Data Engineering Books 2 PyData Paris 2024 2 Data + AI Summit 2025 2 Data Skeptic 1 Making Data Simple 1

Top Speakers

Tobias Macey 10 Matthew Topol (Voltron Data) 3 Wes McKinney (Posit) 3 Julien Le Dem (Astronomer) 3 Alenka Frim (United.Cloud) 2 Joris Van den Bossche 2 Tristan Handy (dbt Labs) 2 Julia Schottenstein (dbt labs) 2 Kyle Polich 1 Hyukjin Kwon (Databricks) 1 Thomas Bierhance 1 Yann LECHELLE (:PROBABL.) 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Data + AI Summit 2025 ×

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Leo Liang (CipherOwl Inc)

AI/ML Analytics Blockchain Data Lakehouse Delta Kafka Spark SQL

We’ll explore how CipherOwl Inc. constructed a near real-time, multi-chain data lakehouse to power anti-money laundering (AML) monitoring at a petabyte scale. We will walk through the end-to-end architecture, which integrates cutting-edge open-source technologies and AI-driven analytics to handle massive on-chain data volumes seamlessly. Off-chain intelligence complements this to meet rigorous AML requirements. At the core of our solution is ChainStorage, an OSS started by Coinbase that provides robust blockchain data ingestion and block-level serving. We enhanced it with Apache Spark™ and Arrow™, coupled for high-throughput processing and efficient data serialization, backed by Delta Lake and Kafka. For the serving layer, we employ StarRocks to deliver lightning-fast SQL analytics over vast datasets. Finally, our system incorporates machine learning and AI agents for continuous data curation and near real-time insights, which are crucial for tackling on-chain AML challenges.

No-Code Change in Your Python UDF for Arrow Optimization

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Hyukjin Kwon (Databricks)

API Pandas Python Spark

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.