talk-data.com talk-data.com

Topic

Arrow

Apache Arrow

data_processing columnar_memory_format big_data

18

tagged

Activity Trend

6 peak/qtr
2020-Q1 2026-Q1

Activities

18 activities · Newest first

Open-source Business

Challenges in economics and governance models for open-source scientific projects

In this presentation, the CEOs of two companies at the forefront of open-source scientific software development - Sylvain Corlay of QuantStack and Yann Lechelle of Probabl - examine the intricate challenges of open-source funding and governance and reflect on how these two aspects interconnect.

We start by reflecting on the origins of the open-source movement within the scientific community, and delve into the contemporary challenges of operating businesses and identifying sustainable economic models that both leverage and contribute to open-source software.

In particular, we highlight the unique approaches and experiences of QuantStack and Probabl, which primarily contribute to multi-stakeholder scientific projects such as scikit-learn, Jupyter, Apache Arrow, or conda-forge.

You Don’t Have to Be an Expert: Stories from the Open Source Frontlines

Four years ago, I had no idea what PyArrow was—or how open source development worked. But through mentorship, collaboration, and learning in public, I found not just a place in the community, but a sense of how open source evolves and connects.

In this keynote, I’ll share my experience on how complex projects like Apache Arrow evolve through shared protocols, cross-project conversations, and the people behind them. Along the way, we’ll look at the human side of technical work, the quiet strength of standards, and how imposter syndrome, while uncomfortable, has sharpened my curiosity and helped me find my own way of contributing.

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

We’ll explore how CipherOwl Inc. constructed a near real-time, multi-chain data lakehouse to power anti-money laundering (AML) monitoring at a petabyte scale. We will walk through the end-to-end architecture, which integrates cutting-edge open-source technologies and AI-driven analytics to handle massive on-chain data volumes seamlessly. Off-chain intelligence complements this to meet rigorous AML requirements. At the core of our solution is ChainStorage, an OSS started by Coinbase that provides robust blockchain data ingestion and block-level serving. We enhanced it with Apache Spark™ and Arrow™, coupled for high-throughput processing and efficient data serialization, backed by Delta Lake and Kafka. For the serving layer, we employ StarRocks to deliver lightning-fast SQL analytics over vast datasets. Finally, our system incorporates machine learning and AI agents for continuous data curation and near real-time insights, which are crucial for tackling on-chain AML challenges.

No-Code Change in Your Python UDF for Arrow Optimization

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.

Jay Alexander Clifford: A 101 in Time Series Analytics with Apache Arrow, Pandas and Parquet

Join Jay Alexander Clifford in a deep dive into Time Series Analytics with Apache Arrow, Pandas, and Parquet. 📈🐍 Explore the power of columnar databases, and learn how to build efficient and scalable analytics applications for time series data using open-source tools. 🚀 #TimeSeries #analytics

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?

Rust is a unique language whose traits make it very appealing for data engineering. In this session, we'll walk through the different aspects of the language that make it such a good fit for big data processing including: how it improves performance and how it provides greater safety guarantees and compatibility with a wide range of existing tools that make it well positioned to become a major building block for the future of analytics.

We will also take a hands-on look through real code examples at a few emerging technologies built on top of Rust that utilize these capabilities, and learn how to apply them to our modern lakehouse architecture.

Talk by: Oz Katz

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Ten years of building open source standards: From Parquet to Arrow to OpenLineage | Astronomer

ABOUT THE TALK: Over the last decade I have been lucky enough to contribute a few successful open source projects to the data ecosystem. In this talk

Julien Le Dem shares the story of his contribution to successful open source projects to the data ecosystem and what made their success possible. From the ideation process and early growth of the Apache Parquet columnar format and how this led to the creation of its in-memory alter-ego Apache Arrow. Julian will end with showing how this experience enabled the success of OpenLineage, an LFAI & Data project that brings observability to the data ecosystem.

ABOUT THE SPEAKER: Julien Le Dem is the Chief Architect of Astronomer and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including OpenLineage, Marquez (LFAI&Data), Apache Arrow, Apache Iceberg and a few others. Previously, he was a senior principal at Wework; principal architect at Dremio; and tech lead for Twitter’s data processing tools and principal engineer working on content platforms at Yahoo, where he received his Hadoop initiation.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Making Moves with Arrow Data: Introducing Arrow Database Connectivity (ADBC) | Voltron Data

ABOUT THE TALK: In this talk, we'll dive into one of the newest Apache Arrow subprojects, Arrow Database Connectivity (ADBC), an API specification for Arrow-based database access.

Over the course of this session, you’ll get a crash course in ADBC and learn how it communicates with different data APIs (like Arrow Flight SQL and Postgres) using Arrow-native in-memory data. By the end, you’ll understand the use cases it can conquer and know where to access the resources you need to get started.

This talk will cover goals, use-cases, and examples of using ADBC to communicate with different Data APIs (such as Flight SQL or postgres) with Arrow Native in-memory data.

ABOUT THE SPEAKER: Matthew Topol is a committer for the Apache Arrow project, frequently enhancing the Golang Arrow and Parquet libraries among other enhancements and helping to grow the Arrow Community. Recently, Matt has joined Voltron Data in order to work on the Apache Arrow libraries full time and grow the Arrow Golang community. In June 2022, Matt's first book was published, which is the first (and currently only) book on Apache Arrow titled "In-Memory Analytics with Apache Arrow".

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers

Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.

Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.

In this talk, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine

Learn how Rust, the Apache Arrow project, and the Data Fusion Query Engine are increasingly being used to accelerate the creation of modern data stacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Polars: Blazingly Fast DataFrames in Rust and Python

This talk will introduce Polars a blazingly fast DataFrame library written in Rust on top of Apache Arrow. Its a DataFrame library that brings exploratory data analysis closer to the lessons learned in database research.

CPU's today's come with many cores and with their superscalar designs and SIMD registers allow for even more parallelism. Polars is written from the ground up to fully utilize the CPU's of this generation.

Besides blazingly fast algorithms, cache efficient memory layout and multi-threading, it consist of a lazy query engine, allowing Polars to do several optimizations that may improve query time and memory usage.

Read more:

https://github.com/pola-rs/polars https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/

Join the talk to learn more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/