Speaker

DB Tsai

Activities

2

talks

Senior Engineering Manager Databricks

DB Tsai is an engineering leader at the Databricks Spark team. He is an Apache Spark Project Management Committee (PMC) Member and Committer, and he enjoys building teams with great cultures focusing on large scale distributed data infrastructure. Before his transition to a leadership role, he implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark project.

Bio from: Data + AI Summit 2025

Filter by Event / Source

Data + AI Summit 2025 2

Talks & appearances

2 activities · Newest first

Search activities →

Apache Spark — Ask Us Anything

2025-06-11 · Data + AI Summit 2025

lightning_talk

with DB Tsai (Databricks) , Jules S. Damji (Anyscale Inc) , Allison Wang (Databricks)

API Big Data Spark

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

2025-06-11 · Data + AI Summit 2025 Watch

talk

with DB Tsai (Databricks) , Xiao Li (Databricks)

Analytics API Data Quality ETL/ELT PySpark Python

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!