talk-data.com talk-data.com

Hyukjin Kwon

Speaker

Hyukjin Kwon

3

talks

Staff Software Engineer Databricks

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.

Bio from: Databricks DATA + AI Summit 2023

Filter by Event / Source

Talks & appearances

3 activities · Newest first

Search activities →
No-Code Change in Your Python UDF for Arrow Optimization

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.

Python with Spark Connect

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse / Spark AMA

Have some great questions about Apache Spark™ and Lakehouses?  Well, come by and ask the experts your questions!

Talk by: Martin Grund, Hyukjin Kwon, and Wenchen Fan

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc