talk-data.com talk-data.com

Denny Lee

Speaker

Denny Lee

4

talks

PM Director, Developer Relations Databricks

Denny Lee is a Product Management Director and Principal Developer Advocate at Databricks with deep expertise in distributed systems, AI, and data science. He is a longtime contributor to Apache Spark and MLflow and a maintainer of Unity Catalog and Delta Lake. He previously worked at Microsoft on Azure Cosmos DB, HDInsight, and SQL Server, and led data science engineering at SAP Concur. He holds a Master’s in Biomedical Informatics from OHSU and has built large-scale data platforms for healthcare and enterprise use. His technical focus includes AI, Apache Iceberg, Delta Lake, Apache Spark, ML, and genomics.

Bio from: Databricks DATA + AI Summit 2023

Frequent Collaborators

Filtering by: O'Reilly Data Engineering Books ×

Filter by Event / Source

Talks & appearances

Showing 4 of 13 activities

Search activities →
Delta Lake: The Definitive Guide

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Learning Spark, 2nd Edition

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

PySpark Cookbook

Dive into the world of big data processing and analytics with the "PySpark Cookbook". This book provides over 60 hands-on recipes for implementing efficient data-intensive solutions using Apache Spark and Python. By mastering these recipes, you'll be equipped to tackle challenges in large-scale data processing, machine learning, and stream analytics. What this Book will help me do Set up and configure PySpark environments effectively, including working with Jupyter for enhanced interactivity. Understand and utilize DataFrames for data manipulation, analysis, and transformation tasks. Develop end-to-end machine learning solutions using the ML and MLlib modules in PySpark. Implement structured streaming and graph-processing solutions to analyze and visualize data streams and relationships. Deploy PySpark applications to the cloud infrastructure efficiently using best practices. Author(s) This book is co-authored by None Lee and None Drabas, who are experienced professionals in data processing and analytics leveraging Python and Apache Spark. With their deep technical expertise and a passion for teaching through practical examples, they aim to make the complex concepts of PySpark accessible to developers of varied experience levels. Who is it for? This book is ideal for Python developers who are keen to delve into the Apache Spark ecosystem. Whether you're just starting with big data or have some experience with Spark, this book provides practical recipes to enhance your skills. Readers looking to solve real-world data-intensive challenges using PySpark will find this resource invaluable.

Learning PySpark

"Learning PySpark" guides you through mastering the integration of Python with Apache Spark to build scalable and efficient data applications. You'll delve into Spark 2.0's architecture, efficiently process data, and explore PySpark's capabilities ranging from machine learning to structured streaming. By the end, you'll be equipped to craft and deploy robust data pipelines and applications. What this Book will help me do Master the Spark 2.0 architecture and its Python integration with PySpark. Leverage PySpark DataFrames and RDDs for effective data manipulation and analysis. Develop scalable machine learning models using PySpark's ML and MLlib libraries. Understand advanced PySpark features such as GraphFrames for graph processing and TensorFrames for deep learning models. Gain expertise in deploying PySpark applications locally and on the cloud for production-ready solutions. Author(s) Authors None Drabas and None Lee bring extensive experience in data engineering and Python programming. They combine a practical, example-driven approach with deep insights into Apache Spark's ecosystem. Their expertise and clarity in writing make this book accessible for individuals aiming to excel in big data technologies with Python. Who is it for? This book is best suited for Python developers who want to integrate Apache Spark 2.0 into their workflow to process large-scale data. Ideal readers will have foundational knowledge of Python and seek to build scalable data-intensive applications using Spark, regardless of prior experience with Spark itself.