talk-data.com talk-data.com

Sandy Ryza

Speaker

Sandy Ryza

5

talks

Software Engineer Databricks

Sandy is a software engineer at Databricks, working on declarative pipelines. Prior, he was the lead engineer on the Dagster project, and led ML teams at KeepTruckin and Clover Health. He's a committer on Spark and Hadoop, and co-authored O'Reilly Advanced Analytics with Spark.

Bio from: Data + AI Summit 2025

Frequent Collaborators

Filter by Event / Source

Talks & appearances

5 activities · Newest first

Search activities →
lightning_talk
with Sandy Ryza (Databricks) , Denny Lee (Databricks) , Xiao Li (Databricks)

Join us for an insightful Ask Me Anything (AMA) session on Declarative Pipelines — a powerful approach to simplify and optimize data workflows. Learn how to define data transformations using high-level, SQL-like semantics, reducing boilerplate code while improving performance and maintainability. Whether you're building ETL processes, feature engineering pipelines, or analytical workflows, this session will cover best practices, real-world use cases and how Declarative Pipelines can streamline your data applications. Bring your questions and discover how to make your data processing more intuitive and efficient!

Declarative Pipelines: What’s Next for the Apache Spark Ecosystem

Lakeflow Declarative Pipelines has made it dramatically easier to build production-grade Spark pipelines, using a framework that abstracts away orchestration and complexity. It’s become a go-to solution for teams who want reliable, maintainable pipelines without reinventing the wheel.But we’re just getting started. In this session, we’ll take a step back and share a broader vision for the future of Spark Declarative Pipelines — one that opens the door to a new level of openness, standardization and community momentum.We’ll cover the core concepts behind Declarative Pipelines, where the architecture is headed, and what this shift means for both existing Lakeflow users and Spark engineers building procedural code. Don’t miss this session — we’ll be sharing something new that sets the direction for what comes next.

Advanced Analytics with PySpark

The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming. Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing. If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis. Familiarize yourself with Spark's programming model and ecosystem Learn general approaches in data science Examine complete implementations that analyze large public datasets Discover which machine learning tools make sense for particular problems Explore code that can be adapted to many uses

Advanced Analytics with Spark, 2nd Edition

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications. With this book, you will: Familiarize yourself with the Spark programming model Become comfortable within the Spark ecosystem Learn general approaches in data science Examine complete implementations that analyze large public data sets Discover which machine learning tools make sense for particular problems Acquire code that can be adapted to many uses

Advanced Analytics with Spark

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.