Dataproc

Building an AI-ready Open Lakehouse on Google Cloud

2025-09-25 · Big Data LDN 2025

Face To Face

by Gareth Williams (Digital Health and Care Wales) , Sadeeq Akintola (Google Cloud)

AI/ML Analytics BigQuery Cloud Computing Data Lakehouse Data Quality GCP Iceberg Fabric Spark

Discover how to build a powerful AI Lakehouse and unified data fabric natively on Google Cloud. Leverage BigQuery's serverless scale and robust analytics capabilities as the core, seamlessly integrating open data formats with Apache Iceberg and efficient processing using managed Spark environments like Dataproc. Explore the essential components of this modern data environment, including data architecture best practices, robust integration strategies, high data quality assurance, and efficient metadata management with Google Cloud Data Catalog. Learn how Google Cloud's comprehensive ecosystem accelerates advanced analytics, preparing your data for sophisticated machine learning initiatives and enabling direct connection to services like Vertex AI.

Hadoop pioneer to cloud innovator: Yahoo’s data lake modernization journey

2025-04-11 · Google Cloud Next '25

session

by Akshay Sarma (Yahoo) , Ayyappan Arasu (Yahoo) , Dana Soltani (Google Cloud)

AI/ML BigQuery Cloud Computing Data Lake GCP Hadoop Pub/Sub

Get the inside story of Yahoo’s data lake transformation. As a Hadoop pioneer, Yahoo’s move to Google Cloud is a significant shift in data strategy. Explore the business drivers behind this transformation, technical hurdles encountered, and strategic partnership with Google Cloud that enabled a seamless migration. We’ll uncover key lessons, best practices for data lake modernization, and how Yahoo is using BigQuery, Dataproc, Pub/Sub, and other services to drive business value, enhance operational efficiency, and fuel their AI initiatives.

Construct a scalable, high-volume trading platform with low latency using AlloyDB and Spark Streaming on Dataproc

2025-04-10 · Google Cloud Next '25

session

by Sachin Pawar (Google) , Surjit Singh (Google)

Cloud Computing GCP Spark Data Streaming

Overwhelmed by the complexities of building a robust and scalable data pipeline for algo trading with AlloyDB? This session provides the Google Cloud services, tools, recommendations, and best practices you need to succeed. We'll explore battle-tested strategies for implementing a low-latency, high-volume trading platform using AlloyDB and Spark Streaming on Dataproc.

Leverage Composer Orchestration to create a scalable and efficient data pipeline that meets the demands of algo trading and can handle increasing data volumes and trading activity by utilizing the scalability of Google Cloud services.

Under the Iceberg: Simple, unified Cloud Storage for analytics data lakes

2025-04-10 · Google Cloud Next '25

session

by Edward Yang (Two Sigma) , Vivek Saraswat (Google Cloud) , Dave Stiver (Google Cloud)

AI/ML Analytics BigQuery Cloud Computing Cloud Storage Iceberg Spark

Modern analytics and AI workloads demand a unified storage layer for structured and unstructured data. Learn how Cloud Storage simplifies building data lakes based on Apache Iceberg. We’ll discuss storage best practices and new capabilities that enable high performance and cost efficiency. We’ll also guide you through real-world examples, including Iceberg data lakes with BigQuery or third-party solutions, data preparation for AI pipelines with Dataproc and Apache Spark, and how customers have built unified analytics and AI solutions on Cloud Storage.

Drive AI workloads with GPU-accelerated data processing, vector indexing and search

2025-04-10 · Google Cloud Next '25

session

by Felix Cheung (NVIDIA) , Corey Nolet (Nvidia)

AI/ML Cloud Computing ETL/ELT GCP Spark Vector DB

NVIDIA GPUs accelerate batch ETL workloads at significant cost savings and performance. In this session, we will delve into optimizing Apache Spark on GCP Dataproc using the G2 accelerator-optimized series with L4 GPUs via RAPIDS Accelerator For Apache Spark, showcasing up to 14x speedups and 80% cost reductions for Spark applications. We will demonstrate this acceleration through a reference AI architecture on financial transaction fraud detection, and go through performance measurements.

Unstructured data makes up the majority of all new data; a trend that's been growing exponentially since 2018. At these volumes, vector embeddings require indexes to be trained so that nearest neighbors can be efficiently approximated, avoiding the need for exhaustive lookups. However, training these indexes puts intense demand on vector databases to maintain a high ingest throughput. In this session, we will explain how the NVIDIA cuVS library is turbo charging vector database ingest with GPUs, providing speedups from 5-20x and improving data readiness.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Build an enterprise AI/ML big data pipeline with Dataproc and custom images/GPU

2024-04-11 · Google Cloud Next '24

session

by Ravichandran Jagannathan (American Express) , Brad Miro (Google) , Dana Soltani (Google Cloud)

AI/ML Big Data Cloud Computing GCP

This session will detail the process of architecting enterprise-grade big data pipelines, encompassing the orchestration of Ephemeral Dataproc clusters, customization through custom images and the strategic incorporation of GPU resources. Real-world use cases, best practices, challenges, and future trends in this domain will also be discussed, providing actionable insights for implementing cutting-edge big data solutions.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Migrating Spark and Hadoop to Dataproc

2024-04-09 · Google Cloud Next '24

session

by Susheel Kaushik (Google Cloud) , Apurva Desai (Google Cloud) , Ramnik Kaur (LiveRamp) , Adnan Hasan (Google Cloud) , Dean Batten (LiveRamp) , Dana Soltani (Google Cloud)

Analytics Big Data Cloud Computing GCP Hadoop Spark

Learn how Dataproc can support your hybrid multicloud strategy and help you meet your business goals for your big data open source analytics workloads. Discover how LiveRamp achieved performance boosts and cost reductions by migrating to Dataproc. Learn their migration secrets, overcome common hurdles, and leverage Dataproc's hidden gems for a seamless transition.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services

2023-11-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Sandika S. Sukhdeve , Dr. Shitalkumar R. Sukhdeve

AI/ML Analytics BI Big Data BigQuery Cloud Computing Cloud Storage Data Analytics Data Science DataViz Dataflow GCP +9 more

This book is your practical and comprehensive guide to learning Google Cloud Platform (GCP) for data science, using only the free tier services offered by the platform. Data science and machine learning are increasingly becoming critical to businesses of all sizes, and the cloud provides a powerful platform for these applications. GCP offers a range of data science services that can be used to store, process, and analyze large datasets, and train and deploy machine learning models. The book is organized into seven chapters covering various topics such as GCP account setup, Google Colaboratory, Big Data and Machine Learning, Data Visualization and Business Intelligence, Data Processing and Transformation, Data Analytics and Storage, and Advanced Topics. Each chapter provides step-by-step instructions and examples illustrating how to use GCP services for data science and big data projects. Readers will learn how to set up a Google Colaboratory account and run Jupyternotebooks, access GCP services and data from Colaboratory, use BigQuery for data analytics, and deploy machine learning models using Vertex AI. The book also covers how to visualize data using Looker Data Studio, run data processing pipelines using Google Cloud Dataflow and Dataprep, and store data using Google Cloud Storage and SQL. What You Will Learn Set up a GCP account and project Explore BigQuery and its use cases, including machine learning Understand Google Cloud AI Platform and its capabilities Use Vertex AI for training and deploying machine learning models Explore Google Cloud Dataproc and its use cases for big data processing Create and share data visualizations and reports with Looker Data Studio Explore Google Cloud Dataflow and its use cases for batch and stream data processing Run data processing pipelines on Cloud Dataflow Explore Google Cloud Storageand its use cases for data storage Get an introduction to Google Cloud SQL and its use cases for relational databases Get an introduction to Google Cloud Pub/Sub and its use cases for real-time data streaming Who This Book Is For Data scientists, machine learning engineers, and analysts who want to learn how to use Google Cloud Platform (GCP) for their data science and big data projects

108 - Google Cloud’s Bruno Aziza on What Makes a Good Customer-Obsessed Data Product Manager

2023-01-10 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Bruno Aziza (Google Cloud) , Brian O’Neill (Designing for Analytics)

AI/ML Analytics BigQuery Cloud Computing Dataflow GCP Looker

Today I’m chatting with Bruno Aziza, Head of Data & Analytics at Google Cloud. Bruno leads a team of outbound product managers in charge of BigQuery, Dataproc, Dataflow and Looker and we dive deep on what Bruno looks for in terms of skills for these leaders. Bruno describes the three patterns of operational alignment he’s observed in data product management, as well as why he feels ownership and customer obsession are two of the most important qualities a good product manager can have. Bruno and I also dive into how to effectively abstract the core problem you’re solving, as well as how to determine whether a problem might be solved in a better way.

Highlights / Skip to:

Bruno introduces himself and explains how he created his “CarCast” podcast (00:45) Bruno describes his role at Google, the product managers he leads, and the specific Google Cloud products in his portfolio (02:36) What Bruno feels are the most important attributes to look for in a good data product manager (03:59) Bruno details how a good product manager focuses on not only the core problem, but how the problem is currently solved and whether or not that’s acceptable (07:20) What effective abstracting the problem looks like in Bruno’s view and why he positions product management as a way to help users move forward in their career (12:38) Why Bruno sees extracting value from data as the number one pain point for data teams and their respective companies (17:55) Bruno gives his definition of a data product (21:42) The three patterns Bruno has observed of operational alignment when it comes to data product management (27:57) Bruno explains the best practices he’s seen for cross-team goal setting and problem-framing (35:30)

Quotes from Today’s Episode

“What’s happening in the industry is really interesting. For people that are running data teams today and listening to us, the makeup of their teams is starting to look more like what we do [in] product management.” — Bruno Aziza (04:29)

“The problem is the problem, so focus on the problem, decompose the problem, look at the frictions that are acceptable, look at the frictions that are not acceptable, and look at how by assembling a solution, you can make it most seamless for the individual to go out and get the job done.” – Bruno Aziza (11:28)

“As a product manager, yes, we’re in the business of software, but in fact, I think you’re in the career management business. Your job is to make sure that whatever your customer’s job is that you’re making it so much easier that they, in fact, get so much more done, and by doing so they will get promoted, get the next job.” – Bruno Aziza (15:41)

“I think that is the task of any technology company, of any product manager that’s helping these technology companies: don’t be building a product that’s looking for a problem. Just start with the problem back and solution from that. Just make sure you understand the problem very well.” (19:52)

“If you’re a data product manager today, you look at your data estate and you ask yourself, ‘What am I building to save money? When am I building to make money?’ If you can do both, that’s absolutely awesome. And so, the data product is an asset that has been built repeatedly by a team and generates value out of data.” – Bruno Aziza (23:12)

“[Machine learning is] hard because multiple teams have to work together, right? You got your business analyst over here, you’ve got your data scientists over there, they’re not even the same team. And so, sometimes you’re struggling with just the human aspect of it.” (30:30)

“As a data leader, an IT leader, you got to think about those soft ways to accomplish the stuff that’s binary, that’s the hard [stuff], right? I always joke, the hard stuff is the soft stuff for people like us because we think about data, we think about logic, we think, ‘Okay if it makes sense, it will be implemented.’ For most of us, getting stuff done is through people. And people are emotional, how can you express the feeling of achieving that goal in emotional value?” – Bruno Aziza (37:36)

Links As referenced by Bruno, “Good Product Manager/Bad Product Manager”: https://a16z.com/2012/06/15/good-product-managerbad-product-manager/ LinkedIn: https://www.linkedin.com/in/brunoaziza/ Bruno’s Medium Article on Competing Against Luck by Clayton M. Christensen: https://brunoaziza.medium.com/competing-against-luck-3daeee1c45d4 The Data CarCast on YouTube: https://www.youtube.com/playlist?list=PLRXGFo1urN648lrm8NOKXfrCHzvIHeYyw

Data Science on the Google Cloud Platform, 2nd Edition

2022-03-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Valliappa Lakshmanan

AI/ML Analytics BigQuery Cloud Computing Dashboard Data Science Dataflow GCP Cloud Run Pub/Sub Spark cloud-computing +3 more

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP. Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way. You'll learn how to: Employ best practices in building highly scalable data and ML pipelines on Google Cloud Automate and schedule data ingest using Cloud Run Create and populate a dashboard in Data Studio Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery Conduct interactive data exploration with BigQuery Create a Bayesian model with Spark on Cloud Dataproc Forecast time series and do anomaly detection with BigQuery ML Aggregate within time windows with Dataflow Train explainable machine learning models with Vertex AI Operationalize ML with Vertex AI Pipelines

Create a unified data platform with BigQuery

· Google Cloud Next '25

demo

AI/ML BigQuery product-biglake product-bigquery product-dataplex product-dataproc

Build an open, secure, and integrated AI data platform. Manage the end-to-end data life cycle with built-in governance.

talk-data.com

Activity Trend

Top Events

Top Speakers

Building an AI-ready Open Lakehouse on Google Cloud

Hadoop pioneer to cloud innovator: Yahoo’s data lake modernization journey

Construct a scalable, high-volume trading platform with low latency using AlloyDB and Spark Streaming on Dataproc

Under the Iceberg: Simple, unified Cloud Storage for analytics data lakes

Drive AI workloads with GPU-accelerated data processing, vector indexing and search

Build an enterprise AI/ML big data pipeline with Dataproc and custom images/GPU

Migrating Spark and Hadoop to Dataproc

Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services

108 - Google Cloud’s Bruno Aziza on What Makes a Good Customer-Obsessed Data Product Manager

Data Science on the Google Cloud Platform, 2nd Edition

Create a unified data platform with BigQuery