talk-data.com

Topic

Dataproc

Google Cloud Dataproc

big_data hadoop spark

Activities

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

Google Cloud Next '25 5 Google Cloud Next '24 2 O'Reilly Data Science Books 2 Big Data LDN 2025 1 Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) 1

Top Speakers

Dana Soltani (Google Cloud) 3 Felix Cheung (NVIDIA) 1 Gareth Williams (Digital Health and Care Wales) 1 Sachin Pawar (Google) 1 Sandika S. Sukhdeve 1 Edward Yang (Two Sigma) 1 Ravichandran Jagannathan (American Express) 1 Akshay Sarma (Yahoo) 1 Bruno Aziza (Google Cloud) 1 Corey Nolet (Nvidia) 1 Apurva Desai (Google Cloud) 1 Ramnik Kaur (LiveRamp) 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Google Cloud Next '25 ×

Hadoop pioneer to cloud innovator: Yahoo’s data lake modernization journey

2025-04-11 · Google Cloud Next '25

session

by Akshay Sarma (Yahoo) , Ayyappan Arasu (Yahoo) , Dana Soltani (Google Cloud)

AI/ML BigQuery Cloud Computing Data Lake GCP Hadoop Pub/Sub

Get the inside story of Yahoo’s data lake transformation. As a Hadoop pioneer, Yahoo’s move to Google Cloud is a significant shift in data strategy. Explore the business drivers behind this transformation, technical hurdles encountered, and strategic partnership with Google Cloud that enabled a seamless migration. We’ll uncover key lessons, best practices for data lake modernization, and how Yahoo is using BigQuery, Dataproc, Pub/Sub, and other services to drive business value, enhance operational efficiency, and fuel their AI initiatives.

Construct a scalable, high-volume trading platform with low latency using AlloyDB and Spark Streaming on Dataproc

2025-04-10 · Google Cloud Next '25

session

by Sachin Pawar (Google) , Surjit Singh (Google)

Cloud Computing GCP Spark Data Streaming

Overwhelmed by the complexities of building a robust and scalable data pipeline for algo trading with AlloyDB? This session provides the Google Cloud services, tools, recommendations, and best practices you need to succeed. We'll explore battle-tested strategies for implementing a low-latency, high-volume trading platform using AlloyDB and Spark Streaming on Dataproc.

Leverage Composer Orchestration to create a scalable and efficient data pipeline that meets the demands of algo trading and can handle increasing data volumes and trading activity by utilizing the scalability of Google Cloud services.

Under the Iceberg: Simple, unified Cloud Storage for analytics data lakes

2025-04-10 · Google Cloud Next '25

session

by Edward Yang (Two Sigma) , Vivek Saraswat (Google Cloud) , Dave Stiver (Google Cloud)

AI/ML Analytics BigQuery Cloud Computing Cloud Storage Iceberg Spark

Modern analytics and AI workloads demand a unified storage layer for structured and unstructured data. Learn how Cloud Storage simplifies building data lakes based on Apache Iceberg. We’ll discuss storage best practices and new capabilities that enable high performance and cost efficiency. We’ll also guide you through real-world examples, including Iceberg data lakes with BigQuery or third-party solutions, data preparation for AI pipelines with Dataproc and Apache Spark, and how customers have built unified analytics and AI solutions on Cloud Storage.

Drive AI workloads with GPU-accelerated data processing, vector indexing and search

2025-04-10 · Google Cloud Next '25

session

by Felix Cheung (NVIDIA) , Corey Nolet (Nvidia)

AI/ML Cloud Computing ETL/ELT GCP Spark Vector DB

NVIDIA GPUs accelerate batch ETL workloads at significant cost savings and performance. In this session, we will delve into optimizing Apache Spark on GCP Dataproc using the G2 accelerator-optimized series with L4 GPUs via RAPIDS Accelerator For Apache Spark, showcasing up to 14x speedups and 80% cost reductions for Spark applications. We will demonstrate this acceleration through a reference AI architecture on financial transaction fraud detection, and go through performance measurements.

Unstructured data makes up the majority of all new data; a trend that's been growing exponentially since 2018. At these volumes, vector embeddings require indexes to be trained so that nearest neighbors can be efficiently approximated, avoiding the need for exhaustive lookups. However, training these indexes puts intense demand on vector databases to maintain a high ingest throughput. In this session, we will explain how the NVIDIA cuVS library is turbo charging vector database ingest with GPUs, providing speedups from 5-20x and improving data readiness.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Create a unified data platform with BigQuery

· Google Cloud Next '25

demo

AI/ML BigQuery product-biglake product-bigquery product-dataplex product-dataproc

Build an open, secure, and integrated AI data platform. Manage the end-to-end data life cycle with built-in governance.