This session dives into the world of on-demand Apache Spark on Google Cloud. We explore its native integration with BigQuery, its new capabilities and the benefits of using Spark for AI and machine learning (ML) workloads. We’ll discuss why Spark is a good choice for large-scale data processing, distributed training, and distributed inferencing. We’ll learn from Trivago about how they leveraged the Spark and BigQuery together to simplify their AI and ML workflows.
talk-data.com
Topic
Spark
Apache Spark
5
tagged
Activity Trend
Top Events
Overwhelmed by the complexities of building a robust and scalable data pipeline for algo trading with AlloyDB? This session provides the Google Cloud services, tools, recommendations, and best practices you need to succeed. We'll explore battle-tested strategies for implementing a low-latency, high-volume trading platform using AlloyDB and Spark Streaming on Dataproc.
Leverage Composer Orchestration to create a scalable and efficient data pipeline that meets the demands of algo trading and can handle increasing data volumes and trading activity by utilizing the scalability of Google Cloud services.
This session provides a comprehensive guide to building a secure and unified AI lakehouse on BigQuery with the power of open source software (OSS). We’ll explore essential components, including data ingestion, storage, and management; AI and machine learning workflows; pipeline orchestration; data governance; and operational efficiency. Learn about the newest features that support both Apache Spark and Apache Iceberg.
Modern analytics and AI workloads demand a unified storage layer for structured and unstructured data. Learn how Cloud Storage simplifies building data lakes based on Apache Iceberg. We’ll discuss storage best practices and new capabilities that enable high performance and cost efficiency. We’ll also guide you through real-world examples, including Iceberg data lakes with BigQuery or third-party solutions, data preparation for AI pipelines with Dataproc and Apache Spark, and how customers have built unified analytics and AI solutions on Cloud Storage.
NVIDIA GPUs accelerate batch ETL workloads at significant cost savings and performance. In this session, we will delve into optimizing Apache Spark on GCP Dataproc using the G2 accelerator-optimized series with L4 GPUs via RAPIDS Accelerator For Apache Spark, showcasing up to 14x speedups and 80% cost reductions for Spark applications. We will demonstrate this acceleration through a reference AI architecture on financial transaction fraud detection, and go through performance measurements.
Unstructured data makes up the majority of all new data; a trend that's been growing exponentially since 2018. At these volumes, vector embeddings require indexes to be trained so that nearest neighbors can be efficiently approximated, avoiding the need for exhaustive lookups. However, training these indexes puts intense demand on vector databases to maintain a high ingest throughput. In this session, we will explain how the NVIDIA cuVS library is turbo charging vector database ingest with GPUs, providing speedups from 5-20x and improving data readiness.
This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.