talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

66

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Data + AI Summit 2025 ×
lightning_talk
by DB Tsai (Databricks) , Jules S. Damji (Anyscale Inc) , Allison Wang (Databricks)

Join us for an interactive Ask Me Anything (AMA) session on the latest advancements in Apache Spark 4, including Spark Connect — the new client-server architecture enabling seamless integration with IDEs, notebooks and custom applications. Learn about performance improvements, enhanced APIs and best practices for leveraging Spark’s next-generation features. Whether you're a data engineer, Spark developer or big data enthusiast, bring your questions on architecture, real-world use cases and how these innovations can optimize your workflows. Don’t miss this chance to dive deep into the future of distributed computing with Spark!

PDF Document Ingestion Accelerator for GenAI Applications

Databricks Financial Service customers in the GenAI space have a common use case of ingestion and processing of unstructured documents — PDF/images — then performing downstream GenAI tasks such as entity extraction and RAG based knowledge Q&A. The pain points for the customers for these types of use cases are: The quality of the PDF/image documents varies since many older physical documents were scanned into electronic form The complexity of the PDF/image documents varies and many contain tables — images with embedding information — which require slower Tesseract OCR They would like to streamline postprocess for downstream workloads In this talk we will present an optimized structured streaming workflow for complex PDF ingestion. The key techniques include Apache Spark™ optimization, multi-threading, PDF object extraction, skew handling and auto retry logics

Breaking Barriers: Building Custom Spark 4.0 Data Connectors with Python

Building a custom Spark data source connector once required Java or Scala expertise, making it complex and limiting. This left many proprietary data sources without public SDKs disconnected from Spark. Additionally, data sources with Python SDKs couldn't harness Spark’s distributed power. Spark 4.0 changes this with a new Python API for data source connectors, allowing developers to build fully functional connectors without Java or Scala. This unlocks new possibilities, from integrating proprietary systems to leveraging untapped data sources. Supporting both batch and streaming, this API makes data ingestion more flexible than ever. In this talk, we’ll demonstrate how to build a Spark connector for Excel using Python, showcasing schema inference, data reads/writes and streaming support. Whether you're a data engineer or Spark enthusiast, you’ll gain the knowledge to integrate Spark with any data source — entirely in Python.

Serverless as the New "Easy Button": How HP Inc. Used Serverless to Turbocharge Their Data Pipeline

How do you wrangle over 8TB of granular “hit-level” website analytics data with hundreds of columns, all while eliminating the overhead of cluster management, decreasing runtime and saving money? In this session, we’ll dive into how we helped HP Inc. use Databricks serverless compute and Lakeflow Declarative Pipelines to streamline Adobe Analytics data ingestion while making it faster, cheaper and easier to operate. We’ll walk you through our full migration story — from managing unwieldy custom-defined AWS-based Apache Spark™ clusters to spinning up Databricks serverless pipelines and workflows with on-demand scalability and near-zero overhead. If you want to simplify infrastructure, optimize performance and get more out of your Databricks workloads, this session is for you.

The Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. We’ll start with new single-node optimizations that make PySpark even more efficient for smaller datasets. Next, we’ll delve into a major “Pythonizing” overhaul — simpler installation, clearer error messages and Pythonic APIs. On the ETL side, we’ll explore greater data source flexibility (including the simplified Python Data Source API) and a thriving UDF ecosystem. We’ll also highlight enhanced support for real-time use cases, built-in data quality checks and the expanding Spark Connect ecosystem — bridging local workflows with fully distributed execution. Don’t miss this chance to see Spark’s next chapter!

Unity Catalog Lakeguard: Secure and Efficient Compute for Your Enterprise

Modern data workloads span multiple sources — data lakes, databases, apps like Salesforce and services like cloud functions. But as teams scale, secure data access and governance across shared compute becomes critical. In this session, learn how to confidently integrate external data and services into your workloads using Spark and Unity Catalog on Databricks. We'll explore compute options like serverless, clusters, workflows and SQL warehouses, and show how Unity Catalog’s Lakeguard enforces fine-grained governance — even when concurrently sharing compute by multiple users. Walk away ready to choose the right compute model for your team’s needs — without sacrificing security or efficiency.

Bridging Big Data and AI: Empowering PySpark With Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, excelling in data preparation, analytics and machine learning tasks within traditional data lakes. However, the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format. Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large record-size data (e.g., images, tensors, embeddings, etc), simplifying multimodal data storage. Its advanced indexing for semantic and full-text search, combined with rapid random access, enables high-performance AI data analytics to the level of SQL. By unifying PySpark's robust processing capabilities with Lance's AI-optimized storage, data engineers and scientists can efficiently manage and analyze the diverse data types required for cutting-edge AI applications within a familiar big data framework.

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.

Delta and Databricks as a Performant Exabyte-Scale Application Backend

The Delta Lake architecture promises to provide a single, highly functional, and high-scale copy of data that can be leveraged by a variety of tools to satisfy a broad range of use cases. To date, most use cases have focused on interactive data warehousing, ETL, model training, and streaming. Real-time access is generally delegated to costly and sometimes difficult-to-scale NoSQL, indexed storage, and domain-specific specialty solutions, which provide limited functionality compared to Spark on Delta Lake. In this session, we will explore the Delta data-skipping and optimization model and discuss how Capital One leveraged it along with Databricks photon and Spark Connect to implement a real-time web application backend. We’ll share how we built a highly-functional and performant security information and event management user experience (SIEM UX) that is cost effective.

Simplify Data Ingest and Egress with the New Python Data Source API

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Highways and Hexagons: Processing Large Geospatial Datasets With H3

The problem of matching GPS locations to roads and local government areas (LGAs) involves handling large datasets and a number of geospatial operations. In this deep dive, we will outline the challenges of developing scalable solutions for these tasks. We will discuss our multi-step approach, first focusing on the use of H3 indexing to isolate matches with single candidates, then explaining use of different geospatial computational techniques to accurately match points with multiple candidates. From technical perspective, the talk will showcase the use of broadcasting and partitioning techniques, their effect on autoscaling, memory usage and effective data parallelization. This session is for anyone interested in geospatial data, spark performance optimization and the real-world challenges of large-scale data engineering.

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Real-time mode is a new low-latency execution mode for Apache Spark™ Structured Streaming. It can consistently provide p99 latencies less than 300 milliseconds for a broad set of stateless and stateful streaming queries. Our talk focuses on the technical aspects of making this possible in Spark. We’ll dive into the core architecture that enables these dramatic latency improvements, including a concurrent stage scheduler and a non-blocking shuffle. We’ll explore how we maintained Spark’s fault-tolerance guarantees, and we’ll also share specific optimizations we made to our streaming SQL operators. These architectural improvements have already enabled Databricks customers to build workloads with latencies up to 10x lower than before. Early adopters in our Private Preview have successfully implemented real-time enrichment pipelines and feature engineering for machine learning — use cases that were previously impossible at these latencies.

Saving Millions From Millions: Navigating Towards Cost-Efficiency in Pinterest's Spark Jobs

While Spark offers powerful processing capabilities for massive data volumes, cost-efficiency challenges are always bothering users operating at large scales. At Pinterest, where we run millions of Spark jobs monthly, maintaining infra cost efficiency is crucial to support our rapid business growth. To tackle this challenge, we have developed several strategies that have saved us tens of millions of dollars across numerous job instances. We will share our analytical methodology for identifying performance bottlenecks, and the technical solutions to overcome various challenges. Our approach includes extracting insights from billions of collected metrics, leveraging remote shuffle services to address shuffle slowness and improve memory utilization and reduce costs while hosting hundreds of millions of pods. The presentation aims to trigger more discussions about cost efficiency topics of Apache Spark in the community and help the community to tackle the common challenge.

The Future of DSv2 in Apache Spark™

DSv2, Spark's next-generation Catalog API, is gaining traction among data source developers. It shifts complexity to Apache Spark™, improves connector reliability and unlocks new functionality such as catalog federation, MERGE operations, storage-partitioned joins, aggregate pushdown, stored procedures and more. This session covers the design of DSv2, current strengths and gaps and its evolving roadmap. It's intended for Spark users and developers working with data sources, whether custom-built or off-the-shelf.

Data Intelligence for Cybersecurity Forum: Insights From SAP, Anvilogic, Capital One, and Wiz

Join cybersecurity leaders from SAP, Anvilogic, Capital One, Wiz, and Databricks to explore how modern data intelligence is transforming security operations. Discover how SAP adopted a modular, AI-powered detection engineering lifecycle using Anvilogic on Databricks. Learn how Capital One built a detection and correlation engine leveraging Delta Lake, Apache Spark Streaming, and Databricks to process millions of cybersecurity events per second. Finally, see how Wiz and Databricks’ partnership enhances cloud security with seamless threat visibility. Through expert insights and live demos, gain strategies to build scalable, efficient cybersecurity powered by data and AI.

From Code to Insights: Leveraging Advanced Infrastructure and AI Capabilities.

In this talk, we will explore how AI and advanced infrastructure are transforming Insulet's development and operations. We'll highlight how our innovations have reduced scrap part costs through manufacturing analytics, showcasing efficiency and cost savings. On leveraging Databricks AI solutions and productivity, it not only identifies errors but also fixes code and assists in writing complex queries. This goes beyond suggestions, providing actual solutions. On the infrastructure side, integrating Spark with Databricks simplifies setup and reduces costs. Additionally Databricks Lakeflow Connect enables real-time updates and simplification without much coding as we integrate with Salesforce. We'll also discuss real-time processing of patient data, demonstrating how Databricks drives efficiency and productivity. Join us to learn how these innovations enhance efficiency, cost savings and performance.

Kernel, Catalog, Action! Reimagining our Delta-Spark Connector with DSv2

Delta Lake is redesigning its Spark connector through the combination of three key technologies: First, we're updating our Spark APIs to DSv2 to achieve deeper catalog integration and improved integration with the Spark optimizer. Second, we're fully integrating on top of Delta Kernel to take advantage of its simplified abstraction of Delta protocol complexities, accelerating feature adoption and improving maintainability. Third, we are transforming Delta to become a catalog-aware lakehouse format with Catalog Commits, enabling more efficient metadata management, governance and query performance. Join us to explore how we're advancing Delta Lake's architecture, pushing the boundaries of metadata management and creating a more intelligent, performant data lakehouse platform.

Empowering Progress: Building a Personalized Training Goal Ecosystem with Databricks

Tonal is the ultimate strength training system, giving you the expertise of a personal trainer and a full gym in your home. Through user interviews and social media feedback, we identified a consistent challenge: members found it difficult to measure their progress in their fitness journey. To address this, we developed the Training Goal (TG) ecosystem, a four-part solution that introduced new preference options to capture users' fitness aspirations, implemented weekly metrics that accumulate as members complete workouts, defined personalized weekly targets to guide progress, and enhanced workout details to show how each session contributes toward individual goals.We present how we leveraged Spark, MLflow, and Workflows within the Databricks ecosystem to compute TG metrics, manage model development, and orchestrate data pipelines. These tools allowed us to launch the TG system on schedule, supporting scalability, reliability, and a more meaningful, personalized way for members to track their progress.

From Days to Seconds — Reducing Query Times on Large Geospatial Datasets by 99%

The Global Water Security Center translates environmental science into actionable insights for the U.S. Department of Defense. Prior to incorporating Databricks, responding to these requests required querying approximately five hundred thousand raster files representing over five hundred billion points. By leveraging lakehouse architecture, Databricks Auto Loader, Spark Streaming, Databricks Spatial SQL, H3 geospatial indexing and Databricks Liquid Clustering, we were able to drastically reduce our “time to analysis” from multiple business days to a matter of seconds. Now, our data scientists execute queries on pre-computed tables in Databricks, resulting in a “time to analysis” that is 99% faster, giving our teams more time for deeper analysis of the data. Additionally, we’ve incorporated Databricks Workflows, Databricks Asset Bundles, Git and Git Actions to support CI/CD across workspaces. We completed this work in close partnership with Databricks.