talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

70

Filtering by: Data Streaming ×

Sessions & talks

Showing 26–50 of 70 · Newest first

Search within this event →
Unlocking the Value of Data Sharing in Financial Services with Lakehouse

Unlocking the Value of Data Sharing in Financial Services with Lakehouse

2023-07-26 Watch
video
Spencer Cook (Databricks)

The emergence of secure data sharing is already having a tremendous economic impact, in large part due to the increasing ease and safety of sharing financial data. McKinsey predicts that the impact of open financial data will be 1-4.5% of GDP globally by 2030. This indicates there is a narrowing window on a massive opportunity for financial institutions and it is critical that they prioritize data sharing. This session will first address the ways in which Delta Sharing and Unity Catalog on a Databricks Lakehouse architecture provides a simple and open framework for building a Secure Data Sharing platform in the financial services industry. Next we will use a Databricks environment to walk through different use cases for open banking data and secure data sharing, demonstrating how they will be implemented using Delta Sharing, Unity Catalog, and other parts of the Lakehouse platform. The use cases will include examples of new product features such as Databricks to Databricks sharing, change data feed and streaming on Delta Sharing, table/column lineage, and the Delta Sharing Excel plugin to demonstrate state of the art sharing capabilities.

In this session, we will discuss secure data sharing on Databricks Lakehouse and will demonstrate architecture and code for common sharing use cases in the finance industry.

Talk by: Spencer Cook

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

2023-07-26 Watch
video

Every year, billions of dollars are lost due to water risks from storms, floods, and droughts. Water data scarcity and excess are issues that risk models cannot overcome, creating a world of uncertainty. Divirod is building a platform of water data by normalizing diverse data sources of varying velocity into one unified data asset. In addition to publicly available third-party datasets, we are rapidly deploying our own IoT sensors. These sensors ingest signals at a rate of about 100,000 messages per hour into preprocessing, signal-processing, analytics, and postprocessing workloads in one spark-streaming pipeline to enable critical real-time decision-making processes. By leveraging streaming architecture, we were able to reduce end-to-end latency from tens of minutes to just a few seconds.

We are leveraging Delta Lake to provide a single query interface across multiple tables of this continuously changing data. This enables data science and analytics workloads to always use the most current and comprehensive information available. In addition to the obvious schema transformations, we implement data quality metrics and datum conversions to provide a trustworthy unified dataset.

Talk by: Adam Wilson and Heiko Udluft

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Striim | Powering a Delightful Travel Experience with a Real-Time Operational Data Hub

Sponsored by: Striim | Powering a Delightful Travel Experience with a Real-Time Operational Data Hub

2023-07-26 Watch
video

American Airlines champions operational excellence in airline operations to provide the most delightful experience to our customers with on-time flights and meticulously maintained aircraft. To modernize and scale technical operations with real-time, data-driven processes, we delivered a DataHub that connects data from multiple sources and delivers it to analytics engines and systems of engagement in real-time. This enables operational teams to use any kind of aircraft data from almost any source imaginable and turn it into meaningful and actionable insights with speed and ease. This empowers maintenance hubs to choose the best service and determine the most effective ways to utilize resources that can impact maintenance outcomes and costs. The end-product is a smooth and scalable operation that results in a better experience for travelers. In this session, you will learn how we combine an operational data store (MongoDB) and a fully managed streaming engine (Striim) to enable analytics teams using Databricks with real-time operational data.

Talk by: John Kutay and Ganesh Deivarayan

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored by: Toptal | Enable Data Streaming within Multicloud Strategies

Sponsored by: Toptal | Enable Data Streaming within Multicloud Strategies

2023-07-26 Watch
video

Join Toptal as we discuss how we can help organizations handle their data streaming needs in an environment utilizing multiple cloud providers. We will delve into the data scientist and data engineering perspective on this challenge. Embracing an open format, utilizing open source technologies while managing the solution through code are the keys to success.

Talk by: Christina Taylor and Matt Kroon

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 Watch
video

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

2023-07-26 Watch
video

Attend this session and learn about an innovative solution built around Databricks structured streaming and Delta Live Tables (DLT) to replicate thousands of tables from on-premises to cloud-based relational databases. A highly desirable pattern for many enterprises across the industries to replicate on-premises data to cloud-based data lakes and data stores in near real time for consumption.

This powerful architecture can offload legacy platform workloads and accelerate cloud journey. The intelligent cost-efficient solution leverages thread-pools, multi-task jobs, Kafka, Apache Spark™ structured streaming and DLT. This session will go into detail about problems, solutions, lessons-learned and best practices.

Talk by: Suneel Konidala and Murali Madireddi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Journey to Real-Time ML: A Look at Feature Platforms & Modern RT ML Architectures Using Tecton

Journey to Real-Time ML: A Look at Feature Platforms & Modern RT ML Architectures Using Tecton

2023-07-26 Watch
video

Are you struggling to keep up with the demands of real-time machine learning? Like most organizations building real-time ML, you’re probably looking for a better way to: Manage the lifecycle of ML models and features, Implement batch, streaming, and real-time data pipelines, Generate accurate training datasets and serve models and data online with strict SLAs, supporting millisecond latencies and high query volumes. Look no further. In this session, we will unveil a modern technical architecture that simplifies the process of managing real-time ML models and features.

Using MLflow and Tecton, we’ll show you how to build a robust MLOps platform on Databricks that can easily handle the unique challenges of real-time data processing. Join us to discover how to streamline the lifecycle of ML models and features, implement data pipelines with ease, and generate accurate training datasets with minimal effort. See how to serve models and data online with mission-critical speed and reliability, supporting millisecond latencies and high query volumes.

Take a firsthand look at how FanDuel uses this solution to power their real-time ML applications, from responsible gaming to content recommendations and marketing optimization. See for yourself how this system can be used to define features, train models, process streaming data, and serve both models and features online for real-time inference with a live demo. Join us to learn how to build a modern MLOps platform for your real-time ML use cases.

Talk by: Mike Del Balso and Morgan Hsu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

2023-07-26 Watch
video

Unplanned downtime in manufacturing costs firms up to a trillion dollars annually. Time that materials spend sitting on a production line is lost revenue. Even just 15 hours of downtime a week adds up to over 800 hours of downtime yearly. The use of Internet of Things or IoT devices can cut this time down by providing details of machine metrics. However, IoT predictive maintenance is challenged by the lack of effective, scalable infrastructure and machine learning solutions. IoT data can be the size of multiple terabytes per day and can come in a variety of formats. Furthermore, without any insights and analysis, this data becomes just another table.

The KPMG Databricks IoT Accelerator is a comprehensive solution enabling manufacturing plant operators to have a bird’s eye view of their machines’ health and empowers proactive machine maintenance across their portfolio of IoT devices. The Databricks Accelerator ingests IoT streaming data at scale and implements the Databricks Medallion architecture while leveraging Delta Live Tables to clean and process data. Real time machine learning models are developed from IoT machine measurements and are managed in MLflow. The AI predictions and IoT device readings are compiled in the gold table powering downstream dashboards like Tableau. Dashboards inform machine operators of not only machines’ ailments, but action they can take to mitigate issues before they arise. Operators can see fault history to aid in understanding failure trends, and can filter dashboards by fault type, machine, or specific sensor reading. 

Talk by: MacGregor Winegard

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

2023-07-26 Watch
video
Rick Wear , Sarah Pollitt (Matillion)

In this presentation, Matillion’s Sarah Pollitt, Group Product Manager for ETL, will discuss how you can use Matillion to load data from popular data sources such as Salesforce, SAP, and over a hundred out-of-the-box connectors into your data lakehouse. You can quickly transform this data using powerful tools like Matillion or dbt, or your own custom notebooks, to derive valuable insights. She will also explore how you can run streaming pipelines to ensure real-time data processing, and how you can extract and manage this data using popular governance tools such as Alation or Collibra, ensuring compliance and data quality. Finally, Sarah will showcase how you can seamlessly integrate this data into your analytics tools of choice, such as Thoughtspot, PowerBI, or any other analytics tool that fits your organization's needs.

Talk by: Rick Wear

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

2023-07-26 Watch
video

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

2023-07-26 Watch
video

Join the panel to hear how Avanade is helping clients enable real-time analytics and tackle the people and process problems that accompany technology, powered by Azure Databricks.

Talk by: Thomas Kim, Dael Williamson, Zoé Durand

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Structured Streaming: Demystifying Arbitrary Stateful Operations

Structured Streaming: Demystifying Arbitrary Stateful Operations

2023-07-26 Watch
video
Angela Chu (Databricks)

Let’s face it -- data is messy. And your company’s business requirements? Even messier. You’re staring at your screen, knowing there is a tool that will let you give your business partners the information they need as quickly as they need it. There’s even a Python version of it now. But…it looks kind of scary. You’ve never used it before, and you don’t know where to start. Yes, we’re talking about the dreaded flatMapGroupsWithState. But fear not - we’ve got you covered.

In this session, we’ll take a real-word use case and use it to show you how to break down flatMapGroupsWithState into its basic building blocks. We’ll explain each piece in both Scala and the newly-released Python, and at the end we’ll illustrate how it all comes together to enable the implementation of arbitrary stateful operations with Spark Structured Streaming.

Talk by: Angela Chu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Control of Streaming Healthcare Data

Taking Control of Streaming Healthcare Data

2023-07-25 Watch
video

Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), initially partnered with Slalom to build a Databricks data lakehouse architecture in response to the analytics demands of the COVID-19 pandemic, since then they have expanded the platform to additional use cases. Recently they have worked together to engineer streaming data pipelines to process healthcare messages, such as HL7, to help CRISP become vendor independent.

This session will focus on the improvements CRISP has made to their data lakehouse platform to support streaming use cases and the impact these changes have had for the organization. We will touch on using Databricks Auto Loader to efficiently ingest incoming files, ensuring data quality with Delta Live Tables, and sharing data internally with a SQL warehouse, as well as some of the work CRISP has done to parse and standardize HL7 messages from hundreds of sources. These efforts have allowed CRISP to stream over 4 million messages daily in near real-time with the scalability it needs to continue to onboard new healthcare providers so it can continue to facilitate care and improve health outcomes.

Talk by: Andy Hanks and Chris Mantz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Build Your Data Lakehouse with a Modern Data Stack on Databricks

Build Your Data Lakehouse with a Modern Data Stack on Databricks

2023-07-25 Watch
video
Pearl Ubaru (Databricks) , Ari Kaplan (Databricks)

Are you looking for an introduction to the Lakehouse and what the related technology is all about? This session is for you. This session explains the value that lakehouses bring to the table using examples of companies that are actually modernizing their data, showing demos throughout. The data lakehouse is the future for modern data teams that want to simplify data workloads, ease collaboration, and maintain the flexibility and openness to stay agile as a company scales.

Come to this session and learn about the full stack, including data engineering, data warehousing in a lakehouse, data streaming, governance, and data science and AI. Learn how you can create modern data solutions of your own.

Talk by: Ari Kaplan and Pearl Ubaru

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Top Mistakes to Avoid in Streaming Applications

Top Mistakes to Avoid in Streaming Applications

2023-07-25 Watch
video

Are you a data engineer seeking to enhance the performance of your streaming applications? Join our session where we will share valuable insights and best practices gained from handling diverse customer streaming use cases using Apache Spark™ Structured Streaming.

In this session, we will delve into the common pitfalls that can hinder your streaming workflows. Learn practical tips and techniques to overcome these challenges during different stages of application development. By avoiding these errors, you can unlock faster performance, improved data reliability, and smoother data processing.

Don't miss out on this opportunity to level up your streaming skills and excel in your data engineering journey. Join us to gain valuable knowledge and practical techniques that will empower you to optimize your streaming applications and drive exceptional results.

Talk by: Vikas Reddy Aravabhumi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Introduction to Data Engineering on the Lakehouse

Introduction to Data Engineering on the Lakehouse

2023-07-25 Watch
video

Data engineering is a requirement for any data, analytics or AI workload. With the increased complexity of data pipelines, the need to handle real-time streaming data and the challenges of orchestrating reliable pipelines, data engineers require the best tools to help them achieve their goals. The Databricks Lakehouse Platform offers a unified platform to ingest, transform and orchestrate data and simplifies the task of building reliable ETL pipelines.

This session will provide an introductory overview of the end-to-end data engineering capabilities of the platform, including Delta Live Tables and Databricks Workflows. We’ll see how these capabilities come together to provide a complete data engineering solution and how they are used in the real world by organizations leveraging the lakehouse turning raw data into insights.

Talk by: Jibreal Hamenoo and Ori Zohar

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Introduction to Data Streaming on the Lakehouse

Introduction to Data Streaming on the Lakehouse

2023-07-25 Watch
video

Streaming is the future of all data pipelines and applications. It enables businesses to make data-driven decisions sooner and react faster, develop data-driven applications considered previously impossible, and deliver new and differentiated experiences to customers. However, many organizations have not realized the promise of streaming to its full potential because it requires them to completely redevelop their data pipelines and applications on new, complex, proprietary, and disjointed technology stacks.

The Databricks Lakehouse Platform is a simple, unified, and open platform that supports all streaming workloads ranging from ingestion, ETL to event processing, event-driven application, and ML inference. In this session, we will discuss the streaming capabilities of the Databricks Lakehouse Platform and demonstrate how easy it is to build end-to-end, scalable streaming pipelines and applications, to fulfill the promise of streaming for your business.

Talk by: Zoe Durand and Yue Zhang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building a Lakehouse for Data Science at DoorDash

Building a Lakehouse for Data Science at DoorDash

2022-07-19 Watch
video

DoorDash was using a data warehouse but found that they needed more data transparency, lower costs, and the ability to handle streaming data as well as batch data. With an engineering team rooted in big data backgrounds at Uber and LinkedIn, they moved to a Lakehouse architecture intuitively, without knowing about the term. In this session, learn more about how they arrived at that architecture, the process of making the move, and the results they have seen. While addressing both data analysts and data scientists from their lakehouse, this session will focus on their machine learning operations, and how their efficiencies are enabling them to tackle more advanced use cases such as NLP and image classification.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

Connecting the Dots with DataHub: Lakehouse and Beyond

2022-07-19 Watch
video

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Serving Near Real-Time Features at Scale

Serving Near Real-Time Features at Scale

2022-07-19 Watch
video

This presentation will first introduce the use case, which generates the price adjustments based on the network effect, and the corresponding model relies on the 108 near real-time features computed by Flink pipelines with the raw demand and supply events. Here is the simplified computation logic:

-The pipelines need to process the raw real-time events at the rate of 300k/s including both demand and supply -Each event needs to be computed on the geospatial, temporal and other dimensions -Each event contributes to the computation on the original hexagon and the 1K+ neighbours due to the fan-out effect of Kring smooth -Each event contributions to the aggregation on multiple window sizes up to 32 minutes, sliding by 1 minute, or 63 windows in total

Next the presentation will briefly go through the DAG of the Flink pipeline before optimization and the issues we faced: the pipeline could not run stably due to OOM and backpressure. The presentation will discuss how to optimize a streaming pipeline with the generic performance tuning framework, which focuses on three areas: Network, CPU and Memory, and five domains: Parallelism, Partition, Remote Call, Algorithm and Garbage Collector. The presentation will also show some example techniques being applied onto the pipelines by following the performance tuning framework.

Then the presentation will discuss one particular optimization technique: Customized Sliding Window.

Powering machine learning models with near real-time features can be quite challenging, due to computation logic complexity, write throughput, serving SLA, etc. In this talk, we have introduced some of the problems that we faced and our solutions to them, in the hope of aiding our peers in similar use cases.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Improving Apache Spark Application Processing Time by Configurations, Code Optimizations, etc.

Improving Apache Spark Application Processing Time by Configurations, Code Optimizations, etc.

2022-07-19 Watch
video

In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.

Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

The main areas that will be discussed are: Spark Configuration changes, code optimizations and the implementation of the Spark custom data source.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

2022-07-19 Watch
video

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

2022-07-19 Watch
video
Karin Wolok (StarTree) , Neha Power (StarTree)

Apache Kafka is the de facto standard for real-time event streaming, but what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Kafka. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.

Apache Kafka's highly performant, distributed, fault-tolerant, real-time publish-subscribe messaging platform powers big data solutions at Airbnb, LinkedIn, MailChimp, Netflix, the New York Times, Oracle, PayPal, Pinterest, Spotify, Twitter, Uber, Wikimedia Foundation, and countless other businesses.

Come hear from Neha Power, Founding Engineer at a StarTree and PMC and committer of Apache Pinot, and Karin Wolok, Head of Developer Community at StarTree, on an introduction to both systems and a view of how they work together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

2022-07-19 Watch
video

Working with S3 is different from doing so with HDFS: The architecture of the Object store makes the standard Spark file connector inefficient to work with S3.

There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with a number of files in the source path.

We have seen this happen at Hunters, a security operations platform that works with a wide range of data sources.

We want to share a description of the problem and the solution we will open-source. The audience will learn how to configure it and make the best use of it. We will also discuss how to use metadata to boost the performance of discovering new files in the stream and show the use case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber attacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Backfill Streaming Data Pipelines in Kappa Architecture

Backfill Streaming Data Pipelines in Kappa Architecture

2022-07-19 Watch
video

Streaming data pipelines can fail due to various reasons. Since the source data, such as Kafka topics, often have limited retention, prolonged job failures can lead to data loss. Thus, streaming jobs need to be backfillable at all times to prevent data loss in case of failures. One solution is to increase the source's retention so that backfilling is simply replaying source streams, but extending Kafka retention is very costly for Netflix's data sizes. Another solution is to utilize source data stored in DWH, commonly known as the Lambda architecture. However, this method introduces significant code duplication, as it requires engineers to maintain a separate equivalent batch job. At Netflix, we have created the Iceberg Source Connector to provide backfilling capabilities to Flink streaming applications. It allows Flink to stream data stored in Apache Iceberg while mirroring Kafka's ordering semantics, enabling us to backfill large-scale stateful Flink pipelines at low retention cost.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/