talk-data.com talk-data.com

Topic

Data Streaming

realtime event_processing data_flow

739

tagged

Activity Trend

70 peak/qtr
2020-Q1 2026-Q1

Activities

739 activities · Newest first

Summary

Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration

Interview

Introduction How did you get involved in the area of data management? Can you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.) What are the misconceptions about the applications of/need for/cost to implement data orchestration?

How do those challenges of customer education change across roles/personas?

Because of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine? You have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time? One of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine? What are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working o

Summary

Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading

Interview

Introduction How did you get involved in the area of data management? Can you describe what dlt is and the story behind it?

What is the problem you want to solve with dlt? Who is the target audience?

The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt? Can you describe how dlt is implemented? What are the benefits of building it in Python? How have the design and goals of the project changed since you first started working on it? How does that language choice influence the performance and scaling characteristics? What problems do users solve with dlt? What are the interfaces available for extending/customizing/integrating with dlt? Can you talk through the process of adding a new source/destination? What is the workflow for someone building a pipeline with dlt? How does the experience scale when supporting multiple connections? Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to th

Summary

Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Vignesh Ravichandran about building an internal database as a service platform at Cloudflare

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the different database workloads that you have at Cloudflare?

What are the different methods that you have used for managing database instances?

What are the requirements and constraints that you had to account for in designing your current system? Why Postgres? optimizations for Postgres

simplification from not supporting multiple engines

limitations in postgres that make multi-tenancy challenging scale of operation (data volume, request rate What are the most interesting, innovative, or unexpected ways that you have seen your DBaaS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on your internal database platform? When is an internal database as a service the wrong choice? What do you have planned for the future of Postgres hosting at Cloudflare?

Contact Info

LinkedIn Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Mac

Summary

Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform

Interview

Introduction How did you get involved in the area of data management? Can you describe what Illumidesk is and the story behind it? What are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences? How are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware? What are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors? What are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports? What are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content? Can you describe how you have architected the Illumidesk platform? How have the design and goals shifted since you first began working on it? What are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities? What are the failure modes of the content generation that you need to account for? What are the most interesting, innovative, or unexpected ways that you have seen Illumidesk us

Apache Iceberg Merge-On-Read: Streaming CDC - Victoria Bukta, Shopify | Crunch Conference 2022

This talk was recorded at Crunch Conference 2022. Victoria from Shopify spoke about apache iceberg merge-on-read: Streaming CDC.

The event was organized by Crafthub.

You can watch the rest of the conference talks on our channel.

If you are interested in more speakers, tickets and details of the conference, check out our website: https://crunchconf.com/ If you are interested in more events from our company: https://crafthub.events/

Improving Hospital Operations with Streaming Data and Real Time AI/ML

Over the past two years, Providence has developed a robust streaming data platform (SDP) leveraging Databricks in Azure. The SDP enables us to ingest and process real-time data reflecting clinical operations across our 52 hospitals and roughly 1000 ambulatory clinics. The HL7 messages generated by Epic are parsed using Databricks in our secure cloud environment and used to generate an up-to-the minute picture of exactly what is happening at the point of care.

We are already leveraging this information to minimize hospital overcrowding and have been actively integrating AI/ML to accurately forecast future conditions (e.g., arrivals, length of stay, acuity, and discharge requirements.) This allows us to both improve resource utilization (e.g., nurse staffing levels) and to optimize patient throughput. The result is both improved patient care and operational efficiency.

In this session, we will share how these outcomes are only possible with the power and elegance afforded by our investments in Azure, Databricks, and increasingly Lakehouse. We will demonstrate Providence's blueprint for enabling real-time analytics which can be generalized to other healthcare providers.

Talk by: Lindsay Mico and Deylo Woo

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Jet Streaming Data & Predictive Analytics: How Collins Aerospace to Keep Aircraft Flying

Most have experienced the frustration and disappointment of a flight delay or cancelation due to aircraft issues. The Collins Aerospace business unit at Raytheon Technologies is committed to redefining aerospace by using data to deliver a more reliable, sustainable, efficient, and enjoyable aviation industry.

Ascentia is a product example of this with focus on helping airlines make smarter and more sustainable decisions by anticipating aircraft maintenance issues in advance, leading to more reliable flight schedules and fewer delays. Over the past five years a variety of products from the Databricks technology suite were employed to achieve this. Leveraging cloud infrastructure and harnessing the Databricks Lakehouse, Apache Spark™ development, and Databricks’ dynamic platform, Collins has been able to accelerate development and deployment of predictive health monitoring (PHM) analytics to generate Ascentia’s aircraft maintenance recommendations.

Labcorp Data Platform Journey: From Selection to Go-Live in Six Months

Join this session to learn about the Labcorp data platform transformation from on-premises Hadoop to AWS Databricks Lakehouse. We will share best practices and lessons learned from cloud-native data platform selection, implementation, and migration from Hadoop (within six months) with Unity Catalog.

We will share steps taken to retire several legacy on-premises technologies and leverage Databricks native features like Spark streaming, workflows, job pools, cluster policies and Spark JDBC within Databricks platform. Lessons learned in Implementing Unity Catalog and building a security and governance model that scales across applications. We will show demos that walk you through batch frameworks, streaming frameworks, data compare tools used across several applications to improve data quality and speed of delivery.

Discover how we have improved operational efficiency, resiliency and reduced TCO, and how we scaled building workspaces and associated cloud infrastructure using Terraform provider.

Talk by: Mohan Kolli and Sreekanth Ratakonda

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Nebula: The Journey of Scaling Instacart’s Data Pipelines with Apache Spark™ and Lakehouse

Instacart has gone through immense growth during the pandemic and the trend continues. Instacart ads is no exception in this growth story. We have launched many new product lines including display and video ads covering the full advertising funnel to address the increasing demand of our retail partners. We have built advanced models to auto-suggest optimal bidding to increase the ROI for our CPG partners. Advertisers’ trust is the utmost priority and thus the quest to build a top-class ads measurement platform.

Ads data processing requires complex data verifications to update ads serving stats. In ETL pipelines these were implemented through files containing thousands of lines of raw SQL which were hard to scale, test, and iterate upon. Our data engineers used to spend hours testing small changes due to a lack of local testing mechanisms. These pain points stress our need for better tools. After some research, we chose Apache Spark™ as our preferred tool to rebuild ETLs, and the Databricks platform made this move easier. In this session, We'll share our journey to move our pipelines to Spark and Delta Lake on Databricks. With Spark, Scala, and Delta we solved many problems which were slowing the team’s productivity. Some key areas that will be covered include:

  • Modular and composable code
  • Unit testing framework
  • Incremental event processing with spark structured streaming
  • Granular resource tuning for better performance and cost efficacy

Other than the domain business logic, the problems discussed here are quite common for performing data processing at scale. We hope that sharing our learnings will benefit others who are going through similar growth challenges or migrating to Lakehouse.

Talk by: Devlina Das and Arthur Li

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Practical Pipelines: A Houseplant Alerting System with ksqlDB

Taking care of houseplants can be difficult; in many cases, over-watering and under-watering can have the same symptoms. Remove the guesswork involved in caring for your houseplants while also gaining valuable experience in building a practical, event-driven pipeline in your own home! This session explores the process of building a houseplant monitoring and alerting system using a Raspberry Pi and Apache Kafka. Moisture and temperature readings are captured from sensors in the soil and streamed into Kafka. From there, we use stream processing to transform the data, create a summary view of the current state, and drive real-time push alerts through Telegram.

In this session, we will talk about how to ingest the data followed by the tools, including ksqlDB and Kafka Connect, that help transform the raw data into useful information, and finally, You'll be shown how to use Kafka Producers and Consumers to make the entire application more interactive. By the end of this session, you’ll have everything you need to start building practical streaming pipelines in your own home. Roll up your sleeves – let’s get our hands dirty!

Talk by: Danica Fine

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time ML in Marketplace at Lyft

Lyft is a ride-sharing company which is a two-sided marketplace; balancing supply and demand using various levers (passenger pricing, driver incentive etc.) to maintain an efficient system. Lyft has built a real-time optimization platform that helps to build the product faster. This complex system makes real-time decisions using various data sources; machine learning models; and a streaming infrastructure for low latency, reliability and scalability. This infrastructure consumes a massive number of events from different sources to make real-time product decisions.

In this session, we will discuss how Lyft organically evolved and scaled the streaming platform that provides a consistent view of the marketplace to aid an individual team independently run their optimization. The platform offers online and offline feature access that helps teams to back test their model in the future. It provides various other powerful capabilities such as replaying the production ML feature in PyNotebook, feature validation, near real-time model training, executing multi-layer of models in a DAG, etc. The speaker will elaborate things that helped him scale the systems to process millions of events per minute and power T0 products with tighter latency SLA.

The Future is Open: Data Streaming in an Omni-Cloud Reality

This session begins with data warehouse trivia and lessons learned from production implementations of multicloud data architecture. You will learn to design future-proof low latency data systems that focus on openness and interoperability. You will also gain a gentle introduction to Cloud FinOps principles that can help your organization reduce compute spend and increase efficiency. 

Most enterprises today are multicloud. While an assortment of low-code connectors boasts the ability to make data available for analytics in real time, they post long-lasting challenges:

  • Inefficient EDW targets
  • Inability to evolve schema
  • Forbiddingly expensive data exports due to cloud and vendor lock-in

The alternative is an open data lake that unifies batch and streaming workloads. Bronze landing zones in open format eliminate the data extraction costs required by proprietary EDW. Apache Spark™ Structured Streaming provides a unified ingestion interface. Streaming triggers allow us to switch back and forth between batch and stream with one-line code changes. Streaming aggregation enables us to incrementally compute on data that arrives near each other.

Specific examples are given on how to use Autoloader to discover newly arrived data and ensure exactly once, incremental processing. How DLT can be configured effectively to further simplify streaming jobs and accelerate the development cycle. How to apply SWE best practices to Workflows and integrate with popular Git providers, either using the Databricks Project or Databricks Terraform provider. 

Talk by: Christina Taylor

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Optimizing Batch and Streaming Aggregations

A client recently asked to optimize their batch and streaming workloads. It happened to be aggregations using DataFrame.groupby operation with a custom Scala UDAF over a data stream from Kafka. Just a single simple-looking request that turned itself up into a a-few-month-long hunt to find a more performant query execution planning than ObjectHashAggregateExec that kept falling back to a sort-based aggregation (i.e., the worst possible aggregation runtime performance). It quickly taught us that an aggregation using a custom Scala UDAF cannot be planned other than ObjectHashAggregateExec but at least tasks don't always have to fall back. And that's just batch workloads. When you throw in streaming semantics and think of the different output modes, windowing and streaming watermark optimizing aggregation can take a long time to do right.

Talk by: Jacek Laskowski

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unlocking Near Real Time Data Replication with CDC, Apache Spark™ Streaming, and Delta Lake

Tune into DoorDash's journey to migrate from a flaky ETL system with 24-hour data delays, to standardizing a CDC streaming pattern across more than 150 databases to produce near real-time data in a scalable, configurable, and reliable manner.

During this journey, understand how we use Delta Lake to build a self-serve, read-optimized data lake with data latencies of 15, whilst reducing operational overhead. Furthermore, understand how certain tradeoffs like conceding to a non-real-time system allow for multiple optimizations but still permit for OLTP query use-cases, and the benefits it provides.

Talk by: Ivan Peng and Phani Nalluri

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

In this session, we will discuss how real-time data streaming can be used to gain insights into user behavior and preferences, and how this data is being used to provide personalized content and recommendations on Discovery+. We will examine techniques that enables faster decision making and insights on accurate real time data including data masking and data validation. To enable a wide set of data consumers from data engineers to data scientists to data analysts, we will discuss how Unity Catalog is leveraged for secure data access and sharing while still allowing teams flexibility.

Operating at this scale requires examining the value being created by the data being processed and optimizing along the way and we will share some of our success in this area.

Talk by: Deepa Paranjpe

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Streaming Schema Drift Discovery and Controlled Mitigation

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc