talk-data.com talk-data.com

Topic

Data Streaming

realtime event_processing data_flow

70

tagged

Activity Trend

70 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Delta Live Tables: Modern Software Engineering and Management for ETL

Data engineers have the difficult task of cleansing complex, diverse data, and transforming it into a usable source to drive data analytics, data science, and machine learning. They need to know the data infrastructure platform in depth, build complex queries in various languages and stitch them together for production. Join this talk to learn how Delta Live Tables (DLT) simplifies the complexity of data transformation and ETL. DLT is the first ETL framework to use modern software engineering practices to deliver reliable and trusted data pipelines at any scale. Discover how analysts and data engineers can innovate rapidly with simple pipeline development and maintenance, how to remove operational complexity by automating administrative tasks and gaining visibility into pipeline operations, how built-in quality controls and monitoring ensure accurate BI, data science, and ML, and how simplified batch and streaming can be implemented with self-optimizing and auto-scaling data pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Find out how Gousto is developing its data pipelines at scale in a repeatable manner. At Gousto, we’ve developed Goustospark - a wrapper around pyspark that allows us to quickly and easily build data pipelines that are deployed into our Databricks environment.

This wrapper abstracts repetitive components of all data pipelines such as spark configurations and metastore interactions. This allows a developer to simply specify the blueprints of the pipeline before turning their attention to more pressing issues, such as data quality and data governance, whilst enjoying a high level of performance and reliability.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy

Data + AI Summit Keynote talks from Reynold Xin and Karthik Ramasamy

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How McAfee Leverages Databricks on AWS at Scale

McAfee, a global leader in online protection security enables home users and businesses to stay ahead of fileless attacks, viruses, malware, and other online threats. Learn how McAfee leverages Databricks on AWS to create a centralized data platform as a single source of truth to power customer insights. We will also describe how McAfee uses additional AWS services specifically Kinesis and CloudWatch to provide real time data streaming and monitor and optimize their Databricks on AWS deployment. Finally, we’ll discuss business benefits and lessons learned during McAfee’s petabyte scale migration to Databricks on AWS using Databricks Delta clone technology coupled with network, compute, storage optimizations on AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How Robinhood Built a Streaming Lakehouse to Bring Data Freshness from 24h to Less Than 15 Mins

Robinhood’s data lake is the bedrock foundation that powers business analytics, product experimentation, and other machine learning applications throughout our organization. Come join this session where we will share our journey of building a scalable streaming data lakehouse with Spark, Postgres and other leading open source technologies.

We will lay out our architecture in depth and describe how we perform CDC streaming ingestion and incremental processing of 1000’s of Postgres tables into our data lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 1 Morning Keynote | Data + AI Summit 2022

Day 1 Morning Keynote | Data + AI Summit 2022 Welcome & "Destination Lakehouse" | Ali Ghodsi Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy Delta Lake | Michael Armbrust How Adobe migrated to a unified and open data Lakehouse to deliver personalization at unprecedented scale | Dave Weinstein Data Governance and Sharing on Lakehouse |Matei Zaharia Analytics Engineering and the Great Convergence | Tristan Handy Data Warehousing | Shant Hovespian Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: https://dbricks.co/3ER9Y0K

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Incremental Ingestion in the Context of a Lakehouse

Incremental ingestion of data is often trickier than one would assume, particularly when it comes to maintaining data consistency: for example, specific challenges arise depending on whether the data is ingested in a streaming or a batched fashion. In this session we want to share the real-life challenges encountered when setting up incremental ingestion pipeline in the context of a Lakehouse architecture.

In this session we outline how we used the recently introduced Databricks features, such as Autoloader and Change Data Feed, in addition to some more mature features, such as Spark Structured Streaming and Trigger Once functionality. These functionalities allowed us to transform batch processes into a “streaming” setup without having the need for the cluster to always run. This setup – which we are keen to share to the community - does not require reloading large amounts of data, and therefore represents a computationally, and consequently economically, cheaper solution.

In our presentation we dive deeper into each of the different aspects of the setup, with some extra focus on some essential Autoloader functionalities, such as schema inference, recovery mechanisms and file discovery modes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Supercharging our data architecture at Coinbase using Databricks Lakehouse   Eric Sun

Coinbase is neither simply a finance company nor a tech company — it’s a crypto company. This distinction has big implications for how we work with the Blockchain, Product and Financial data that we need to drive our hypergrowth. We’ve recently enabled a Lakehouse architecture based upon Databricks to unify these complex and varied data sets, to deliver a high performance, continuous ingestion framework at an unprecedented scale. We can now support both ETL and ML workloads on one platform to deliver innovative batch and streaming use cases, and democratize data much faster by enabling teams to use the tools of their choice, while greatly reducing end-to-end latency and simplifying maintenance and operations. In this keynote, we will share our journey to the Lakehouse, and some of the lessons learned as we built an open data architecture at scale.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Business Decisions Using Event Streams

Today's real-time solutions demand continuousness, autonomy, and observability. Data streams have evolved to guarantee only continuousness; thus, streams alone will never satisfy this demand. Industries instead crave a properly end-to-end streaming architecture backing their applications and services -- a concept that has narrowly evaded realization until now.

In this session, Rohit Bose will demonstrate how such architectures cleanly solve complex problems. This will require two parts:

  1. Building an industry-specific application that continuously generates insights and reports them over dynamically-scoped real-time streams
  2. Discussing the advantages and generalizations of the application's design

The demo will utilize the Swim platform to expose thousands of streaming APIs seeded by an Apache Kafka firehose, enabling both real-time map visualizations and decision-making clients to instantly observe changes across distributed entities with zero unnecessary subscriptions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Using Feast Feature Store with Apache Spark for Self-Served Data Sharing and Analysis for Streaming

In this presentation we will talk about how we will use available NER based sensitive data detection methods, automated record of activity processing on top of spark and feast for collaborative intelligent analytics & governed data sharing. Information sharing is the key to successful business outcomes but it's complicated by sensitive information both user centric and business centric.

Our presentation is motivated by the need to share key KPIs, outcomes for health screening data collected from various surveys to improve care and assistance. In particular, collaborative information sharing was needed to help with health data management, early detection and prevention of disease KPIs. We will present a framework or an approach we have used for these purposes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/