Databricks DATA + AI Summit 2023

MLOps on Databricks: A How-To Guide

2022-07-19 Watch

video

AI/ML Databricks MLOps

As companies roll out ML pervasively, operational concerns become the primary source of complexity. Machine Learning Operations (MLOps) has emerged as a practice to manage this complexity. At Databricks, we see firsthand how customers develop their MLOps approaches across a huge variety of teams and businesses. In this session, we will show how your organization can build robust MLOps practices incrementally. We will unpack general principles which can guide your organization’s decisions for MLOps, presenting the most common target architectures we observe across customers. Combining our experiences designing and implementing MLOps solutions for Databricks customers, we will walk through our recommended approaches to deploying ML models and pipelines on Databricks. You will come away with a deeper understanding of how to scale deployment of ML models across your organization, as well as a practical, coded example illustrating how to implement an MLOps workflow on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Modern Architecture of a Cloud-Enabled Data and Analytics Platform

2022-07-19 Watch

video

Agile/Scrum Analytics Cloud Computing Databricks

In today’s modern IT organization whether it is the delivery of a sophisticated analytical model or a product advancement decision or understanding the behavior of a customer, the fact remains that in every instance we rely on data to make good, informed decisions. Given this backdrop, having an architecture which supports the ability to efficiently collect data from a wide range of sources within the company is still an important goal of all data organizations.

In this session we will explain how Bayer has deployed a hybrid data platform which strives to integrate key existing legacy data systems of the past while taking full advantage of what a modern cloud data platform has to offer in terms of scalability and flexibility. It will elaborate the use of its most significant component, Databricks, which serves to provide not only a very sophisticated data pipelining solution but also a complete ecosystem for teams to create data and analytical solutions in a flexible and agile way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Monitoring and Quality Assurance of Complex ML Deployments via Assertions

2022-07-19 Watch

video

Daniel Kang

AI/ML Analytics Databricks

Machine Learning (ML) is increasingly being deployed in complex situations by teams. While much research effort has focused on the training and validation stages, other parts have been neglected by the research community.

In this talk, Daniel Kang will describe two abstractions (model assertions and learned observation assertions) that allow users to input domain knowledge to find errors at deployment time and in labeling pipelines. He will show real-world errors in labels and ML models deployed in autonomous vehicles, visual analytics, and ECG classification that these abstractions can find. I'll further describe how they can be used to improve model quality by up to 2x at a fixed labeling budget. This work is being conducted jointly with researchers from Stanford University and Toyota Research Institute.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mosaic: A Framework for Geospatial Analytics at Scale

2022-07-19 Watch

video

Analytics API Databricks HTML Java PySpark

In this session we’ll present Mosaic, a new Databricks Labs project with a geospatial flavour.

Mosaic provides users of Spark and Databricks with a unified framework for distributing geospatial analytics. Users can choose to employ existing Java-based tools such as JTS or Esri's Geometry API for Java and Mosaic will handle the task of parallelizing these tools' operations: e.g. efficiently reading and writing geospatial data and performing spatial functions on geometries. Mosaic helps users scale these operations by providing spatial indexing capabilities (using, for example, Uber's H3 library) and advanced techniques for optimising common point-in-polygon and polygon-polygon intersection operations.

The development of Mosaic builds upon techniques developed with Ordnance Survey (the central hub for geospatial data across UK Government) and described in this blog post: https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Multimodal Deep Learning Applied to E-commerce Big Data

2022-07-19 Watch

video

AI/ML Big Data Databricks MLOps Spark

At Mirakl, we empower marketplaces with Artificial Intelligence solutions. Catalogs data is an extremely rich source of e-commerce sellers and marketplaces products which include images, descriptions, brands, prices and attributes (for example, size, gender, material or color). Such big volumes of data are suitable for training multimodal deep learning models and present several technical Machine Learning and MLOps challenges to tackle.

We will dive deep into two key use cases: deduplication and categorization of products. For categorization the creation of quality multimodal embeddings plays a crucial role and is achieved through experimentation of transfer learning techniques on state-of-the-art models. Finding very similar or almost identical products among millions and millions can be a very difficult problem and that is where our deduplication algorithm comes to bring a fast and computationally efficient solution.

Furthermore we will show how we deal with big volumes of products using robust and efficient pipelines, Spark for distributed and parallel computing, TFRecords to stream and ingest data optimally on multiple machines avoiding memory issues, and MLflow for tracking experiments and metrics of our models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

2022-07-19 Watch

video

AI/ML Analytics AWS BI Cloud Computing Data Quality

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Nixtla: Deep Learning for Time Series Forecasting

2022-07-19 Watch

video

Databricks GitHub IoT Python PyTorch

Time series forecasting has a wide range of applications: finance, retail, healthcare, IoT, etc. Recently deep learning models such as ESRNN or N-BEATS have proven to have state-of-the-art performance in these tasks. Nixtlats is a python library that we have developed to facilitate the use of these state-of-the-art models to data scientists and developers, so that they can use them in productive environments. Written in pytorch, its design is focused on usability and reproducibility of experiments. For this purpose, nixtlats has several modules:

Data: contains datasets of various time series competencies. Models: includes state-of-the-art models. Evaluation: has various loss functions and evaluation metrics.

Objective:

To introduce attendees to the challenges of time series forecasting with deep learning.
Commercial applications of time series forecasting.
Describe nixtlats, their components and best practices for training and deploying state-of-the-art models in production.
Reproduction of state-of-the-art results using nixtlats from the winning model of the M4 time series competition (ESRNN).

Project repository: https://github.com/Nixtla/nixtlats.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

2022-07-19 Watch

video

Analytics API ClickHouse Databricks Druid JSON

Spreadsheets revolutionized IT by giving end users the ability to create their own analytics. Providing direct end user access to trillion-row datasets generated in financial markets or digital marketing is much harder. New SQL data warehouses like ClickHouse and Druid can provide fixed latency with constant cost on very large datasets, which opens up new possibilities.

Our talk walks through recent experience on analytic apps developed by ClickHouse users that enable end users like market traders to develop their own analytics directly off raw data. We’ll cover the following topics.

Characteristics of new open source column databases and how they enable low-latency analytics at constant cost.
Idiomatic ways to validate new apps by building MVPs that support a wide range of queries on source data including storing source JSON, schema design, applying compression on columns, and building indexes for needle-in-a-haystack queries.
Incrementally identifying hotspots and applying easy optimizations to bring query performance into line with long term latency and cost requirements.
Methods of building accessible interfaces, including traditional dashboards, imitating existing APIs that are already known, and creating app-specific visualizations.

We’ll finish by summarizing a few of the benefits we’ve observed and also touch on ways that analytic infrastructure could be improved to make end user access even more productive. The lessons are as general as possible so that they can be applied across a wide range of analytic systems, not just ClickHouse.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

2022-07-19 Watch

video

Karin Wolok (StarTree) , Neha Power (StarTree)

Analytics Azure Big Data Cloud Computing Cloud Storage Data Lake

Apache Kafka is the de facto standard for real-time event streaming, but what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Kafka. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.

Apache Kafka's highly performant, distributed, fault-tolerant, real-time publish-subscribe messaging platform powers big data solutions at Airbnb, LinkedIn, MailChimp, Netflix, the New York Times, Oracle, PayPal, Pinterest, Spotify, Twitter, Uber, Wikimedia Foundation, and countless other businesses.

Come hear from Neha Power, Founding Engineer at a StarTree and PMC and committer of Apache Pinot, and Karin Wolok, Head of Developer Community at StarTree, on an introduction to both systems and a view of how they work together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Orchestration Made Easy with Databricks Workflows

2022-07-19 Watch

video

AI/ML Analytics Cloud Computing Data Lakehouse Databricks

Orchestrating and managing end-to-end production pipelines have remained a bottleneck for many organizations. Data teams spend too much time stitching pipeline tasks and manually managing and monitoring the orchestration process – with heavy reliance on external or cloud-specific orchestration solutions, all of which slow down the delivery of new data. In this session, we introduce you to Databricks Workflows: a fully managed orchestration service for all your data, analytics, and AI, built in the Databricks Lakehouse Platform. Join us as we dive deep into the new workflow capabilities, and understand the integration with the underlying platform. You will learn how to create and run reliable production workflows, centrally manage and monitor workflows, and learn how to implement recovery actions such as repair and run, as well as other new features.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

OvalEdge: End-To-End Data Governance

2022-07-19 Watch

video

Data Governance Data Quality Databricks

OvalEdge presents a progressive solution for Data Governance and is the only platform that provides an end-to-end data governance experience. Data Governance is all about access, data literacy, lineage, better business processes, data privacy and compliance controls, and data quality. What makes OvalEdge successful is having all of these features in a central platform that is accessible and beneficial for all data users.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling ML at CashApp with Tecton

2022-07-19 Watch

video

AI/ML Databricks

This is a joint talk given by CashApp and Tecton. CashApp’s mobile payment product has stringent technical requirements: scale, reliability, speed. ML-based recommendations are at the core of this service and pose a significant engineering challenge. This talk describes CashApp’s journey through various generations of its core ML capabilities, covering the technical and organizational challenges associated with building large-scale production recommendation systems. The talk finishes with a look at the latest generation of CashApp’s ML platform and highlights how Tecton’s real-time Feature Platform helps CashApp deliver world-class recommendations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

2022-07-19 Watch

video

Ryan Harris (HSBC)

AI/ML Analytics Azure Data Lake Databricks Delta

Due to the unique cybersecurity challenges that HSBC faces daily - from high data volumes to untrustworthy sources to the privacy and security restrictions of a highly regulated industry - the resulting architecture was an unwieldy set of disparate data silos. So, how do we build a cybersecurity advanced analytics environment to enrich and transform these myriad data sources into a unified, well-documented, robust, resilient, repeatable, scalable, maintainable platform that will empower the cyber analysts of the future? That at the same time remains cost-effective and enables everyone from the less-technical junior reporting user to the senior machine learning engineers?

In this session, Ryan Harris, Principal Cybersecurity Engineer at HSBC, dives into the infrastructure and architecture employed, ranging from the landing zone concepts, secure access workstations, data lake structure, and isolated data ingestion, to the enterprise integration layer. In the process of building the data pipelines and lakehouses, we ended up building a hybrid data mesh leveraging Delta Lake. The result is a flexible, secure, self-service environment that is unlocking the capabilities of our humans.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

2022-07-19 Watch

video

Databricks HDFS S3 Cyber Security Spark Data Streaming

Working with S3 is different from doing so with HDFS: The architecture of the Object store makes the standard Spark file connector inefficient to work with S3.

There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with a number of files in the source path.

We have seen this happen at Hunters, a security operations platform that works with a wide range of data sources.

We want to share a description of the problem and the solution we will open-source. The audience will learn how to configure it and make the best use of it. We will also discuss how to use metadata to boost the performance of discovering new files in the stream and show the use case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber attacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Announcing General Availability of Databricks Terraform Provider

2022-07-19 Watch

video

CI/CD Data Engineering Data Science Databricks DevOps Cyber Security

We all live in the exciting times and the hype of Distributed Data Mesh (or just mess). This talk will cover a couple architectural and organizational approaches on achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated team collaborative environments, and security enforcement. As a Data Leader, you’ll learn what kinds of things you’d need to pay attention to, when starting (or reviving) a modern Data Engineering and Data Science strategy and how Databricks Unity Catalog may help you automating that. As DevOps, you’ll learn about the best practices and pitfalls of Continuous Deployment on Databricks With Terraform and Continuous Integration with Databricks Repos. You’ll be excited how you can automate Data Security with Unity Catalog and Terraform. As a Data Scientist, you’ll learn how you can get relevant infrastructure into “production” relatively faster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers

2022-07-19 Watch

video

API Arrow Databricks SQL

Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.

Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.

In this talk, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark on Kubernetes—Lessons Learned from Launching Millions of Spark Executors

2022-07-19 Watch

video

Zhou (Apple) , Aaruna (Apple)

Cloud Computing Databricks Kubernetes Spark

At Apple, data scientists and engineers are running enormous Spark workloads to deliver amazing cloud services. Apple Cloud Service supports the ever-increasing scale of Spark workloads and resource requirements with great user experience: from code to deployment management, one interface for all compute backends.

In this talk, Aaruna and Zhou would walk through the lessons we learnt and pitfalls encountered for supporting the service at Apple scale - we would share how Apple Cloud Services effectively orchestrate Spark applications, as well as the seamless switchover among different resource managers - be it in Mesos or Kubernetes, private or on-premise infrastructure. We will also cover the monitoring system and how it helps tuning Spark resource requirements with actual execution analysis.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

2022-07-19 Watch

video

Databricks ORC Spark SQL

Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.

(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc. (2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution. (3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low. (4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics (5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output

we’ll take deep dive of above features and lessons learned.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

2022-07-19 Watch

video

Data Lakehouse Data Management Databricks Pandas

Auto-Encoder-Decoder is a type of deep learning neural network architecture with an hourglass shape, high dimensional inputs are compressed to latent space through the encoder. The decoder mirrors the encoder architecture and reconstructs the input data from the latent space. Auto-Encoder-Decoder models are commonly used for anomaly detection, after training, the reconstructed error of normal data is minimized thus anomaly can be detected if its reconstructed error gets higher than the “normal threshold”. This presentation will demonstrate an Auto-Encoder-Decoder anomaly detection solution built with the Lakehouse Paradigm, from data management to after-deployment monitoring, to explain the entire model life cycle. It will also highlight the flexibility and scalability that MLflow custom model and Pandas UDF can bring when a large number of individual models need to be trained, deployed, and monitored in parallel.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Model Lifecycle Orchestration with Jenkins

2022-07-19 Watch

video

AI/ML CI/CD Databricks Jenkins MLOps

A key part of the lifecycle involves bringing a model to production. In regular software systems, this is accomplished via a CI/CD pipeline such as one built with Jenkins. However, integrating Jenkins into a typical DS/ML workflow is not straightforward for X, Y, Z reasons. In this hands-on talk, I will talk about what Jenkins and CI/CD practices can bring to your ML workflows, demonstrate a few of these workflows, and share some best practices on how a bit of Jenkins can level up your MLOps processes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Backfill Streaming Data Pipelines in Kappa Architecture

2022-07-19 Watch

video

Flink AWS Lambda Databricks DWH Iceberg Kafka

Streaming data pipelines can fail due to various reasons. Since the source data, such as Kafka topics, often have limited retention, prolonged job failures can lead to data loss. Thus, streaming jobs need to be backfillable at all times to prevent data loss in case of failures. One solution is to increase the source's retention so that backfilling is simply replaying source streams, but extending Kafka retention is very costly for Netflix's data sizes. Another solution is to utilize source data stored in DWH, commonly known as the Lambda architecture. However, this method introduces significant code duplication, as it requires engineers to maintain a separate equivalent batch job. At Netflix, we have created the Iceberg Source Connector to provide backfilling capabilities to Flink streaming applications. It allows Flink to stream data stored in Apache Iceberg while mirroring Kafka's ordering semantics, enabling us to backfill large-scale stateful Flink pipelines at low retention cost.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Batches, Streams, and Everything in between: Unifying Batch and Stream Storage with Apache Pulsar

2022-07-19 Watch

video

Data Lakehouse Databricks Delta Data Streaming

Delta Lake and Lakehouse architectures have been instrumental technologies in providing a better foundation for dealing with streaming and data deltas via an open-industry standard. The rapid growth of the ecosystem is a testament to the success of this approach. However, challenges still remain in building a data platform that allows teams to process all data via streams, regardless of the age of data, while also being able to view all streams as tables without exporting data out of the streaming system. In this talk, we will take a hands-on look at how Apache Pulsar is building it’s core storage engine on the concepts of Lakehouse architectures, allowing teams to build data platforms that can manage data over its entire lifecycle and enabling data to be consumed as either a stream or a table. With these capabilities, we will show how Pulsar + Delta Lake empowers teams, regardless of toolset, to better focus on driving value from data, not just managing it.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Beyond Daily Batch Processing: Operational Trade-Offs of Microbatch, Incremental, and Real-Time

2022-07-19 Watch

video

Flink Data Science Databricks Data Streaming

Are you considering converting some batch daily pipelines to a realtime system? Perhaps restating multiple days of batch data is becoming unscalable for your pipelines. Maybe a short SLA is music to your stakeholders' ears. If you're flink-curious or possibly just sick of pondering your late arriving data, this discussion is for you.

On the Streaming Data Science and Engineering team at Netflix we support business-critical daily batch, hourly batch, incremental, and realtime pipelines with a rotating on-call system. In this presentation I'll discuss tradeoffs we experience between these systems with an emphasis on operational support when things go sideways. I'll also share some learnings about "goodness of fit" per processing type amongst various workloads with an eye for keeping your data timely and your colleagues sane.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Beyond Monitoring: The Rise of Data Observability

2022-07-19 Watch

video

Barr Moses (Monte Carlo)

Cloud Computing Dashboard Data Engineering Data Lake Databricks Monte Carlo

"Why did our dashboard break?" "What happened to my data?" "Why is this column missing?" If you've been on the receiving end of these messages (and many others!) from downstream stakeholders, you're not alone. Data engineering teams spend 40 percent or more of their time tackling data downtime, or periods of time when data is missing, erroneous, or otherwise inaccurate, and as data systems become increasingly complex and distributed, this number will only increase. To address this problem, data observability is becoming an increasingly important part of the cloud data stack, helping engineers and analysts reduce time to detection and resolution for data incidents caused by faulty data, code, and operational environments. But what does data observability actually look like in practice? During this presentation, Barr Moses, CEO and co-founder of Monte Carlo, will present on how some of today's best data leaders implement observability across their data lake ecosystem and share best practices for data teams seeking to achieve end-to-end visibility into their data at scale. Topics addressed will include: building automated lineage for Apache Spark, applying data reliability workflows, and extending beyond testing and monitoring to solve for unknown unknowns in your data pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Build an Enterprise Lakehouse for Free with Trino and Delta Lake

2022-07-19 Watch

video

Claudius , Tom

Data Lake Data Lakehouse Databricks Delta SQL Trino

Delta Lake has quickly grown in usage across data lakes everywhere due to the growing use cases that require DML capabilities that Delta Lake brings. Outside of support for ACID transactions, users want the ability to interactively query the data in their data lake. This is where a query engine like Trino (formerly PrestoSQL) comes in. Starburst provides an enterprise version of the popular Trino MPP SQL query engine and has recently open sourced their Delta Lake connector.

In this talk, Tom and Claudius will talk about the connector, its features, and how their users are taking advantage of expanding the functionality of their data lakes with improved performance and the ability to handle colliding modifications. Get started with this feature-rich and open stack without the need of a multi-million dollar budget.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Databricks DATA + AI Summit 2023

Top Topics

Top Speakers

MLOps on Databricks: A How-To Guide

Modern Architecture of a Cloud-Enabled Data and Analytics Platform

Monitoring and Quality Assurance of Complex ML Deployments via Assertions

Mosaic: A Framework for Geospatial Analytics at Scale

Multimodal Deep Learning Applied to E-commerce Big Data

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Nixtla: Deep Learning for Time Series Forecasting

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

Orchestration Made Easy with Databricks Workflows

OvalEdge: End-To-End Data Governance

Scaling ML at CashApp with Tecton

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

Announcing General Availability of Databricks Terraform Provider

Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers

Apache Spark on Kubernetes—Lessons Learned from Launching Millions of Spark Executors

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

Automating Model Lifecycle Orchestration with Jenkins

Backfill Streaming Data Pipelines in Kappa Architecture

Batches, Streams, and Everything in between: Unifying Batch and Stream Storage with Apache Pulsar

Beyond Daily Batch Processing: Operational Trade-Offs of Microbatch, Incremental, and Real-Time

Beyond Monitoring: The Rise of Data Observability

Build an Enterprise Lakehouse for Free with Trino and Delta Lake