talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Migrate Your Existing DAGs to Databricks Workflows

In this session, you will learn the benefits of orchestrating your business-critical ETL and ML workloads within the lakehouse, as well as how to migrate and consolidate your existing workflows to Databricks Workflows - a fully managed lakehouse orchestration service that allows you to run workflows on any cloud. We’ll walk you through different migration scenarios and share lessons learned and recommendations to help you reap the benefits of orchestration with Databricks Workflows.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Migrating Complex SAS Processes to Databricks - Case Study

Many federal agencies use SAS software for critical operational data processes. While SAS has historically been a leader in analytics, it has often been used by data analysts for ETL purposes as well. However, modern data science demands on ever-increasing volumes and types of data require a shift to modern, cloud architectures and data management tools and paradigms for ETL/ELT. In this presentation, we will provide a case study at Centers for Medicare and Medicaid Services (CMS) detailing the approach and results of migrating a large, complex legacy SAS process to modern, open-source/open-standard technology - Spark SQL & Databricks – to produce results ~75% faster without reliance on proprietary constructs of the SAS language, with more scalability, and in a manner that can more easily ingest old rules and better govern the inclusion of new rules and data definitions. Significant technical and business benefits derived from this modernization effort are described in this session.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ML on the Lakehouse: Bringing Data and ML Together to Accelerate AI Use Cases

Discover the latest innovations from Databricks that can help you build and operationalize the next generation of machine learning solutions. This session will dive into Databricks Machine Learning, a data-centric AI platform that spans the full machine learning lifecycle - from data ingestion and model training to production MLOps. You'll learn about key capabilities that you can leverage in your ML use cases and see the product in action. You will also directly hear how Databricks ML is being used to maximize supply chain logistics and keep millions of Coca-Cola products on the shelf.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLOps at DoorDash

MLOps is one of the widely discussed topics in the ML practitioner community. Streamlining the ML development and productionalizing ML are important ingredients to realize the power of ML, however it requires a vast and complex infrastructure. The ROI of ML projects will start only when they are in production. The journey to implementing MLOps will be unique to each company. At DoorDash, we’ve been applying MLOps for a couple of years to support a diverse set of ML use cases and to perform large scale predictions at low latency.

This session will share our approach to MLOps, as well as some of the learnings and challenges. In addition, it will share some details about the DoorDash ML stack, which consists of a mixture of homegrown solutions, open source solutions and vendor solutions like Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLOps on Databricks: A How-To Guide

As companies roll out ML pervasively, operational concerns become the primary source of complexity. Machine Learning Operations (MLOps) has emerged as a practice to manage this complexity. At Databricks, we see firsthand how customers develop their MLOps approaches across a huge variety of teams and businesses. In this session, we will show how your organization can build robust MLOps practices incrementally. We will unpack general principles which can guide your organization’s decisions for MLOps, presenting the most common target architectures we observe across customers. Combining our experiences designing and implementing MLOps solutions for Databricks customers, we will walk through our recommended approaches to deploying ML models and pipelines on Databricks. You will come away with a deeper understanding of how to scale deployment of ML models across your organization, as well as a practical, coded example illustrating how to implement an MLOps workflow on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Modern Architecture of a Cloud-Enabled Data and Analytics Platform

In today’s modern IT organization whether it is the delivery of a sophisticated analytical model or a product advancement decision or understanding the behavior of a customer, the fact remains that in every instance we rely on data to make good, informed decisions. Given this backdrop, having an architecture which supports the ability to efficiently collect data from a wide range of sources within the company is still an important goal of all data organizations.

In this session we will explain how Bayer has deployed a hybrid data platform which strives to integrate key existing legacy data systems of the past while taking full advantage of what a modern cloud data platform has to offer in terms of scalability and flexibility. It will elaborate the use of its most significant component, Databricks, which serves to provide not only a very sophisticated data pipelining solution but also a complete ecosystem for teams to create data and analytical solutions in a flexible and agile way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Monitoring and Quality Assurance of Complex ML Deployments via Assertions

Machine Learning (ML) is increasingly being deployed in complex situations by teams. While much research effort has focused on the training and validation stages, other parts have been neglected by the research community.

In this talk, Daniel Kang will describe two abstractions (model assertions and learned observation assertions) that allow users to input domain knowledge to find errors at deployment time and in labeling pipelines. He will show real-world errors in labels and ML models deployed in autonomous vehicles, visual analytics, and ECG classification that these abstractions can find. I'll further describe how they can be used to improve model quality by up to 2x at a fixed labeling budget. This work is being conducted jointly with researchers from Stanford University and Toyota Research Institute.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mosaic: A Framework for Geospatial Analytics at Scale

In this session we’ll present Mosaic, a new Databricks Labs project with a geospatial flavour.

Mosaic provides users of Spark and Databricks with a unified framework for distributing geospatial analytics. Users can choose to employ existing Java-based tools such as JTS or Esri's Geometry API for Java and Mosaic will handle the task of parallelizing these tools' operations: e.g. efficiently reading and writing geospatial data and performing spatial functions on geometries. Mosaic helps users scale these operations by providing spatial indexing capabilities (using, for example, Uber's H3 library) and advanced techniques for optimising common point-in-polygon and polygon-polygon intersection operations.

The development of Mosaic builds upon techniques developed with Ordnance Survey (the central hub for geospatial data across UK Government) and described in this blog post: https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Multimodal Deep Learning Applied to E-commerce Big Data

At Mirakl, we empower marketplaces with Artificial Intelligence solutions. Catalogs data is an extremely rich source of e-commerce sellers and marketplaces products which include images, descriptions, brands, prices and attributes (for example, size, gender, material or color). Such big volumes of data are suitable for training multimodal deep learning models and present several technical Machine Learning and MLOps challenges to tackle.

We will dive deep into two key use cases: deduplication and categorization of products. For categorization the creation of quality multimodal embeddings plays a crucial role and is achieved through experimentation of transfer learning techniques on state-of-the-art models. Finding very similar or almost identical products among millions and millions can be a very difficult problem and that is where our deduplication algorithm comes to bring a fast and computationally efficient solution.

Furthermore we will show how we deal with big volumes of products using robust and efficient pipelines, Spark for distributed and parallel computing, TFRecords to stream and ingest data optimally on multiple machines avoiding memory issues, and MLflow for tracking experiments and metrics of our models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Nixtla: Deep Learning for Time Series Forecasting

Time series forecasting has a wide range of applications: finance, retail, healthcare, IoT, etc. Recently deep learning models such as ESRNN or N-BEATS have proven to have state-of-the-art performance in these tasks. Nixtlats is a python library that we have developed to facilitate the use of these state-of-the-art models to data scientists and developers, so that they can use them in productive environments. Written in pytorch, its design is focused on usability and reproducibility of experiments. For this purpose, nixtlats has several modules:

Data: contains datasets of various time series competencies. Models: includes state-of-the-art models. Evaluation: has various loss functions and evaluation metrics.

Objective:

  • To introduce attendees to the challenges of time series forecasting with deep learning.
  • Commercial applications of time series forecasting.
  • Describe nixtlats, their components and best practices for training and deploying state-of-the-art models in production.
  • Reproduction of state-of-the-art results using nixtlats from the winning model of the M4 time series competition (ESRNN).

Project repository: https://github.com/Nixtla/nixtlats.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

Spreadsheets revolutionized IT by giving end users the ability to create their own analytics. Providing direct end user access to trillion-row datasets generated in financial markets or digital marketing is much harder. New SQL data warehouses like ClickHouse and Druid can provide fixed latency with constant cost on very large datasets, which opens up new possibilities.

Our talk walks through recent experience on analytic apps developed by ClickHouse users that enable end users like market traders to develop their own analytics directly off raw data. We’ll cover the following topics.

  1. Characteristics of new open source column databases and how they enable low-latency analytics at constant cost.

  2. Idiomatic ways to validate new apps by building MVPs that support a wide range of queries on source data including storing source JSON, schema design, applying compression on columns, and building indexes for needle-in-a-haystack queries.

  3. Incrementally identifying hotspots and applying easy optimizations to bring query performance into line with long term latency and cost requirements.

  4. Methods of building accessible interfaces, including traditional dashboards, imitating existing APIs that are already known, and creating app-specific visualizations.

We’ll finish by summarizing a few of the benefits we’ve observed and also touch on ways that analytic infrastructure could be improved to make end user access even more productive. The lessons are as general as possible so that they can be applied across a wide range of analytic systems, not just ClickHouse.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Speed and Scale of User-Facing Analytics Using Apache Kafka and Pinot

Apache Kafka is the de facto standard for real-time event streaming, but what do you do if you want to perform user-facing, ad-hoc, real-time analytics too? That's where Apache Pinot comes in.

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage) as well as streaming sources such as Kafka. Pinot is used extensively at LinkedIn and Uber to power many analytical applications such as Who Viewed My Profile, Ad Analytics, Talent Analytics, Uber Eats and many more serving 100k+ queries per second while ingesting 1Million+ events per second.

Apache Kafka's highly performant, distributed, fault-tolerant, real-time publish-subscribe messaging platform powers big data solutions at Airbnb, LinkedIn, MailChimp, Netflix, the New York Times, Oracle, PayPal, Pinterest, Spotify, Twitter, Uber, Wikimedia Foundation, and countless other businesses.

Come hear from Neha Power, Founding Engineer at a StarTree and PMC and committer of Apache Pinot, and Karin Wolok, Head of Developer Community at StarTree, on an introduction to both systems and a view of how they work together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Orchestration Made Easy with Databricks Workflows

Orchestrating and managing end-to-end production pipelines have remained a bottleneck for many organizations. Data teams spend too much time stitching pipeline tasks and manually managing and monitoring the orchestration process – with heavy reliance on external or cloud-specific orchestration solutions, all of which slow down the delivery of new data. In this session, we introduce you to Databricks Workflows: a fully managed orchestration service for all your data, analytics, and AI, built in the Databricks Lakehouse Platform. Join us as we dive deep into the new workflow capabilities, and understand the integration with the underlying platform. You will learn how to create and run reliable production workflows, centrally manage and monitor workflows, and learn how to implement recovery actions such as repair and run, as well as other new features.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

OvalEdge: End-To-End Data Governance

OvalEdge presents a progressive solution for Data Governance and is the only platform that provides an end-to-end data governance experience. Data Governance is all about access, data literacy, lineage, better business processes, data privacy and compliance controls, and data quality. What makes OvalEdge successful is having all of these features in a central platform that is accessible and beneficial for all data users.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling ML at CashApp with Tecton

This is a joint talk given by CashApp and Tecton. CashApp’s mobile payment product has stringent technical requirements: scale, reliability, speed. ML-based recommendations are at the core of this service and pose a significant engineering challenge. This talk describes CashApp’s journey through various generations of its core ML capabilities, covering the technical and organizational challenges associated with building large-scale production recommendation systems. The talk finishes with a look at the latest generation of CashApp’s ML platform and highlights how Tecton’s real-time Feature Platform helps CashApp deliver world-class recommendations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

Due to the unique cybersecurity challenges that HSBC faces daily - from high data volumes to untrustworthy sources to the privacy and security restrictions of a highly regulated industry - the resulting architecture was an unwieldy set of disparate data silos. So, how do we build a cybersecurity advanced analytics environment to enrich and transform these myriad data sources into a unified, well-documented, robust, resilient, repeatable, scalable, maintainable platform that will empower the cyber analysts of the future? That at the same time remains cost-effective and enables everyone from the less-technical junior reporting user to the senior machine learning engineers?

In this session, Ryan Harris, Principal Cybersecurity Engineer at HSBC, dives into the infrastructure and architecture employed, ranging from the landing zone concepts, secure access workstations, data lake structure, and isolated data ingestion, to the enterprise integration layer. In the process of building the data pipelines and lakehouses, we ended up building a hybrid data mesh leveraging Delta Lake. The result is a flexible, secure, self-service environment that is unlocking the capabilities of our humans.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks

Working with S3 is different from doing so with HDFS: The architecture of the Object store makes the standard Spark file connector inefficient to work with S3.

There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with a number of files in the source path.

We have seen this happen at Hunters, a security operations platform that works with a wide range of data sources.

We want to share a description of the problem and the solution we will open-source. The audience will learn how to configure it and make the best use of it. We will also discuss how to use metadata to boost the performance of discovering new files in the stream and show the use case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber attacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Announcing General Availability of Databricks Terraform Provider

We all live in the exciting times and the hype of Distributed Data Mesh (or just mess). This talk will cover a couple architectural and organizational approaches on achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated team collaborative environments, and security enforcement. As a Data Leader, you’ll learn what kinds of things you’d need to pay attention to, when starting (or reviving) a modern Data Engineering and Data Science strategy and how Databricks Unity Catalog may help you automating that. As DevOps, you’ll learn about the best practices and pitfalls of Continuous Deployment on Databricks With Terraform and Continuous Integration with Databricks Repos. You’ll be excited how you can automate Data Security with Unity Catalog and Terraform. As a Data Scientist, you’ll learn how you can get relevant infrastructure into “production” relatively faster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers

Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.

Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.

In this talk, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/