talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

582

Sessions & talks

Showing 551–575 of 582 · Newest first

Search within this event →
Optimizing Incremental Ingestion in the Context of a Lakehouse

Optimizing Incremental Ingestion in the Context of a Lakehouse

2022-07-19 Watch
video

Incremental ingestion of data is often trickier than one would assume, particularly when it comes to maintaining data consistency: for example, specific challenges arise depending on whether the data is ingested in a streaming or a batched fashion. In this session we want to share the real-life challenges encountered when setting up incremental ingestion pipeline in the context of a Lakehouse architecture.

In this session we outline how we used the recently introduced Databricks features, such as Autoloader and Change Data Feed, in addition to some more mature features, such as Spark Structured Streaming and Trigger Once functionality. These functionalities allowed us to transform batch processes into a “streaming” setup without having the need for the cluster to always run. This setup – which we are keen to share to the community - does not require reloading large amounts of data, and therefore represents a computationally, and consequently economically, cheaper solution.

In our presentation we dive deeper into each of the different aspects of the setup, with some extra focus on some essential Autoloader functionalities, such as schema inference, recovery mechanisms and file discovery modes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Partner Connect & Ecosystem Strategy

Partner Connect & Ecosystem Strategy

2022-07-19 Watch
video
Zaheera Valani (Databricks) , Francois Ajenstat , George Fraser (Fivetran)

Data + AI Summit Keynotes from: Partner Connect & Ecosystem Strategy (Zaheera Valani) What are ELT and CDC, and why are all the cool kids doing it? (George Fraser) Analytics without Compromise (Francois Ajenstat)

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Polars: Blazingly Fast DataFrames in Rust and Python

Polars: Blazingly Fast DataFrames in Rust and Python

2022-07-19 Watch
video

This talk will introduce Polars a blazingly fast DataFrame library written in Rust on top of Apache Arrow. Its a DataFrame library that brings exploratory data analysis closer to the lessons learned in database research.

CPU's today's come with many cores and with their superscalar designs and SIMD registers allow for even more parallelism. Polars is written from the ground up to fully utilize the CPU's of this generation.

Besides blazingly fast algorithms, cache efficient memory layout and multi-threading, it consist of a lazy query engine, allowing Polars to do several optimizations that may improve query time and memory usage.

Read more:

https://github.com/pola-rs/polars https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/

Join the talk to learn more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Powering Geospatial Data Science with Graph Machine Learning

Powering Geospatial Data Science with Graph Machine Learning

2022-07-19 Watch
video

At Iggy we provide easy access to hundreds of geospatial features to help companies make sense of ‘place’. We believe that incorporating ‘place’ into data science and machine learning pipelines can have a huge impact on predictive capabilities in a wide range of fields such as travel, real estate, healthcare, logistics and many more.

Traditionally this data was accessible in a tabular form, but recently we have been experimenting with converting our tabular data into a graph representation, and applying graph machine learning to build derived products. This allows us to leverage the power of graphs and to more effectively model the relationships between different entities in our data.

In this talk we will present: 1. What is geospatial data? 2. Why are we interested in graph representations of geospatial data? 3. What do our graph representations look like? 4. How are we applying graph machine learning? 5. What are some use cases and derived products that we are building using graph machine learning?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto 101: An Introduction to Open Source Presto

Presto 101: An Introduction to Open Source Presto

2022-07-19 Watch
video

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Philip and Rohan will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

2022-07-19 Watch
video

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

2022-07-19 Watch
video

John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to feed the world. The John Deere Data Factory that is built upon the Databricks Lakehouse Platform is at the core of this innovation. It ingests petabytes of data and trillions of records to give data teams fast, reliable access to standardized data sets supporting 100s of ML and analytics use cases across the organization. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in the supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

2022-07-19 Watch
video

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scalable XGBoost on GPU Clusters

Scalable XGBoost on GPU Clusters

2022-07-19 Watch
video

XGBoost is a popular open-source implementation of gradient boosting tree algorithms. In this talk, we walk through some of the new features in XGBoost that help us train better models, and explain how to scale up the pipeline to larger datasets with GPU clusters.

It is challenging to train gradient boosting models with the growing size and complexity of data. The latest XGBoost introduces categorical data support to help data scientists work with non-numerical data without the need for encoding. The new XGBoost could train multi-output models to handle datasets with non-exclusive class labels and multi-target regression. XGBoost has also introduced a new AUC implementation that supports more model types and features a robust approximation in distributed environments.

The latest XGBoost has significantly improved its built-in GPU support for scalability and performance. The data loading and processing have been improved for increased memory efficiency, enabling users to handle larger datasets. GPU-based model training is over 2x faster compared to past versions. The performance improvement has also been extended to model explanation. XGBoost added GPU-based SHAP value computation, obtaining more than 10x speedup compared to the traditional CPU-based method. On Spark GPU clusters, end-to-end pipelines could now be accelerated on GPU from feature engineering in ETL to model training/inference in XGBoost.

We will walk through these XGBoost improvements with the newly released XGBoost packages from DMLC. Benchmark results will be shared. Example applications and notebooks will be provided for audiences to learn these new features on the cloud.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

2022-07-19 Watch
video

Many companies are trying to solve the challenges of ingesting transactional data in Data lake and dealing with late-arriving updates and deletes.

To address this at Swiggy, we have built CDC(Change Data Capture) system, an incremental processing framework to power all business-critical data pipelines at low latency and high efficiency.

It offers: Freshness: It operates in near real-time with configurable latency requirements. Performance: Optimized read and write performance with tuned compaction parameters and partitions and delta table optimization. Consistency: It supports reconciliation based on transaction types. Basically applying insert, update, and delete on existing data.

To implement this system, AWS DMS helped us with initial bootstrapping and CDC replication for Mysql sources. AWS Lambda and DynamoDB streams helped us to solve the bootstrapping and CDC replication for DynamoDB source.

After setting up the bootstrap and cdc replication process we have used Databricks delta merge to reconcile the data based on the transaction types.

To support the merge we have implemented supporting features - * Deduplicating multiple mutations of the same record using log offset and time stamp. * Adding optimal partition of the data set. * Infer schema and apply proper schema evolutions(Backward compatible schema) * We have extended the delta table snapshot generation technique to create a consistent partition for partitioned delta tables.

FInally to read the data we are using Spark sql with Hive metastore and Snowflake. Delta tables read with Spark sql have implicit support for hive metastore. We have built our own implementation of the snowflake sync process to create external, internal tables and materialized views on Snowflake.

Stats: 500m CDC logs/day 600+ tables

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

2022-07-19 Watch
video

Think for a moment about an ML pipeline that you have created. Was it tedious to write? Did you have to familiarize yourself with technology outside your normal domain? Did you find many bugs? Did you give up with a “good enough” solution? Even simple ML pipelines are tedious. Complex ML pipelines make teams that include Data Engineers and ML Engineers still end up with delays and bugs. Okta’s FIG (Feature Infrastructure Generator) simplifies this with a configuration language for Data Scientists that produces scalable and correct ML pipelines, even highly complex ones. FIG is “just a library” in the sense that you can PIP install it. Once installed, FIG will configure your AWS account, creating ETL jobs, workflows, and ML training and scoring jobs. Data Scientists then use FIG’s configuration language to specify features and model integrations. With a single function call, FIG will run an ML pipeline to generate feature data, train models, and create scoring data. Feature generation is performed in a scalable, efficient, and temporally correct manner. Model training artifacts and scoring are automatically labeled and traced. This greatly simplifies the ML prototyping experience. Once it is time to productionize a model, FIG is able to use the same configuration to coordinate with Okta’s deployment infrastructure to configure production AWS accounts, register build and model artifacts, and setup monitoring. This talk will show a demo of using FIG in the development of Okta’s next generation security infrastructure. The demo includes a walkthrough of the configuration language and how that is translated into AWS during a prototyping session. The demo will also briefly cover how FIG interacts with Okta’s deployment system to make productionization seamless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Supercharging our data architecture at Coinbase using Databricks Lakehouse   Eric Sun

Supercharging our data architecture at Coinbase using Databricks Lakehouse Eric Sun

2022-07-19 Watch
video
Eric Sun (Coinbase)

Coinbase is neither simply a finance company nor a tech company — it’s a crypto company. This distinction has big implications for how we work with the Blockchain, Product and Financial data that we need to drive our hypergrowth. We’ve recently enabled a Lakehouse architecture based upon Databricks to unify these complex and varied data sets, to deliver a high performance, continuous ingestion framework at an unprecedented scale. We can now support both ETL and ML workloads on one platform to deliver innovative batch and streaming use cases, and democratize data much faster by enabling teams to use the tools of their choice, while greatly reducing end-to-end latency and simplifying maintenance and operations. In this keynote, we will share our journey to the Lakehouse, and some of the lessons learned as we built an open data architecture at scale.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Time Series Forecasting with PyCaret

Time Series Forecasting with PyCaret

2022-07-19 Watch
video

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

This presentation will demo the time series forecasting use case using PyCaret's new low-code time series forecasting module.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Towards a Modular Future: Reimagining and Rebuilding Kedro-viz for Visualizing Modular Pipelines

Towards a Modular Future: Reimagining and Rebuilding Kedro-viz for Visualizing Modular Pipelines

2022-07-19 Watch
video

Kedro is an open-source framework for creating portable pipelines through modular data science code, and provides a powerful interactive visualisation tool called ‘Kedro-Viz’, a webapp that magically generates a highly powerful and informational visualisation of the pipeline.

In 2020, the Kedro project introduced an important set of features to support Modular Pipelines, which allows users to set up a series of pipelines that are logically isolated and re-usable to form higher level pipelines.

With this paradigm shift comes the need to reimagine the visualization of the pipeline on Kedro-viz, in that it needs to introduce a series of redesigns and new features to support this new representation of pipeline structure.

As a core contributor and team member to the Kedro-viz project throughout the past year, I have witnessed the journey of this transition through shipping the core features for modular pipelines on Kedro-viz.

This talk will focus on my experience as a front end developer as I walk through the unique architecture and data ingestion setup for this project. I will deep-dive into the unique set of problems and assumptions we have to make in accommodating this new modular pipeline setup, and our approach for solving them within a Front End(React + Redux) context.

Not to say I will definitely share the mistakes and learnings along the way, and how this paved the path towards the app architecture choices for our next set of features in ML experiment tracking.

This talk is for the curious data practitioner who is up for exposure to a fresh set of problems beyond the typical data science domain, and for those who are up for a ride through the mind-boggling details of the unique set up of front end development and data visualisation for data science.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Transforming Drug Discovery using Digital Biology   Daphne Koller

Transforming Drug Discovery using Digital Biology Daphne Koller

2022-07-19 Watch
video

Modern medicine has given us effective tools to treat some of the most significant and burdensome diseases. At the same time, it is becoming consistently more challenging and more expensive to develop new therapeutics. A key factor in this trend is that the drug development process involves multiple steps, each of which involves a complex and protracted experiment that often fails. We believe that, for many of these phases, it is possible to develop machine learning models to help predict the outcome of these experiments, and that those models, while inevitably imperfect, can outperform predictions based on traditional heuristics. To achieve this goal, we are bringing together high-quality data from human cohorts, while also developing cutting edge methods in high throughput biology and chemistry that can produce massive amounts of in vitro data relevant to human disease and therapeutic interventions. Those are then used to train machine learning models that make predictions about novel targets, coherent patient segments, and the clinical effect of molecules. Our ultimate goal is to develop a new approach to drug development that uses high-quality data and ML models to design novel, safe, and effective therapies that help more people, faster, and at a lower cost.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

2022-07-19 Watch
video

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

What to Know about Data Science and Machine Learning in 2022   Peter Norvig

What to Know about Data Science and Machine Learning in 2022 Peter Norvig

2022-07-19 Watch
video
Peter Norvig (Google)

After writing an AI textbook in 2020 and a Data Science textbook in 2022, Peter Norvig reflects on how we teach and learn about these fields has changed in recent years. New data and new algorithms have become available, but more importantly, ethical and societal issues have become more prominent, and the question of what exactly we are trying to optimize is paramount.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Workflows   Stacy Kerkela Delta Live Tables   Michael Armbrust   Keynotes Data + AI Summit 2022

Workflows Stacy Kerkela Delta Live Tables Michael Armbrust Keynotes Data + AI Summit 2022

2022-07-19 Watch
video
Michael Armbrust (Databricks) , Stacy Kerkela (Databricks)

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Business Decisions Using Event Streams

Automating Business Decisions Using Event Streams

2022-07-19 Watch
video

Today's real-time solutions demand continuousness, autonomy, and observability. Data streams have evolved to guarantee only continuousness; thus, streams alone will never satisfy this demand. Industries instead crave a properly end-to-end streaming architecture backing their applications and services -- a concept that has narrowly evaded realization until now.

In this session, Rohit Bose will demonstrate how such architectures cleanly solve complex problems. This will require two parts:

  1. Building an industry-specific application that continuously generates insights and reports them over dynamically-scoped real-time streams
  2. Discussing the advantages and generalizations of the application's design

The demo will utilize the Swim platform to expose thousands of streaming APIs seeded by an Apache Kafka firehose, enabling both real-time map visualizations and decision-making clients to instantly observe changes across distributed entities with zero unnecessary subscriptions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Warehousing   Shant Hovespian   Keynote Data + AI Summit 2022

Data Warehousing Shant Hovespian Keynote Data + AI Summit 2022

2022-07-19 Watch
video

Data + AI Summit Keynote talk from Shant Hovespian

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 2 Morning Keynote |  Data + AI Summit 2022

Day 2 Morning Keynote | Data + AI Summit 2022

2022-07-19 Watch
video
Ganesh Jayaram , Manish Amde (Intuit) , Kasey Uhlenhuth (Databricks) , Peter Norvig (Google) , Andrew Ng , Hilary Mason (Hidden Door) , Michael Armbrust (Databricks) , Stacy Kerkela (Databricks) , Patrick Wendell (Databricks) , Alon Amit (Intuit)

Day 2 Morning Keynote | Data + AI Summit 2022 Production Machine Learning | Patrick Wendell MLflow 2.0 | Kasey Uhlenhuth Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse architecture | Ganesh Jayaram Intuit’s Data Journey to the Lakehouse: Developing Smart, Personalized Financial Products for 100M+ Consumers & Small Businesses | Alon Amit and Manish Amde Workflows | Stacy Kerkela Delta Live Tables | Michael Armbrust AI and creativity, and building data products where there's no quantitative metric for success, such as in games, or web-scale search, or content discovery | Hilary Mason What to Know about Data Science and Machine Learning in 2022 | Peter Norvig Data-centric AI development: From Big Data to Good Data | Andrew Ng

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Lake   Michael Armbrust   Keynote Data + AI Summit 2022

Delta Lake Michael Armbrust Keynote Data + AI Summit 2022

2022-07-19 Watch
video
Michael Armbrust (Databricks)

Data + AI Summit Keynote talk from Michael Armbrust

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Intuit’s Data Journey to the Lakehouse

Intuit’s Data Journey to the Lakehouse

2022-07-19 Watch
video
Manish Amde (Intuit) , Alon Amit (Intuit)

Intuit is the global technology platform that helps 100M consumers and small businesses overcome their most important financial challenges. In 2020-21, Intuit QuickBooks Capital facilitated more than $1.4B in loans to approximately 40,000 small businesses to help manage their cash flow through the pandemic, by harnessing the power of data and AI.

Pivotal to Intuit’s success is a lakehouse data architecture, catalyzed by the adoption of Databricks, for collecting, processing, and transforming petabytes of raw data into a unified mesh of high quality data. Altogether, enabling the company to accelerate delivery of awesome AI-driven personalized customer experiences at scale with products such as TurboTax, QuickBooks and Mint.

In this talk, Intuit’s AI+Data Vice President of Product, Alon Amit and Director of Engineering, Manish Amde, will provide insight into the company’s migration to a lakehouse architecture, highlight use cases to illustrate its value, and share lessons learned.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

On Large Language Models for Understanding Human Language   Christopher Manning

On Large Language Models for Understanding Human Language Christopher Manning

2022-07-19 Watch
video

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Spline: Central Data-Lineage Tracking, Not Only For Spark

Spline: Central Data-Lineage Tracking, Not Only For Spark

2022-07-19 Watch
video

Data lineage tracking continues to be a major problem for many organizations. The variety of data tools and frameworks used in big companies’ and a lack of standards and universal lineage tracking solutions (especially open-source ones) makes it very difficult or sometimes even impossible to reliably track and visualize dataflows end to end. Spline is one of a very few open-source solutions available nowadays that tries to address that problem. Spline has started as a data-lineage tracking tool for Apache Spark. But now it offers a generic API and model that is capable to aggregate lineage metadata gathered from different data tools, wire it all together, providing a full end-to-end representation of how the data flows through the pipelines, and how it transforms along the way.

In this presentation we will explain how Spline can be used as a central data-lineage tracking tool for the organization. We’ll briefly cover the high-level architecture and design ideas, outline challenges and limitations of the current solution, and talk about deployment options. We’ll also talk about how Spline compares to some other open-source tools, and how OpenLineage standard can be leveraged to integrate with them.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/