talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Financial Services Experience at Data + AI Summit 2022

The future of Financial Services is open with data and AI at its core. Welcome data teams and executives in Financial Services! This year’s Data + AI Summit is jam-packed with talks, demos and discussions on how Financial Services leaders are harnessing the power of data and analytics to digitally transform, minimize risk, accelerate time to market and drive sustainable value creation To help you take full advantage of the Financial Services industry experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Financial Services Industry Forum: Our flagship event for Financial Services attendees at Summit featuring keynotes and panel discussions with ADP, Northwestern Mutual, Point72 Asset Management, S&P Global and EY, followed by networking. More details in the agenda below. Financial Services Lounge: Stop by our lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including Accenture, Avanade, Deloitte and others. Session Talks: Over 15 technical talks and demos on topics including hyper-personalization, AI-fueled forecasting, enterprise analytics in cloud, scaling privacy and cybersecurity, MLOps in cryptocurrency, ethical credit scoring and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Fireside Chat with Zhamak Dehghani and Arsalan Tavakoli | Keynote Data + AI Summit 2022

Join Zhamak Dehghani - creator of Data Mesh and Arsalan Tavakoli Co-founder and SVP Field Engineering of Databricks

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Health Care and Life Sciences Experience at Data + AI Summit 2022

Welcome data teams and executives in the Healthcare and Life Sciences industry! This year’s Data + AI Summit is jam-packed with talks, demos and discussions on the biggest innovations in patient care and drug R&D. To help you take full advantage of the Healthcare and Life Sciences experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Healthcare and Life Sciences Industry Forum: Our capstone event for Healthcare and Life Sciences attendees at Summit featuring keynotes and panel discussions with Walgreens, Takeda, Optum, and Humana followed by networking. More details in the agenda below. Healthcare and Life Sciences Lounge: Stop by our industry lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including ZS Associates, John Snow Labs and others. Session Talks: Over 10 technical talks on topics including healthcare NLP, knowledge graphs for R&D, commercial analytics, and predicting hospital readmissions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How Adobe migrated to a unified and open data Lakehouse to deliver personalization at scale.

In this keynote talk, David Weinstein, VP of Engineering for Adobe Experience Cloud, will share Adobe’s journey from a simple data lake to a unified, open Lakehouse architecture with Databricks. Adobe can now deliver personalized experiences at scale to diverse customers with greater speed, operational efficiency and faster innovation across the Experience Cloud portfolio. Learn why they chose to migrate from Iceberg to Delta Lake to drive its open standard development and accelerate innovation of their Lakehouse, and they’ll also share how leveraging the Delta Lake table format has allowed for techniques to support change data capture and significantly improve operational efficiency.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Manufacturing Experience at Data + AI Summit 2022

Welcome data teams and executives in the Manufacturing industry! This year’s Data + AI Summit is jam-packed with talks, demos and discussions on the biggest innovations around improving manufacturing operations, building agile supply chains and enabling an AI-driven business. To help you take full advantage of the Manufacturing experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Manufacturing Industry Forum: Our capstone event for Manufacturing attendees at Summit featuring keynotes and panel discussions with John Deere, Honeywell and Collins Aerospace followed by networking. More details in the agenda below. Manufacturing Lounge:Stop by our lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from The Global Solution Integrator and Tredence. Session Talks: Insightful talks on predicting and preventing machine downtime, real-time process optimization and leveraging informational and operational technology data to make enterprise decisions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Media and Entertainment Experience at Data + AI Summit 2022

Welcome data teams and executives in Media and Entertainment! This year’s Data + AI Summit is jam-packed with talks, demos and discussions focused on how organizations are using data to personalize, monetize and innovate the audience experience. To help you take full advantage of the Communications, Media & Entertainment experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Communications, Media & Entertainment Forum: Our capstone event for the industry at Summit featuring fireside chats and panel discussions with HBO, Warner Bros. Discovery, LaLiga, and Condé Nast followed by networking. More details in the agenda below. Industry Lounge: Stop by our lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including Cognizant, Fivetran, Labelbox, and Lovelytics. Session Talks: Over 10 technical talks on topics including Telecommunication Data Lake Management at AT&T, Data-driven Futbol Analysis from LaLiga, Improving Recommendations with Graph Neural Networks from Condé Nast, Tools for Assisted Spark Version Migrations at Netflix, Real-Time Cost Reduction Monitoring and Alerting with HuuugeGames and much more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

More Context, Less Chaos: How Atlan and Unity Catalog Power Column-Level Lineage and Active Metadata

“What does this mean? Who created it? How is it being used? Is it up to date?” Ever fielded these types of questions about your Databricks assets?

Today, context is a huge challenge for data teams. Everyone wants to use your company’s data, but often only a few experts know all of its tribal knowledge and context. The result — they get bombarded with endless questions and requests.

Atlan — the active metadata platform for modern data teams, recently named a Leader in The Forrester Wave: Enterprise Data Catalogs for DataOps — has launched an integration with Databricks Unity Catalog. By connecting to UC’s REST API, Atlan extracts metadata from Databricks clusters and workspaces, generates column-level lineage, and pairs it with metadata from the rest of your data assets to create true end-to-end lineage and visibility across your data stack.

In this session, Prukalpa Sankar (Co-Founder at Atlan and a lifelong data practitioner) and Todd Greenstein (Product Manager with Databricks) will do a live product demo to show how Atlan and Databricks work together to power modern data governance, cataloging, and collaboration.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving from Apache Spark 2 to Apache Spark 3: Spark Version Upgrade at Scale in Pinterest

Apache Spark has become Pinterest’s dominant distributed batch processing framework, and Pinterest is migrating its Spark Platform and most production Spark jobs to Spark 3.

In this talk, we’ll share how Pinterest performed the Spark 3 version migration at scale. Moving to Spark 3 is a huge version upgrade that brings many incompatibilities and major differences compared with Spark 2. We’ll first introduce the motivation of the migration, then talk about the major challenges, approaches we took, how we handled different Spark job types during the migration, how we address the incompatibilities between Spark 2 and Spark 3, like Scala version support, and how we efficiently and safely migrated our existing production Spark jobs at scale without impacting stability & SLO with the help of Auto Migration Service (AMS). We’ll then further discuss our current performance improvements, cost saving, as well as the future plans and improvements that we’ll work on.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

OpeningProduction Machine Learning | Patrick WendellMLflow 2.0 | Kasey Uhlenhuth

Opening Production Machine Learning | Patrick Wendell MLflow 2.0 | Kasey Uhlenhuth | Keynotes Data + AI Summit 2022

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Optimizing Incremental Ingestion in the Context of a Lakehouse

Incremental ingestion of data is often trickier than one would assume, particularly when it comes to maintaining data consistency: for example, specific challenges arise depending on whether the data is ingested in a streaming or a batched fashion. In this session we want to share the real-life challenges encountered when setting up incremental ingestion pipeline in the context of a Lakehouse architecture.

In this session we outline how we used the recently introduced Databricks features, such as Autoloader and Change Data Feed, in addition to some more mature features, such as Spark Structured Streaming and Trigger Once functionality. These functionalities allowed us to transform batch processes into a “streaming” setup without having the need for the cluster to always run. This setup – which we are keen to share to the community - does not require reloading large amounts of data, and therefore represents a computationally, and consequently economically, cheaper solution.

In our presentation we dive deeper into each of the different aspects of the setup, with some extra focus on some essential Autoloader functionalities, such as schema inference, recovery mechanisms and file discovery modes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Partner Connect & Ecosystem Strategy

Data + AI Summit Keynotes from: Partner Connect & Ecosystem Strategy (Zaheera Valani) What are ELT and CDC, and why are all the cool kids doing it? (George Fraser) Analytics without Compromise (Francois Ajenstat)

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Polars: Blazingly Fast DataFrames in Rust and Python

This talk will introduce Polars a blazingly fast DataFrame library written in Rust on top of Apache Arrow. Its a DataFrame library that brings exploratory data analysis closer to the lessons learned in database research.

CPU's today's come with many cores and with their superscalar designs and SIMD registers allow for even more parallelism. Polars is written from the ground up to fully utilize the CPU's of this generation.

Besides blazingly fast algorithms, cache efficient memory layout and multi-threading, it consist of a lazy query engine, allowing Polars to do several optimizations that may improve query time and memory usage.

Read more:

https://github.com/pola-rs/polars https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/

Join the talk to learn more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Powering Geospatial Data Science with Graph Machine Learning

At Iggy we provide easy access to hundreds of geospatial features to help companies make sense of ‘place’. We believe that incorporating ‘place’ into data science and machine learning pipelines can have a huge impact on predictive capabilities in a wide range of fields such as travel, real estate, healthcare, logistics and many more.

Traditionally this data was accessible in a tabular form, but recently we have been experimenting with converting our tabular data into a graph representation, and applying graph machine learning to build derived products. This allows us to leverage the power of graphs and to more effectively model the relationships between different entities in our data.

In this talk we will present: 1. What is geospatial data? 2. Why are we interested in graph representations of geospatial data? 3. What do our graph representations look like? 4. How are we applying graph machine learning? 5. What are some use cases and derived products that we are building using graph machine learning?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto 101: An Introduction to Open Source Presto

Presto is a widely adopted distributed SQL engine for data lake analytics. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis. Additionally, new features like the disaggregated coordinator, Presto-on-Spark, scan optimizations, a reusable native engine, and a Pinot connector enable added benefits around performance, scale, and ecosystem.

In this session, Philip and Rohan will introduce the Presto technology and share why it’s becoming so popular – in fact, companies like Facebook, Uber, Twitter, Alibaba, and much more use Presto for interactive ad hoc queries, reporting & dashboarding data lake analytics, and much more. We’ll also show a quick demo on getting Presto running in AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to feed the world. The John Deere Data Factory that is built upon the Databricks Lakehouse Platform is at the core of this innovation. It ingests petabytes of data and trillions of records to give data teams fast, reliable access to standardized data sets supporting 100s of ML and analytics use cases across the organization. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in the supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scalable XGBoost on GPU Clusters

XGBoost is a popular open-source implementation of gradient boosting tree algorithms. In this talk, we walk through some of the new features in XGBoost that help us train better models, and explain how to scale up the pipeline to larger datasets with GPU clusters.

It is challenging to train gradient boosting models with the growing size and complexity of data. The latest XGBoost introduces categorical data support to help data scientists work with non-numerical data without the need for encoding. The new XGBoost could train multi-output models to handle datasets with non-exclusive class labels and multi-target regression. XGBoost has also introduced a new AUC implementation that supports more model types and features a robust approximation in distributed environments.

The latest XGBoost has significantly improved its built-in GPU support for scalability and performance. The data loading and processing have been improved for increased memory efficiency, enabling users to handle larger datasets. GPU-based model training is over 2x faster compared to past versions. The performance improvement has also been extended to model explanation. XGBoost added GPU-based SHAP value computation, obtaining more than 10x speedup compared to the traditional CPU-based method. On Spark GPU clusters, end-to-end pipelines could now be accelerated on GPU from feature engineering in ETL to model training/inference in XGBoost.

We will walk through these XGBoost improvements with the newly released XGBoost packages from DMLC. Benchmark results will be shared. Example applications and notebooks will be provided for audiences to learn these new features on the cloud.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Databricks Delta

Many companies are trying to solve the challenges of ingesting transactional data in Data lake and dealing with late-arriving updates and deletes.

To address this at Swiggy, we have built CDC(Change Data Capture) system, an incremental processing framework to power all business-critical data pipelines at low latency and high efficiency.

It offers: Freshness: It operates in near real-time with configurable latency requirements. Performance: Optimized read and write performance with tuned compaction parameters and partitions and delta table optimization. Consistency: It supports reconciliation based on transaction types. Basically applying insert, update, and delete on existing data.

To implement this system, AWS DMS helped us with initial bootstrapping and CDC replication for Mysql sources. AWS Lambda and DynamoDB streams helped us to solve the bootstrapping and CDC replication for DynamoDB source.

After setting up the bootstrap and cdc replication process we have used Databricks delta merge to reconcile the data based on the transaction types.

To support the merge we have implemented supporting features - * Deduplicating multiple mutations of the same record using log offset and time stamp. * Adding optimal partition of the data set. * Infer schema and apply proper schema evolutions(Backward compatible schema) * We have extended the delta table snapshot generation technique to create a consistent partition for partitioned delta tables.

FInally to read the data we are using Spark sql with Hive metastore and Snowflake. Delta tables read with Spark sql have implicit support for hive metastore. We have built our own implementation of the snowflake sync process to create external, internal tables and materialized views on Snowflake.

Stats: 500m CDC logs/day 600+ tables

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

Think for a moment about an ML pipeline that you have created. Was it tedious to write? Did you have to familiarize yourself with technology outside your normal domain? Did you find many bugs? Did you give up with a “good enough” solution? Even simple ML pipelines are tedious. Complex ML pipelines make teams that include Data Engineers and ML Engineers still end up with delays and bugs. Okta’s FIG (Feature Infrastructure Generator) simplifies this with a configuration language for Data Scientists that produces scalable and correct ML pipelines, even highly complex ones. FIG is “just a library” in the sense that you can PIP install it. Once installed, FIG will configure your AWS account, creating ETL jobs, workflows, and ML training and scoring jobs. Data Scientists then use FIG’s configuration language to specify features and model integrations. With a single function call, FIG will run an ML pipeline to generate feature data, train models, and create scoring data. Feature generation is performed in a scalable, efficient, and temporally correct manner. Model training artifacts and scoring are automatically labeled and traced. This greatly simplifies the ML prototyping experience. Once it is time to productionize a model, FIG is able to use the same configuration to coordinate with Okta’s deployment infrastructure to configure production AWS accounts, register build and model artifacts, and setup monitoring. This talk will show a demo of using FIG in the development of Okta’s next generation security infrastructure. The demo includes a walkthrough of the configuration language and how that is translated into AWS during a prototyping session. The demo will also briefly cover how FIG interacts with Okta’s deployment system to make productionization seamless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/