talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

561

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Simplify Global DataOps and MLOps Using Okta’s FIG Automation Library

Think for a moment about an ML pipeline that you have created. Was it tedious to write? Did you have to familiarize yourself with technology outside your normal domain? Did you find many bugs? Did you give up with a “good enough” solution? Even simple ML pipelines are tedious. Complex ML pipelines make teams that include Data Engineers and ML Engineers still end up with delays and bugs. Okta’s FIG (Feature Infrastructure Generator) simplifies this with a configuration language for Data Scientists that produces scalable and correct ML pipelines, even highly complex ones. FIG is “just a library” in the sense that you can PIP install it. Once installed, FIG will configure your AWS account, creating ETL jobs, workflows, and ML training and scoring jobs. Data Scientists then use FIG’s configuration language to specify features and model integrations. With a single function call, FIG will run an ML pipeline to generate feature data, train models, and create scoring data. Feature generation is performed in a scalable, efficient, and temporally correct manner. Model training artifacts and scoring are automatically labeled and traced. This greatly simplifies the ML prototyping experience. Once it is time to productionize a model, FIG is able to use the same configuration to coordinate with Okta’s deployment infrastructure to configure production AWS accounts, register build and model artifacts, and setup monitoring. This talk will show a demo of using FIG in the development of Okta’s next generation security infrastructure. The demo includes a walkthrough of the configuration language and how that is translated into AWS during a prototyping session. The demo will also briefly cover how FIG interacts with Okta’s deployment system to make productionization seamless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Supercharging our data architecture at Coinbase using Databricks Lakehouse   Eric Sun

Coinbase is neither simply a finance company nor a tech company — it’s a crypto company. This distinction has big implications for how we work with the Blockchain, Product and Financial data that we need to drive our hypergrowth. We’ve recently enabled a Lakehouse architecture based upon Databricks to unify these complex and varied data sets, to deliver a high performance, continuous ingestion framework at an unprecedented scale. We can now support both ETL and ML workloads on one platform to deliver innovative batch and streaming use cases, and democratize data much faster by enabling teams to use the tools of their choice, while greatly reducing end-to-end latency and simplifying maintenance and operations. In this keynote, we will share our journey to the Lakehouse, and some of the lessons learned as we built an open data architecture at scale.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Time Series Forecasting with PyCaret

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

This presentation will demo the time series forecasting use case using PyCaret's new low-code time series forecasting module.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Towards a Modular Future: Reimagining and Rebuilding Kedro-viz for Visualizing Modular Pipelines

Kedro is an open-source framework for creating portable pipelines through modular data science code, and provides a powerful interactive visualisation tool called ‘Kedro-Viz’, a webapp that magically generates a highly powerful and informational visualisation of the pipeline.

In 2020, the Kedro project introduced an important set of features to support Modular Pipelines, which allows users to set up a series of pipelines that are logically isolated and re-usable to form higher level pipelines.

With this paradigm shift comes the need to reimagine the visualization of the pipeline on Kedro-viz, in that it needs to introduce a series of redesigns and new features to support this new representation of pipeline structure.

As a core contributor and team member to the Kedro-viz project throughout the past year, I have witnessed the journey of this transition through shipping the core features for modular pipelines on Kedro-viz.

This talk will focus on my experience as a front end developer as I walk through the unique architecture and data ingestion setup for this project. I will deep-dive into the unique set of problems and assumptions we have to make in accommodating this new modular pipeline setup, and our approach for solving them within a Front End(React + Redux) context.

Not to say I will definitely share the mistakes and learnings along the way, and how this paved the path towards the app architecture choices for our next set of features in ML experiment tracking.

This talk is for the curious data practitioner who is up for exposure to a fresh set of problems beyond the typical data science domain, and for those who are up for a ride through the mind-boggling details of the unique set up of front end development and data visualisation for data science.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Transforming Drug Discovery using Digital Biology   Daphne Koller

Modern medicine has given us effective tools to treat some of the most significant and burdensome diseases. At the same time, it is becoming consistently more challenging and more expensive to develop new therapeutics. A key factor in this trend is that the drug development process involves multiple steps, each of which involves a complex and protracted experiment that often fails. We believe that, for many of these phases, it is possible to develop machine learning models to help predict the outcome of these experiments, and that those models, while inevitably imperfect, can outperform predictions based on traditional heuristics. To achieve this goal, we are bringing together high-quality data from human cohorts, while also developing cutting edge methods in high throughput biology and chemistry that can produce massive amounts of in vitro data relevant to human disease and therapeutic interventions. Those are then used to train machine learning models that make predictions about novel targets, coherent patient segments, and the clinical effect of molecules. Our ultimate goal is to develop a new approach to drug development that uses high-quality data and ML models to design novel, safe, and effective therapies that help more people, faster, and at a lower cost.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

What to Know about Data Science and Machine Learning in 2022   Peter Norvig

After writing an AI textbook in 2020 and a Data Science textbook in 2022, Peter Norvig reflects on how we teach and learn about these fields has changed in recent years. New data and new algorithms have become available, but more importantly, ethical and societal issues have become more prominent, and the question of what exactly we are trying to optimize is paramount.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Business Decisions Using Event Streams

Today's real-time solutions demand continuousness, autonomy, and observability. Data streams have evolved to guarantee only continuousness; thus, streams alone will never satisfy this demand. Industries instead crave a properly end-to-end streaming architecture backing their applications and services -- a concept that has narrowly evaded realization until now.

In this session, Rohit Bose will demonstrate how such architectures cleanly solve complex problems. This will require two parts:

  1. Building an industry-specific application that continuously generates insights and reports them over dynamically-scoped real-time streams
  2. Discussing the advantages and generalizations of the application's design

The demo will utilize the Swim platform to expose thousands of streaming APIs seeded by an Apache Kafka firehose, enabling both real-time map visualizations and decision-making clients to instantly observe changes across distributed entities with zero unnecessary subscriptions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 2 Morning Keynote |  Data + AI Summit 2022

Day 2 Morning Keynote | Data + AI Summit 2022 Production Machine Learning | Patrick Wendell MLflow 2.0 | Kasey Uhlenhuth Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse architecture | Ganesh Jayaram Intuit’s Data Journey to the Lakehouse: Developing Smart, Personalized Financial Products for 100M+ Consumers & Small Businesses | Alon Amit and Manish Amde Workflows | Stacy Kerkela Delta Live Tables | Michael Armbrust AI and creativity, and building data products where there's no quantitative metric for success, such as in games, or web-scale search, or content discovery | Hilary Mason What to Know about Data Science and Machine Learning in 2022 | Peter Norvig Data-centric AI development: From Big Data to Good Data | Andrew Ng

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Intuit’s Data Journey to the Lakehouse

Intuit is the global technology platform that helps 100M consumers and small businesses overcome their most important financial challenges. In 2020-21, Intuit QuickBooks Capital facilitated more than $1.4B in loans to approximately 40,000 small businesses to help manage their cash flow through the pandemic, by harnessing the power of data and AI.

Pivotal to Intuit’s success is a lakehouse data architecture, catalyzed by the adoption of Databricks, for collecting, processing, and transforming petabytes of raw data into a unified mesh of high quality data. Altogether, enabling the company to accelerate delivery of awesome AI-driven personalized customer experiences at scale with products such as TurboTax, QuickBooks and Mint.

In this talk, Intuit’s AI+Data Vice President of Product, Alon Amit and Director of Engineering, Manish Amde, will provide insight into the company’s migration to a lakehouse architecture, highlight use cases to illustrate its value, and share lessons learned.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Spline: Central Data-Lineage Tracking, Not Only For Spark

Data lineage tracking continues to be a major problem for many organizations. The variety of data tools and frameworks used in big companies’ and a lack of standards and universal lineage tracking solutions (especially open-source ones) makes it very difficult or sometimes even impossible to reliably track and visualize dataflows end to end. Spline is one of a very few open-source solutions available nowadays that tries to address that problem. Spline has started as a data-lineage tracking tool for Apache Spark. But now it offers a generic API and model that is capable to aggregate lineage metadata gathered from different data tools, wire it all together, providing a full end-to-end representation of how the data flows through the pipelines, and how it transforms along the way.

In this presentation we will explain how Spline can be used as a central data-lineage tracking tool for the organization. We’ll briefly cover the high-level architecture and design ideas, outline challenges and limitations of the current solution, and talk about deployment options. We’ll also talk about how Spline compares to some other open-source tools, and how OpenLineage standard can be leveraged to integrate with them.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Swedbank: Enterprise Analytics in Cloud

Swedbank is the largest bank in Sweden & third largest in Nordics. They have about 7-8M customers across retail, mortgage , and investment (pensions). One of the key drivers for the bank was to look at data across all silos and build analytics to drive their ML models - they couldn’t. That’s when Swedbank made a strategic decision to go to the cloud and make bets on Databricks, Immuta, and Azure.

-Enterprise analytics in cloud is an initiative to move Swedbanks on-premise Hadoop based data lake into the cloud to provide improved analytical capabilities at scale. The strategic goals of the “Analytics Data Lake” are: -Advanced analytics: Improve analytical capabilities in terms of functionality, reduce analytics time to market and better predictive modelling -A Catalyst for Sharing Data: Make data Visible, Accessible, Understandable, Linked, and Trusted Technical advancements: Future proof with ability to add new tools/libraries, support for 3rd party solutions for Deep Learning/AI

To achieve these goals, Swedbank had to migrate existing capabilities and application services to Azure Databricks & implement Immuta as its unified access control plane. A “data discovery” space was created for data scientists to be able to come & scan (new) data, develop, train & operationalise ML models. To meet these goals Swedbank requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service data discovery & analytics. Protection of sensitive data is key to enable Swedbank to support key financial services use cases.

The presentation will focus on this journey, calling out key technical challenges, learning & benefits observed.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unlocking the power of data, AI & analytics: Amgen’s journey to the Lakehouse | Kerby Johnson

In this keynote, you will learn more about Amgen's data platform journey from data warehouse to data lakehouse. They’’ll discuss our decision process and the challenges they faced with legacy architectures, and how they designed and implemented a sustaining platform strategy with Databricks Lakehouse, accelerating their ability to democratize data to thousands of users.
Today, Amgen has implemented 400+ data science and analytics projects covering use cases like clinical trial optimization, supply chain management and commercial sales reporting, with more to come as they complete their digital transformation and unlock the power of data across the company.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

US Air Force: Safeguarding Personnel Data at Enterprise Scale

The US Air Force VAULT platform is a cloud-native enterprise data platform designed to provide the Department of the Air Force (DAF) with a robust, interoperable, and secure data environment. The strategic goals of VAULT include:

  • Leading Data Culture - Increase data use and literacy to improve efficiency and effectiveness of decisions, readiness, mission operations, and cybersecurity.
  • A Catalyst for Sharing Data - Make data Visible, Accessible, Understandable, Linked, and Trusted (VAULT).
  • Driving Data Capabilities - Increase access to the right combination of state-of-the-art technologies needed to best utilize data.

To achieve these goals, the VAULT team created a self-service platform to onboard and extract, transform and load data, perform data analytics, machine learning and visualization, and data governance. Supporting over 50 tenants across NIPR and SIPR, adds complexity to maintaining data security while ensuring data can be shared and utilized for analytics. To meet these goals VAULT requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service analytics. Protection of sensitive data is key to enable VAULT to support key use cases such as personal readiness to optimally place Airmen trainees to meet production goals, increase readiness, and match trainees to their preferences.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Using Feast Feature Store with Apache Spark for Self-Served Data Sharing and Analysis for Streaming

In this presentation we will talk about how we will use available NER based sensitive data detection methods, automated record of activity processing on top of spark and feast for collaborative intelligent analytics & governed data sharing. Information sharing is the key to successful business outcomes but it's complicated by sensitive information both user centric and business centric.

Our presentation is motivated by the need to share key KPIs, outcomes for health screening data collected from various surveys to improve care and assistance. In particular, collaborative information sharing was needed to help with health data management, early detection and prevention of disease KPIs. We will present a framework or an approach we have used for these purposes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Vision AI—Animal Health Industry Use Cases Using Databricks on Azure

Vision AI and Azure Cognitive services can be applied in a variety of ways for healthcare, especially for Animal Health. Animal Diagnostics market size is valued at over USD 4.5 Billion in 2020 and is expected to grow at CAGR of 8.5% from 2021 to 2027(Markets&Markets Study).

The overall livestock advanced monitoring market is expected to grow from USD 1.4 billion in 2021 to USD 2.3 billion by 2026; it is expected to grow at a CAGR of 10.4% during 2021–2026.

We hope to showcase various uses of AI/ML for the care of livestock and companion animals to help assist vets and farm-owners. Live demos will include real life case studies and forward looking applications of the same using reinforced learning techniques and services.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/