talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Discover Data Lakehouse With End-to-End Lineage

Data Lineage is key for managing change, ensuring data quality and implementing Data Governance in an organization. There are a few use cases for Data Lineage: Data Governance: For compliance and regulatory purposes our customers are required to prove the data/reports they are submitting came from a trusted and verified source.

This typically means identifying the tables and data sets used in a report or dashboard and tracing the source of these tables and fields. Another use case for the Governance scenario is to understand the spread of sensitive data within the lakehouse. Data Discovery: Data analysts looking to self-serve and build their own analytics and models typically spend time exploring and understanding the data in their lakehouse.

Lineage is a key piece of information which enhances the understanding and trustworthiness of the data the analyst plans to use. Problem Identification: Data teams are often called to solve errors in analysts dashboards and reports (“Why is the total number of widgets different in this report than the one I have built?”). This usually leads to an expensive forensic exercise by the DE team to understand the sources of data and the transformations applied to it before it hits the report. Change Management : It is not uncommon for data sources to change, a new source may stop delivering data or a field in the source system changes its semantics.

In this scenario the DE team would like to understand the downstream impact of this change - to get a sense of how many datasets and users will be affected by this change. This will help them determine the impact of the change, manage user expectations and address issues ahead of time In this talk, we will talk about how we capture table and column lineage for spark / delta and unity catalog for our customers in details and how users could leverage data lineage to serve various use cases mentioned above.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Doubling the Capacity of the Data Platform Without Doubling the Cost

The data and ML platform at Scribd is growing. I am responsible for understanding and managing its cost, while enabling the business to solve new and interesting problems with our data. In this talk we'll discuss each of the following concepts and how they apply at Scribd and more broadly to other Databricks customers.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Embedding Privacy by Design Into Data Infrastructure Through Open-Source, Extensible Tooling

The systemic privacy issues in our digital infrastructure stem largely from a fundamental design flaw: privacy is only considered reactively, once personal data is already flowing. Consumer trust is more valuable than ever, and the legal stakes for respecting personal data continue to climb. Appointing a privacy engineer to check boxes at the time of deployment won't cut it...the status quo for data context and data control - in other words, privacy controls - needs to change.

Analogous to AppSec's leftward shift, privacy responsibility lies with builders and maintainers of data and software systems. This requires resources for developers to embrace their role in tasks like evaluating privacy risk with minimal friction, compatible with the array of modern data infrastructure. Cillian will share actionable steps to implement Privacy by Design and offer just one example of what it could look like in action with open-source devtools for automated privacy checks in the CI pipeline.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enable Production ML with Databricks Feature Store

Productionalizing ML models is hard. In fact, very few ML projects make it to production, and one of the hardest problems is data! Most AI platforms are disconnected from the data platform, making it challenging to keep features constantly updated and available in real-time. Offline/online skew prevents models from being used in real-time or, worse, introduces bugs and biases in production. Building systems to enable real-time inference requires valuable production engineering resources. As a result of these challenges, most ML models do not see the light of day.

Learn how you can simplify production ML using Databricks Feature Store, the first feature store built on the data lakehouse. Data sources for features are drawn from a central data lakehouse, and the feature tables themselves are tables in the lakehouse, accessible in Spark and SQL for both machine learning and analytics use cases. Features, data pipelines, source data, and models can all be co-governed in a central platform. Feature Store is seamlessly integrated with Apache Spark™, enabling automatic lineage tracking, and with MLflow, enabling models to look up feature values at inference time automatically. See these capabilities in action and how you can use it for your ML projects.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Business Users to Perform Interactive Ad-Hoc Analysis over Delta Lake with No Code

In this talk, we'll first introduce Sigma Workbooks along with its technical design motivations and architectural details. Sigma Workbooks is an interactive visual data analytics system that enables business users to easily perform complex ad-hoc analysis over data in cloud data warehouses (CDWs). We'll then demonstrate the expressivity, scalability, and ease-of-use of Sigma Workbooks through real-life use cases over datasets stored in Delta Lake. We’ll conclude the talk by sharing the lessons that we have learned throughout the design and implementation iterations of Sigma Workbooks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Evolution of Data Architectures and How to Build a Lakehouse

Data architectures are the key and part of a larger picture to building robust analytical and AI applications. One must take a holistic view of the entire data analytics realm when it comes to planning for data science initiatives.

Through this talk, learn about the evolution of the data landscape and why Lakehouses are becoming a de facto for organizations building scalable data architectures. A lakehouse architecture combines data management capability including reliability, integrity, and quality from the data warehouse and supports all data workloads including BI and AI with the low cost and open approach of data lakes.

Data Practitioners will also learn some core concepts of building an efficient Lakehouse with Delta Lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Fugue Tune: Distributed Hybrid Hyperparameter Tuning

Hyperparameter optimization on Spark is commonly memory-bound, where the model training is done on data that doesn’t fit on a single machine. We introduce Fugue-tune, an intuitive interface focusing on compute-bound hyperparameter tuning that scales Hyperopt and Optuna by allowing them to leverage Spark and Dask without code change.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

FutureMetrics: Using Deep Learning to Create a Multivariate Time Series Forecasting Platform

Liquidity forecasting is one of the most essential activities at any bank. TD bank, the largest of the big Five, has to provide liquidity for half a trillion dollars in products, and to forecast it to remain within a $5BN buffer.

The use case was to predict liquidity growth over short to moderate time horizons: 90 days to 18 months. Model must perform reliably in a strict regulatory framework, and as such validating such a model to the required standards is a key area of focus for this talk. While univariate models are widely used for this reason, their performance is capped preventing future improvements for these type problems.

The most challenging aspect of this problem is that the data is shallow (P N): the primary cadence is monthly, and chaotic nature of economic systems results in poor connectivity of behavior across transitions. Goal is to create an MLOps platform for these types of time series forecasting metrics across the enterprise.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Measuring the Success of Your Algorithm Using a Shadow System

How to determine whether your new data product is a success if you cannot use A/B testing techniques?

At Gousto we recently implemented our newest algorithm to route orders to sites. Comparing this to the previous algorithm using classic A/B testing techniques was not possible, because the algorithm requires a full set of orders to optimise and ensure the volume we send to sites remains stable. A routing algorithm is a high impact product. To ensure confidence in our algorithm before go-live, we came up with a different experimentation strategy. This included building a full-blown shadow system. For measuring its performance we built a set of data pipelines (including ETL) using Databricks.

Sometimes an A/B test cannot do the job. This talk will outline challenges and benefits of building a shadow system, providing the audience with an A/B testing alternative and an overview of relevant considerations in terms of choosing and building this experiment design.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

Find out how Gousto is developing its data pipelines at scale in a repeatable manner. At Gousto, we’ve developed Goustospark - a wrapper around pyspark that allows us to quickly and easily build data pipelines that are deployed into our Databricks environment.

This wrapper abstracts repetitive components of all data pipelines such as spark configurations and metastore interactions. This allows a developer to simply specify the blueprints of the pipeline before turning their attention to more pressing issues, such as data quality and data governance, whilst enjoying a high level of performance and reliability.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How to Automate the Modernization and Migration of Your Data Warehousing Workloads to Databricks

The logic in your data is the heartbeat of your organization’s reports, analytics, dashboards and applications. But that logic is often trapped in antiquated technologies that can’t take advantage of the massive scalability in the Databricks Lakehouse.

In this session BladeBridge will show how to automate the conversion of this metadata and code into Databricks PySpark and DBSQL. BladeBridge will demonstrate the flexibility of configuring for N legacy technologies to facilitate an automated path for not just a single modernization project but a factory approach for corporate wide modernization.

BladeBridge will also present how you can empirically size your migration project to determine the level of effort required.

In this session you will learn: What BladeBridge Converter is What BladeBridge Analyzer is How BladeBridge configures Readers and Writers How to size a conversion effort How to accelerate adoption of Databricks

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Welcome &  Destination Lakehouse    Ali Ghodsi   Keynote Data + AI Summit 2022

Join the Day 1 keynote to hear from Databricks co-founders - and original creators of Apache Spark and Delta Lake - Ali Ghodsi, Matei Zaharia, and Reynold Xin on how Databricks and the open source community is taking on the biggest challenges in data. The talks will address the latest updates on the Apache Spark and Delta Lake projects, the evolution of data lakehouse architecture, and how companies like Adobe and Amgen are using lakehouse architecture to advance their data goals.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Adversarial AI—The Nature of the Threat, Impacts, and Mitigation Strategies

Adversarial AI/ML is an emerging research area focused on the vulnerabilities of Artificial Intelligence (AI)/Machine Learning (ML) models to adversarial exploitation such as data poisoning, adversarial perturbations, inference and extraction attacks. This research area is of particular interest to domains where AI/ML models play an essential role in the mission-critical decision making processes. In this presentation, we will give a review of the four principal categories of Adversarial AI. We will discuss each one of these, supported by the relevant and interesting examples, and we will discuss the future implications. We will present in greater depth our research in Adversarial NLP, backed by the specific data poisoning and adversarial perturbation examples attacks on NLP classifiers. We will conclude the presentation by discussing the current mitigation approaches and methods, and offer some general recommendations for how to best address the Adversarial AI exploits.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

AI-Fueled Forecasting: The Next Generation of Financial Planning

In an age of data abundance and digital disruption CFOs are adopting next generation planning capabilities to drive strategic decision making in real time. The future of forecasting is AI driven. PrecisionViewTM, a Deloitte’s proprietary forecasting solution leverages data aggregation technologies with predictive analytics and machine-learning capabilities to allow businesses to achieve improved forecasting accuracy.

Attend this webinar to hear about: • AI-powered financial planning that helps generate high-impact insights by incorporating the organization’s internal data and a myriad of external macroeconomic factors • Examples of how companies have achieved success using scenario modelling • Databricks’ compute capabilities that allow for parallel processing which helps generate near real time forecasts at the most granular levels

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Apache Spark Community Update | Reynold Xin Streaming Lakehouse | Karthik Ramasamy

Data + AI Summit Keynote talks from Reynold Xin and Karthik Ramasamy

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Best Practices of Maintaining High-Quality Data

Data sits at the heart of machine learning algorithms and makes your model only as good as the data governance policies at the organization. The talk will cover multiple data governance frameworks. Besides, we will talk in depth about one of the key areas of the data governance policy i.e. data quality. The session will cover the significance of the data quality, the definition of goodness, what are the key benefits and impact of maintaining high quality data and processes. Not merely a theoretical aspect, the talk focusses on the practical techniques and guidelines on maintaining the data quality.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Big Data in the Age of Moneyball

Data and predictions have permeated sports and our conversations around it since the beginning. Who will win the big game this weekend? How many points will your favorite player score? How much money will be guaranteed in the next free agent contract? Once could argue that data-driven decisions in sports started with Moneyball in baseball, in 2003. In the two decades since, data and technology have exploded on the scene. The Texas Rangers are using modern cloud software, such as Databricks, to help make sense of this data, and provide actionable information to create a World Series team on the field. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of Big Data. Finally, this talk will demonstrate how the Texas Rangers use MLFlow and the MLRegistry inside Databricks to organize their predictive models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building a Data Science as a Service Platform in Azure with Databricks

Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturity's. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.

In this session we discuss how Databricks enabled a data science as a service platform for one of the UKs largest household insurers. We look at how this platform is empowering users of all abilities to build models, deploy models and realise and return on investment earlier.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Recommendation Systems Using Graph Neural Networks

RECKON (RECommendation systems using KnOwledge Networks) is a machine learning project centred around improving the entities intelligence.

We represent the dataset of our site interactions as a heterogeneous graph. The nodes represent various entities in the underlying data (Users, Articles, Authors, etc.). Edges between nodes represent interactions between these entities (User u has read article v, Article u was written by author v, etc.)

RECKON uses a GNN based encoder-decoder architecture to learn representations for important entities in our data by leveraging both their individual features and the interactions between them through repeated graph convolutions.

Personalized Recommendations play an important role in improving our user's experience and retaining them. We would like to take this opportunity to walk through some of the techniques that we have incorporated in RECKON and an end-end building of this product on databricks along with the demo.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/