talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Simplifying Migrations to Lakehouse—the Databricks Way

Customers around the world are experiencing tremendous success migrating from legacy on-premises Hadoop architectures to a modern Databricks Lakehouse in the cloud. At Databricks, we have formulated a migration methodology that helps customers sail through this migration journey with ease. In this talk, we will touch upon some of the key elements that minimize risks and simplify the process of migrating to Databricks, and will walk through some of the customer journeys and use cases.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Smart Manufacturing: Real-time Process Optimization with Databricks

Learn more about how a Fortune 500 aluminium rolled stock manufacturer is leveraging Tredence-Databricks Lakehouse-based AIoT Industrial internet of things (IIoT) Solutions to improve productivity by +20%.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Spark Data Source V2 Performance Improvement: Aggregate Push Down

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Spark Inception: Exploiting the Apache Spark REPL to Build Streaming Notebooks

Join Scott Haines (Databricks Beacon) as he teaches you to write your own Notebook style service (like Jupyter / Zeppelin / Databricks) for both fun (and profit?). Cause haven't we all just been a little curious how Notebook environments work? From the outside things probably seem magical, however just below the surface there is a literal world of possibilities waiting to be exploited (both figuratively and literally) to assist in the building of unimaginable new creations. Curiosity is of course the foundation for creativity and novel ideation, and when armed with the knowledge you'll pick up in this session, you'll have gained an additional perspective and way of thinking (mental model) for solving complex problems using dynamic procedural (on-the-fly) code compilation.

Did I mention you'll use Spark Structured Streaming in order to generate a "live" communication channel between your Notebook service and the "outside world"?

Overview During this session you'll learn to build your own Notebook-style service on top of Apache Spark & the Scala ILoop. Along the way, you'll uncover how to harness the SparkContext to manage, drive, and scale your own procedurally defined Apache Spark applications by mixing core configuration and other "magic". As we move through the steps necessary to achieve this end result, you'll learn to run individual paragraphs, or the entire synchronous waterfall of paragraphs, leading to the dynamic generation of applications.

Deep dive into the world of possibilities that fork from a solid understanding of procedurally generated, on-the-fly, code compilation (live injection), the security ramifications (cause of course this is unsafe!), but come away with a new mental model focused on architecting composite applications, or auto-generated

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Streaming ML Enrichment Framework Using Advanced Delta Table Features

Talk about a challenge of building a scalable framework for data scientists and ML engineers, that could accommodate hundreds of generic or customer specific ML models, running both in streaming and batch, capable of processing 100+ million records per day from social media networks.

The goal has been archived using Spark and Delta. Our framework is built on clever usage of delta features such as change data feed, selective merge and spark structure streaming from and into delta tables. Saving the data in multiple delta tables, where the structure of these tables are reflecting the particular step in the whole flow. This brings great efficiency, as the downstream processing does very little transformations and thus even people without extensive experience of writing ML pipelines and jobs can use the framework easily. At the heart of the framework there is a series of Spark structure streaming jobs continuously evaluating rules and looking for what social media content should be processed by which model. These rules could be updated by the users anytime and the framework needs to automatically adjust the processing. In an environment like this, the ability to track the records throughout the whole process and the atomicity of operations is of utmost importance and delta tables are providing all of this out of the box.

In the talk we are going to focus on the ideas behind the framework and efficient combining of structured streaming and delta tables. Key takeaways would be exploring some of the lesser known delta table features and real-life experiences from building a ML framework solution based on scalable big data technologies, showing how capable and fast such a solution can be, even with minimal hardware resources.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Supercharge your SaaS applications with a modern, cloud-native database

Today’s world demands modern applications that process data at faster speeds and deliver real-time insights. Yet the challenge for most businesses is their data infrastructure isn't designed for data intensity — the idea that high volumes of data should be quickly ingested and processed, no matter how complex or diverse the data sets. How do you meet the demands of a data-intensive application? It starts with the right database. This session gives you a roadmap with key criteria for powering modern, data-intensive applications with a cloud-native database — and how three customers drove up to 100x better performance for their applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Survey of Production ML Tech Stacks

Production machine learning demand stitching together many tools ranging from open source standards to cloud-specific and third party solutions. This session surveys the current ML deployment technology landscape to contextualize which tools solve for which features off production ML systems such as CI/CD, REST endpoint, and monitoring. It'll help answer the questions: what tools are out there? Where do I start with the MLops tech stack for my application? What are the pros and cons of open source versus managed solutions? This talk takes a features driven approach to tool selection for MLops tacks to provide best practices in the most rapidly evolving field of data science.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Take Databricks Lakehouse to the Max with Informatica​

The hard part of ML and analytics is not building data models. It’s getting the data right and into production. Join us to learn how Informatica’s Intelligent Data Management Cloud (IDMC) helps you maximize the benefits of the Databricks’ Unified Analytics platform. Learn how our cloud-native capabilities can shorten your time to results. See how to enable more data users to easily load data and develop data engineering workflows on Databricks in ELT mode at scale. Find out how Informatica delivers all the necessary governance and compliance guardrails you need to operate analytics, AI and ML. Accelerate adoption and maximize agility while maintaining control of your data and lowering risk.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Future is Open - a Look at Google Cloud’s Open Data Ecosystem

Join Anagha Khanolkar and Mansi Maharana, both Cloud Customer Engineers specialized in Advanced Analytics, to learn about Open Data Analytics on Google Cloud. This session will cover Google Data Cloud's Open Data Analytics portfolio, value proposition, customer stories, trends, and more, and including Databricks on GCP.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Future of Data - What’s Next with Google Cloud

Join Bruno Aziza, Head of Data and Analytics, Google Cloud, for an in-depth look at what he is seeing in the future of data and emerging trends. He will also cover Google Cloud’s data analytics practice, including insights into the Data Cloud Alliance, Big Lake, and our strategic partnership with Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Modern Metadata Platform: What, Why, and How?

Recently there has been a lot of buzz in the data community on the topic of metadata management. It’s often discussed in the context of data discovery, data provenance, data governance, and data privacy. Even Gartner and Forrester have created the new Active Metadata Management and Enterprise Data Fabric categories to highlight the development in this area.

However, metadata management isn’t actually a new problem. It has just taken on a whole new dimension with the widespread adoption of the Modern Data Stack. What used to be a small, esoteric issue that only concerned the core data team has exploded into complex, organizational challenges that plagued companies large and small.

In this talk, we’ll explain how a Modern Metadata Platform (MMP) can help solve these new challenges and the key ingredients to building a scalable and extensible MMP.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Semantics of Biology—Vaccine and Drug Research with Knowledge Graphs and Logical Inferencing

From the organization of the tree of life, to the tissues and structures of living organisms: trees and graphs are a recurring data structure in biology. Given the tree-like relationships between biological entities, Knowledge Graphs are emerging as the ideal way to store and retrieve biological data.

In our first Data + AI talk (https://www.youtube.com/watch?v=Kj5bZ2afWSU), we presented the Bellman open source library (https://github.com/gsk-aiops/bellman). Bellman was developed to translate SPARQL queries into Apache Spark Dataset operations so that scientists can submit graph queries in familiar environments like Jupyter and Databricks notebooks.

In this talk, we present the new logical inferencing capabilities we've built into the Bellman OSS library. We will demonstrate how connections between biological entities that are not explicitly connected in the data are deduced from ontologies. These inferred connections are returned to the scientist to aid in the discovery of new connections with the intent on accelerating gene to disease research. To demonstrate these capabilities, we will take a deep dive into the "subclassOf" logical entailment to retrieve all subclasses of a biological entity. The performance characteristics of inference algorithms like forward and backward chaining will also be compared.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Tools for Assisted Apache Spark Version Migrations, From 2.1 to 3.2+

This talk will look at the current state of tools to automate library and language upgrades in Python and Scala and apply them to upgrading to new version of Apache Spark. After doing a very informal survey, it seems that many users are stuck on no longer supported versions of Spark, so this talk will expand on the first attempt at automating upgrades (2.4 - 3.0) to explore the problem all the way back to 2.1.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Towards Dynamic Microstructure: The Role of Machine Learning in the Next Generation of Exchanges

What role will AI and machine learning play in ensuring the efficiency and transparency of the next generation of markets?

In this session, Douglas Hamilton (AVP, Machine Intelligence Lab) and Michael O’Rourke (SVP, Engineering & AI/ML) will show attendees how Nasdaq is building dynamic microstructures that reduce the inherent frictions associated with trading, and give insights into their application across industries.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Turbocharge your AI/ML Databricks workflows with Precisely

Trusted analytics and predictive data models require accurate, consistent, and contextual data. The more attributes used to fuel models, the more accurate their results. However, building comprehensive models with trusted data is not easy. Accessing data from multiple disparate sources, making spatial data consumable, and enriching models with reliable third-party data is challenging.

In response to these challenges, Precisely has developed tools to facilitate a location-enabled lakehouse on the Databricks platform, helping users get more out of their data. Come see live demos and learn how to build your own location-enabled lakehouse by:

• Organizing and managing address data and assigning a unique and persistent identifier • Enriching addresses with standard and dynamic attributes from our curated data portfolio • Analyzing enriched data to uncover relationships and create dashboard visualizations • Understanding high-level solution architecture

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Turning Big Biology Data into Insights on Disease – The Power of Circulating Biomarkers

Profiling small molecules in human blood across global populations gives rise to a greater understanding of the varied biological pathways and processes that contribute to human health and diseases. Herein, we describe the development of a comprehensive Human Biology Database, derived from nontargeted molecular profiling of over 300,000 human blood samples from individuals across diverse backgrounds, demographics, geographical locations, lifestyles, diseases, and medication regimens, and its applications to inform drug development.

Approximately 11,000 circulating molecules have been captured and measured per sample using Sapient’s high-throughput, high-specificity rapid liquid chromatography-mass spectrometry (rLC-MS) platform. The samples come from cohorts with adjudicated clinical outcomes from prospective studies lasting 10-25 years, as well as data on individuals’ diet, nutrition, physical exercise, and mental health. Genetic information for a subset of subjects is also included and we have added microbiome sequencing data from over 150,000 human samples in diverse diseases.

An efficient data science environment is established to enable effective health insight mining across this vast database. Built on a customized AWS and Databricks “infrastructure-as-code” Terraform configuration, we employ streamlined data ETL and machine learning-based approaches for rapid rLC-MS data extraction. In mining the database, we have been able to identify circulating molecules potentially causal to disease; illuminate the impact of human exposures like diet and environment on disease development, aging, and mortality over decades of time; and support drug development efforts through identification of biomarkers of target engagement, pharmacodynamics, safety, efficacy, and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unifying Data Science and Business: AI Augmentation/Integration in Production Business Applications

Why is it so hard to integrate Machine Learning into real business applications? In 2019 Gartner predicted that AI augmentation would solve this problem and would create will create $2.9 trillion of business value and 6.2 billion hours of worker productivity in 2021. A new realm of business science methods that encompass AI-powered analytics that allows people with domain expertise to make smarter decisions faster and with more confidence have also emerged as a solution to this problem. Dr. Harvey will demystify why integration challenges still account for $30.2 billion in annual global losses and discuss what it takes to integrate AI/ML code or algorithms into real business applications and the effort that goes into making each component, including data collection, preparation, training, and serving production-ready, enabling organizations to use the results of integrated models repeatedly with minimal user intervention. Finally, Dr. Harvey will discuss AISquared’s integration with Databricks and MLFlow to accelerate the integration of AI by unifying data science with business. By adding five lines of code to your model, users can now leverage AISquared’s model integration API framework which provides a quick and easy way to integrate models directly into live business applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unity Catalog: Journey to Unified Governance for Your Data and AI Assets on Lakehouse

Modern data assets take many forms: not just files or tables, but dashboards, ML models, and unstructured data like video and images, all of which cannot be governed and managed by legacy data governance solutions. Join this session to learn how data teams can use Unity Catalog to centrally manage all data and AI assets with a common governance model based on familiar ANSI SQL, ensuring much better native performance and security. Built-in automated data lineage provides end-to-end visibility into how data flows from source to consumption, so that organizations can identify and diagnose the impact of data changes. Unity Catalog delivers the flexibility to leverage existing data catalogs and solutions and establish a future-proof, centralized governance without expensive migration costs. It also creates detailed audit reports for data compliance and security, while ensuring data teams can quickly discover and reference data for BI, analytics, and ML workloads, accelerating time to value.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

What to Do When Your Job Goes OOM in the Night (Flowcharts!)

Have you ever had a Spark job just stop working? No idea where to start debugging? Or maybe your job that used to be completed in minutes is now taking hours? Or are you just tired of answering user questions? Come join us for a fun detour into the world of out of memory exceptions, slow jobs, and other things that make our lives sad and leave with techniques to make our lives happy again. This flowchart is based on the initial work of Anya's Spark tuning flowchart updated with our collective experience fixing broken Spark jobs. The talk will wrap up with the methodology we used and how you can contribute to the flowchart (aka guilt you into writing pull requests).

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Why a Data Lakehouse is Critical During the Manufacturing Apocalypse

COVID has changed the way that we work and the way that we must do business. Supply Chain disruptions have impacted manufacturers’ ability to manufacture and distribute products. Logistics and the lack of labor have forced us to staff differently. The existential threat is real and we must change the way that we analyze data and solve problems real time in order to stay relevant.

In this session, you’ll learn about our journey, why the Data Lake and digital tech is essential to survival in this new world, some practical examples of how machine learning and data pipelines enable faster decision making, and why businesses cannot survive without these capabilities.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/