talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

From Insights to Recommendations:How SkyWatch Predicts Demand for Satellite Imagery Using Databricks

SkyWatch is on a mission to democratize earth observation data and make it simple for anyone to use.

In this session, you will learn about how SkyWatch aggregates demand signals for the EO market and turns them into monetizable recommendations for satellite operators. Skywatch’s Data & Platform Engineer, Aayush will share how the team built a serverless architecture that synthesizes customer requests for satellite images and identifies geographic locations with high demand, helping satellite operators maximize revenue and satisfying a broad range of EO data hungry consumers.

This session will cover:

  • Challenges with Fulfillment in Earth Observation ecosystem
  • Processing large scale GeoSpatial Data with Databricks
  • Databricks in-built H3 functions
  • Delta Lake to efficiently store data leveraging optimization techniques like Z-Ordering
  • Data LakeHouse Architecture with Serverless SQL Endpoints and AWS Step Functions
  • Building Tasking Recommendations for Satellite Operators

Talk by: Aayush Patel

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

Don't miss this session where we demonstrate how the Texas Rangers baseball team organized their predictive models by using MLflow and the MLRegistry inside Databricks. They started using Databricks as a simple solution to centralizing our development on the cloud. This helped lessen the issue of siloed development in our team, and allowed us to leverage the benefits of distributed cloud computing.

But we quickly found that Databricks was a perfect solution to another problem that we faced in our data engineering stack. Specifically, cost, complexity, and scalability issues hampered our data architecture development for years, and we decided we needed to modernize our stack by migrating to a lakehouse. With Databricks Lakehouse, ad-hoc-analytics, ETL operations, and MLOps all living within Databricks, development at scale has never been easier for our team.

Going forward, we hope to fully eliminate the silos of development, and remove the disconnect between our analytics and data engineering teams. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of big data.

Talk by: Alexander Booth and Oliver Dykstra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Improve Apache Spark™ DS v2 Query Planning Using Column Stats

When doing the TPC-DS benchmark using external v2 data source, we have observed that for several of the queries, DS v1 has better join plans than Apache Spark. The main reason is that DS v1 uses column stats, especially number of distinct values (NDV) for query optimization. Currently, Spark™ DS v2 only has interfaces for data sources to report table statistics such as size in bytes and number of rows. In order to use column stats in DS v2, we have added new interfaces to allow external data sources to report column stats to Spark.

For a data source with huge data, it’s always challenging to get the column stats, especially the NDV. We plan to calculate NDV using Apache DataSketches Theta sketch and save the serialized compact sketch in the statistics file. The NDV and other column stats will be reported to Spark for query plan optimization.

Talk by: Huaxin Gao and Parth Chandra

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

Every year, billions of dollars are lost due to water risks from storms, floods, and droughts. Water data scarcity and excess are issues that risk models cannot overcome, creating a world of uncertainty. Divirod is building a platform of water data by normalizing diverse data sources of varying velocity into one unified data asset. In addition to publicly available third-party datasets, we are rapidly deploying our own IoT sensors. These sensors ingest signals at a rate of about 100,000 messages per hour into preprocessing, signal-processing, analytics, and postprocessing workloads in one spark-streaming pipeline to enable critical real-time decision-making processes. By leveraging streaming architecture, we were able to reduce end-to-end latency from tens of minutes to just a few seconds.

We are leveraging Delta Lake to provide a single query interface across multiple tables of this continuously changing data. This enables data science and analytics workloads to always use the most current and comprehensive information available. In addition to the obvious schema transformations, we implement data quality metrics and datum conversions to provide a trustworthy unified dataset.

Talk by: Adam Wilson and Heiko Udluft

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Making the Shift to Application-Driven Intelligence

In the digital economy, application-driven intelligence delivered against live, real-time data will become a core capability of successful enterprises. It has the potential to improve the experience that you provide to your customers and deepen their engagement. But to make application-driven intelligence a reality, you can no longer rely only on copying live application data out of operational systems into analytics stores. Rather, it takes the unique real-time application-serving layer of a MongoDB database combined with the scale and real-time capabilities of a Databricks Lakehouse to automate and operationalize complex and AI-enhanced applications at scale.

In this session, we will show how it can be seamless for developers and data scientists to automate decisioning and actions on fresh application data and we'll deliver a practical demonstration on how operational data can be integrated in real time to run complex machine learning pipelines.

Talk by: Mat Keep and Ashwin Gangadhar

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Python with Spark Connect

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling MLOps for a Demand Forecasting Across Multiple Markets for a Large CPG

In this session, we look at how one of the world’s largest CPG company setup a scalable MLOps pipeline for a demand forecasting use case that predicted demand at 100,000+ DFUs (demand forecasting units) on a weekly basis across more than 20 markets. This implementation resulted in significant cost savings in terms of improved productivity, reduced cloud usage and faster time to value amongst other benefits. You will leave this session with a clearer picture on the following:

  • Best practices in scaling MLOps with Databricks and Azure for a demand forecasting use case with a multi-market and multi-region roll-out.
  • Best practices related to model re-factoring and setting up standard CI-CD pipelines for MLOps.
  • What are some of the pitfalls to avoid in such scenarios?

Talk by: Sunil Ranganathan and Vinit Doshi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Simon + Denny Live: Ask Us Anything

Simon and Denny have been discussing and debating all things Delta, Lakehouse and Apache Spark™ on their regular webshow. Whether you want advice on lake structures, want to hear their opinions on the latest trends and hype in the data world, or you simply have a tech implementation question to throw at two seasoned experts, these two will have something to say on the matter. In their previous shows, Simon and Denny focused on building out a sample lakehouse architecture, refactoring and tinkering as new features came out, but now we're throwing the doors open for any and every question you might have.

So if you've had a persistent question and think these two can help, this is the session for you. There will be a question submission form shared prior to the event, so the team will be prepped with a whole bunch of topics to talk through. Simon and Denny want to hear your questions, which they can field drawing from a wealth of industry experience, wide ranging community engagement and their differing perspectives as external consultant and internal Databricks respectively. There's also a chance they'll get distracted and go way off track talking about coffee, sci-fi, nerdery or the English weather. It happens.

Talk by: Simon Whiteley and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Qlik | Extracting the Full Potential of SAP Data for Global Automotive Manufacturing

Every year, organizations lose millions of dollars due to equipment failure, unscheduled downtime, or unoptimized supply chains because business and operational data is not integrated. During this session you will hear from experts at Qlik and Databricks on how global luxury automotive manufacturers are accelerating the discovery and availability of complex data sets like SAP. Learn how Qlik, Microsoft, and Databricks together are delivering an integrated solution for global luxury automotive manufacturers that combines the automated data delivery capabilities of Qlik Data Integration with the agility and openness of the Databricks Lakehouse platform and AI on Azure Synpase.

We'll explore how to leverage the IT and OT data convergence to extract the full potential of business-critical SAP data, lower IT costs and deliver real-time prescriptive insights, at scale, for more resilient, predictable, and sustainable supply-chains. Learn how organizations can track and manage inventory levels, predict demand, optimize production and help their organizations identify opportunities for improvements.

Talk by: Matthew Hayes and Bala Amavasai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksi

Sponsored by: Striim | Powering a Delightful Travel Experience with a Real-Time Operational Data Hub

American Airlines champions operational excellence in airline operations to provide the most delightful experience to our customers with on-time flights and meticulously maintained aircraft. To modernize and scale technical operations with real-time, data-driven processes, we delivered a DataHub that connects data from multiple sources and delivers it to analytics engines and systems of engagement in real-time. This enables operational teams to use any kind of aircraft data from almost any source imaginable and turn it into meaningful and actionable insights with speed and ease. This empowers maintenance hubs to choose the best service and determine the most effective ways to utilize resources that can impact maintenance outcomes and costs. The end-product is a smooth and scalable operation that results in a better experience for travelers. In this session, you will learn how we combine an operational data store (MongoDB) and a fully managed streaming engine (Striim) to enable analytics teams using Databricks with real-time operational data.

Talk by: John Kutay and Ganesh Deivarayan

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored by: Toptal | Enable Data Streaming within Multicloud Strategies

Join Toptal as we discuss how we can help organizations handle their data streaming needs in an environment utilizing multiple cloud providers. We will delve into the data scientist and data engineering perspective on this challenge. Embracing an open format, utilizing open source technologies while managing the solution through code are the keys to success.

Talk by: Christina Taylor and Matt Kroon

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Streamlining API Deploy ML Models Across Multiple Brands: Ahold Delhaize's Experience on Serverless

At Ahold Delhaize, we have 19 local brands. Most of our brands have common goals, such as providing personalized offers to their customers, a better search engine on e-commerce websites, and forecasting models to reduce food waste and ensure availability. As a central team, our goal is to standardize the way of working across all of these brands, including the deployment of machine learning models. To this end, we have adopted Databricks as our standard platform for our batch inference models.

However, API deployment for real time inference models remained challenging due to the varying capabilities of our brands. Our attempts to standardize API deployments with different tools failed due to complexity of our organization. Fortunately, Databricks has recently introduced a new feature: serverless API deployment. Since all our brands already use Databricks, this feature was easy to adopt. It allows us to easily reuse API deployment across all of our brands, significantly reducing time to market (from 6-12 months to one month), increasing efficiency, and reducing the costs. In this session, you will see the solution architecture, sample use case specifically used to cross-sell model deployed to four different brands, and API deployment using Databricks Serverless API with custom model.

Talk by: Maria Vechtomova and Basak Eskili

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unlock the Next Evolution of the Modern Data Stack With the Lakehouse Revolution -- with Live Demos

As the data landscape evolves, organizations are seeking innovative solutions that provide enhanced value and scalability without exploding costs. In this session, we will explore the exciting frontier of the Modern Data Stack on Databricks Lakehouse, a game-changing alternative to traditional Data Cloud offerings. Learn how Databricks Lakehouse empowers you to harness the full potential of Fivetran, dbt, and Tableau, while optimizing your data investments and delivering unmatched performance.

We will showcase real-world demos that highlight the seamless integration of these modern data tools on the Databricks Lakehouse platform, enabling you to unlock faster and more efficient insights. Witness firsthand how the synergy of Lakehouse and the Modern Data Stack outperforms traditional solutions, propelling your organization into the future of data-driven innovation. Don't miss this opportunity to revolutionize your data strategy and unleash unparalleled value with the lakehouse revolution.

Talk by: Kyle Hale and Roberto Salcido

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Accelerating the Development of Viewership Personas with a Unified Feature Store

With the proliferation of video content and flourishing consumer demand, there is an enormous opportunity for customer-centric video entertainment companies to use data and analytics to understand what their viewers want and deliver more of the content that that meets their needs.

At DIRECTV, our Data Science Center of Excellence is constantly looking to push the boundary of innovation in how we can better and more quickly understand the needs of our customers and leverage those actionable insights to deliver business impact. One way in which we do so is through the development of Viewership Personas with cluster analysis at scale to group our customers by the types of content they enjoy watching. This process is significantly accelerated by a unified feature store which contain a wide array of features that captures key information on viewing preferences.

This talk will focus on how the DIRECTV Data Science team utilizes Databricks to help develop a unified feature store, and learn how we leverage the feature store to accelerate the process of running machine learning algorithms to find meaningful viewership clusters.

Talk by: Malav Shah,Taylor Hosbach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Advanced Governance with Collibra on Databricks

A data lake is only as good as its governance. Understanding what data you have, performing classification, defining/applying security policies and auding how it's used is the data governance lifecycle. Unity Catalog with its rich ecosystem of supported tools simplifies all stages of the data governance lifecycle. Learn how metadata can be hydrated, into Collibra directly from Unity Catalog. Once the metadata is available in Collibra we will demonstrate classification, defining security policies on the data and pushing those policies into Databricks. All access and usage of data is automatically audited with real time lineage provided in the data explorer as well as system tables.

Talk by: Leon Eller and Antonio Castelo

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

A Fireside Chat: Building Your Startup on Databricks

Are you interested in learning how leading startups build applications on Databricks and leverage the power of the lakehouse? Join us for a fireside chat with cutting edge startups as we discuss real world insights and best practices for building on the Databricks Lakehouse, as well as successes and challenges encountered along the way. This conversation will provide an opportunity to learn and ask questions to panelists spanning all sectors.

Talk by: Chris Hecht, Derek Slager, Uri May, and Edward Chiu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

AI-Accelerated Delta Tables: Faster, Easier, Cheaper

In this session, learn about recent releases for Delta Tables and the upcoming roadmap. Learn how to leverage AI to get blazing fast performance from Delta, without requiring users to do time-consuming and complicated tuning themselves. Recent releases like Predictive I/O and Auto Tuning for Optimal File Sizes will be covered, as well as the exciting roadmap of even more intelligent capabilities.

Talk by: Sirui Sun and Vijayan Prabhakaran

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Practices for Setting Up Databricks SQL at Enterprise Scale

To learn more, visit the Databricks Security and Trust Center: https://www.databricks.com/trust

In this session, we will talk about the best practices for setting up Databricks to run at large enterprise scale with thousands of users, departmental security and governance, and end-to-end lineage from ingestion to BI tools. We’ll showcase the power of Unity Catalog and Databricks SQL as the core of your modern data stack and how to achieve both data, environment, and financial governance while empowering your users to quickly find and access the data they need.

Talk by: Siddharth Bhai, Paul Roome, Jeremy Lewallen, and Samrat Ray

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin