API

The Evolution of Delta Lake from Data + AI Summit 2024

2024-06-17 · Databricks DATA + AI Summit 2023 Watch

video

by Shant Hovsepian (Databricks)

AI/ML Data Lakehouse Databricks Delta DuckDB DWH Hudi Iceberg

Shant Hovsepian, Chief Technology Officer of Data Warehousing at Databricks explains why Delta Lake is the most adopted open lakehouse format.

Includes: - Delta Lake UniForm GA (support for and compatibility with Hudi, Apache Iceberg, Delta) - Delta Lake Liquid Clustering - Delta Lake production-ready catalog (Iceberg REST API) - The growth and strength of the Delta ecosystem - Delta Kernel - DuckDB integration with Delta - Delta 4.0

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rob Bajra , Derrick Olson

AI/ML Analytics BI Data Science Databricks NLP Data Streaming

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Simon Whiteley (Advancing Analytics) , Jordan Donmoyer

Analytics Data Lakehouse Databricks Delta SQL Data Streaming

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Chris Inkpen , Paul Mracek

Data Engineering Data Lakehouse Databricks Delta Data Streaming

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ian Sotnek , Jacob Renn

AI/ML Analytics Databricks LLM MLOps

DLite is a new instruction-following model developed by AI Squared by fine-tuning the smallest GPT-2 model on the Alpaca dataset. Despite having only 124 million parameters, DLite exhibits impressive ChatGPT-like interactivity and can be fine-tuned on a single T4 GPU for less than $15.00. Due to its small relative size, DLite can be run locally on a wide variety of compute environments, including laptop CPUs, and can be used without sending data to any third-party API. This lightweight property of DLite makes it highly accessible for personal use, empowering users to integrate machine learning models and advanced analytics into their workflows quickly, securely, and cost-effectively.

Leveraging DLite within AI Squared's platform can empower organizations to orchestrate the integration of Dolly/DLite into business workflows, creating personalized versions of Dolly/DLite, chaining models or analytics to contextualize Dolly/Dlite responses/prompts, and curating new datasets leveraging real-time feedback.

Talk by: Jacob Renn and Ian Sotnek

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Harsh Mishra , Deepak Sekar

Data Lakehouse Databricks Delta IoT KPI Data Streaming

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

Precise train location
Weight of the train in real-time
Train crossing time to the second level
Train speed, temperature, sound frequency, and friction
Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lineage System Table in Unity Catalog

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Menglei Sun

Data Governance Data Lakehouse Databricks Delta Python Scala SQL

Unity Catalog provides fully automated data lineage for all workloads in SQL, R, Python, Scala and across all asset types at Databricks. The aggregated view has been available to end users through data explorer and API. In this session, we are excited to share that lineage is available via delta table in their UC metastore. It stores full history of recent lineage records and it is near real time. Additionally, customers can query it through standard SQL interface. With that, customers can get significant operational insights about their workload for impact analysis, troubleshooting, quality assurance, data discovery, and data governance.

Together with the system table platform effort, which provides query history, job run operational data, audit logs and more, lineage table will be a critical piece to link all the data asset and entity asset together, providing better lakehouse observability and unification to customers.

Talk by: Menglei Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Processing Prescriptions at Scale at Walgreens

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Daniel Zafar

Cosmos Data Engineering Data Lakehouse Databricks Microsoft Spark Data Streaming

We designed a scalable Spark Streaming job to manage 100s of millions of prescription-related operations per day at an end-to-end SLA of a few minutes and a lookup time of one second using CosmosDB.

In this session, we will share not only the architecture, but the challenges and solutions to using the Spark Cosmos connector at scale. We will discuss usages of the Aggregator API, custom implementations of the CosmosDB connector, and the major roadblocks we encountered with the solutions we engineered. In addition, we collaborated closely with Cosmos development team at Microsoft and will share the new features which resulted. If you ever plan to use Spark with Cosmos, you won't want to miss these gotchas!

Talk by: Daniel Zafar

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

An API for Deep Learning Inferencing on Apache Spark™

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Lee Yang

AWS Glue Big Data Databricks ETL/ELT LLM MLOps PySpark Spark

Apache Spark is a popular distributed framework for big data processing. It is commonly used for ETL (extract, transform and load) across large datasets. Today, the transform stage can often include the application of deep learning models on the data. For example, common models can be used for classification of images, sentiment analysis of text, language translation, anomaly detection, and many other use cases. Applying these models within Spark can be done today with the combination of PySpark, Pandas_UDF, and a lot of glue code. Often, that glue code can be difficult to get right, because it requires expertise across multiple domains - deep learning frameworks, PySpark APIs, pandas_UDF internal behavior, and performance optimization.

In this session, we introduce a new, simplified API for deep learning inferencing on Spark, introduced in SPARK-40264 as a collaboration between NVIDIA and Databricks, which seeks to standardize and open source this glue code to make deep learning inference integrations easier for everyone. We discuss its design and demonstrate its usage across multiple deep learning frameworks and models.

Talk by: Lee Yang

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building Apps on the Lakehouse with Databricks SQL

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Adriana Ispas (Databricks) , Chris Stevens

BI CRM Data Lakehouse Databricks DWH JavaScript Python SaaS SQL

BI applications are undoubtedly one of the major consumers of a data warehouse. Nevertheless, the prospect of accessing data using standard SQL is appealing to many more stakeholders than just the data analysts. We’ve heard from customers that they experience an increasing demand to provide access to data in their lakehouse platforms from external applications beyond BI, such as e-commerce platforms, CRM systems, SaaS applications, or custom data applications developed in-house. These applications require an “always on” experience, which makes Databricks SQL Serverless a great fit.

In this session, we give an overview of the approaches available to application developers to connect to Databricks SQL and create modern data applications tailored to needs of users across an entire organization. We discuss when to choose one of the Databricks native client libraries for languages such as Python, Go, or node.js and when to use the SQL Statement Execution API, the newest addition to the toolset. We also explain when ODBC and JDBC might not be the best for the task and when they are your best friends. Live demos are included.

Talk by: Adriana Ispas and Chris Stevens

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Python with Spark Connect

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ruifeng Zheng , Hyukjin Kwon (Databricks)

Databricks Pandas PySpark Python Spark

PySpark has accomplished many milestones such as Project Zen, and been increasingly growing. We introduced pandas API on Spark, and hugely improved usability such as error messages, type hints, etc., and PySpark has become almost the very standard of distributed computing in Python. With this trend, the kind of PySpark use cases became also very complicated especially for modern data applications such as notebooks, IDEs, even devices such as smart home devices leveraging the power of data, that virtually need a lightweight separate client. However, today’s PySpark client is considerably heavy, and does not allow the separation from its scheduler, optimizer and analyzer as an example.

In Apache Spark 3.4, one of the key features we introduced in PySpark is the Python client for Spark Connect that decouples client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Apache Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications. In this talk, we will introduce what Spark Connect is, the internals of Spark Connect with Python, how to use Spark Connect with Python in the end-user perspective, and what’s next beyond Apache Spark 3.4.

Talk by: Hyukjin Kwon and Ruifeng Zheng

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Streamlining API Deploy ML Models Across Multiple Brands: Ahold Delhaize's Experience on Serverless

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Basak Eskili , Maria Vechtomova (Marvelous MLOps)

AI/ML Databricks

At Ahold Delhaize, we have 19 local brands. Most of our brands have common goals, such as providing personalized offers to their customers, a better search engine on e-commerce websites, and forecasting models to reduce food waste and ensure availability. As a central team, our goal is to standardize the way of working across all of these brands, including the deployment of machine learning models. To this end, we have adopted Databricks as our standard platform for our batch inference models.

However, API deployment for real time inference models remained challenging due to the varying capabilities of our brands. Our attempts to standardize API deployments with different tools failed due to complexity of our organization. Fortunately, Databricks has recently introduced a new feature: serverless API deployment. Since all our brands already use Databricks, this feature was easy to adopt. It allows us to easily reuse API deployment across all of our brands, significantly reducing time to market (from 6-12 months to one month), increasing efficiency, and reducing the costs. In this session, you will see the solution architecture, sample use case specifically used to cross-sell model deployed to four different brands, and API deployment using Databricks Serverless API with custom model.

Talk by: Maria Vechtomova and Basak Eskili

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jeff Mroz

AI/ML Analytics Cloud Computing Data Engineering Data Governance Data Lake Data Lakehouse Data Management Data Quality Databricks Delta DWH +12 more

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data & AI Products on Databricks: Making Data Engineering & Consumption Self-Service Data Platforms

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ankit Sharma

AI/ML Data Engineering Databricks Delta

Our client, a large IT and business consulting firm, embarked on a journey to create “Data As a Product” for both their internal and external stakeholders. In this project, Infosys took a data platform approach and leveraged Delta Sharing, API endpoints, and Unity Catalog to effectively create a realization of Data and AI Products (Data Mesh) architecture. This session presents the three primary design patterns used, providing valuable insights for your evolution toward a no-code/low-code approach.

Talk by: Ankit Sharma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Martin Grund (Databricks) , Stefania Leone (Databricks)

Databricks Spark SQL

Over the past decade, developers, researchers, and the community at large have successfully built tens of thousands of data applications using Apache Spark™. Since then, use cases and requirements of data applications have evolved. Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements as there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, notebooks and programming languages. This session highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and provide an outlook on how the community can participate in the extension of Spark Connect for new programming languages and frameworks bringing the power of Spark everywhere.

Talk by: Martin Grund and Stefania Leone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Kernel: Simplifying Building Connectors for Delta

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Denny Lee (Databricks) , Tathagata Das (Databricks)

Flink Data Lakehouse Databricks Delta PySpark Rust Trino

Since the release of Delta 2.0, the project has been growing at a breakneck speed. In this session, we will cover all the latest capabilities that makes Delta Lake the best format for the lakehouse. Based on lessons learned from this past year, we will introduce Project Aqueduct and how we will simplify building Delta Lake APIs from Rust and Go to Trino, Flink, and PySpark.

Talk by: Tathagata Das and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Kasey Uhlenhuth (Databricks) , Alkis Polyzotis

AI/ML Data Engineering Data Lakehouse Databricks

Developing and upkeep of production data engineering and machine learning pipelines is a challenging process for many data teams. Even more challenging is monitoring the quality of your data and models once they go into production. Building upon untrustworthy data can cause many complications for data teams. Without a monitoring service, it is challenging to proactively discover when your ML models degrade over time, and the root causes behind it. Furthermore, with a lack of lineage tracking, it is even more painful to debug errors in your models and data. Databricks Lakehouse Monitoring offers a unified service to monitor the quality of all your data and ML assets.

In this session, you’ll learn how to:

Use one unified tool to monitor the quality of any data product: data or AI
Quickly diagnose errors in your data products with root cause analysis
Set up a monitor with low friction, requiring only a button click or a single API call to start and automatically generate out-of-the-box metrics
Enable self-serve experiences for data analysts by providing reliability status for every data asset

Talk by: Kasey Uhlenhuth and Alkis Polyzotis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Advancements in Open Source LLM Tooling, Including MLflow

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Corey Zumar (Databricks) , Ben Wilson (Databricks)

AI/ML Data Lakehouse Databricks GenAI LLM

MLflow is one of the most used open source machine learning frameworks with over 13 million monthly downloads. With the recent advancements in generative AI, MLflow has been rapidly integrating support for a lot of the popular AI tools being used such as Hugging Face, LangChain, and OpenAI. This means that it’s becoming easier than ever to build AI pipelines with your data as the foundation, yet expanding your capabilities with the incredible advancements of the AI community.

Come to this session to learn how MLflow can help you:

Easily grab open source models from Hugging Face and use Transformers pipelines in MLflow
Integrate LangChain for more advanced services and to add context into your model pipelines
Bring in OpenAI APIs as part of your pipelines
Quickly track and deploy models on the lakehouse using MLflow

Talk by: Corey Zumar and Ben Wilson

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sound Data Engineering in Rust—From Bits to DataFrames

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Data Engineering Databricks Parquet Rust Spark SQL

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mosaic: A Framework for Geospatial Analytics at Scale

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics Databricks HTML Java PySpark Spark

In this session we’ll present Mosaic, a new Databricks Labs project with a geospatial flavour.

Mosaic provides users of Spark and Databricks with a unified framework for distributing geospatial analytics. Users can choose to employ existing Java-based tools such as JTS or Esri's Geometry API for Java and Mosaic will handle the task of parallelizing these tools' operations: e.g. efficiently reading and writing geospatial data and performing spatial functions on geometries. Mosaic helps users scale these operations by providing spatial indexing capabilities (using, for example, Uber's H3 library) and advanced techniques for optimising common point-in-polygon and polygon-polygon intersection operations.

The development of Mosaic builds upon techniques developed with Ordnance Survey (the central hub for geospatial data across UK Government) and described in this blog post: https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Activity Trend

Top Events

Top Speakers

The Evolution of Delta Lake from Data + AI Summit 2024

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

Rapidly Implementing Major Retailer API at the Hershey Company

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

Lineage System Table in Unity Catalog

Processing Prescriptions at Scale at Walgreens

An API for Deep Learning Inferencing on Apache Spark™

Building Apps on the Lakehouse with Databricks SQL

Python with Spark Connect

Streamlining API Deploy ML Models Across Multiple Brands: Ahold Delhaize's Experience on Serverless

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

Data & AI Products on Databricks: Making Data Engineering & Consumption Self-Service Data Platforms

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect

Delta Kernel: Simplifying Building Connectors for Delta

Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse

Advancements in Open Source LLM Tooling, Including MLflow

Sound Data Engineering in Rust—From Bits to DataFrames

Mosaic: A Framework for Geospatial Analytics at Scale