talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

582

Sessions & talks

Showing 276–300 of 582 · Newest first

Search within this event →
Live from the Lakehouse: industry outlook from Simon Whiteley & AI policy from Matteo Quattrocchi

Live from the Lakehouse: industry outlook from Simon Whiteley & AI policy from Matteo Quattrocchi

2023-07-14 Watch
video
Matteo Quattrocchi (BSA | The Software Alliance) , Simon Whiteley (Advancing Analytics)

Hear from two guests. First, Simon Whiteley (co-owner, Advancing Analytics) on his reaction to industry announcements, where he sees the industry heading, and an introduction to his community at Advancing Analytics. Second guest, Matteo Quattrocchi (Director - Policy, EMEA at BSA | The Software Alliance) on the current state of AI policies - by international governments, global committees, and individual companies.. Hosted by Ari Kaplan (Head of Evangelism, Databricks) and Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Live from the Lakehouse: Lakehouse observability, and Delta Lake. With Michael Milirud and Denny Lee

Live from the Lakehouse: Lakehouse observability, and Delta Lake. With Michael Milirud and Denny Lee

2023-07-14 Watch
video
Denny Lee (Databricks) , Michael Milirud (Databricks)

Hear from two guests. First, Michael Milirud (Sr Manager, Product Management, Databricks) on Lakehouse monitoring and observability. Second guest, Denny Lee (Sr Staff Developer Advocate, Databricks), discusses Delta Lake. Hosted by Holly Smith (Sr Resident Solutions Architect, Databricks) and Jimmy Obeyeni (Strategic Account Executive, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: LLMs, AutoML, modern data stacks: Ben Lorica, Conor Jensen, & Franco Patano

Live from the Lakehouse: LLMs, AutoML, modern data stacks: Ben Lorica, Conor Jensen, & Franco Patano

2023-07-14 Watch
video
Conor Jensen (Dataiku) , Ben Lorica (Gradient Flow) , Franco Patano (Databricks)

Hear from two guests. First, Ben Lorica (Principal, Gradient Flow) on AI and LLMs. Second guest, Conor Jensen (Field CDO, Dataiku), discusses democratizing AI through AutoML, LLMs, and the role of Field CDOs. Third guest, Franco Patano (Lead Product Specialist, Databricks), on modern data stacks and technology community. Hosted by Holly Smith (Sr Resident Solutions Architect, Databricks) and Jimmy Obeyeni (Strategic Account Executive, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: LLMs, LangChain, and analytics engineering workflow with dbt Labs

Live from the Lakehouse: LLMs, LangChain, and analytics engineering workflow with dbt Labs

2023-07-14 Watch
video
Drew Banin (Fishtown Analytics) , Nicolas Palaez (Databricks) , Harrison Chase (LangChain)

Hear from three guests. Harrison Chase (CEO, LangChain) and Nicolas Palaez (Sr. Technical Marketing Manager, Databricks) on LLMs and generative AI. Third guest, Drew Banin (co-founder, dbt Labs), discusses analytics engineering workflow with his company dbt Labs, how he started the company, and how they provide value with the Databricks partnership. Hosted by Ari Kaplan (Head of Evangelism, Databricks) and Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Machine Learning, LLM, Delta Lake, and data engineering

Live from the Lakehouse: Machine Learning, LLM, Delta Lake, and data engineering

2023-07-14 Watch
video
Jason Pohl (Databricks) , Caryl Yuhas (Databricks)

Hear from two guests. First, Caryl Yuhas (Global Practice Lead, Solutions Architect, Databricks) on Machine Learning & LLMs. Second guest, Jason Pohl (Sr. Director, Field Engineering), discusses Delta Lake and data engineering. Hosted by Holly Smith (Sr Resident Solutions Architect, Databricks) and Jimmy Obeyeni (Strategic Account Executive, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Machine Learning, LLM & market changes over the past decade & data strategy

Live from the Lakehouse: Machine Learning, LLM & market changes over the past decade & data strategy

2023-07-14 Watch
video
Richard Garris (Databricks) , Robin Sutara (Databricks)

Hear from two guests. First, Richard Garris (Global Product Specialists Leader, Databricks) on Machine Learning, LLMs, and his decade journey at Databricks. Second guest, Robin Sutara (Field CTO, Databricks) on data strategy, and the learnings from her role as Field CTO. Hosted by Ari Kaplan (Head of Evangelism, Databricks) and Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: pre-show sideline reporting, from the Data & AI Summit by Databricks

Live from the Lakehouse: pre-show sideline reporting, from the Data & AI Summit by Databricks

2023-07-14 Watch
video
Pearl Ubaru (Databricks) , Ari Kaplan (Databricks)

With 75k attendees (and 12k in person at the sold-out show), the conference is kicked off by Ari Kaplan (Head of Evangelism, Databricks) and Pearl Ubaru (Sr Technical Marketing Engineer, Databricks). Hear what to expect on the state of data and AI, Databricks, the community, and why the theme is "Generation AI". WE are the generation to make AI a reality, and we all can have a part in shaping this new phase of technology and humanity.

Data + AI Summit Keynote Thursday

Data + AI Summit Keynote Thursday

2023-06-29 Watch
video
Marc Andreessen (Andreessen Horowitz) , Arsalan , Lin Qiao , Jitendra Malik (University of California, Berkeley) , Eric Schmidt (Google (Alphabet)) , Ali Ghodsi (Databricks) , Reynold Xin (Databricks) , Hannes Muhleisen , Matei Zaharia (Databricks) , Michael Armbrust (Databricks) , Harrison Chase (LangChain)

0:00 Open 6:08 Ali Ghodsi & Marc Andreessen 32:06 Reynold Xin 48:09 Michael Armbrust 1:00:00 Matei Zaharia & Panel 1:27:10 Hannes Muhleisen 01:37:43 Harrison Chase 01:49:15 Lin Qiao 02:05:03 Jitendra Malik 02:21:15 Arsalan & Eric Schmidt

Data + AI Summit Keynote Wednesday

Data + AI Summit Keynote Wednesday

2023-06-29 Watch
video
Larry Feinsmith (JP Morgan Chase) , Kasey Uhlenhuth (Databricks) , Zaheera Valani (Databricks) , Wassym Bensaid (Rivian) , Satya Nadella (Microsoft) , Weston Hutchins (Databricks) , Naveen Rao (MosaicML) , Ali Ghodsi (Databricks) , Reynold Xin (Databricks) , Sai Pradhan Ravuru (Jetblue) , Matei Zaharia (Databricks) , Caryl Yuhas (Databricks) , Patrick Wendell (Databricks)

0:00 Opener 01:18- Ali Ghodsi, Databricks 06:53 - Satya Nadella, Microsoft 15:50 Ali Ghodsi, Databricks 20:40 Larry Feinsmith, JP Morgan Chase 41:09 Ali Ghodsi, Databricks 45:07 Matei Zaharia, Databricks 52:31 Weston Hutchins, Databricks 58:36 Ali Ghodsi, Databricks 1:02:05 Naveen Rao, MosaicML 1:12:15 Patrick Wendell, Databricks 1:27:57 Kasey Uhlenhuth, Databricks 1:39:18 Sai Pradhan Ravuru, Jetblue 01:47 Ali Ghodsi, Databricks 1:49:20 Reynold Xin, Databricks 2:05:07 Ali Ghodsi, Databricks 2:09:26 Matei Zaharia, Databricks 2:17:24 Caryl Yuhas, Databricks 2:24:12 Zaheera Valani, Databricks 2:39:55 Wassym Bensaid, Rivian

Data+AI Summit 2022 Highlights

Data+AI Summit 2022 Highlights

2022-08-16 Watch
video

Check out all of the amazing conference highlights from this year's Data+AI Summit that took place at Moscone Center, San Francisco and virtually.

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

2022-07-22 Watch
video

Traditional cybersecurity Security Information and Event Management (SIEM) ways do not scale well for data sources with 30TiB per day, leading HSBC to create a Cybersecurity Lakehouse with Delta and Spark. Creating a platform to overcome several conventional technical constraints, the limitation in the amount of data for long-term analytics available in traditional platforms and query languages is difficult to scale and time-consuming to run. In this talk, we’ll learn how to implement (or actually reverse-engineer) a language with Scala and translate it into what Apache Spark understands, the Catalyst engine. We’ll guide you through the technical journey of building equivalents of a query language into Spark. We’ll learn how HSBC business benefited from this cutting-edge innovation, like decreasing time and resources for Cyber data processing migration, improving Cyber threat Incident Response, and fast onboarding of HSBC Cyber Analysts on Spark with Cybersecurity Lakehouse platform.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Applied Predictive Maintenance in Aviation: Without Sensor Data

Applied Predictive Maintenance in Aviation: Without Sensor Data

2022-07-19 Watch
video

We will show how using Azure Databricks Lakehouse is modernizing our data & analytics environment which has given us new capability to create custom predictive models for hundreds of families of aircraft components without sensor data. We currently have over 95% success rate with over $1.3 million in avoided operational impact costs in FY21.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Introducing Zipline: An Open Source Feature Engineering Platform

Introducing Zipline: An Open Source Feature Engineering Platform

2022-07-19 Watch
video

This talk will introduce Zipline, a declarative feature engineering platform developed at Airbnb, which will be open-sourced in March. We will talk about core capabilities and concepts of zipline, and will include a 5 minute demo of developing real-time features for training and for inference.

This talk will be mainly focused on guiding the audience on how to deploy and use zipline in their organizations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accelerating Hybrid Data Mesh Implementation

Accelerating Hybrid Data Mesh Implementation

2022-07-19 Watch
video

How to get started on your hybrid data mesh journey? Where does Databricks fit into new decentralized analytical data estates? How does it fuel the next generation of domain driven data products, and data marketplace? In this session, you'll hear about practical examples informed by our client experience on how to frame the business challenge and value, define the required building blocks (architecture, governance, automation, people and organization), and jumpstart incremental

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/ adoption.

Building a Lakehouse for Data Science at DoorDash

Building a Lakehouse for Data Science at DoorDash

2022-07-19 Watch
video

DoorDash was using a data warehouse but found that they needed more data transparency, lower costs, and the ability to handle streaming data as well as batch data. With an engineering team rooted in big data backgrounds at Uber and LinkedIn, they moved to a Lakehouse architecture intuitively, without knowing about the term. In this session, learn more about how they arrived at that architecture, the process of making the move, and the results they have seen. While addressing both data analysts and data scientists from their lakehouse, this session will focus on their machine learning operations, and how their efficiencies are enabling them to tackle more advanced use cases such as NLP and image classification.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud and Data Science Modernization of Veterans Affairs Financial Service Center with Azure Databri

Cloud and Data Science Modernization of Veterans Affairs Financial Service Center with Azure Databri

2022-07-19 Watch
video

The Department of Veterans Affairs (VA) is home to over 420,000 employees, provides health care for 9.16 million enrollees and manages the benefits of 5.75 million recipients. The VA also hosts an array of financial management, professional, and administrative services at their Financial Service Center (FSC), located in Austin, Texas. The FSC is divided into various service groups organized around revenue centers and product lines, including the Data Analytics Service (DAS). To support the VA mission, in 2021 FSC DAS continued to press forward with their cloud modernization efforts, successfully achieving four key accomplishments:

Office of Community Care (OCC) Financial Time Series Forecast - Financial forecasting enhancements to predict claims CFO Dashboard - Productivity and capability enhancements for financial and audit analytics Datasets Migrated to the Cloud - Migration of on-prem datasets to the cloud for down-stream analytics (includes a supply chain proof-of-concept) Data Science Hackathon - A hackathon to predict bad claims codes and demonstrate DAS abilities to accelerate a ML use case using Databricks AutoML

This talk discusses FSC DAS’ cloud and data science modernization accomplishments in 2021, lessons learned, and what’s ahead.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

Connecting the Dots with DataHub: Lakehouse and Beyond

2022-07-19 Watch
video

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mapping Data Quality Concerns to Data Lake Zones

Mapping Data Quality Concerns to Data Lake Zones

2022-07-19 Watch
video

A common pattern in Data Lake and Lakehouse design is structuring data into zones, with Bronze, Silver and Gold being typical labels. Each zone is suitable for different workloads and different consumers: for instance, machine learning algorithms typically process against Bronze or Silver, while analytic dashboards often query Gold. This prompts the question: which layer is best suited for applying data quality rules and actions? Our answer: all of them.

In this session, we’ll expand on our answer by describing the purposes of the different zones, and mapping the categories of data quality relevant for each by assessing its qualitative requirements. We’ll describe Data Enrichment: the practice of making observed anomalies available as inputs to downstream data pipelines, and provide recommendations for when to merely alert, when to quarantine data, when to halt pipelines, and when to apply automated corrective actions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Open Source Powers the Modern Data Stack

Open Source Powers the Modern Data Stack

2022-07-19 Watch
video

Lakehouses like Databricks’ Delta Lake are becoming the central brain for all data systems. But Lakehouses are only one component of the data stack. There are many building blocks required for tackling data needs, including data integrations, data transformation, data quality, observability, orchestration etc.

In this session, we will present how open source powers companies' approach to building a modern data stack. We will talk about technologies like Airbyte, Airflow, dbt, Preset, and how to connect them in order to build a customized and extensible data platform centered around Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Powering Up the Business with a Lakehouse

Powering Up the Business with a Lakehouse

2022-07-19 Watch
video

Within Wehkamp we required a uniform way to provide reliable and on time data to the business, while making this access compliant with GDPR. Unlocking all the data sources that we have scattered across the company and democratize the data access was of the utmost importance, allowing us to empower the business with more, better and faster data.

Focusing on open source technologies, we've built a data platform almost from the ground up that focuses on 3 levels of data curation - bronze, silver and gold - which follows the LakeHouse Architecture. The ingestion into bronze is where the PII fields are pseudonymized, making the use of the data within the delta lake compliant and, since there is no visible user data, it means everyone can use the entire delta lake for exploration and new use cases. Naturally, specific teams are allowed to see some user data that is necessary for their use cases. Besides the standard architecture, we've developed a library that allows us to ingest new data sources by adding a JSON config file with the characteristics. This combined with the ACID transactions that delta provides and the efficient Structured Stream provided through Auto Loader has allowed a small team to maintain 100+ streams with insignificant downtime.

Some other components of this platform are the following: - Alerting to Slack - Data quality checks - CI/CD - Stream processing with the delta engine

The feedback so far has been encouraging, as more and more teams across the company are starting to use the new platform and taking advantage of all its perks. It is still a long time until we get to turn off some of the components of the old data platform, but it has come a long way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Privacy: Practical Architectures and Experiences

Scaling Privacy: Practical Architectures and Experiences

2022-07-19 Watch
video

At Spark Data & AI 2021, We presented the use case around Privacy in an Insurance Landscape using Privacera. Scaling Privacy in a Spark Ecosystem (https://www.youtube.com/watch?v=cjJEMlNcg5k). In one year, the concept of privacy and security have taken off as a major need to solve and the ability to embed this into business process to empower data democratization has become mandatory. The concept that data is a product is now commonplace and that ability to rapidly innovate those products hinges on the ability to balance a dual mandate. One mandate: Move Fast. Second Mandate: Manage Privacy and Security. How do we make this happen? Let's dig into the real details and experiences and show the blueprint for success.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Secure Data Distribution and Insights with Databricks on AWS

Secure Data Distribution and Insights with Databricks on AWS

2022-07-19 Watch
video

Every industry must comply with some form of compliance or data security in order to operate. As data becomes more mission critical to the organization, so does the need to protect and secure it.

Public Sector organizations are responsible for securing sensitive data sets and complying with regulatory programs such as HIPAA, FedRAMP, and StateRAMP.

This does not come as a surprise given the many different attacks targeted at the industry and the extremely sensitive nature of the large volumes of data stored and analyzed. For a product owner or DBA, this can be extremely overwhelming with a security team issuing more restrictions and data access becoming more of a common request among business users. It can be difficult finding an effective governance model to democratize data while also managing compliance across your hybrid estate.

In this session, we will discuss challenges faced in the public sector when expanding to AWS cloud. We will review best practices for managing access and data integrity for a cloud-based data lakehouse with Databricks, and discuss recommended approaches for securing your AWS Cloud environment. We will highlight ways to enable compliance by developing a continuous monitoring strategy and providing tips for implementation of defense in depth. This guide will provide critical questions to ask, an overall strategy, and specific recommendations to serve all security leaders and data engineers in the Public Sector.

This talk is intended to educate on security design considerations when extending your data warehouse to the cloud. This guidance is expected to grow and evolve as new standards and offerings emerge for local, state, and federal government.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Serving Near Real-Time Features at Scale

Serving Near Real-Time Features at Scale

2022-07-19 Watch
video

This presentation will first introduce the use case, which generates the price adjustments based on the network effect, and the corresponding model relies on the 108 near real-time features computed by Flink pipelines with the raw demand and supply events. Here is the simplified computation logic:

-The pipelines need to process the raw real-time events at the rate of 300k/s including both demand and supply -Each event needs to be computed on the geospatial, temporal and other dimensions -Each event contributes to the computation on the original hexagon and the 1K+ neighbours due to the fan-out effect of Kring smooth -Each event contributions to the aggregation on multiple window sizes up to 32 minutes, sliding by 1 minute, or 63 windows in total

Next the presentation will briefly go through the DAG of the Flink pipeline before optimization and the issues we faced: the pipeline could not run stably due to OOM and backpressure. The presentation will discuss how to optimize a streaming pipeline with the generic performance tuning framework, which focuses on three areas: Network, CPU and Memory, and five domains: Parallelism, Partition, Remote Call, Algorithm and Garbage Collector. The presentation will also show some example techniques being applied onto the pipelines by following the performance tuning framework.

Then the presentation will discuss one particular optimization technique: Customized Sliding Window.

Powering machine learning models with near real-time features can be quite challenging, due to computation logic complexity, write throughput, serving SLA, etc. In this talk, we have introduced some of the problems that we faced and our solutions to them, in the hope of aiding our peers in similar use cases.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse

2022-07-19 Watch
video

Warehouses? Where we are going, we won't need warehouses! Join Dillon, Franco, and Shannon as they take an industry-standard Data Warehouse integration benchmark, called TPC-DI, which is a typical 80s style data warehouse, and bring it into the future. We will review how to implement standard data warehousing practices on Lakehouse, and show you how to deliver optimal price/performance in the cloud and keep your data so fresh and so clean. We will take an assortment of structured, semi-structured, and unstructured data in the form of CSV, TXT, XML, and Fixed-Width files, and transform them warehouse-style into Lakehouse with a historical load and incremental CDC loads.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Sound Data Engineering in Rust—From Bits to DataFrames

Sound Data Engineering in Rust—From Bits to DataFrames

2022-07-19 Watch
video

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/