talk-data.com talk-data.com

Topic

Data Management

data_governance data_quality metadata_management

26

tagged

Activity Trend

88 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Databricks Customers at Data + AI Summit

At this year's event, over 250 customers shared their data and AI journies. They showcased a wide variety of use cases, best practices and lessons from their leadership and innovation with the latest data and AI technologies.

See how enterprises are leveraging generative AI in their data operations and how innovative data management and data governance are fueling organizations as they race to develop GenAI applications. https://www.databricks.com/blog/how-real-world-enterprises-are-leveraging-generative-ai

To see more real-world use cases and customer success stories, visit: https://www.databricks.com/customers

Data + AI Summit Keynote Day 1 - Full
video
by Patrick Wendall (Databricks) , Fei-Fei Li (Stanford University) , Brian Ames (General Motors) , Ken Wong (Databricks) , Ali Ghodsi (Databricks) , Jackie Brosamer (Block) , Reynold Xin (Databricks) , Jensen Huang (NVIDIA)

Databricks Data + AI Summit 2024 Keynote Day 1

Experts, researchers and open source contributors — from Databricks and across the data and AI community gathered in San Francisco June 10 - 13, 2024 to discuss the latest technologies in data management, data warehousing, data governance, generative AI for the enterprise, and data in the era of AI.

Hear from Databricks Co-founder and CEO Ali Ghodsi on building generative AI applications, putting your data to work, and how data + AI leads to data intelligence.

Plus a fireside chat between Ali Ghodsi and Nvidia Co-founder and CEO, Jensen Huang, on the expanded partnership between Nvidia and Databricks to accelerate enterprise data for the era of generative AI

Product announcements in the video include: - Databricks Data Intelligence Platform - Native support for NVIDIA GPU acceleration on the Databricks Data Intelligence Platform - Databricks open source model DBRX available as an NVIDIA NIM microservice - Shutterstock Image AI powered by Databricks - Databricks AI/BI - Databricks LakeFlow - Databricks Mosaic AI - Mosaic AI Agent Framework - Mosaic AI Agent Evaluation - Mosaic AI Tools Catalog - Mosaic AI Model Training - Mosaic AI Gateway

In this keynote hear from: - Ali Ghodsi, Co-founder and CEO, Databricks (1:45) - Brian Ames, General Motors (29:55) - Patrick Wendall, Co-founder and VP of Engineering, Databricks (38:00) - Jackie Brosamer, Head of AI, Data and Analytics, Block (1:14:42) - Fei Fei Li, Professor, Stanford University and Denning Co-Director, Stanford Institute for Human-Centered AI (1:23:15) - Jensen Huang, Co-founder and CEO of NVIDIA with Ali Ghodsi, Co-founder and CEO of Databricks (1:42:27) - Reynold Xin, Co-founder and Chief Architect, Databricks (2:07:43) - Ken Wong, Senior Director, Product Management, Databricks (2:31:15) - Ali Ghodsi, Co-founder and CEO, Databricks (2:48:16)

Sponsored: EY | Business Value Unleashed: Real-World Accelerating AI & Data-Centric Transformation

Data and AI are revolutionizing industries and transforming businesses at an unprecedented pace. These advancements pave the way for groundbreaking outcomes such as fresh revenue streams, optimized working capital, and captivating, personalized customer experiences.

Join Hugh Burgin, Luke Pritchard and Dan Diasio as we explore a range of real-world examples of AI and data-driven transformation opportunities being powered by Databricks, including business value realized and technical solutions implemented. We will focus on how to integrate and leverage business insights, a diverse network of cloud-based solutions and Databricks to unleash new business value opportunities. By highlighting real-world use cases we will discuss:

  • Examples of how Manufacturing, Retail, Financial Services and other sectors are using Databricks services to scale AI, gain insights that matter and secure their data
  • The ways data monetization are changing how companies view data and incentivizing better data management
  • Examples of Generative AI and LLMs changing how businesses operate, how their customers engage, and what you can do about it

Talk by: Hugh Burgin and Luke Pritchard

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored by: Labelbox | Unlocking Enterprise AI with Your Proprietary Data and Foundation Models

We are starting to see a paradigm shift in how AI systems are built across enterprises. In 2023 and beyond, this shift is being propelled by the era of foundation models. Foundation models can be seen as the next evolution in using "pre-trained" models and transfer learning. In order to fully leverage these breakthrough models, we’ve seen a common formula for success: leading AI teams within enterprises need to be able successfully harness their own store of unstructured data and pair this with the right model in order to ship intelligent applications that deliver next-generation experiences to their customers.

In this session you will learn how to incorporate foundation models into your data and machine learning workflows so that anyone can build AI faster and, in many cases, get the business outcome without needing to build AI models altogether. Which foundation AI models can be used to pre-label / enrich data and what specific data pipeline (data engine) will enable this? Real-world use cases of when to incorporate large language models and fine-tuning to improve machine learning models in real-time. Discover the power of leveraging both Labelbox and Databricks to streamline this data management and model deployment process.

Talk by: Manu Sharma

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling Deep Learning Using Delta Lake Storage Format on Databricks

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

This session focuses on one of the first Unity Catalog implementations for a large-scale enterprise. In this scenario, a cloud scale analytics platform with 7500 active users based on the lakehouse approach is used. In addition, there is potential for 1500 further users who are subject to special governance rules. They are consuming more than 600 TB of data stored in Delta Lake - continuously growing at more than 1TB per day. This might grow due to local country data. Therefore, the existing data platform must be extended to enable users to combine global and local data from their countries. A new data management was required, which reflects the strict information security rules at a need to know base. Core requirements are: read only from global data, write into local and share the results.

Due to a very pronounced information security awareness and a lack of the technological possibilities it was not possible to interdisciplinary analyze and exchange data so easy or at all so far. Therefore, a lot of business potential and gains could not be identified and realized.

With the new developments in the technology used and the basis of the lakehouse approach, thanks to Unity Catalog, we were able to develop a solution that could meet high requirements for security and process. And enables globally secured interdisciplinary data exchange and analysis at scale. This solution enables the democratization of the data. This results not only in the ability to gain better insights for business management, but also to generate entirely new business cases or products that require a higher degree of data integration and encourage the culture to change. We highlight technical challenges and solutions, present best practices and point out benefits of implementing Unity catalog for enterprises.

Talk by: Artem Meshcheryakov and Pascal van Bellen

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Accelerating Adoption of Modern Analytics and Governance at Scale

To unlock all the competitive advantage Databricks offers your organization, you might need to update your strategy and methodology for the platform. With over 1,000+ Databricks projects completed globally in the last 18 months, we are going to share our insights on the best building blocks to target as you search for efficiency and competitive advantage.

These building blocks supporting this include enterprise metadata and data management services, data management foundation, and data services and products that enable business units to fully use their data and analytics at scale.

In this session, Avanade data leaders will highlight how Databricks’ modern data stack fits Azure PaaS and SaaS (such as Microsoft Fabric) ecosystem, how Unity catalog metadata supports automated data operations scenarios, and how we are helping clients measure modern analytics and governance business impact and value.

Talk by: Alan Grogan and Timur Mehmedbasic

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

As part of Comcast Effectv’s transformation into a completely digital advertising agency, it was key to develop an approach to manage and remediate data quality issues related to customer data so that the sales organization is using reliable data to enable data-driven decision making. Like many organizations, Effectv's customer lifecycle processes are spread across many systems utilizing various integrations between them. This results in key challenges like duplicate and redundant customer data that requires rationalization and remediation. Data is at the core of Effectv’s modernization journey with the intended result of winning more business, accelerating order fulfillment, reducing make-goods and identifying revenue.

In partnership with Slalom Consulting, Comcast Effectv built a traditional lakehouse on Databricks to ingest data from all of these systems but with a twist; they anchored every engineering decision in how it will enable their data governance program.

In this session, we will touch upon the data transformation journey at Effectv and dive deeper into the implementation of data governance leveraging Databricks solutions such as Delta Lake, Unity Catalog and DB SQL. Key focus areas include how we baked master data management into our pipelines by automating the matching and survivorship process, and bringing it all together for the data consumer via DBSQL to use our certified assets in bronze, silver and gold layers.

By making thoughtful decisions about structuring data in Unity Catalog and baking MDM into ETL pipelines, you can greatly increase the quality, reliability, and adoption of single-source-of-truth data so your business users can stop spending cycles on wrangling data and spend more time developing actionable insights for your business.

Talk by: Maggie Davis and Risha Ravindranath

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Feeding the World One Plant at a Time

Join this session to learn how the CVML and Data Platform team at BlueRiver Technology utilized Databricks to maximize savings on herbicide usage and revolutionize Precision Agriculture.

Blue River Technology is an agricultural technology company that uses computer vision and machine learning (CVML) to revolutionize the way crops are grown and harvested. BRT’s See & Spray technology, which uses CVML to identify and precisely determine whether the plant is a weed or a crop so it can deliver a small, targeted dose of herbicide directly to the plant, while leaving the crop unharmed. By using this approach, Blue River significantly reduces the amount of herbicides used in agriculture by over 70% and has a positive impact on the environment and human health.

The technical challenges we seek to overcome are:  - Processing massive petabytes of proprietary data at scale and in real time. Equipment in the field can generate up to 40TBs of data per hour per machine. - Aggregating, curating and visualizing at scale data can often be convoluted, error-prone and complex.  - Streamlining pipelines runs from weeks to hours to ensure continuous delivery of data.  - Abstracting and automating  the infra, deployment and data management from each program. - Building downstream data products based on descriptive analysis, predictive analysis or prescriptive analysis to drive the machine behavior.

The business questions we seek to answer for any machine are:  - Are we getting the spray savings we anticipated? - Are we reducing the use of herbicide at the scale we expected? - Are spraying nozzles performing at the expected rate? - Finding the relevant data to troubleshoot new edge conditions.  - Providing a simple interface for data exploration to both technical and non-technical personas to help improve our model. - Identifying repetitive and new faults in our machines. - Filtering out data based on certain incidents. - Identifying anomalies for e.g. sudden drop in spray saving, like frequency of broad spray suddenly is too high.

How we are addressing and plan to address these challenges: - Designating Databricks as our purposeful DB for all data - using the bronze, silver and gold layer standards. - Processing new machine logs using a Delta Live table as a source both in batch and incremental manner. - Democratize access for data scientists, product managers, data engineers who are not proficient with the robotic software stack via notebooks for quick development as well as real time dashboards.

Talk by: Fahad Khan and Naveed Farooqui

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Combining Privacy Solutions to Solve Data Access at Scale

The trend that has made data easier to collect and analyze has only aggravated privacy risks. Luckily, a range of privacy technologies have emerged to enable private data management; differential privacy, synthetic data, confidential computing. In isolation, those technologies have had a limited impact because they did not always bring the 10x improvement expected by data leaders.

Combining these privacy technologies has been the real game changer. We will demonstrate that the right mix of technologies brings the optimal balance of privacy and flexibility at the scale of the data warehouse. We will illustrate this by real-life applications of Sarus in three domains:

  • Healthcare: how to make hospital data available for research at scale in full compliance
  • Finance: how to pool data between several banks to fight criminal transactions
  • Marketing: how to build insights on combined data from partners and distributors

The examples will be illustrated using data stored in Databricks and queried using Sarus differential privacy engine.

Talk by: Maxime Agostini

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Accenture | Databricks Enables Employee Data Domain to Align People w/ Business Outcomes

A global franchise retailer was struggling to understand the value of its employees and had not fostered a data-driven enterprise. During the journey to use facts as the basis for decision making, Databricks became the facilitator of DataMesh and created the pipelines, analytics and source engine for a three-layer — bronze, silver, gold — lakehouse that supports the HR domain and drives the integration of multiple additional domains: sales, customer satisfaction, product quality and more. In this talk, we will walk through:

  • The business rationale and drivers
  • The core data sources
  • The data products, analytics and pipelines
  • The adoption of Unity Catalog for data privacy compliance /adherence and data management
  • Data quality metrics

Join us to see the analytic product and the design behind this innovative view of employees and their business outcomes.

Talk by: Rebecca Bucnis

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Fivetran | Fivetran and Catalyst Enable Businesses & Solve Critical Market Challenges

Fivetran helps Enterprise and Commercial companies improve the efficiency of their data movement, infrastructure, and analysis by providing a secure, scalable platform for high-volume data movement. In this fireside chat, we will dive into the pain points that drove Catalyst, a cloud-based platform that helps software companies grow revenue with advanced insights and workflows that strengthen customer adoption, retention, expansion and advocacy, to begin their search for a partnership that would automate and simplify data management along with the pivotal success driven by the implementation of Fivetran and Databricks. 

Discover how together Fivetran and Databricks:

  • Deliver scalable, real-time analytics to customers with minimal configuration and centralize customer data into customer success tools.
  • Improve Catalyst’s visibility into customer health, opportunities, and risks across all teams.
  • Turn data into revenue-driving insights around digital customer behavior with improved targeting and Ai/ Machine learning.
  • Provide a robust and scalable data infrastructure that supports Catalyst’s growing data needs, with improvements in data availability, data quality, and overall efficiency in data operations.

Talk by: Edward Chiu and Lauren Schwartz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

PII Detection at Scale on the Lakehouse

SEEK is Australia’s largest online employment marketplace and a market leader spanning ten countries across Asia Pacific and Latin America. SEEK provides employment opportunities for roughly 16 million monthly active users and process 25 million candidate applications to listings. Processing millions of resumes involves handling and managing highly sensitive candidate information, usually inputted in a highly unstructured format. With recent high-profile data leaks in Australia, personally identifiable information (PII) protection has become a major focus area for large digital organizations.

The first step is detection, and SEEK has developed a custom framework built using HuggingFace transformers fine-tuned with nuances around employment. For example, “Software Engineer at Databricks” is not PII, but “CEO at Databricks” is PII. After identifying and anonymizing PII in stream and batch data, SEEK uses Unity Catalog’s data lineage to track PII through their reporting, ETL, and other downstream ML use-cases and govern access control achieving an organization-wide data management capability driven by deep learning and enforcement using Databricks.

Talk by: Ajmal Aziz and Rachael Straiton

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

MLOps at Gucci: From Zero to Hero

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

A Modern Approach to Big Data for Finance
  • There are unique challenges associated with working with big data for finance (volume of data, disparate storage, variable sharing protocols etc...)
  • Leveraging open source technologies, like Databricks' Delta Sharing, in combination with a flexible data management stack, can allow organizations to be more nimble in testing and deploying more strategies
  • Live demonstration of Delta Sharing in combination with Nasdaq Data Fabric

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Migrating Complex SAS Processes to Databricks - Case Study

Many federal agencies use SAS software for critical operational data processes. While SAS has historically been a leader in analytics, it has often been used by data analysts for ETL purposes as well. However, modern data science demands on ever-increasing volumes and types of data require a shift to modern, cloud architectures and data management tools and paradigms for ETL/ELT. In this presentation, we will provide a case study at Centers for Medicare and Medicaid Services (CMS) detailing the approach and results of migrating a large, complex legacy SAS process to modern, open-source/open-standard technology - Spark SQL & Databricks – to produce results ~75% faster without reliance on proprietary constructs of the SAS language, with more scalability, and in a manner that can more easily ingest old rules and better govern the inclusion of new rules and data definitions. Significant technical and business benefits derived from this modernization effort are described in this session.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auto Encoder Decoder-Based Anomaly Detection with the Lakehouse Paradigm

Auto-Encoder-Decoder is a type of deep learning neural network architecture with an hourglass shape, high dimensional inputs are compressed to latent space through the encoder. The decoder mirrors the encoder architecture and reconstructs the input data from the latent space. Auto-Encoder-Decoder models are commonly used for anomaly detection, after training, the reconstructed error of normal data is minimized thus anomaly can be detected if its reconstructed error gets higher than the “normal threshold”. This presentation will demonstrate an Auto-Encoder-Decoder anomaly detection solution built with the Lakehouse Paradigm, from data management to after-deployment monitoring, to explain the entire model life cycle. It will also highlight the flexibility and scalability that MLflow custom model and Pandas UDF can bring when a large number of individual models need to be trained, deployed, and monitored in parallel.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

This talk discusses “software-defined assets”, a declarative approach to orchestration and data management that makes it drastically easier to trust and evolve datasets and ML models. Dagster is an open source orchestrator built for maintaining software-defined assets.

In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.

Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.

Software-defined assets is a natural approach to orchestration for the modern data stack, in part because dbt models are a type of software-defined asset.

Attendees of this session will learn how to build and maintain lakehouses of software-defined assets with Dagster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/