talk-data.com talk-data.com

Topic

DWH

Data Warehouse

analytics business_intelligence data_storage

568

tagged

Activity Trend

35 peak/qtr
2020-Q1 2026-Q1

Activities

568 activities · Newest first

AWS re:Invent 2024 - Accelerate value from data: Migrating from batch to stream processing (ANT324)

Growing business needs for incorporating real-time insights into conventional use cases is pushing the data transformation envelope from batch processing to streaming. From gaming to clickstream to generative AI use cases, batch analytical workloads today want high throughput, low latency, and simplified ingestion mechanisms for real-time insights and visualizations. Join this session to hear from experts on how to successfully migrate from batch to stream processing using AWS streaming services that provide scalable integrations and real-time capabilities across services such as Amazon Redshift for real-time data warehousing analytics and ELT pipelines.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Think Inside the Box: Constraints Drive Data Warehousing Innovation

As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/


Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.

To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.

Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.

A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. Integrating generative AI with robust databases is becoming essential. As organizations face a plethora of database options and AI tools, making informed decisions is crucial for enhancing customer experiences and operational efficiency. How do you ensure your AI systems are powered by high-quality data? And how can these choices impact your organization's success? Gerrit Kazmaier is the VP and GM of Data Analytics at Google Cloud. Gerrit leads the development and design of Google Cloud’s data technology, which includes data warehousing and analytics. Gerrit’s mission is to build a unified data platform for all types of data processing as the foundation for the digital enterprise. Before joining Google, Gerrit served as President of the HANA & Analytics team at SAP in Germany and led the global Product, Solution & Engineering teams for Databases, Data Warehousing and Analytics. In 2015, Gerrit served as the Vice President of SAP Analytics Cloud in Vancouver, Canada. In this episode, Richie and Gerrit explore the transformative role of AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, leveraging unstructured data, and much more. Links Mentioned in the Show: Google CloudConnect with GerritThinking Fast and Slow by Daniel KahnemanCourse: Introduction to GCPRelated Episode: Not Only Vector Databases: Putting Databases at the Heart of AI, with Andi Gutmans, VP and GM of Databases at GoogleRewatch sessions from RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Big Data is Dead: Long Live Hot Data 🔥

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.

Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/


Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.

Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.

Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.

We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.

The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.

Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.

Summary Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stackInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data Migration Agent is and the story behind it?What is the core problem that you are targeting with the agent?What are the biggest time sinks in the process of database and tooling migration that teams run into?Can you describe the architecture of your agent?What was your selection and evaluation process for the LLM that you are using?What were some of the main unknowns that you had to discover going into the project?What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?When is the data migration agent the wrong choice?What do you have planned for the future of applications of AI at Datafold?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DatafoldDatafold Migration AgentDatafold data-diffDatafold Reconciliation Podcast EpisodeSQLGlotLark parserClaude 3.5 SonnetLookerPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Coalesce 2024: Unlocking low latency analytics with Firebolt and dbt integration

In this session Connor will dive into optimizing compute resources, accelerating query performance, and simplifying data transformations with dbt and cover in detail: - SQL-based data transformation, and why is it gaining traction as the preferred language with data engineers - Life cycle management for native objects like fact tables, dimension tables, primary indexes, aggregating indexes, join indexes, and others. - Declarative, version-controlled data modeling - Auto-generated data lineage and documentation

Learn about incremental models, custom materializations, and column-level lineage. Discover practical examples and real-world use cases how Firebolt enables data engineers to efficiently manage complex tasks and optimize data operations while achieving high efficiency and low latency on their data warehouse workloads.

Speaker: Connor Carreras Solutions Architect Firebolt

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: How Roche is redefining data, analytics & genAI at scale with dbt

Centralize, harmonize, and streamline. That’s how Roche delivers self-service analytics to the thousands of people in its pharma commercial sector. Dbt is powering the backend solution that combines over 60 transactional systems into a harmonized simplified data model. By adopting a version-controlled approach and enabling end-to-end lineage tracking, we achieved significant reduction in duplication and accelerated time-to-insight for data-driven decision-making. The transition from a heterogeneous technology stack to standardized ways of working has fostered greater flexibility in allocating resources across the organization to address diverse use cases. Additionally, the scalable nature of this platform allows us to easily replicate successful data solutions globally. We further augmented our capabilities by integrating generative AI into our Redshift data warehouse, empowering the creation of innovative data products using dbt. This presentation will share practical lessons learned, architectural insights, and the tangible business impact realized from this data platform modernization.

Speaker: João Antunes Lead Engineer Hoffmann-La Roche

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: Food + data for better lives: Modernizing the Houston Food Bank's data stack with dbt

The Houston Food Bank (HFB) is the largest food bank in the country, serving 18 southeast Texas counties and distributing over 120 million meals in the last fiscal year through our network of 1,600+ community partners to the 1 million-plus food-insecure persons in the region.

Over the last 2+ years, HFB has leveraged dbt to modernize our data stack. Initially working with dbt Core, our data team's engineers centralized, streamlined, and automated data pipelines to provide critical KPIs to HFB Leadership. Fast-forward to today, our data team of 10, which includes engineers, analysts, and other specialists, uses dbt Cloud to manage all data transformations in our data warehouse, which now supports 30+ integrations and 70+ reports that deliver 180+ metrics to stakeholders across the organization. This organizational transformation has saved countless hours for our staff, improved organizational trust in data significantly by identifying and managing sources of truth, and delivered key insights to stakeholders across our entire organization.

A handful of examples include: - Identifying corporate donor opportunities by mining donor and volunteer data - Increasing the number of opportunities for federal and grant-based funding by being able to generate metrics across an ever-increasing number of data sources - Assessing the efficiency of school-based programs by analyzing the proportion and volume of students served to the food-insecure population of that school

HFB is committed to being a data leader in the food banking space, and we’re hoping our journey using dbt can inspire other non-profits to leverage the platform as well.

Speakers: Erwin Kristel Data Analyst Houston Food Bank

Benjamin Herndon-Miller Data Engineer Houston Food Bank

Susan Quiros Data Analyst II Houston Food Bank

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: Surfing the LLM wave: We can't opt out and neither can you

This session is a practical guide to changing how you operate in response to the Cambrian explosion of AI and LLM technologies. In your future, everyone at the company will have access to an LLM with unfettered access to your data warehouse. Do you feel afraid? The Data team at Hex did too. They'll share how they had to change how they worked to adapt, and what data leaders and practitioners need to be thinking about for their own teams.

Speakers: Amanda Fioritto Senior Analytics Engineer Hex Technologies

Erika Pullum Analytics Engineer Hex Technologies

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: How SurveyMonkey sharpens dbt performance and governance with data observability

The data team at SurveyMonkey, the global leader in survey software, oversees heavy data transformation in dbt Cloud — both to power current business-critical projects, and also to migrate legacy workloads. Much of that transformation work is taking raw data — either from legacy databases or their cloud data warehouse (Snowflake) — and making it accessible and useful for downstream users. And to Samiksha Gour, Senior Data Engineering Manager at SurveyMonkey, each of these projects is not considered complete unless the proper checks, monitors, and alerts are in place.

Join Samiksha in this informative session as she walks through how her team uses dbt and their data observability platform Monte Carlo to ensure proper governance, gain efficiencies by eliminating duplicate testing and monitoring, and use data lineage to ensure upstream and downstream continuity for users and stakeholders.

Speaker: Samiksha Gour Senior Data Engineering Manager SurveyMonkey

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences.  In Season 01, Episode 19, host Nadiem von Heydebrand interviews Pradeep Fernando, who leads the data and metadata management initiative at Swisscom. They explore key topics in data product management, including the definition and categorization of data products, the role of AI, prioritization strategies, and the application of product management principles. Pradeep shares valuable insights and experiences on successfully implementing data product management within organizations. About our host Nadiem von Heydebrand: Nadiem is the CEO and Co-Founder of Mindfuel. In 2019, he merged his passion for data science with product management, becoming a thought leader in data product management. Nadiem is dedicated to demonstrating the true value contribution of data. With over a decade of experience in the data industry, Nadiem leverages his expertise to scale data platforms, implement data mesh concepts, and transform AI performance into business performance, delighting consumers at global organizations that include Volkswagen, Munich Re, Allianz, Red Bull, and Vorwerk. Connect with Nadiem on LinkedIn. About our guest Pradeep Fernando: Pradeep is a seasoned data product leader with over 6 years of data product leadership experience and over 10 years of product management experience. He leads or is a key contributor to several company-wide data & analytics initiatives at Swisscom such as Data as a Product (Data Mesh), One Data Platform, Machine Learning (Factory), MetaData management, Self-service data & analytics, BI Tooling Strategy, Cloud Transformation, Big Data platforms,and Data warehousing. Previously, he was a product manager at both Swisscom's B2B and Innovation units both building new products and optimizing mature products (profitability) in the domains of enterprise mobile fleet management, cyber-and mobile device security.Pradeep is also passionate about and experienced in leading the development of data products and transforming IT delivery teams into empowered, agile product teams. And, he is always happy to engage in a conversation about lean product management or "heavier" topics such as humanity's future or our past. Connect with Pradeep on LinkedIn. All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else.  Join the conversation on LinkedIn.  Apply to be a guest or nominate someone that you know.  Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!              

The data landscape is fickle, and once-coveted roles like "DBA" and "Data Scientist" have faced challenges. Now, the spotlight shines on Data Engineers, but will they suffer the same fate? 

Thistalk dives into historical trends.

In the early 2010’s, DBA/data warehouse was the sexiest job. Data Warehouse became the “No Team.”

In the mid-2010’s, data scientist was the sexiest job. Data Science became the “mistaken for” team.

Now, data engineering is the sexiest job. Data Engineering became the “confused team”. The confusion run rampant with questions about the industry: What is a data engineer? What do they do? Should we have all kinds of nuanced titles for variations? Just how technical should they be?

Together, let’s go back to history and look for ways on how data engineering can avoid the same fate as data warehousing and data science. 

This talk provides a thought-provoking discussion on navigating the exciting yet challenging world of data engineering. Let's avoid the pitfalls of the past and shape a future where data engineers thrive as essential drivers of innovation and success.

Main Takeaways:

● We need to look back on the history of data teams to avoid their mistakes

● Data Engineering is following the same mistakes as Data Science and Data Warehousing

● Learn the actionable insights to help data engineering avoid similar fates

The next big innovation in data management after separation of compute and storage is the open table formats. These formats have truly commoditized storage, allowing you to store data anywhere and run multiple compute workloads without vendor lock-in. This innovation addresses the biggest challenges of cloud data warehousing — performance, usability, and high costs—ushering in the era of the data lakehouse architecture.

In this session, discover how an AI-powered data lakehouse:

• Unlocks data for modern AI use cases

• Enhances performance and enables real-time analytics

• Reduces total cost of ownership (TCO) by up to 75%

• Delivers increased interoperability across the entire data landscape

Join us to explore how the integration of AI with the lakehouse architecture can transform your approach to data management and analytics.

Overcome the limitations of your legacy data warehouse or BI systems and reap the benefits of a cloud-native stack with LeapLogic, Impetus’ automated cloud migration accelerator. Join our session to explore how LeapLogic’s end-to-end automated capabilities can fast-track and streamline the transformation of legacy data warehouse, ETL, Hadoop, analytics, and reporting workloads to the cloud. Gain actionable insights from real-world success stories of Fortune 500 enterprises that have successfully modernised their legacy workloads, positioning them at the forefront of the GenAI revolution. 

In today's data-driven landscape, the ability to efficiently harness the power of AI is crucial for businesses seeking to unlock valuable insights and drive innovation. This session will explore how BigQuery, Google Cloud's leading data warehouse solution, can accelerate your AI initiatives. Discover how BigQuery's serverless architecture, built-in machine learning capabilities, and seamless integration with Google Cloud's AI ecosystem empower you to build, train, and deploy ML models at scale. Whether you're a data scientist, engineer, or business leader, this session will provide you with actionable insights and strategies to supercharge your AI efforts with BigQuery.

DuckDB in Action

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineersInterview IntroductionHow did you get involved in the area of data management?Can you describe what DataKitchen is and the story behind it?You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?What are the challenges that never went away?You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?Can you talk through the technical implementation of your new obserability and quality testing platform?What does the onboarding and integration process look like?Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?What do you have planned for the future of your work at DataKitchen?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DataKitchenPodcast EpisodeNASADataOps ManifestoData Reliability EngineeringData ObservabilitydbtDevOps Enterprise SummitBuilding The Data Warehouse by Bill Inmon (affiliate link)dataops-testgen, dataops-observabilityFree Data Quality and Data Observability CertificationDatabricksDORA MetricsDORA for dataThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA