Accelerated Computing in Modern Data Centers With Datapelago

2025-03-08 · Data Engineering Podcast Listen

podcast_episode

by Rajan Goyal (Datapelago) , Tobias Macey

AI/ML Data Engineering Data Management Datafold Iceberg Spark Trino

Summary In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AWS re:Invent 2024 - [NEW LAUNCH] Amazon SageMaker Lakehouse: Accelerate analytics & AI (ANT354-NEW)

2024-12-05 · AWS re:Invent 2024 Watch

video

by Neeraja Rentachintala (Amazon) , Mahesh Mishra (Amazon Web Services)

Agile/Scrum AI/ML Analytics AWS Cloud Computing Data Lakehouse Iceberg Redshift S3 Amazon SageMaker Cyber Security

Data warehouses, data lakes, or both? Explore how Amazon SageMaker Lakehouse, a unified, open, and secure data lake house simplifies analytics and AI. This session unveils how SageMaker Lakehouse provides unified access to data across Amazon S3 data lakes, Amazon Redshift data warehouses, and third-party sources without altering your existing architecture. Learn how it breaks down data silos and opens your data estate with Apache Iceberg compatibility, offering flexibility to use preferred query engines and tools that accelerate your time to insights. Discover robust security features, including consistent fine-grained access controls, that help democratize data without compromises.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - A practitioner’s guide to data for generative AI (DAT319)

2024-12-03 · AWS re:Invent 2024 Watch

video

by Jonathan Katz (Amazon Redshift) , Siva Raghupathy (AWS)

Agile/Scrum AI/ML AWS Aurora AWS Glue Kinesis Cloud Computing Data Quality GenAI RAG Data Streaming

In this session, gain the skills needed to deploy end-to-end generative AI applications using your most valuable data. While this session focuses on the Retrieval Augmented Generation (RAG) process, the concepts also apply to other methods of customizing generative AI applications. Discover best practice architectures using AWS database services like Amazon Aurora, Amazon OpenSearch Service, or Amazon MemoryDB along with data processing services like AWS Glue and streaming data services like Amazon Kinesis. Learn data lake, governance, and data quality concepts and how Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and other features tie solution components together.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

2024-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Viquar Khan , Trâm Ngọc Phạm , Gonzalo Herreros González , Huda Nofal

Analytics Athena AWS Amazon EMR AWS Glue Big Data Cloud Computing Data Engineering ETL/ELT QuickSight Redshift data +1 more

Data Engineering with AWS Cookbook serves as a comprehensive practical guide for building scalable and efficient data engineering solutions using AWS. With this book, you will master implementing data lakes, orchestrating data pipelines, and creating serving layers using AWS's robust services, such as Glue, EMR, Redshift, and Athena. With hands-on exercises and practical recipes, you will enhance your AWS-based data engineering projects. What this Book will help me do Gain the skills to design centralized data lake solutions and manage them securely at scale. Develop expertise in crafting data pipelines with AWS's ETL technologies like Glue and EMR. Learn to implement and automate governance, orchestration, and monitoring for data platforms. Build high-performance data serving layers using AWS analytics tools like Redshift and QuickSight. Effectively plan and execute data migrations to AWS from on-premises infrastructure. Author(s) Trâm Ngọc Phạm, Gonzalo Herreros González, Viquar Khan, and Huda Nofal bring together years of collective experience in data engineering and AWS cloud solutions. Each author's deep knowledge and passion for cloud technology have shaped this book into a valuable resource, geared towards practical learning and real-world application. Their approach ensures readers are not just learning but building tangible, impactful solutions. Who is it for? This book is geared towards data engineers and big data professionals engaged in or transitioning to cloud-based environments, specifically on AWS. Ideal readers are those looking to optimize workflows and master AWS tools to create scalable, efficient solutions. The content assumes a basic familiarity with AWS concepts like IAM roles and a command-line interface, ensuring all examples are accessible yet meaningful for those seeking advancement in AWS data engineering.

Ingest govern and secure your data with OneLake | BRK201

2024-11-21 · Microsoft Ignite 2023 Watch

video

Cloud Computing Microsoft Fabric

With the massive growth in the volume of data and data sources, managing an entire organization-wide data estate is becoming increasingly complex. This session will explore the latest capabilities coming to OneLake, Fabric’s multi-cloud data lake, that can help you bring in data from any source and then govern and secure that data. Discover how new mirroring and governance tools in Fabric can help you manage your data estate and unlock deeper insights.

𝗦𝗽𝗲𝗮𝗸𝗲𝗿𝘀: * Joshua Caplan * Wilson Lee * Adi Regev

𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻: This is one of many sessions from the Microsoft Ignite 2024 event. View even more sessions on-demand and learn about Microsoft Ignite at https://ignite.microsoft.com

BRK201 | English (US) | Data

MSIgnite

Building a Data Mesh

2024-11-20 · Data Product Management in Action: The Practitioner's Podcast Listen

podcast_episode

by Soheil Mirchi (Temedica) , Frannie Helforoush (RBC Global Asset Management)

Data Contracts Data Management

Building a Data Mesh The Data Product Management In Action podcast, brought to you by Soda and executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. In Episode 23 of Data Product Management in Action, our host Frannie Helforoush is joined by Soheil Mirchi, a technical product manager. Soheil discusses his company’s shift from a centralized data lake to a decentralized data mesh architecture. He outlines the three types of data products—source-aligned, aggregated, and customer-facing—and highlights the importance of data contracts and testing. Learn about strategies for measuring success through metrics and customer feedback, along with lessons on starting small and fostering data democratization. Tune in for essential insights on effective data management! About our host Frannie Helforoush: Frannie's journey began as a software engineer and evolved into a strategic product manager. Now, as a data product manager, she leverages her expertise in both fields to create impactful solutions. Frannie thrives on making data accessible and actionable, driving product innovation, and ensuring product thinking is integral to data management. Connect with Frannie on LinkedIn. About our guest Soheil Mirchi :Soheil is a Technical Product Manager at Temedica, a health insights company focused on transforming complex healthcare and pharmaceutical data into actionable insights. Leading a team of data engineers, scientists, and analysts, Soheil drives the development of cutting-edge data products while guiding the company’s transition to a data mesh architecture. He is passionate about empowering teams with the autonomy to manage their own data products and believes in a collaborative approach to driving innovation in the health tech space. Connect with Soheil on LinkedIn. All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. Join the conversation on LinkedIn. Apply to be a guest or nominate someone that you know. Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights!

Bring Vector Search And Storage To The Data Lake With Lance

2024-10-20 · Data Engineering Podcast Listen

podcast_episode

by Weston Pace , Tobias Macey

AI/ML Arrow Cloud Computing Cloud Storage Data Engineering Data Lakehouse Data Management Data Modelling Datafold Delta GenAI Iceberg +3 more

Summary The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storageInterview IntroductionHow did you get involved in the area of data management?Can you describe what Lance is and the story behind it?What are the core problems that Lance is designed to solve?What is explicitly out of scope?The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?What formats does Lance replace or obviate?In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?Are there any practical or hard limitations on vector dimensionality?When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?What are the other main integrations for Lance?What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?What are the most interesting, innovative, or unexpected ways that you have seen Lance used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?When is Lance the wrong choice?What do you have planned for the future of Lance?Contact Info LinkedInGitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Lance FormatLanceDBSubstraitPyArrowFAISSPineconePodcast EpisodeParquetIcebergPodcast EpisodeDelta LakePodcast EpisodePyLanceHilbert CurvesSIFT VectorsS3 ExpressWekaDataFusionRay DataTorch Data LoaderHNSW == Hierarchical Navigable Small Worlds vector indexIVFPQ vector indexGeoJSONPolarsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Role of Python in Shaping the Future of Data Platforms with DLT

2024-10-13 · Data Engineering Podcast Listen

podcast_episode

by Marcin Rudolf (DLT Hub) , Tobias Macey , Adrian Brudaru (dlthub)

AI/ML API Arrow Data Engineering Data Lakehouse Data Management Datafold DuckDB GenAI Python

Summary In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integrationInterview IntroductionHow did you get involved in the area of data management?Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?What are the core principles that guide your work on dlt and dlthub?You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?What are some of the notable investments that you have made in the developer experience for building dlt pipelines?How have the interfaces for source/destination development improved?You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?What is your strategy for building a sustainable product on top of dlt?How does that strategy help to form a "virtuous cycle" of improving the open source foundation?What are the most interesting, innovative, or unexpected ways that you have seen dlt used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?When is dlt the wrong choice?What do you have planned for the future of dlt/dlthub?Contact Info AdrianLinkedInMarcinLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links dltPodcast EpisodePyArrowPolarsIbisDuckDBPodcast Episodedlt Data ContractsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodePyAirbyteOpenAI o1 ModelLanceDBQDrant EmbeddedAirflowGitHub ActionsArrow DataFusionApache ArrowPyIcebergDelta-RSSCD2 == Slowly Changing DimensionsSQLAlchemySQLGlotFSSpecPydanticSpacyEntity RecognitionParquet File FormatPython DecoratorREST API ToolkitOpenAPI Connector GeneratorConnectorXPython no-GILDelta LakePodcast EpisodeSQLMeshPodcast EpisodeHamiltonTabularPostHogPodcast.init EpisodeAsyncIOCursor.AIData MeshPodcast EpisodeFastAPILangChainGraphRAGAI Engineering Podcast EpisodeProperty GraphPython uvThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Inevitable Shift to Modern Data Lake Architecture

2024-09-18 · Big Data LDN 2024

Face To Face

by Guy Fighel (Hetz Ventures)

Data Management Iceberg

In the next five years, we are poised to witness a significant transformation towards modern data lake architecture across industries. This shift is driven by an urgent need for a unified, flexible, and scalable data management solution. Such a solution must address the challenges of siloed data environments and the increasing complexity of data sources while balancing the benefits of data mesh principles with centralized governance and semantic consistency.

In this talk, we will cover latest trends and benefits in this field, as well as usage of open formats like Iceberg, lower costs of data movement, & multiple engines to support different workloads that ultimately helps in getting into a single source of truth.

Lessons Learnt From Supporting Modern Data Lake Formats at Snowplow

2024-09-18 · Big Data LDN 2024

Face To Face

by Jordan Peck (/ Snowplow)

Snowplow

Snowplow has been a leader in developing first-class support for data lakes and modern day open table formats. In this talk, Jordan will discuss the most popular data lake formats, why they have grown in popularity in the last 12-18 months, and best practices when deciding to adopt open lake formats.

Self Service Analytics - Open Your Data Lake to Ad-Hoc Queries

2024-09-18 · Big Data LDN 2024

Face To Face

by Julian Schaefer , Graham Evans

Analytics Dremio Cyber Security

Like many large businesses Maersk relies on a centrally managed Data Lake to deliver customer insights and business reports. But in today's fast-paced business environment, timely and accurate decision-making relies heavily on the ability to access and analyse data at will. Opening up your Data Lake and allowing business users the ability to query the data directly, could bring major benefits; but also brings with it technical and security challenges.

Join Graham and Julian as they explore how Maersk has used Dremio software to deliver Self Service Analytics and their checklist of the essential ‘must have’ technical capabilities.

Self Service Analytics - Open Your Data Lake to Ad-Hoc Queries

2024-09-18 · Exhibitors' Events - Auto created

Face To Face

by Graham Evans Senior Enterprise Architect - Maersk , Julian Schaefer Senior Enterprise Architect - Maersk

Analytics Dremio

Explore how Maersk has used Dremio software to deliver Self Service Analytics and essential ‘must have’ technical capabilities.

#252: The Ever-Shifting Operating Environment of the Data Professional

2024-08-20 · The Analytics Power Hour Listen

podcast_episode

by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Katie Bauer (GlossGenius) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

AI/ML Big Data SQL

Broadly writ, we're all in the business of data work in some form, right? It's almost like we're all swimming around in a big data lake, and our peers are swimming around it, too, and so are our business partners. There might be some HiPPOs and some SLOTHs splashing around in the shallow end, and the contours of the lake keep changing. Is lifeguarding…or writing SQL…or prompt engineering to get AI to write SQL…or identifying business problems a job or a skill? Does it matter? Aren't we all just trying to get to the Insights Water Slide? Katie Bauer, Head of Data at Gloss Genius and thought-provoker at Wrong But Useful, joined Michael, Julie, and Val for a much less metaphorically tortured exploration of the ever-shifting landscape in which the modern data professional operates. Or swims. Or sinks? For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

The Evolution of DataOps: Insights from DataKitchen's CEO

2024-08-04 · Data Engineering Podcast Listen

podcast_episode

by Chris Berg (DataKitchen) , Tobias Macey

Data Engineering Data Lakehouse Data Management Data Quality DataOps Delta DevOps DWH Hive Iceberg Trino

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineersInterview IntroductionHow did you get involved in the area of data management?Can you describe what DataKitchen is and the story behind it?You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?What are the challenges that never went away?You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?Can you talk through the technical implementation of your new obserability and quality testing platform?What does the onboarding and integration process look like?Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?What do you have planned for the future of your work at DataKitchen?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DataKitchenPodcast EpisodeNASADataOps ManifestoData Reliability EngineeringData ObservabilitydbtDevOps Enterprise SummitBuilding The Data Warehouse by Bill Inmon (affiliate link)dataops-testgen, dataops-observabilityFree Data Quality and Data Observability CertificationDatabricksDORA MetricsDORA for dataThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

2024-07-28 · Data Engineering Podcast Listen

podcast_episode

by Tom Baeyens (Soda Data) , Tobias Macey

AI/ML API Cloud Computing Data Contracts Data Engineering Data Lakehouse Data Management Data Quality dbt Delta GenAI Hive +3 more

Summary Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your dataInterview IntroductionHow did you get involved in the area of data management?Can you describe the scope and purpose of data contracts in the context of this conversation?In what way(s) do they differ from data quality/data observability?Data contracts are also known as the API for data, can you elaborate on this?What are the types of guarantees and requirements that you can enforce with these data contracts?What are some examples of constraints or guarantees that cannot be represented in these contracts?Are data contracts related to the shift-left?Data contracts are also known as the API for data, can you elaborate on this?The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?How did you approach the design of the syntax and implementation for Soda's data contracts?Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?When are data contracts the wrong choice?What do you have planned for the future of data contracts?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SodaPodcast EpisodeJBossData ContractAirflowUnit TestingIntegration TestingOpenAPIGraphQLCircuit Breaker PatternSodaCLSoda Data ContractsData MeshGreat Expectationsdbt Unit TestsOpen Data ContractsODCS == Open Data Contract StandardODPS == Open Data Product SpecificationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

How Generative AI Is Impacting Data Engineering Teams

2024-07-21 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Lior Gavish (Monte Carlo)

AI/ML Data Engineering Data Lakehouse Data Management Delta GenAI Hive Iceberg LLM Monte Carlo Python RAG +2 more

Summary Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineersInterview IntroductionHow did you get involved in the area of data management?Can you start by clarifying what we are discussing when we say "AI"?Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?What are the areas where LLMs have proven useful/effective in data engineering?Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?What new skills do data teams need to acquire to be effective in supporting AI applications?What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?When is AI the wrong choice?What are your predictions for the future impact of AI on data engineering teams?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your Links Monte CarloPodcast EpisodeNLP == Natural Language ProcessingLarge Language ModelsGenerative AIMLOpsML EngineerFeature StoreRetrieval Augmented Generation (RAG)LangchainThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Role of Product Managers in Data-Centric Organizations

2024-07-13 · Data Engineering Podcast Listen

podcast_episode

by Praveen Gujar (LinkedIn) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Lakehouse Data Management Delta Hive Iceberg Python Trino

Summary In this episode Praveen Gujar, Director of Product at LinkedIn, talks about the intricacies of product management for data and analytical platforms. Praveen shares his journey from Amazon to Twitter and now LinkedIn, highlighting his extensive experience in building data products and platforms, digital advertising, AI, and cloud services. He discusses the evolving role of product managers in data-centric environments, emphasizing the importance of clean, reliable, and compliant data. Praveen also delves into the challenges of building scalable data platforms, the need for organizational and cultural alignment, and the critical role of product managers in bridging the gap between engineering and business teams. He provides insights into the complexities of platformization, the significance of long-term planning, and the necessity of having a strong relationship with engineering teams. The episode concludes with Praveen offering advice for aspiring product managers and discussing the future of data management in the context of AI and regulatory compliance.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Praveen Gujar about product management for data and analytical platformsInterview IntroductionHow did you get involved in the area of data management?Product management is typically thought of as being oriented toward customer facing functionality and features. What is involved in being a product manager for data systems?Many data-oriented products that are customer facing require substantial technical capacity to serve those use cases. How does that influence the process of determining what features to provide/create?investment in technical capacity/platformsidentifying groupings of features that can be served by a common platform investmentmanaging organizational pressures between engineering, product, business, finance, etc.What are the most interesting, innovative, or unexpected ways that you have seen "Data Products & Platforms @ Big-tech" used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on "Building Data Products & Platforms for Big-tech"?When is "Data Products & Platforms @ Big-tech" the wrong choice?What do you have planned for the future of "Data Products & Platforms @ Big-tech"?Contact Info LinkedInWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DataHubPodcast EpisodeRAG == Retrieval Augmented GenerationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Neon: A Serverless And Developer Friendly Postgres

2024-07-08 · Data Engineering Podcast Listen

podcast_episode

by Nikita Shamgunov (Neon) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Lakehouse Data Management Delta Hive Iceberg Python Trino postgresql

Summary Postgres is one of the most widely respected and liked database engines ever. To make it even easier to use for developers to use, Nikita Shamgunov decided to makee it serverless, so that it can scale from zero to infinity. In this episode he explains the engineering involved to make that possible, as well as the numerous details that he and his team are packing into the Neon service to make it even more attractive for anyone who wants to build on top of Postgres. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Nikita Shamgunov about his work on making Postgres a serverless database at Neon.Interview IntroductionHow did you get involved in the area of data management?Can you describe what Neon is and the story behind it?The ecosystem around Postgres is large and varied. What are the pain points that you are trying to address with Neon? What does it mean for a database to be serverless?What kinds of products and services are unlocked by making Postgres a serverless database?How does your vision for Neon compare/contrast with what you know of PlanetScale?Postgres is known for having a large ecosystem of plugins that add a lot of interesting and useful features, but the storage layer has not been as easily extensible historically. How have architectural changes in recent Postgres releases enabled your work on Neon?What are the core pieces of engineering that you have had to complete to make Neon possible?How have the design and goals of the project evolved since you first started working on it?The separation of storage and compute is one of the most fundamental promises of the cloud. What new capabilities does that enable in Postgres?How does the branching functionality change the ways that development teams are able to deliver and debug features?Because the storage is now a networked system, what new performance/latency challenges does that introduce? How have you addressed them in Neon?Anyone who has ever operated a Postgres instance has had to tackle the upgrade process. How does Neon address that process for end users?The rampant growth of AI has touched almost every aspect of computing, and Postgres is no exception. How does the introduction of pgvector and semantic/similarity search functionality impact the adoption and usage patterns of Postgres/Neon?What new challenges does that introduce for you as an operator and business owner?What are the lessons that you learned from MemSQL/SingleStore that have been most helpful in your work at Neon?What are the most interesting, innovative, or unexpected ways that you have seen Neon used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Neon?When is Neon the wrong choice? Postgres?What do you have planned for the future of Neon?Contact Info @nikitabase on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links NeonPostgreSQLNeon GithubPHPMySQLSQL ServerSingleStorePodcast EpisodeAWS AuroraKhosla VenturesYugabyteDBPodcast EpisodeCockroachDBPodcast EpisodePlanetScalePodcast EpisodeClickhousePodcast EpisodeDuckDBPodcast EpisodeWAL == Write-Ahead LogPgBouncerPureStoragePaxos)HNSW IndexIVF Flat IndexRAG == Retrieval Augmented GenerationAlloyDBNeon Serverless DriverDevinmagic.devThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

2024-07-01 · Airflow Summit 2024

session

by Palanieppan Muthiah , Peiqiu Tian , Kevin Wang

Airflow Analytics Kubernetes

Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase. Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs. This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

2024-06-30 · Data Engineering Podcast Listen

podcast_episode

by Petr Janda (SYNQ) , Tobias Macey

AI/ML Data Engineering Data Governance Data Lakehouse Data Management Data Quality Delta Hive Iceberg Python Trino

Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigorInterview IntroductionHow did you get involved in the area of data management?Can you describe what Synq is and the story behind it? Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address?Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary?What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team?How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach?With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance?Can you describe how Synq is designed/implemented? How have the scope and goals of the product changed since you first started working on it?For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows?What are the types of incidents/errors that you are able to identify and alert on? What does a typical incident/error resolution process look like with Synq?What are the most interesting, innovative, or unexpected ways that you have seen Synq used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq?When is Synq the wrong choice?What do you have planned for the future of Synq?Contact Info LinkedInSubstackParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SynqIncident ManagementSLA == Service Level AgreementData GovernancePodcast EpisodePagerDutyOpsGenieClickhousePodcast EpisodedbtPodcast EpisodeSQLMeshPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

talk-data.com

Data Lake

Activity Trend

Top Events

Top Speakers

Accelerated Computing in Modern Data Centers With Datapelago

AWS re:Invent 2024 - [NEW LAUNCH] Amazon SageMaker Lakehouse: Accelerate analytics & AI (ANT354-NEW)

AWSreInvent #AWSreInvent2024

AWS re:Invent 2024 - A practitioner’s guide to data for generative AI (DAT319)

AWSreInvent #AWSreInvent2024

Data Engineering with AWS Cookbook

Ingest govern and secure your data with OneLake | BRK201

MSIgnite

Building a Data Mesh

Bring Vector Search And Storage To The Data Lake With Lance

The Role of Python in Shaping the Future of Data Platforms with DLT

The Inevitable Shift to Modern Data Lake Architecture

Lessons Learnt From Supporting Modern Data Lake Formats at Snowplow

Self Service Analytics - Open Your Data Lake to Ad-Hoc Queries

Self Service Analytics - Open Your Data Lake to Ad-Hoc Queries

#252: The Ever-Shifting Operating Environment of the Data Professional

The Evolution of DataOps: Insights from DataKitchen's CEO

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

How Generative AI Is Impacting Data Engineering Teams

The Role of Product Managers in Data-Centric Organizations

Neon: A Serverless And Developer Friendly Postgres

Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq