Databricks

Aligning Business and Data: The Essential Role of Data Modeling

2025-09-01 · Data Engineering Podcast Listen

podcast_episode

by Serge Gershkovich (SQL DBM) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lakehouse Data Management Data Modelling Data Vault Datafold DWH LLM Snowflake +1 more

Summary In this episode of the Data Engineering Podcast Serge Gershkovich, head of product at SQL DBM, talks about the socio-technical aspects of data modeling. Serge shares his background in data modeling and highlights its importance as a collaborative process between business stakeholders and data teams. He debunks common misconceptions that data modeling is optional or secondary, emphasizing its crucial role in ensuring alignment between business requirements and data structures. The conversation covers challenges in complex environments, the impact of technical decisions on data strategy, and the evolving role of AI in data management. Serge stresses the need for business stakeholders' involvement in data initiatives and a systematic approach to data modeling, warning against relying solely on technical expertise without considering business alignment.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Enterprises today face an enormous challenge: they’re investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern. And that’s especially evident in large, complex enterprise data environments. That’s why companies like DirecTV and Pfizer rely on SqlDBM. Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy. In today’s cloud era, that backbone needs a modern approach built natively for the cloud, with direct connections to the very platforms driving your business forward. Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale. SqlDBM ensures enterprises don’t just move to the cloud—they maximize their ROI by creating governed, scalable, and business-aligned data environments. If global enterprises are using SqlDBM to tackle the biggest challenges in data management, analytics, and AI, isn’t it worth exploring what it can do for yours? Visit dataengineeringpodcast.com/sqldbm to learn more.Your host is Tobias Macey and today I'm interviewing Serge Gershkovich about how and why data modeling is a sociotechnical endeavorInterview IntroductionHow did you get involved in the area of data management?Can you start by describing the activities that you think of when someone says the term "data modeling"?What are the main groupings of incomplete or inaccurate definitions that you typically encounter in conversation on the topic?How do those conceptions of the problem lead to challenges and bottlenecks in execution?Data modeling is often associated with data warehouse design, but it also extends to source systems and unstructured/semi-structured assets. How does the inclusion of other data localities help in the overall success of a data/domain modeling effort?Another aspect of data modeling that often consumes a substantial amount of debate is which pattern to adhere to (star/snowflake, data vault, one big table, anchor modeling, etc.). What are some of the ways that you have found effective to remove that as a stumbling block when first developing an organizational domain representation?While the overall purpose of data modeling is to provide a digital representation of the business processes, there are inevitable technical decisions to be made. What are the most significant ways that the underlying technical systems can help or hinder the goals of building a digital twin of the business?What impact (positive and negative) are you seeing from the introduction of LLMs into the workflow of data modeling?How does tool use (e.g. MCP connection to warehouse/lakehouse) help when developing the transformation logic for achieving a given domain representation? What are the most interesting, innovative, or unexpected ways that you have seen organizations address the data modeling lifecycle?What are the most interesting, unexpected, or challenging lessons that you have learned while working with organizations implementing a data modeling effort?What are the overall trends in the ecosystem that you are monitoring related to data modeling practices?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links sqlDBMSAPJoe ReisERD == Entity Relation DiagramMaster Data ManagementdbtData ContractsData Modeling With Snowflake book by Serge (affiliate link)Type 2 DimensionData VaultStar SchemaAnchor ModelingRalph KimballBill InmonSixth Normal FormMCP == Model Context ProtocolThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Leveling Up: Analytics in the Gaming Industry (w/ Carly Taylor)

2025-07-18 · Mavens of Data Listen

podcast_episode

by Carly Taylor (ggAI)

Analytics Data Analytics KPI

In this episode, we'll chat with Carly Taylor, Field CTO of Gaming at Databricks, to explore the fascinating world of data analytics in the gaming industry, where every click, quest, and respawn generates insights that shape the games we love. Carly shares her experience working in gaming to help harness data for better gameplay and smarter monetization. She'll break down what analysts, data scientists, and sales engineers actually do in gaming and how teams turn raw data into real-time decisions. Whether you're a player, a data nerd, or someone who wants to turn both into a career, this episode is your walkthrough guide to data in gaming. What You'll Learn: How gaming companies use data to optimize player experience and business outcomes What it's like to work in a field engineering or customer-facing analyst role The tools, KPIs, and best practices for success How to break into a data role in gaming and what skills to focus on Stay updated with Carly's latest by subscribing to her Substack Register for free to be part of the next live session: https://bit.ly/3XB3A8b Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

Data Engineering Central Podcast - Episode 8

2025-07-10 · Data Engineering Central Podcast Listen

podcast_episode

Data Engineering DuckDB Iceberg SQL

This is a free preview of a paid episode. To hear more, visit dataengineeringcentral.substack.com

Hello! A new episode of the Data Engineering Central Podcast is dropping today, we will be covering a few hot topics! * Apache Iceberg Catalogs * new Boring Catalog * new full Iceberg support from Databricks/Unity Catalog * Databricks SQL Scripting * DuckDB coming to a Lake House near you * Lakebase from Databricks Going to be a great show, come along for the ride! Thanks …

Robin Sutara on Responsible AI, Governance, Diversity, and People Behind Data

2025-05-23 · Future of Data and AI Listen

podcast_episode

by Robin Sutara (Databricks)

AI/ML Data Lakehouse Microsoft RAG

🎙️ Future of Data and AI Podcast: Episode 06 with Robin Sutara What do Apache, Excel, Microsoft, and Databricks have in common? Robin Sutara! From being a technician for Apache helicopters to leading global data strategy at Microsoft and now Databricks, Robin Sutara’s journey is anything but ordinary. In this episode, she shares how enterprises are adopting AI in practical, secure, and responsible ways—without getting lost in the hype. We dive into how Databricks is evolving beyond the Lakehouse to power the next wave of enterprise AI—supporting custom models, Retrieval-Augmented Generation (RAG), and compound AI systems that balance innovation with governance, transparency, and risk management. Robin also breaks down the real challenges to AI adoption—not technical, but cultural. She explains why companies must invest in change management, empower non-technical teams, and embrace diverse perspectives to make AI truly work at scale. Her take on job evolution, bias in AI, and the human side of automation is both refreshing and deeply relevant. A sharp, insightful conversation for anyone building or scaling AI inside the enterprise—especially in regulated industries where trust and explainability matter as much as innovation.

From Data Discovery to AI: The Evolution of Semantic Layers

2025-05-21 · Data Engineering Podcast Listen

podcast_episode

by Shinji Kim (Select Star) , Tobias Macey

AI/ML BI Data Engineering Data Management Datafold DWH LLM Modern Data Stack Python Snowflake

Summary In this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Shinji Kim about the role of semantic layers in the era of AIInterview IntroductionHow did you get involved in the area of data management?Semantic modeling gained a lot of attention ~4-5 years ago in the context of the "modern data stack". What is your motivation for revisiting that topic today?There are several overlapping concepts – "semantic layer," "metrics layer," "headless BI." How do you define these terms, and what are the key distinctions and overlaps?Do you see these concepts converging, or do they serve distinct long-term purposes?Data warehousing and business intelligence have been around for decades now. What new value does semantic modeling beyond practices like star schemas, OLAP cubes, etc.?What benefits does a semantic model provide when integrating your data platform into AI use cases?How is it different between using AI as an interface to your analytical use cases vs. powering customer facing AI applications with your data?Putting in the effort to create and maintain a set of semantic models is non-zero. What role can LLMs play in helping to propose and construct those models?For teams who have already invested in building this capability, what additional context and metadata is necessary to provide guidance to LLMs when working with their models?What's the most effective way to create a semantic layer without turning it into a massive project? There are several technologies available for building and serving these models. What are the selection criteria that you recommend for teams who are starting down this path?What are the most interesting, innovative, or unexpected ways that you have seen semantic models used?What are the most interesting, unexpected, or challenging lessons that you have learned while working with semantic modeling?When is semantic modeling the wrong choice?What do you predict for the future of semantic modeling?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SelectStarSun MicrosystemsMarkov Chain Monte CarloSemantic ModelingSemantic LayerMetrics LayerHeadless BICubePodcast EpisodeAtScaleStar SchemaData VaultOLAP CubeRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeKNN == K-Nearest NeighbersHNSW == Hierarchical Navigable Small Worlddbt Metrics LayerSoda DataLookMLHexPowerBITableauSemantic View (Snowflake)Databricks GenieSnowflake Cortex AnalystMalloyThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

#295 How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy

2025-03-31 · DataFramed Listen

podcast_episode

by Deepak Goyal (Azurelib Academy) , Richie (DataCamp)

AI/ML Azure Cloud Computing Data Engineering GenAI Microsoft

The role of data and AI engineers is more critical than ever. With organizations collecting massive amounts of data, the challenge lies in building efficient data infrastructures that can support AI systems and deliver actionable insights. But what does it take to become a successful data or AI engineer? How do you navigate the complex landscape of data tools and technologies? And what are the key skills and strategies needed to excel in this field? Deepak Goyal is a globally recognized authority in Cloud Data Engineering and AI. As the Founder & CEO of Azurelib Academy, he has built a trusted platform for advanced cloud education, empowering over 100,000 professionals and influencing data strategies across Fortune 500 companies. With over 17 years of leadership experience, Deepak has been at the forefront of designing and implementing scalable, real-world data solutions using cutting-edge technologies like Microsoft Azure, Databricks, and Generative AI. In the episode, Richie and Deepak explore the fundamentals of data engineering, the critical skills needed, the intersection with AI roles, career paths, and essential soft skills. They also discuss the hiring process, interview tips, and the importance of continuous learning in a rapidly evolving field, and much more. Links Mentioned in the Show: AzureLibAzureLib Academy Connect with DeepakGet Certified! Azure FundamentalsRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwaySign up to attend RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Serhii Sokolenko: Building a Python-First Compute Platform for Data Engineers

2025-03-17 · Straight Data Talk Listen

podcast_episode

by Yuliia Tkachova (Masthead Data) , Serhii Sokolenko (Tower Dev)

AI/ML Cloud Computing Data Engineering GCP Python Snowflake

Serhii Sokolenko, founder at Tower Dev and former product manager at tech giants like Google Cloud, Snowflake, and Databricks, joined Yuliia to discuss his journey building a next-generation compute platform. Tower Dev aims to simplify data processing for data engineers who work with Python. Serhii explains how Tower addresses three key market trends: the integration of data engineering with AI through Python, the movement away from complex distributed processing frameworks, and users' desire for flexibility across different data platforms. He explains how Tower makes Python data applications more accessible by eliminating the need to learn complex frameworks while automatically scaling infrastructure. Sergei also shares his perspective on the future of data engineering, noting in which ways AI will transform the profession.Tower Dev - https://tower.dev/Serhii's Linkedin - https://www.linkedin.com/in/ssokolenko/

Data Engineering Central Podcast - 06

2025-02-13 · Data Engineering Central Podcast Listen

podcast_episode

AWS AWS Lambda Data Engineering Data Quality Delta DuckDB Iceberg IaC Polars Terraform

It’s time for another episode of the Data Engineering Central Podcast. In this episode, we cover … * AWS Lambda + DuckDB and Delta Lake (Polars, Daft, etc). * IAC - Long Live Terraform. * Databricks Data Quality with DQX. * Unity Catalog releases for DuckDB and Polars * Bespoke vs Managed Data Platforms * Delta Lake vs. Iceberg and UinFORM for a single table. Thanks for b…

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Data Engineering Central Podcast - 05

2024-12-20 · Data Engineering Central Podcast Listen

podcast_episode

by Daniel Beach

AWS Data Contracts Data Engineering S3

In todays episode of Data Engineering Central Podcast we talk about a few hot topics, AWS S3 Tables, Databricks raising money, are Data Contracts Dead, and the Lake House Storage Format battle! It's a good one, buckle up!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Data Engineering Central Podcast - 04

2024-11-20 · Data Engineering Central Podcast Listen

podcast_episode

by Daniel Beach

Airflow Data Engineering DuckDB Polars

It’s time for another episode of the Data Engineering Central Podcast. In this episode we cover … * Apache Airflow vs Databricks Workflows * End-of-Year Engineering Planning for 2025 * 10 Billion Row Challenge with DuckDB vs Daft vs Polars * Raw Data Ingestion. As usual, the full episode is available to paid subscribers, and a shortened version to you free loaders out there, don’t worry, I still love you though.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Streaming Data Into The Lakehouse With Iceberg And Trino At Going

2024-11-18 · Data Engineering Podcast Listen

podcast_episode

by Ken Pickering (Going) , Tobias Macey

Data Lakehouse Data Quality Iceberg Data Streaming Trino

In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.

Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.

We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.

Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.

Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.

Data Engineering Central Podcast - 03

2024-10-16 · Data Engineering Central Podcast Listen

podcast_episode

by Daniel Beach

Data Engineering Delta DuckDB Polars Spark

It’s time for another episode of Data Engineering Central Podcast, our third one! Topics in this episode … * Should you use DuckDB or Polars? * Small Engineering Changes (PR Reviews) * Daft vs Spark on Databricks with Unity Catalog (Delta Lake) * Primary and Foreign keys in the Lake House Enjoy!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

#252 Is Big Data Dead? MotherDuck and the Small Data Manifesto with Ryan Boyd Co-Founder at MotherDuck

2024-10-14 · DataFramed Listen

podcast_episode

by Ryan Boyd (Databricks) , Richie (DataCamp)

AI/ML Big Data Data Engineering Data Science Marketing Motherduck Neo4j SQL

Businesses are collecting more data than ever before. But is bigger always better? Many companies are starting to question whether massive datasets and complex infrastructure are truly delivering results or just adding unnecessary costs and complications. How can you make sure your data strategy is aligned with your actual needs? What if focusing on smaller, more manageable datasets could improve your efficiency and save resources, all while delivering the same insights? Ryan Boyd is the Co-Founder & VP, Marketing + DevRel at MotherDuck. Ryan started his career as a software engineer, but since has led DevRel teams for 15+ years at Google, Databricks and Neo4j, where he developed and executed numerous marketing and DevRel programs. Prior to MotherDuck, Ryan worked at Databricks and focussed the team on building an online community during the pandemic, helping to organize the content and experience for an online Data + AI Summit, establishing a regular cadence of video and blog content, launching the Databricks Beacons ambassador program, improving the time to an “aha” moment in the online trial and launching a University Alliance program to help professors teach the latest in data science, machine learning and data engineering. In the episode, Richie and Ryan explore data growth and computation, the data 1%, the small data movement, data storage and usage, the shift to local and hybrid computing, modern data tools, the challenges of big data, transactional vs analytical databases, SQL language enhancements, simple and ergonomic data solutions and much more. Links Mentioned in the Show: MotherDuckThe Small Data ManifestoConnect with RyanSmall DataSF conferenceRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwayRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Data Engineering Central Podcast

2024-09-17 · Data Engineering Central Podcast Listen

podcast_episode

Data Engineering Snowflake Spark

Welcome to the Data Engineering Central Podcast —— a no-holds-barred discussion on the Data Landscape. Welcome to Episode 01 In today’s episode we will talk about the following topics from the Data Engineering perspective … * Snowflake vs Databricks. * Is Apache Spark being replaced?? * Notebooks in Production. Bad.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

#234 High Performance Generative AI Applications with Ram Sriharsha, CTO at Pinecone

2024-08-12 · DataFramed Listen

podcast_episode

by Ram Sriharsha (Pinecone) , Richie (DataCamp)

AI/ML GenAI LLM Pinecone RAG Spark Splunk Vector DB

Perhaps the biggest complaint about generative AI is hallucination. If the text you want to generate involves facts, for example, a chatbot that answers questions, then hallucination is a problem. The solution to this is to make use of a technique called retrieval augmented generation, where you store facts in a vector database and retrieve the most appropriate ones to send to the large language model to help it give accurate responses. So, what goes into building vector databases and how do they improve LLM performance so much? Ram Sriharsha is currently the CTO at Pinecone. Before this role, he was the Director of Engineering at Pinecone and previously served as Vice President of Engineering at Splunk. He also worked as a Product Manager at Databricks. With a long history in the software development industry, Ram has held positions as an architect, lead product developer, and senior software engineer at various companies. Ram is also a long time contributor to Apache Spark. In the episode, Richie and Ram explore common use-cases for vector databases, RAG in chatbots, steps to create a chatbot, static vs dynamic data, testing chatbot success, handling dynamic data, choosing language models, knowledge graphs, implementing vector databases, innovations in vector data bases, the future of LLMs and much more. Links Mentioned in the Show: PineconeWebinar - Charting the Path: What the Future Holds for Generative AICourse - Vector Databases for Embeddings with PineconeRelated Episode: The Power of Vector Databases and Semantic Search with Elan Dekel, VP of Product at PineconeRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

#55 Can AI Predict Euro 2024 Winners?

2024-06-20 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Vitale , David

AI/ML Delta GitHub Iceberg Microsoft Snowflake

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don’t), where industry insights meet laid-back banter. Whether you’re a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let’s get into the heart of data, unplugged style! In this episode, join us along with guests Vitale and David as we explore: Euro 2024 Predictions with AI: Using Snowflake's machine learning models for data-driven predictions and sharing our own predictions. Can animals predict wins better than ML models?Tech in football: From VAR to connected ball technology, is it all a good idea?Nvidia overtaking Apple and Microsoft as the biggest tech corporation? Discussing Nvidia's leap to surpass Apple and Microsoft, and the implications for the GPU market and AI development.Unity Catalog vs. Polaris: Comparing Unity+Delta with Polaris+Iceberg and their roles in data cataloging and management. Explore the details on GitHub Unity Catalog, YouTube, and insights on LinkedIn. Databricks Data and AI Summit recap: Discussing the biggest announcements from the summit, including Mosaic AI integration, serverless options, and the open-source unity catalog.Exploring BM25: Discussing the BM25 algorithm and its advancements over traditional TF-IDF for document classification.

Being Data Driven At Stripe With Trino And Iceberg

2024-06-16 · Data Engineering Podcast Listen

podcast_episode

by Kevin Liu (Stripe) , Tobias Macey

AI/ML Flink Athena AWS Data Engineering Data Lake Data Lakehouse Data Management Delta Hive Hudi Iceberg +5 more

Summary

Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?

What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?

What were the requirements and selection criteria that led to the selection of that combination of technologies?

What are the other systems that feed into and rely on the Trino/Iceberg service?

what kinds of questions are you answering with table metadata

what use case/team does that support

comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?

Contact Info

Substack LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink

Podcast Episode

Tabular

Podcast Episode

Delta Table

Podcast Episode

Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio

Podcast Episode

Parquet Hudi Trino Project Tardigrade Trino On Ice

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Trusted by the teams at Comcast and Doordash, Starburst del

From Moneyball to Gen AI

2024-05-12 · The Analytics Engineering Podcast Listen

podcast_episode

by Tristan Handy (dbt Labs) , Eric Avidon (TechTarget)

AI/ML Analytics Data Quality GenAI Snowflake

Eric Avidon is a journalist at TechTarget who's interviewed Tristan a few times, and now Tristan gets to flip the script and interview Eric. Eric is a journalist veteran, covering everything from finance to the Boston Red Sox, but now he spends a lot of time with vendors in the data space and has a broad view of what's going on. Eric and Tristan discuss AI and analytics and how mature these features really are today, data quality and its importance, the AI strategies of Snowflake and Databricks, and a lot more. Plus, part way through you can hear Tristan reacting to a mild earthquake that hit the East Coast. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.

#49 How Will the EU AI Act Affect the Future of AI?

2024-05-08 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Maryam Ilyas

AI/ML PySpark Python SQL

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In this episode, we're joined by special guest Maryam Ilyas as we delve into a variety of topics that shape our digital world: Women’s Healthcare Insights: Exploring the Oura ring's commitment during Women's Health Awareness Month and its role in addressing the underrepresentation of female health conditions in research. A Deep Dive into the EU AI Act: Examining the AI Act’s implications, including its classification of AI systems (prohibited, high-risk, limited-risk, and minimal-risk), ethical concerns, regulatory challenges & the act's impact on AI usage, particularly regarding mass surveillance at the Paris Olympics.The Evolution of Music and AI: Reviewing the AI-generated music video for "The Hardest Part" by Washed Out, directed by Paul Trillo, showcasing AI’s growing role in the arts.Hot Takes on Data Tools: Is combining SQL, PySpark (and Python) in Databricks the most powerful tool in the data space? Let's dissect the possibilities and limitations.Don't forget to check us out on Youtube too, where you can find a lot more content beyond the podcast!

#48 How Can We Define DevRel in the Tech World? Tech Insights with Mehdi Ouazza

2024-05-02 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Mehdi Ouazza (MotherDuck)

AI/ML Data Engineering DuckDB GenAI GitHub IBM LLM Motherduck Terraform

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

In this episode, we're thrilled to have special guest Mehdi Ouazza diving into a plethora of hot tech topics: Mehdi Ouazza's Insights into his career, online community and working with DuckDB and MotherDuck.Demystifying DevRel: Definitions and distinctions in the realm of tech influence (dive deeper here).Terraform's Licensing Shift: Reactions to HashiCorp's recent changes and its new IBM collaboration, more details here.Github Copilot Workspace: Exploring the latest in AI-powered coding assistance, comparing with devin.ai and CodySnowflake's Arctic LLM: Discussing the latest enterprise AI capabilities and their real-world applications. Read more about Arctic - what it excels at, and how its performance was measuredMore legal kerfuffle in the GenAI realm: The ongoing legal debates around AI's use in creative industries, highlighted by a dispute over Drake’s use of late rapper Tupac’s AI-generated voice in diss track & the licensing deal between Financial Times and OpenAIFuture of Data Engineering: Examining the integration of LLMs into data engineering tools. Insights on prompt-based feature engineering and Databricks' English SDKAI in Music Creation: A little bonus with an AI generated song about Murilo, created with Suno

talk-data.com

Activity Trend

Top Events

Top Speakers

Aligning Business and Data: The Essential Role of Data Modeling

Leveling Up: Analytics in the Gaming Industry (w/ Carly Taylor)

Data Engineering Central Podcast - Episode 8

Robin Sutara on Responsible AI, Governance, Diversity, and People Behind Data

From Data Discovery to AI: The Evolution of Semantic Layers

#295 How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy

Serhii Sokolenko: Building a Python-First Compute Platform for Data Engineers

Data Engineering Central Podcast - 06

Data Engineering Central Podcast - 05

Data Engineering Central Podcast - 04

Streaming Data Into The Lakehouse With Iceberg And Trino At Going

Data Engineering Central Podcast - 03

#252 Is Big Data Dead? MotherDuck and the Small Data Manifesto with Ryan Boyd Co-Founder at MotherDuck

Data Engineering Central Podcast

#234 High Performance Generative AI Applications with Ram Sriharsha, CTO at Pinecone

#55 Can AI Predict Euro 2024 Winners?

Being Data Driven At Stripe With Trino And Iceberg

From Moneyball to Gen AI

#49 How Will the EU AI Act Affect the Future of AI?

#48 How Can We Define DevRel in the Tech World? Tech Insights with Mehdi Ouazza