talk-data.com talk-data.com

Topic

Data Engineering

etl data_pipelines big_data

1127

tagged

Activity Trend

127 peak/qtr
2020-Q1 2026-Q1

Activities

1127 activities · Newest first

Supercharge your lakehouse with Azure Databricks and Microsoft Fabric | BRK203

Azure Databricks enhances the lakehouse experience in Azure by seamlessly integrating data and AI solutions for faster value. Catalog data, schema, and tables in Unity Catalog are readily available, supporting data engineering, data science, real-time intelligence, and optimized performance, delivering blazing fast insights with Power BI.

𝗦𝗽𝗲𝗮𝗸𝗲𝗿𝘀: * Lindsey Allen * Robert Saxby

𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻: This is one of many sessions from the Microsoft Ignite 2024 event. View even more sessions on-demand and learn about Microsoft Ignite at https://ignite.microsoft.com

BRK203 | English (US) | Data

MSIgnite

Think Inside the Box: Constraints Drive Data Warehousing Innovation

As a Head of Data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-... Small Data Manifesto: https://motherduck.com/blog/small-dat... Why Small Data?: https://benn.substack.com/p/is-excel-... Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: / motherduck
X/Twitter : / motherduck
Blog: https://motherduck.com/blog/


Learn how your data team can drive innovation and maximize ROI by embracing constraints, drawing inspiration from SpaceX's revolutionary cost-effective approach. This video challenges the "abundance mindset" prevalent in the modern data stack, where easily scalable cloud data warehouses and a surplus of tools often lead to unmanageable data models and underutilized dashboards. We explore a focused data strategy for extracting maximum value from small data, shifting the paradigm from "more data" to more impact.

To maximize value, data teams must move beyond being order-takers and practice strategic stakeholder management. Discover how to use frameworks like the stakeholder engagement matrix to prioritize high-impact business leaders and align your work with core business goals. This involves speaking the language of business growth models, not technical jargon about data pipelines or orchestration, ensuring your data engineering efforts resonate with key decision-makers and directly contribute to revenue-generating activities.

Embracing constraints is key to innovation and effective data project management. We introduce the Iron Triangle—a fundamental engineering concept balancing scope, cost, and time—as a powerful tool for planning data projects and having transparent conversations with the business. By treating constraints not as limitations but as opportunities, data engineers and analysts can deliver higher-quality data products without succumbing to scope creep or uncontrolled costs.

A critical component of this strategy is understanding the Total Cost of Ownership (TCO), which goes far beyond initial compute costs to include ongoing maintenance, downtime, and the risk of vendor pricing changes. Learn how modern, efficient tools like DuckDB and MotherDuck are designed for cost containment from the ground up, enabling teams to build scalable, cost-effective data platforms. By making the true cost of data requests visible, you can foster accountability and make smarter architectural choices. Ultimately, this guide provides a blueprint for resisting data stack bloat and turning cost and constraints into your greatest assets for innovation.

It’s time for another episode of the Data Engineering Central Podcast. In this episode we cover … * Apache Airflow vs Databricks Workflows * End-of-Year Engineering Planning for 2025 * 10 Billion Row Challenge with DuckDB vs Daft vs Polars * Raw Data Ingestion. As usual, the full episode is available to paid subscribers, and a shortened version to you free loaders out there, don’t worry, I still love you though.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. We’re often caught chasing the dream of “self-serve” data—a place where data empowers stakeholders to answer their questions without a data expert at every turn. But what does it take to reach that point? How do you shape tools that empower teams to explore and act on data without the usual bottlenecks? And with the growing presence of natural language tools and AI, is true self-service within reach, or is there still more to the journey? Sameer Al-Sakran is the CEO at Metabase, a low-code self-service analytics company. Sameer has a background in both data science and data engineering so he's got a practitioner's perspective as well as executive insight. Previously, he was CTO at Expa and Blackjet, and the founder of SimpleHadoop and Adopilot. In the episode, Richie and Sameer explore self-serve analytics, the evolution of data tools, GenAI vs AI agents, semantic layers, the challenges of implementing self-serve analytics, the problem with data-driven culture, encouraging efficiency in data teams, the parallels between UX and data projects, exciting trends in analytics, and much more. Links Mentioned in the Show: MetabaseConnect with SameerArticles from Metabase on jargon, information budgets, analytics mistakes, and data model mistakesCourse: Introduction to Data CultureRelated Episode: Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at FivetranRewatch Sessions from RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

People often ask me for career advice. In a tough job market where people are sending out thousands of resumes and hearing nothing back, I notice a lot of people have weak networks and are unknown to the companies they're applying to. This results in lots of frustration and disappointment for job seekers.

Is there a better way? Yes. People need to know who you are. Obscurity is your enemy.

Also, the name of the Friday show changed because I can't seem to keep things to five minutes ;)

My works:

📕Fundamentals of Data Engineering: https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

🎥 Deeplearning.ai Data Engineering Certificate: https://www.coursera.org/professional-certificates/data-engineering

🔥Practical Data Modeling: https://practicaldatamodeling.substack.com/

🤓 My SubStack: https://joereis.substack.com/

Big Data is Dead: Long Live Hot Data 🔥

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: Simplifying our work.

Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.

📓 Resources Big Data is Dead: https://motherduck.com/blog/big-data-is-dead/ Small Data Manifesto: https://motherduck.com/blog/small-data-manifesto/ Small Data SF: https://www.smalldatasf.com/

➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck X/Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/


Explore the "Small Data" movement, a counter-narrative to the prevailing big data conference hype. This talk challenges the assumption that data scale is the most important feature of every workload, defining big data as any dataset too large for a single machine. We'll unpack why this distinction is crucial for modern data engineering and analytics, setting the stage for a new perspective on data architecture.

Delve into the history of big data systems, starting with the non-linear hardware costs that plagued early data practitioners. Discover how Google's foundational papers on GFS, MapReduce, and Bigtable led to the creation of Hadoop, fundamentally changing how we scale data processing. We'll break down the "big data tax"—the inherent latency and system complexity overhead required for distributed systems to function, a critical concept for anyone evaluating data platforms.

Learn about the architectural cornerstone of the modern cloud data warehouse: the separation of storage and compute. This design, popularized by systems like Snowflake and Google BigQuery, allows storage to scale almost infinitely while compute resources are provisioned on-demand. Understand how this model paved the way for massive data lakes but also introduced new complexities and cost considerations that are often overlooked.

We examine the cracks appearing in the big data paradigm, especially for OLAP workloads. While systems like Snowflake are still dominant, the rise of powerful alternatives like DuckDB signals a shift. We reveal the hidden costs of big data analytics, exemplified by a petabyte-scale query costing nearly $6,000, and argue that for most use cases, it's too expensive to run computations over massive datasets.

The key to efficient data processing isn't your total data size, but the size of your "hot data" or working set. This talk argues that the revenge of the single node is here, as modern hardware can often handle the actual data queried without the overhead of the big data tax. This is a crucial optimization technique for reducing cost and improving performance in any data warehouse.

Discover the core principles for designing systems in a post-big data world. We'll show that since only 1 in 500 users run true big data queries, prioritizing simplicity over premature scaling is key. For low latency, process data close to the user with tools like DuckDB and SQLite. This local-first approach offers a compelling alternative to cloud-centric models, enabling faster, more cost-effective, and innovative data architectures.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Synopsis What if hiring wasn’t about flipping through endless CVs but instead focused solely on skills? In this episode of Making Data Simple, we sit down with Tim Freestone, founder of Alooba, the groundbreaking platform revolutionizing how businesses hire for analytics, data science, and engineering roles. Tim shares how Alooba eliminates bias, speeds up hiring, and ensures candidates are evaluated based on what really matters—their capabilities. From his journey as an economics teacher to leading data teams, Tim’s insights are a must-hear for anyone tackling hiring challenges in today’s competitive job market. Learn how Alooba’s data-driven approach is transforming recruitment and why the future of hiring might just leave resumes in the dust. Show Notes 4:46 – How do you go from economics teacher to head of business intelligence?7:53 – Do CV’s matter anymore?13:22 – What business problem is Alooba solving?16:05 – Do you have any data that supports your theory?19:01 – Why analytics, data science, data engineering?20:26 - What do you do that others don’t?23:50 – How does Alooba define success?25:42 – Who’s your target client base?32:40 –Is there a customer you can talk about?36:24 – What does Alooba mean?Alooba  Connect with the Team Executive Producer Kate Mayne - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary The challenges of integrating all of the tools in the modern data stack has led to a new generation of tools that focus on a fully integrated workflow. At the same time, there have been many approaches to how much of the workflow is driven by code vs. not. Burak Karakan is of the opinion that a fully integrated workflow that is driven entirely by code offers a beneficial and productive means of generating useful analytical outcomes. In this episode he shares how Bruin builds on those opinions and how you can use it to build your own analytics without having to cobble together a suite of tools with conflicting abstractions.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Burak Karakan about the benefits of building code-only data systemsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Bruin is and the story behind it?Who is your target audience?There are numerous tools that address the ETL workflow for analytical data. What are the pain points that you are focused on for your target users?How does a code-only approach to data pipelines help in addressing the pain points of analytical workflows?How might it act as a limiting factor for organizational involvement?Can you describe how Bruin is designed?How have the design and scope of Bruin evolved since you first started working on it?You call out the ability to mix SQL and Python for transformation pipelines. What are the components that allow for that functionality?What are some of the ways that the combination of Python and SQL improves ergonomics of transformation workflows?What are the key features of Bruin that help to streamline the efforts of organizations building analytical systems?Can you describe the workflow of someone going from source data to warehouse and dashboard using Bruin and Ingestr?What are the opportunities for contributions to Bruin and Ingestr to expand their capabilities?What are the most interesting, innovative, or unexpected ways that you have seen Bruin and Ingestr used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bruin?When is Bruin the wrong choice?What do you have planned for the future of Bruin?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BruinFivetranStitchIngestrBruin CLIMeltanoSQLGlotdbtSQLMeshPodcast EpisodeSDFPodcast EpisodeAirflowDagsterSnowparkAtlanEvidenceThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Let's do things the right way, not just the fast way.

My works:

📕Fundamentals of Data Engineering: https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

🎥 Deeplearning.ai Data Engineering Certificate: https://www.coursera.org/professional-certificates/data-engineering

🔥Practical Data Modeling: https://practicaldatamodeling.substack.com/

🤓 My SubStack: https://joereis.substack.com/

Summary In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloadsInterview IntroductionCan you describe what Feldera is and the story behind it?DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?In what situations would you replace another technology with Feldera?When is it an additive technology?Can you describe the architecture of Feldera?How have the design and scope evolved since you first started working on it?What are the state storage interfaces available in Feldera?What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?Can you describe a typical workflow for an engineer building with Feldera?You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?What are the most interesting, unexpected, or challenging lessons that

We talked about:

00:00 DataTalks.Club intro 01:56 Using data to create livable cities 02:52 Rachel's career journey: from geography to urban data science 04:20 What does a transport scientist do? 05:34 Short-term and long-term transportation planning 06:14 Data sources for transportation planning in Singapore 08:38 Rachel's motivation for combining geography and data science 10:19 Urban design and its connection to geography 13:12 Defining a livable city 15:30 Livability of Singapore and urban planning 18:24 Role of data science in urban and transportation planning 20:31 Predicting travel patterns for future transportation needs 22:02 Data collection and processing in transportation systems 24:02 Use of real-time data for traffic management 27:06 Incorporating generative AI into data engineering 30:09 Data analysis for transportation policies 33:19 Technologies used in text-to-SQL projects 36:12 Handling large datasets and transportation data in Singapore 42:17 Generative AI applications beyond text-to-SQL 45:26 Publishing public data and maintaining privacy 45:52 Recommended datasets and projects for data engineering beginners 49:16 Recommended resources for learning urban data science

About the speaker:

Rachel is an urban data scientist dedicated to creating liveable cities through the innovative use of data. With a background in geography, and a masters in urban data science, she blends qualitative and quantitative analysis to tackle urban challenges. Her aim is to integrate data driven techniques with urban design to foster sustainable and equitable urban environments. 

Links: - https://datamall.lta.gov.sg/content/datamall/en/dynamic-data.html

00:00 DataTalks.Club intro 01:56 Using data to create livable cities 02:52 Rachel's career journey: from geography to urban data science 04:20 What does a transport scientist do? 05:34 Short-term and long-term transportation planning 06:14 Data sources for transportation planning in Singapore 08:38 Rachel's motivation for combining geography and data science 10:19 Urban design and its connection to geography 13:12 Defining a livable city 15:30 Livability of Singapore and urban planning 18:24 Role of data science in urban and transportation planning 20:31 Predicting travel patterns for future transportation needs 22:02 Data collection and processing in transportation systems 24:02 Use of real-time data for traffic management 27:06 Incorporating generative AI into data engineering 30:09 Data analysis for transportation policies 33:19 Technologies used in text-to-SQL projects 36:12 Handling large datasets and transportation data in Singapore 42:17 Generative AI applications beyond text-to-SQL 45:26 Publishing public data and maintaining privacy 45:52 Recommended datasets and projects for data engineering beginners 49:16 Recommended resources for learning urban data science

Join our slack: https: //datatalks.club/slack.html

I speak at a lot of conferences, and I've lost track of how many questions I've answered. Since conferences are top of mind for me right now, here are some tips for asking good (and bad) questions of speakers.

My works:

📕Fundamentals of Data Engineering: https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

🎥 Deeplearning.ai Data Engineering Certificate: https://www.coursera.org/professional-certificates/data-engineering

🔥Practical Data Modeling: https://practicaldatamodeling.substack.com/

🤓 My SubStack: https://joereis.substack.com/

Apache Airflow Best Practices

"Apache Airflow Best Practices" is your go-to guide for mastering data workflow orchestration using Apache Airflow. This book introduces you to core concepts and features of Airflow and helps you efficiently design, deploy, and manage workflows. With detailed examples and hands-on tutorials, you'll learn how to tackle real-world challenges in data engineering. What this Book will help me do Understand and utilize the features and updates introduced in Apache Airflow 2.x. Design and implement robust, scalable, and efficient data pipelines and workflows. Learn best practices for deploying Apache Airflow in cloud environments such as AWS and GCP. Extend Airflow's functionality with custom plugins and advanced configuration. Monitor, maintain, and scale your Airflow deployment effectively for high availability. Author(s) Dylan Intorf, Dylan Storey, and Kendrick van Doorn are seasoned professionals in data engineering, data strategy, and software development. Between them, they bring decades of experience working in diverse industries like finance, tech, and life sciences. They bring their expertise into this practical guide to help practitioners understand and master Apache Airflow. Who is it for? This book is tailored for data professionals such as data engineers, scientists, and system administrators, offering valuable insights for new learners and experienced users. If you're starting with workflow orchestration, seeking to optimize your current Airflow implementation, or scaling efforts, this book aligns with your goals. Readers should have a basic knowledge of Python programming and data engineering principles.

Delta Lake: The Definitive Guide

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Help us become the #1 Data Podcast by leaving a rating & review! We are 67 reviews away! Big changes are happening in the data world, and it’s not just about AI! It’s a mix of challenges and new chances in the data field. Let’s dig into what’s happening and why now’s the time to rethink your next career move. 💌 Join 30k+ aspiring data analysts & get my tips in your inbox weekly 👉 https://www.datacareerjumpstart.com/newsletter 🆘 Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training 👉 https://www.datacareerjumpstart.com/training 👩‍💻 Want to land a data job in less than 90 days? 👉 https://www.datacareerjumpstart.com/daa 👔 Ace The Interview with Confidence 👉 https://www.datacareerjumpstart.com//interviewsimulator 🔗 LIVE DATA TECHNOLOGIES: https://www.livedatatechnologies.com/ ⌚ TIMESTAMPS 01:10 - Data-Driven Insights on the Job Market 02:18 - The Rise of Data Engineering 03:49 - AI's Impact on Data Roles 04:44 - Data Analyst Jobs Are Still Growing 06:27 - Job Hopping in Data Roles 🔗 CONNECT WITH AVERY 🎥 YouTube Channel 🤝 LinkedIn 📸 Instagram 🎵 TikTok 💻 Website Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Summary Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stackInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data Migration Agent is and the story behind it?What is the core problem that you are targeting with the agent?What are the biggest time sinks in the process of database and tooling migration that teams run into?Can you describe the architecture of your agent?What was your selection and evaluation process for the LLM that you are using?What were some of the main unknowns that you had to discover going into the project?What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?When is the data migration agent the wrong choice?What do you have planned for the future of applications of AI at Datafold?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DatafoldDatafold Migration AgentDatafold data-diffDatafold Reconciliation Podcast EpisodeSQLGlotLark parserClaude 3.5 SonnetLookerPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

I've seen a TON of horror stories with tech debt and code migrations. It's estimated that 15% to 60% of every dollar in IT spend goes toward tech debt (that's a big range, I know). Regardless, most of this tech debt will not be paid down without a radical change in how we do things. Might AI be the Hail Mary we need to pay down tech debt? I don't see why not...

My works:

📕Fundamentals of Data Engineering: https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

🎥 Deeplearning.ai Data Engineering Certificate: https://www.coursera.org/professional-certificates/data-engineering

🔥Practical Data Modeling: https://practicaldatamodeling.substack.com/

🤓 My SubStack: https://joereis.substack.com/

Summary The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storageInterview IntroductionHow did you get involved in the area of data management?Can you describe what Lance is and the story behind it?What are the core problems that Lance is designed to solve?What is explicitly out of scope?The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?What formats does Lance replace or obviate?In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?Are there any practical or hard limitations on vector dimensionality?When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?What are the other main integrations for Lance?What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?What are the most interesting, innovative, or unexpected ways that you have seen Lance used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?When is Lance the wrong choice?What do you have planned for the future of Lance?Contact Info LinkedInGitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Lance FormatLanceDBSubstraitPyArrowFAISSPineconePodcast EpisodeParquetIcebergPodcast EpisodeDelta LakePodcast EpisodePyLanceHilbert CurvesSIFT VectorsS3 ExpressWekaDataFusionRay DataTorch Data LoaderHNSW == Hierarchical Navigable Small Worlds vector indexIVFPQ vector indexGeoJSONPolarsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

SQL is one of the most widely used data analysis tools around, often discussed as a cornerstone for Data Analysis, Data Science, and Data Engineering careers. In this episode, Thais Cooke talks about how she leverages SQL in her role as a Data Analyst and shares practical tips you can use to take your SQL game to the next level. You'll leave the show with an insider's perspective on where SQL adds the most value, and where you should focus if you want to build SQL skills that will advance your career. What You'll Learn: What makes SQL such a valuable skill set for so many roles Some of the most valuable ways you can use SQL on the job Where you can focus if you want to build job-ready SQL skills   Register for free to be part of the next live session: https://bit.ly/3XB3A8b   About our guest: Thais Cooke is a Data Analyst proficient in Excel, SQL, and Python with a background in Clinical Healthcare. SQL for Healthcare Professionals Course Follow Thais on LinkedIn  

Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter