Send us a text In this episode, we're joined by Sam Debruyn and Dorian Van den Heede who reflect on their talks at SQL Bits 2025 and dive into the technical content they presented. Sam walks through how dbt integrates with Microsoft Fabric, explaining how it improves lakehouse and warehouse workflows by adding modularity, testing, and documentation to SQL development. He also touches on Fusion’s SQL optimization features and how it compares to tools like SQLMesh. Dorian shares his MLOps demo, which simulates beating football bookmakers using historical data,nshowing how to build a full pipeline with Azure ML, from feature engineering to model deployment. They discuss the role of Python modeling in dbt, orchestration with Azure ML, and the practical challenges of implementing MLOps in real-world scenarios. Toward the end, they explore how AI tools like Copilot are changing the way engineers learn and debug code, raising questions about explainability, skill development, and the future of junior roles in tech. It’s rich conversation covering dbt, MLOps, Python, Azure ML, and the evolving role of AI in engineering.
talk-data.com
Topic
Azure
Microsoft Azure
47
tagged
Activity Trend
Top Events
In this episode, I sit down with Wendy Turner-Williams, a distinguished tech leader and executive with a deep history at companies like Microsoft and Salesforce. She's of the original minds behind what became Azure Data Factory, among other foundational tech. In this wide-ranging conversation, Wendy charts the trajectory from the early days of the Internet to the current AI-driven hype cycle and looming crisis. She explains how these tools of innovation are now being turned against the workforce and why this technological revolution is fundamentally more disruptive than anything that has come before. This episode is a candid, unfiltered discussion about the real-world impact of AI on jobs, the economy, and our collective future, and a call for leaders to act before it's too late. Timestamps: 00:22 - Catching up: The tough job market and writing new books. 05:49 - Wendy's impressive career history at Microsoft, Salesforce, and Tableau. 06:17 - The origin story of Azure Data Factory and other foundational projects at Microsoft. 09:18 - A personal story about the challenges of being a woman in Big Tech in the early days. 13:02 - A look back at a favorite early-career project: Digitizing physical maps with nascent GPS technology in 2001. 18:11 - The state of the tech industry: "Tech is cannibalizing itself because of AI." 20:31 - The massive, impending shock to the job market and why AI is different from previous industrial revolutions. 27:26 - Why the "human in the loop" is a temporary and misleading solution. 29:55 - Breaking down the numbers: The staggering quantity of white-collar jobs projected to be eliminated. 36:37 - Why leaders are failing to act and conversations are happening behind closed doors without solutions. 38:25 - Discussing potential solutions: Should companies have quotas for their human workforce? 45:21 - The need for "truth tellers" and leaders who are willing to question the current path and drive human-centric transformation. 53:15 - The grim reality for recent graduates with computer science degrees who can't find jobs. 56:22 - The risk of IP hoarding and engineers deliberately crippling systems to protect their jobs. 01:00:20 - Final thoughts: Are we waiting for a "let them eat cake" moment before we see real change?
Elliot Foreman and Andrew DeLave from ProsperOps joined Yuliia and Dumky to discuss automated cloud cost optimization through commitment management. As Google go-to-market director and senior FinOps specialist, they explain how their platform manages over $4 billion in cloud spend by automating reserved instances, committed use discounts, and savings plans across AWS, Azure, and Google Cloud. The conversation covers the psychology behind commitment hesitation, break-even point mathematics for cloud discounts, workload volatility optimization, and why they avoid AI in favor of deterministic algorithms for financial decisions. They share insights on managing complex multi-cloud environments, the human vs automation debate in FinOps, and practical strategies for reducing cloud costs while mitigating commitment risks.
Supported by Our Partners • Statsig — The unified platform for flags, analytics, experiments, and more. • Sinch — Connect with customers at every step of their journey. • Modal — The cloud platform for building AI applications. — How has Microsoft changed since its founding in 1975, especially in how it builds tools for developers? In this episode of The Pragmatic Engineer, I sit down with Scott Guthrie, Executive Vice President of Cloud and AI at Microsoft. Scott has been with the company for 28 years. He built the first prototype of ASP.NET, led the Windows Phone team, led up Azure, and helped shape many of Microsoft’s most important developer platforms. We talk about Microsoft’s journey from building early dev tools to becoming a top cloud provider—and how it actively worked to win back and grow its developer base. In this episode, we cover: • Microsoft’s early years building developer tools • Why Visual Basic faced resistance from devs back in the day: even though it simplified development at the time • How .NET helped bring a new generation of server-side developers into Microsoft’s ecosystem • Why Windows Phone didn’t succeed • The 90s Microsoft dev stack: docs, debuggers, and more • How Microsoft Azure went from being the #7 cloud provider to the #2 spot today • Why Microsoft created VS Code • How VS Code and open source led to the acquisition of GitHub • What Scott’s excited about in the future of developer tools and AI • And much more! — Timestamps (00:00) Intro (02:25) Microsoft’s early years building developer tools (06:15) How Microsoft’s developer tools helped Windows succeed (08:00) Microsoft’s first tools were built to allow less technically savvy people to build things (11:00) A case for embracing the technology that’s coming (14:11) Why Microsoft built Visual Studio and .NET (19:54) Steve Ballmer’s speech about .NET (22:04) The origins of C# and Anders Hejlsberg’s impact on Microsoft (25:29) The 90’s Microsoft stack, including documentation, debuggers, and more (30:17) How productivity has changed over the past 10 years (32:50) Why Gergely was a fan of Windows Phone—and Scott’s thoughts on why it didn’t last (36:43) Lessons from working on (and fixing) Azure under Satya Nadella (42:50) Codeplex and the acquisition of GitHub (48:52) 2014: Three bold projects to win the hearts of developers (55:40) What Scott’s excited about in new developer tools and cloud computing (59:50) Why Scott thinks AI will enhance productivity but create more engineering jobs — The Pragmatic Engineer deepdives relevant for this episode: • Microsoft is dogfooding AI dev tools’ future • Microsoft’s developer tools roots • Why are Cloud Development Environments spiking in popularity, now? • Engineering career paths at Big Tech and scaleups • How Linux is built with Greg Kroah-Hartman — See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].
Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
In this podcast episode, we talked with Eddy Zulkifly about From Supply Chain Management to Digital Warehousing and FinOps
About the Speaker: Eddy Zulkifly is a Staff Data Engineer at Kinaxis, building robust data platforms across Google Cloud, Azure, and AWS. With a decade of experience in data, he actively shares his expertise as a Mentor on ADPList and Teaching Assistant at Uplimit. Previously, he was a Senior Data Engineer at Home Depot, specializing in e-commerce and supply chain analytics. Currently pursuing a Master’s in Analytics at the Georgia Institute of Technology, Eddy is also passionate about open-source data projects and enjoys watching/exploring the analytics behind the Fantasy Premier League.
In this episode, we dive into the world of data engineering and FinOps with Eddy Zulkifly, Staff Data Engineer at Kinaxis. Eddy shares his unconventional career journey—from optimizing physical warehouses with Excel to building digital data platforms in the cloud.
🕒 TIMECODES 0:00 Eddy’s career journey: From supply chain to data engineering 8:18 Tools & learning: Excel, Docker, and transitioning to data engineering 21:57 Physical vs. digital warehousing: Analogies and key differences 31:40 Introduction to FinOps: Cloud cost optimization and vendor negotiations 40:18 Resources for FinOps: Certifications and the FinOps Foundation 45:12 Standardizing cloud cost reporting across AWS/GCP/Azure 50:04 Eddy’s master’s degree and closing thoughts
🔗 CONNECT WITH EDDY Twitter - https://x.com/eddarief Linkedin - https://www.linkedin.com/in/eddyzulkifly/ Github: https://github.com/eyzyly/eyzyly ADPList: https://adplist.org/mentors/eddy-zulkifly
🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
Check other upcoming events - https://lu.ma/dtc-events LinkedIn - https://www.linkedin.com/company/datatalks-club/ Twitter - https://twitter.com/DataTalksClub Website - https://datatalks.club/
The role of data and AI engineers is more critical than ever. With organizations collecting massive amounts of data, the challenge lies in building efficient data infrastructures that can support AI systems and deliver actionable insights. But what does it take to become a successful data or AI engineer? How do you navigate the complex landscape of data tools and technologies? And what are the key skills and strategies needed to excel in this field? Deepak Goyal is a globally recognized authority in Cloud Data Engineering and AI. As the Founder & CEO of Azurelib Academy, he has built a trusted platform for advanced cloud education, empowering over 100,000 professionals and influencing data strategies across Fortune 500 companies. With over 17 years of leadership experience, Deepak has been at the forefront of designing and implementing scalable, real-world data solutions using cutting-edge technologies like Microsoft Azure, Databricks, and Generative AI. In the episode, Richie and Deepak explore the fundamentals of data engineering, the critical skills needed, the intersection with AI roles, career paths, and essential soft skills. They also discuss the hiring process, interview tips, and the importance of continuous learning in a rapidly evolving field, and much more. Links Mentioned in the Show: AzureLibAzureLib Academy Connect with DeepakGet Certified! Azure FundamentalsRelated Episode: Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at AwaySign up to attend RADAR: Skills Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
The rapid expansion of data centers is reshaping the industry, requiring new approaches to design, safety, and leadership.
We’re excited to have Doug Mouton, former Senior Eng Lead, Datacenter Design Engineering and Construction at Meta, as a guest on this latest episode of the “Data Center Revolution” podcast. Doug joins us with key insights into leadership, adaptability, and the evolution of hyperscale data-center construction. He also shares his journey from military service to leading large-scale infrastructure projects in the data center industry, highlighting key transferable skills along the way.
Key Takeaways:
(07:54) Military mindset builds strong leaders. (14:25) Veterans thrive in high-pressure environments. (25:32) Katrina exposed disaster preparedness gaps. (35:16) Microsoft shifted to cost-effective data center designs. (43:56) Data centers face growing energy challenges. (54:26) Safety-first culture boosts efficiency and morale. (01:21:43) Data centers must transition to hybrid cooling solutions. (01:42:09) AI needs ethical guardrails.
Resources Mentioned:
Fidelis New Energy | Website - https://www.fidelisinfra.com
Microsoft Azure - https://azure.microsoft.com/en-us/
Meta - https://about.meta.com/
Jacobs - https://www.jacobs.com/
National Guard - https://nationalguard.com/
Jones Lang LaSalle - https://www.us.jll.com/
Thank you for listening to “Data Center Revolution.” Don’t forget to leave us a review and subscribe so you don’t miss an episode. To learn more about Overwatch, visit us at https://linktr.ee/overwatchmissioncritical
DataCenterIndustry #NuclearEnergy #FutureOfDataCenters #AI
On this episode of the Data Unchained podcast, Desiree Campbell, Managing Director for HPC Americas from the Azure team at Microsoft, joins us to discuss Women in HPC, architecting and orchestrating data workflows across hybrid environments, and the importance of being a mentor in the tech industry.
podcast #ai #data #innovation #datascience #datastorage #datacloudtechnology #global #international #hybridcloud #cloud #dataorchestration #hightech #tech #technology #technologynews #tech
@Microsoft @MicrosoftAzure https://azure.microsoft.com/ Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.
By now, many of us are convinced that generative AI chatbots like ChatGPT are useful at work. However, many executives are rightfully worried about the risks from having business and customer conversations recorded by AI chatbot platforms. Some privacy and security-conscious organizations are going so far as to block these AI platforms completely. For organizations such as EY, a company that derives value from its intellectual property, leaders need to strike a balance between privacy and productivity. John Thompson runs the department for the ideation, design, development, implementation, & use of innovative Generative AI, Traditional AI, & Causal AI solutions, across all of EY's service lines, operating functions, geographies, & for EY's clients. His team has built the world's largest, secure, private LLM-based chat environment. John also runs the Marketing Sciences consultancy, advising clients on monetization strategies for data. He is the author of four books on data, including "Data for All' and "Causal Artificial Intelligence". Previously, he was the Global Head of AI at CSL Behring, an Adjunct Professor at Lake Forest Graduate School of Management, and an Executive Partner at Gartner. In the episode, Richie and John explore the adoption of GenAI at EY, data privacy and security, GenAI use cases and productivity improvements, GenAI for decision making, causal AI and synthetic data, industry trends and predictions and much more. Links Mentioned in the Show: Azure OpenAICausality by Judea Pearl[Course] AI EthicsRelated Episode: Data & AI at Tesco with Venkat Raghavan, Director of Analytics and Science at TescoCatch John talking about AI Maturity this SeptemberRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business
Summary
Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou
Interview
Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges?
How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
What are the challenges in terms of safety and reliability?
What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics?
Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape
Podcast Episode ML Podcast Episode
Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg
Podcast Episode
Hudi
Podcast Episode
Hadoop PowerBI
Podcast Episode
Velox Gluten Apache XTable GraphQL Formula 1 McLaren
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Starburst: 
This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by T
Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In this episode #42, titled "Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond," we're joined once again by the tech maestro and newly minted Microsoft MVP, Sam Debruyn. Sam brings to the table a bevy of updates from his recent accolades to the intricacies of Microsoft's data platforms and the world of SQL.
Biz Buzz: From Reddit's IPO to the performance versus utility debate in database selection, we dissect the big moves shaking up the business side of tech. Read about Reddit's IPO.Microsoft's Fabric Unraveled: Get the lowdown on Microsoft's Fabric, the one-stop AI platform, as Sam Debruyn gives us a deep dive into its capabilities and integration with Azure Databricks and Power BI. Discover more about Fabric and dive into Sam's blog.dbt Developments: Sam talks dbt and the exciting new SQL tool for data pipeline building with upcoming unit testing capabilities.Polaris Project: Delving into Microsoft's internal storage projects, including insights on Polaris and its integration with Synapse SQL. Read the paper here.AI Advances: From the release of Grok-1 and Apple's MM1 AI model to GPT-4's trillion parameters, we discuss the leaps in artificial intelligence.Stability in Motion: After OpenAI's Sora, we look at Stability AI's new venture into motion with Stable Video. Check out Stable Video.Benchmarking Debate: A critical look at performance benchmarks in database selection and the ongoing search for the 'best' database. Contemplate benchmarking perspectives.Versioning Philosophy: Hot takes on semantic versioning and what stability really means in software development. Dive into Semantic Versioning.
We’ve heard so much about the value and capabilities of generative AI over the past year, and we’ve all become accustomed to the chat interfaces of our preferred models. One of the main concerns many of us have had has been privacy. Is OpenAI keeping the data and information I give to ChatGPT secure? One of the touted solutions to this problem is running LLMs locally on your own machine, but with the hardware cost that comes with it, running LLMs locally has not been possible for many of us. That might now be starting to change. Nuri Canyaka is VP of AI Marketing at Intel. Prior to Intel, Nuri spent 16 years at Microsoft, starting out as a Technical Evangelist, and leaving the organization as the Senior Director of Product Marketing. He ran the GTM team that helped generate adoption of GPT in Microsoft Azure products. La Tiffaney Santucci is Intel’s AI Marketing Director, specializing in their Edge and Client products. La Tiffaney has spent over a decade at Intel, focussing on partnerships with Dell, Google Amazon and Microsoft. In the episode, Richie, Nuri and La Tiffaney explore AI’s impact on marketing analytics, the adoptions of AI in the enterprise, how AI is being integrated into existing products, the workflow for implementing AI into business processes and the challenges that come with it, the importance of edge AI for instant decision-making in uses-cases like self-driving cars, the emergence of AI engineering as a distinct field of work, the democratization of AI, what the state of AGI might look like in the near future and much more. About the AI and the Modern Data Stack DataFramed Series This week we’re releasing 4 episodes focused on how AI is changing the modern data stack and the analytics profession at large. The modern data stack is often an ambiguous and all-encompassing term, so we intentionally wanted to cover the impact of AI on the modern data stack from different angles. Here’s what you can expect: Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot — Covering how AI will change analytics workflows and tools How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks — Covering Databricks, data intelligence and how AI tools are changing data democratizationAdding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake — Covering Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, and how to improve your data managementAccelerating AI Workflows with Nuri Cankaya, VP of AI Marketing & La Tiffaney Santucci, AI Marketing Director at Intel — Covering AI’s impact on marketing analytics, how AI is being integrated into existing products, and the democratization of AI Links Mentioned in the Show: Intel OpenVINO™ toolkitIntel Developer Clouds for Accelerated ComputingAWS Re:Invent[Course] Implementing AI Solutions in BusinessRelated Episode: Intel CTO Steve Orrin on How Governments Can Navigate the Data & AI RevolutionSign up to a href="https://www.datacamp.com/radar-analytics-edition"...
Você já deve ter ouvido, sobre o lançamento da nova Cloud Publica e Brasileira, que movimentou muitos rumores no mercado de tecnologia. E atendo a pedidos da comunidade, agora você tem a chance de conhecer as estratégias, e um pouco mais, sobre a Magalu Cloud.
Neste episódio do Data Hackers — a maior comunidade de AI e Data Science do Brasil-, chamamos o Vaner Vendramini — Field CTO na Magalu Cloud, para desmitificar tudo que está por de trás deste lançamento da primeira Cloud Brasileira em Hiperscala, da Magalu.
Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!
Conheça nosso convidado:
Vaner Vendramini — Field CTO na Magalu Cloud
Nossa Bancada Data Hackers:
Monique Femme — Head of Community Management na Data Hackers Allan Senne — Co-founder da Data Hackers e Co-Founder & CTO at Dadosfera.
Paulo Vasconcellos — Co-founder da Data Hackers e Principal Data Scientist na Hotmart. Gabriel Lages — Co-founder da Data Hackers e Data & Analytics Sr. Director na Hotmart.
Falamos no episódioLinks de referências:
Sobre o evento de lançamento da Magalu Cloud: https://www.magazineluiza.com.br/blog-da-lu/c/dl/dldc/magalu-cloud-a-nuvem-do-magazine-luiza/12434/ Cloud Alema citada pelo Vaner: https://www.stackit.de/en/ Estudo da McKinsey sobre o mercado de cloud Computing em 2030: https://www.mckinsey.com/br/our-insights/all-insights/computacao-em-nuvem-2030 Progressão do market sharing de Cloud, de 2018 até 2021, da digital cloud training: https://digitalcloud.training/comparison-of-aws-vs-azure-vs-google/ Página de parceiros da Magalu Cloud: https://magalu.cloud/solucoes/
Summary
A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse
Interview
Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?
How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead?
What does CI/CD look like for a data warehouse?
How is it different from CI/CD for software applications?
Can you describe how Agile Data Engine is architected?
How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals?
What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics?
What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?
In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine?
Guest Contact Info
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
About Agile Data Engine
Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.
Links
Agile Data Engine Bill Inmon Ralph Kimball Snowflake Redshift BigQuery Azure Synapse Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Rudderstack: 
RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.
RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.
RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.
Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast
Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet
Interview
Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the st
Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design
Interview
Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role
Summary Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution frame
Summary One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nand
Summary The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when
Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse
Interview
Introduction How did you get involved in the area of data management? Can you d