talk-data.com talk-data.com

Topic

GCP

Google Cloud Platform (GCP)

cloud cloud_provider infrastructure services

55

tagged

Activity Trend

31 peak/qtr
2020-Q1 2026-Q1

Activities

55 activities · Newest first

Brought to You By: •⁠ Statsig ⁠ — ⁠ The unified platform for flags, analytics, experiments, and more. Something interesting is happening with the latest generation of tech giants. Rather than building advanced experimentation tools themselves, companies like Anthropic, Figma, Notion and a bunch of others… are just using Statsig. Statsig has rebuilt this entire suite of data tools that was available at maybe 10 or 15 giants until now. Check out Statsig. •⁠ Linear – The system for modern product development. Linear is just so fast to use – and it enables velocity in product workflows. Companies like Perplexity and OpenAI have already switched over, because simplicity scales. Go ahead and check out Linear and see why it feels like a breeze to use. — What is it really like to be an engineer at Google? In this special deep dive episode, we unpack how engineering at Google actually works. We spent months researching the engineering culture of the search giant, and talked with 20+ current and former Googlers to bring you this deepdive with Elin Nilsson, tech industry researcher for The Pragmatic Engineer and a former Google intern. Google has always been an engineering-driven organization. We talk about its custom stack and tools, the design-doc culture, and the performance and promotion systems that define career growth. We also explore the culture that feels built for engineers: generous perks, a surprisingly light on-call setup often considered the best in the industry, and a deep focus on solving technical problems at scale. If you are thinking about applying to Google or are curious about how the company’s engineering culture has evolved, this episode takes a clear look at what it was like to work at Google in the past versus today, and who is a good fit for today’s Google. Jump to interesting parts: (13:50) Tech stack (1:05:08) Performance reviews (GRAD) (2:07:03) The culture of continuously rewriting things — Timestamps (00:00) Intro (01:44) Stats about Google (11:41) The shared culture across Google (13:50) Tech stack (34:33) Internal developer tools and monorepo (43:17) The downsides of having so many internal tools at Google (45:29) Perks (55:37) Engineering roles (1:02:32) Levels at Google  (1:05:08) Performance reviews (GRAD) (1:13:05) Readability (1:16:18) Promotions (1:25:46) Design docs (1:32:30) OKRs (1:44:43) Googlers, Nooglers, ReGooglers (1:57:27) Google Cloud (2:03:49) Internal transfers (2:07:03) Rewrites (2:10:19) Open source (2:14:57) Culture shift (2:31:10) Making the most of Google, as an engineer (2:39:25) Landing a job at Google — The Pragmatic Engineer deepdives relevant for this episode: •⁠ Inside Google’s engineering culture •⁠ Oncall at Google •⁠ Performance calibrations at tech companies •⁠ Promotions and tooling at Google •⁠ How Kubernetes is built •⁠ The man behind the Big Tech comics: Google cartoonist Manu Cornet — Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

Elliot Foreman and Andrew DeLave from ProsperOps joined Yuliia and Dumky to discuss automated cloud cost optimization through commitment management. As Google go-to-market director and senior FinOps specialist, they explain how their platform manages over $4 billion in cloud spend by automating reserved instances, committed use discounts, and savings plans across AWS, Azure, and Google Cloud. The conversation covers the psychology behind commitment hesitation, break-even point mathematics for cloud discounts, workload volatility optimization, and why they avoid AI in favor of deterministic algorithms for financial decisions. They share insights on managing complex multi-cloud environments, the human vs automation debate in FinOps, and practical strategies for reducing cloud costs while mitigating commitment risks.

Try Keboola 👉 https://www.keboola.com/mcp?utm_campaign=FY25_Q2_RoW_Marketing_Events_Webinar_Keboola_MCP_Server_Launch_June&utm_source=Youtube&utm_medium=Avery Today, we'll create an entire data pipeline from scratch without writing a single line of code! Using Keboola MCP server and ClaudeAI, we’ll extract data from my FindADataJob.com RSS feed, transform it, load it into Google BigQuery, and visualize it with Streamlit. This is the future of data engineering! Keboola MCP Integration: https://mcp.connection.us-east4.gcp.keboola.com/sse I Analyzed Data Analyst Jobs to Find Out What Skills You ACTUALLY Need https://www.youtube.com/watch?v=lo3VU1srV1E&t=212s 💌 Join 10k+ aspiring data analysts & get my tips in your inbox weekly 👉 https://www.datacareerjumpstart.com/newsletter 🆘 Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training 👉 https://www.datacareerjumpstart.com/training 👩‍💻 Want to land a data job in less than 90 days? 👉 https://www.datacareerjumpstart.com/daa 👔 Ace The Interview with Confidence 👉 https://www.datacareerjumpstart.com/interviewsimulator ⌚ TIMESTAMPS 00:00 - Introduction 00:54 - Definition of Basic Data Engineering Terms 02:26 - Keboola MCP and Its Capabilities 07:48 - Extracting Data from RSS Feed 12:43 - Transforming and Cleaning the Data 19:19 - Aggregating and Analyzing Data 23:19 - Scheduling and Automating the Pipeline 25:04 - Visualizing Data with Streamlit

🔗 CONNECT WITH AVERY 🎥 YouTube Channel: https://www.youtube.com/@averysmith 🤝 LinkedIn: https://www.linkedin.com/in/averyjsmith/ 📸 Instagram: https://instagram.com/datacareerjumpstart 🎵 TikTok: https://www.tiktok.com/@verydata 💻 Website: https://www.datacareerjumpstart.com/ Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Just wrapped up a whirlwind tour, giving a workshop in Atlanta and then attending Google Cloud Next. B2b nonstop action, and I'm glad to home for a bit.

While at Next, I had a conversation with another tech old timer friend. We talked about how much we're having using AI as a coding assistant. I'm having fun coming up with wild stuff and seeing if it's possible to build with code. AI's made coding fun again!

📈 This episode is brought to you by GoodData. Design and deploy custom data applications and integrate AI-assisted analytics capabilities wherever your users need them.

For more information, visit https://www.gooddata.com

In this podcast episode, we talked with Eddy Zulkifly about From Supply Chain Management to Digital Warehousing and FinOps

About the Speaker: Eddy Zulkifly is a Staff Data Engineer at Kinaxis, building robust data platforms across Google Cloud, Azure, and AWS. With a decade of experience in data, he actively shares his expertise as a Mentor on ADPList and Teaching Assistant at Uplimit. Previously, he was a Senior Data Engineer at Home Depot, specializing in e-commerce and supply chain analytics. Currently pursuing a Master’s in Analytics at the Georgia Institute of Technology, Eddy is also passionate about open-source data projects and enjoys watching/exploring the analytics behind the Fantasy Premier League.

In this episode, we dive into the world of data engineering and FinOps with Eddy Zulkifly, Staff Data Engineer at Kinaxis. Eddy shares his unconventional career journey—from optimizing physical warehouses with Excel to building digital data platforms in the cloud.

🕒 TIMECODES 0:00 Eddy’s career journey: From supply chain to data engineering 8:18 Tools & learning: Excel, Docker, and transitioning to data engineering 21:57 Physical vs. digital warehousing: Analogies and key differences 31:40 Introduction to FinOps: Cloud cost optimization and vendor negotiations 40:18 Resources for FinOps: Certifications and the FinOps Foundation 45:12 Standardizing cloud cost reporting across AWS/GCP/Azure 50:04 Eddy’s master’s degree and closing thoughts

🔗 CONNECT WITH EDDY Twitter - https://x.com/eddarief Linkedin - https://www.linkedin.com/in/eddyzulkifly/ Github: https://github.com/eyzyly/eyzyly ADPList: https://adplist.org/mentors/eddy-zulkifly

🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

Check other upcoming events - https://lu.ma/dtc-events LinkedIn - https://www.linkedin.com/company/datatalks-club/ Twitter - https://twitter.com/DataTalksClub Website - https://datatalks.club/

Kir Titievsky, Product Manager at Google Cloud with extensive experience in streaming and storage infrastructure, joined Yuliia and Dumky to talk about streaming. Drawing from his work with Apache Kafka, Cloud PubSub, Dataflow and Cloud Storage since 2015, Kir explains the fundamental differences between streaming and micro-batch processing. He challenges common misconceptions about streaming costs, explaining how streaming can be significantly less expensive than batch processing for many use cases. Kir shares insights on the "service bus architecture" revival, discussing how modern distributed messaging systems have solved historic bottlenecks while creating new opportunities for business and performance needs.Kir's medium - https://medium.com/@kir-gcpKir's Linkedin page - https://www.linkedin.com/in/kir-titievsky-%F0%9F%87%BA%F0%9F%87%A6-7775052/

Serhii Sokolenko, founder at Tower Dev and former product manager at tech giants like Google Cloud, Snowflake, and Databricks, joined Yuliia to discuss his journey building a next-generation compute platform. Tower Dev aims to simplify data processing for data engineers who work with Python. Serhii explains how Tower addresses three key market trends: the integration of data engineering with AI through Python, the movement away from complex distributed processing frameworks, and users' desire for flexibility across different data platforms. He explains how Tower makes Python data applications more accessible by eliminating the need to learn complex frameworks while automatically scaling infrastructure. Sergei also shares his perspective on the future of data engineering, noting in which ways AI will transform the profession.Tower Dev - https://tower.dev/Serhii's Linkedin - https://www.linkedin.com/in/ssokolenko/

Deepti Srivastava, Founder of Snow Leopard AI and former Spanner Product Lead at Google Cloud, joined Yuliia to chat what's wrong with current approaches to AI integration. Deepti introduces a paradigm shift away from ETL pipelines towards federated, real-time data access for AI applications. She explains how Snow Leopard's intelligent data retrieval platform enables enterprises to connect AI systems directly to operational data sources without compromising security or freshness. Through practical examples Deepti explains why conventional RAG approaches with vector stores are not good enough for business-critical AI applications, and how a systems thinking approach to AI infrastructure can unlock greater value while reducing unnecessary data movement.Deepti's linkedin - https://www.linkedin.com/in/thedeepti/Snowleopard.ai - http://snowleopard.ai/

podcast_episode
by Richard He (Fundamenta – Data Consultancy) , Yuliia Tkachova (Masthead Data)

Richard He, Founder at Fundamenta – Data Consultancy, former Engineering Director at Virgin Media O2  and creator of Practical GCP YouTube channel, joined Yuliia to discuss the critical concept of "cost of inaction" in data infrastructure modernization. Based on his two decades of experience, including several successful migrations, Richard emphasized the importance of proactive platform evolution over reactive large-scale migrations. He shared valuable insights on measuring ROI for platform teams, and bridging the gap between technical execution and business strategy. Richard's Linkedin - https://www.linkedin.com/in/shenghuahe/

We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here. Integrating generative AI with robust databases is becoming essential. As organizations face a plethora of database options and AI tools, making informed decisions is crucial for enhancing customer experiences and operational efficiency. How do you ensure your AI systems are powered by high-quality data? And how can these choices impact your organization's success? Gerrit Kazmaier is the VP and GM of Data Analytics at Google Cloud. Gerrit leads the development and design of Google Cloud’s data technology, which includes data warehousing and analytics. Gerrit’s mission is to build a unified data platform for all types of data processing as the foundation for the digital enterprise. Before joining Google, Gerrit served as President of the HANA & Analytics team at SAP in Germany and led the global Product, Solution & Engineering teams for Databases, Data Warehousing and Analytics. In 2015, Gerrit served as the Vice President of SAP Analytics Cloud in Vancouver, Canada. In this episode, Richie and Gerrit explore the transformative role of AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, leveraging unstructured data, and much more. Links Mentioned in the Show: Google CloudConnect with GerritThinking Fast and Slow by Daniel KahnemanCourse: Introduction to GCPRelated Episode: Not Only Vector Databases: Putting Databases at the Heart of AI, with Andi Gutmans, VP and GM of Databases at GoogleRewatch sessions from RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Generative AI and data are more interconnected than ever. If you want quality in your AI product, you need to be connected to a database with high quality data. But with so many database options and new AI tools emerging, how do you ensure you’re making the right choices for your organization? Whether it’s enhancing customer experiences or improving operational efficiency, understanding the role of your databases in powering AI is crucial.  Andi Gutmans is the General Manager and Vice President for Databases at Google. Andi’s focus is on building, managing, and scaling the most innovative database services to deliver the industry’s leading data platform for businesses. Prior to joining Google, Andi was VP Analytics at AWS running services such as Amazon Redshift. Prior to his tenure at AWS, Andi served as CEO and co-founder of Zend Technologies, the commercial backer of open-source PHP. Andi has over 20 years of experience as an open source contributor and leader. He co-authored open source PHP. He is an emeritus member of the Apache Software Foundation and served on the Eclipse Foundation’s board of directors. He holds a bachelor’s degree in computer science from the Technion, Israel Institute of Technology. In the episode, Richie and Andi explore databases and their relationship with AI and GenAI, key features needed in databases for AI, GCP database services, AlloyDB, federated queries in Google Cloud, vector databases, graph databases, practical use cases of AI in databases and much more.  Links Mentioned in the Show: GCPConnect with AndiAlloyDB for PostgreSQLCourse: Responsible AI Data ManagementRelated Episode: The Power of Vector Databases and Semantic Search with Elan Dekel, VP of Product at PineconeSign up to RADAR: Forward Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Sayle Matthews leads the North American GCP Data Practice at DoiT International. Over the past year and a half, he has focused almost exclusively on BigQuery, helping hundreds of GCP customers optimize their usage and solve some of their biggest 'Big Data' challenges. With extensive experience in Google BigQuery billing, we sat down to discuss the changes and, most importantly, the impact these changes have had on the market, as observed by Sayle while working with hundreds of clients of various sizes at DoiT. Sayle's LinkedIn page - https://www.linkedin.com/in/sayle-matthews-522a795/

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that flow like your morning coffee, where industry insights meet laid-back banter. Whether you're a data aficionado or just curious about the digital age, pull up a chair and let's explore the heart of data, unplugged style!

Stack Overflow and OpenAI Deal Controversy: Discussing the partnership controversy, with users protesting the lack of an opt-out option and how this could reshape the platform. Look into Phind here.Apple and OpenAI Rumors - could ChatGPT be the new Siri? Examining rumors of ChatGPT potentially replacing Siri, and Apple's AI strategy compared to Microsoft’s MAI-1. Check out more community opinions here.Hello GPT-4o: Exploring the new era with OpenAI's GPT-4o that blends video, text, and audio for more dynamic human-AI interactions. Discussing AI's challenges under the European AI Act and chatgpt’s use in daily life and dating apps like Bumble.Claude Takes Europe: Claude 3 now available in the EU. How does it compare to ChatGPT in coding and conversation?ElevenLabs' Music Generation AI: A look at ElevenLabs' AI for generating music and the broader AI music landscape. How are these algorithms transforming music creation? Check out the AI Song Contest here.Google Cloud’s Big Oops with UniSuper: Unpack the shocking story of how Google Cloud accidentally wiped out UniSuper’s account. What does this mean for data security and redundancy strategies?The Great CLI Debate: Is Python really the right choice for CLI tools? We spark the debate over Python vs. Go and Rust in building efficient CLI tools.

Today I’m chatting with Bruno Aziza, Head of Data & Analytics at Google Cloud. Bruno leads a team of outbound product managers in charge of BigQuery, Dataproc, Dataflow and Looker and we dive deep on what Bruno looks for in terms of skills for these leaders. Bruno describes the three patterns of operational alignment he’s observed in data product management, as well as why he feels ownership and customer obsession are two of the most important qualities a good product manager can have. Bruno and I also dive into how to effectively abstract the core problem you’re solving, as well as how to determine whether a problem might be solved in a better way. 

Highlights / Skip to:

Bruno introduces himself and explains how he created his “CarCast” podcast (00:45) Bruno describes his role at Google, the product managers he leads, and the specific Google Cloud products in his portfolio (02:36) What Bruno feels are the most important attributes to look for in a good data product manager (03:59) Bruno details how a good product manager focuses on not only the core problem, but how the problem is currently solved and whether or not that’s acceptable (07:20) What effective abstracting the problem looks like in Bruno’s view and why he positions product management as a way to help users move forward in their career (12:38) Why Bruno sees extracting value from data as the number one pain point for data teams and their respective companies (17:55) Bruno gives his definition of a data product (21:42) The three patterns Bruno has observed of operational alignment when it comes to data product management (27:57) Bruno explains the best practices he’s seen for cross-team goal setting and problem-framing (35:30)

Quotes from Today’s Episode  

“What’s happening in the industry is really interesting. For people that are running data teams today and listening to us, the makeup of their teams is starting to look more like what we do [in] product management.” — Bruno Aziza (04:29)

“The problem is the problem, so focus on the problem, decompose the problem, look at the frictions that are acceptable, look at the frictions that are not acceptable, and look at how by assembling a solution, you can make it most seamless for the individual to go out and get the job done.” – Bruno Aziza (11:28)

“As a product manager, yes, we’re in the business of software, but in fact, I think you’re in the career management business. Your job is to make sure that whatever your customer’s job is that you’re making it so much easier that they, in fact, get so much more done, and by doing so they will get promoted, get the next job.” – Bruno Aziza (15:41)

“I think that is the task of any technology company, of any product manager that’s helping these technology companies: don’t be building a product that’s looking for a problem. Just start with the problem back and solution from that. Just make sure you understand the problem very well.” (19:52)

“If you’re a data product manager today, you look at your data estate and you ask yourself, ‘What am I building to save money? When am I building to make money?’ If you can do both, that’s absolutely awesome. And so, the data product is an asset that has been built repeatedly by a team and generates value out of data.” – Bruno Aziza (23:12)

“[Machine learning is] hard because multiple teams have to work together, right? You got your business analyst over here, you’ve got your data scientists over there, they’re not even the same team. And so, sometimes you’re struggling with just the human aspect of it.” (30:30)

“As a data leader, an IT leader, you got to think about those soft ways to accomplish the stuff that’s binary, that’s the hard [stuff], right? I always joke, the hard stuff is the soft stuff for people like us because we think about data, we think about logic, we think, ‘Okay if it makes sense, it will be implemented.’ For most of us, getting stuff done is through people. And people are emotional, how can you express the feeling of achieving that goal in emotional value?” – Bruno Aziza (37:36)

Links As referenced by Bruno, “Good Product Manager/Bad Product Manager”: https://a16z.com/2012/06/15/good-product-managerbad-product-manager/ LinkedIn: https://www.linkedin.com/in/brunoaziza/ Bruno’s Medium Article on Competing Against Luck by Clayton M. Christensen: https://brunoaziza.medium.com/competing-against-luck-3daeee1c45d4 The Data CarCast on YouTube:  https://www.youtube.com/playlist?list=PLRXGFo1urN648lrm8NOKXfrCHzvIHeYyw

Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet

Interview

Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the st

Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design

Interview

Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role

Summary Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution frame

Summary One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nand

Summary The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when

Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse

Interview

Introduction How did you get involved in the area of data management? Can you d