talk-data.com talk-data.com

Topic

Azure

Microsoft Azure

cloud cloud_provider microsoft infrastructure

723

tagged

Activity Trend

278 peak/qtr
2020-Q1 2026-Q1

Activities

723 activities · Newest first

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

Akamai's content delivery network (CDN) processes about 30% of the internet's daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors. In this session, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services. We will describe the iterative process of re-architecting a massive scale data platform using the aforementioned technologies.

We will also delve into how today, Akamai is able to quickly ingest and make available to customers terabytes of data, as well as efficiently query Petabytes of data and return results within 10 seconds for most queries. This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.

Real-Time Streaming Solution for Call Center Analytics: Business Challenges and Technical Enablement

A large international client with a business footprint in North America, Europe and Africa reached out to us with an interest in having a real-time streaming solution designed and implemented for its call center handling incoming and outgoing client calls. The client had a previous bad experience with another vendor, who overpromised and underdelivered on the latency of the streaming solution. The previous vendor delivered an over-complex streaming data pipeline resulting in the data taking over five minutes to reach a visualization layer. The client felt that architecture was too complex and involved many services integrated together.

Our immediate challenges involved gaining the client's trust and proving that our design and implementation quality would supersede a previous experience. To resolve an immediate challenge of the overly complicated pipeline design, we deployed a Databricks Lakehouse architecture with Azure Databricks at the center of the solution. Our reference architecture integrated Genesys Cloud : App Services : Event Hub : Databricks : : Data Lake : Power BI.

The streaming solution proved to be low latency (seconds) during the POV stage, which led to subsequent productionalization of the pipeline with deployment of jobs, DLTs pipeline, including multi-notebook workflow and business and performance metrics dashboarding relied on by the call center staff for a day-to-day performance monitoring and improvements.

Talk by: Natalia Demidova

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Sharing: The Key Data Mesh Enabler

Data Mesh is an emerging architecture pattern that challenges the centralized data platform approach by empowering different engineering teams to own the data products in a specific business domain. One of the keys to the success of any Data Mesh initiative is selecting the right protocol for Data Sharing between different business data domains that could potentially be implemented through different technologies and cloud providers.

In this session you will learn about how the Delta Sharing protocol and the Delta table format have enabled the historically stuck-in-the-past energy and construction industry to be catapulted to the 21st century by way of a modern Data Mesh implementation based on Azure Databricks.

Talk by: Francesco Pizzolon

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling MLOps for a Demand Forecasting Across Multiple Markets for a Large CPG

In this session, we look at how one of the world’s largest CPG company setup a scalable MLOps pipeline for a demand forecasting use case that predicted demand at 100,000+ DFUs (demand forecasting units) on a weekly basis across more than 20 markets. This implementation resulted in significant cost savings in terms of improved productivity, reduced cloud usage and faster time to value amongst other benefits. You will leave this session with a clearer picture on the following:

  • Best practices in scaling MLOps with Databricks and Azure for a demand forecasting use case with a multi-market and multi-region roll-out.
  • Best practices related to model re-factoring and setting up standard CI-CD pipelines for MLOps.
  • What are some of the pitfalls to avoid in such scenarios?

Talk by: Sunil Ranganathan and Vinit Doshi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Qlik | Extracting the Full Potential of SAP Data for Global Automotive Manufacturing

Every year, organizations lose millions of dollars due to equipment failure, unscheduled downtime, or unoptimized supply chains because business and operational data is not integrated. During this session you will hear from experts at Qlik and Databricks on how global luxury automotive manufacturers are accelerating the discovery and availability of complex data sets like SAP. Learn how Qlik, Microsoft, and Databricks together are delivering an integrated solution for global luxury automotive manufacturers that combines the automated data delivery capabilities of Qlik Data Integration with the agility and openness of the Databricks Lakehouse platform and AI on Azure Synpase.

We'll explore how to leverage the IT and OT data convergence to extract the full potential of business-critical SAP data, lower IT costs and deliver real-time prescriptive insights, at scale, for more resilient, predictable, and sustainable supply-chains. Learn how organizations can track and manage inventory levels, predict demand, optimize production and help their organizations identify opportunities for improvements.

Talk by: Matthew Hayes and Bala Amavasai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksi

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

Join the panel to hear how Avanade is helping clients enable real-time analytics and tackle the people and process problems that accompany technology, powered by Azure Databricks.

Talk by: Thomas Kim, Dael Williamson, Zoé Durand

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

Akamai's content delivery network (CDN) processes about 30% of the internet's daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors. In this session, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services. We will describe the iterative process of re-architecting a massive scale data platform using the aforementioned technologies.

We will also delve into how today, Akamai is able to quickly ingest and make available to customers terabytes of data, as well as efficiently query Petabytes of data and return results within 10 seconds for most queries. This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Learn how to provision Redis Cloud alongside other cloud resources on AWS, Azure, and Google Cloud using your favorite programming languages and the Redis Cloud provider for Pulumi. Topics include getting started with NoSQL using Redis Cloud, defining and deploying Redis Cloud resources with code, and integrating Pulumi with Redis.

Apache Airflow is Scalable, Dynamic, Extensible , Elegant and can it be a lot more ? We have taken Airflow to the next level, using it as hybrid cloud data service accelerating our transformation. During this talk we will present the implementation of Airflow as an orchestration solution between LEGACY, PRIVATE and PUBLIC cloud (AWS / AZURE) : Comparison between public/private offers. Harness the power of Hybric cloud orchestrator to meet the regulatory requirements (European Financial Institutions) Real production use cases

As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger. Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters. With Airflow 2.6.0 now the DAG.params are used to render a nice entry form and with a bit of options a user friendly trigger UI can be implemented. This session is showing how the new feature works and provides some examples how to use it for your purposes.

Summary

A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?

How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead?

What does CI/CD look like for a data warehouse?

How is it different from CI/CD for software applications?

Can you describe how Agile Data Engine is architected?

How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals?

What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics?

What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?

In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine?

Guest Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

About Agile Data Engine

Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.

Links

Agile Data Engine Bill Inmon Ralph Kimball Snowflake Redshift BigQuery Azure Synapse Airflow

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack: Rudderstack

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast

SQL Server 2022 Administration Inside Out

Conquer SQL Server 2022 and Azure SQL administration from the inside out! Dive into SQL Server 2022 administration and grow your Microsoft SQL Server data platform skillset. This well-organized reference packs in timesaving solutions, tips, and workarounds, all you need to plan, implement, deploy, provision, manage, and secure SQL Server 2022 in any environment: on-premises, cloud, or hybrid, including detailed, dedicated chapters on Azure SQL Database and Azure SQL Managed Instance. Nine experts thoroughly tour DBA capabilities available in the SQL Server 2022 Database Engine, SQL Server Data Tools, SQL Server Management Studio, PowerShell, and much more. Youll find extensive new coverage of Azure SQL Database and Azure SQL Managed Instance, both as a cloud platform of SQL Server and in their new integrations with SQL Server 2022, information available in no other book. Discover how experts tackle todays essential tasks and challenge yourself to new levels of mastery. Identify low-hanging fruit and practical, easy wins for improving SQL Server administration Get started with modern SQL Server tools, including SQL Server Management Studio, and Azure Data Studio Upgrade your SQL Server administration skillset to new features of SQL Server 2022, Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Linux Design and implement modern on-premises database infrastructure, including Kubernetes Leverage data virtualization of third-party or non-relational data sources Monitor SQL instances for corruption, index activity, fragmentation, and extended events Automate maintenance plans, database mail, jobs, alerts, proxies, and event forwarding Protect data through encryption, privacy, and auditing Provision, manage, scale and secure, and bidirectionally synchronize Microsofts powerful Azure SQL Managed Instance Understand and enable new Intelligent Query Processing features to increase query concurrency Prepare a best-practice runbook for disaster recovery Use SQL Server 2022 features to span infrastructure across hybrid environments ...

Azure SQL Hyperscale Revealed: High-performance Scalable Solutions for Critical Data Workloads

Take a deep dive into the Azure SQL Database Hyperscale Service Tier and discover a new form of cloud architecture from Microsoft that supports massive databases. The new horizontally scalable architecture, formerly code-named Socrates, allows you to decouple compute nodes from storage layers. This radically different approach dramatically increases the scalability of the service. This book shows you how to leverage Hyperscale to provide next-level scalability, high throughput, and fast performance from large databases in your environment. The book begins by showing how Hyperscale helps you eliminate many of the problems of traditional high-availability and disaster recovery architecture. You’ll learn how Hyperscale overcomes storage capacity limitations and issues with scale-up times and costs. With Hyperscale, your costs do not increase linearly with database size and you can manage more data than ever at a lower cost. The book teaches you how todeploy, configure, and monitor an Azure SQL Hyperscale database in a production environment. The book also covers migrating your current workloads from traditional architecture to Azure SQL Hyperscale. What You Will Learn Understand the advantages of Hyperscale over traditional architecture Deploy a Hyperscale database on the Azure cloud (interactively and with code) Configure the advanced features of the Hyperscale database tier Monitor and scale database performance to suit your needs Back up and restore your Azure SQL Hyperscale databases Implement disaster recovery and failover capability Compare performance of Hyperscale vs traditional architecture Migrate existing databases to the Hyperscale service tier Who This Book Is For SQL architects, data engineers, and DBAs who want the most efficient and cost-effective cloud technologies to run their critical data workloads, and those seeking rapid scalability and high performance and throughput while utilizing large databases

Expert Performance Indexing in Azure SQL and SQL Server 2022: Toward Faster Results and Lower Maintenance Both on Premises and in the Cloud

Take a deep dive into perhaps the single most important facet of query performance—indexes—and how to best use them. Newly updated for SQL Server 2022 and Azure SQL, this fourth edition includes new guidance and features related to columnstore indexes, improved and consolidated content on Query Store, deeper content around Intelligent Query Processing, and other updates to help you optimize query execution and make performance improvements to even the most challenging workloads. The book begins with explanations of the types of indexes and how they are stored in a database. Moving further into the book, you will learn how statistics are critical for optimal index usage and how the Index Advisor can assist in reviewing and optimizing index health. This book helps you build a clear understanding of how indexes work, how to implement and use them, and the many options available to tame even the most large and complex workloads. What You Will Learn Properly index row store, columnstore, and memory-optimized tables Make use of Intelligent Query Processing for faster query results Review statistics to understand indexing choices made by the optimizer Apply indexing strategies such as covering indexes, included columns, and index intersections Recognize and remove unnecessary indexes Design effective indexes for full-text, spatial, and XML data types Who This Book Is For Azure SQL and SQL Server administrators and developers who are ready to improve the performance of their database environment by thoughtfully building indexes to speed up queries that matter the most and make a difference to the business

Microsoft Power Platform Enterprise Architecture - Second Edition

Microsoft Power Platform Enterprise Architecture is your essential guide to designing powerful, enterprise-grade solutions using Microsoft technology. This thoroughly structured book equips you with architectural insights, methodologies, and best practices necessary to optimize solutions using the Microsoft Power Platform and integrate them seamlessly with M365 and Azure. What this Book will help me do Design robust enterprise solutions leveraging Microsoft Power Platform and Dynamics 365. Integrate Power Platform tools with Microsoft 365 and Azure effectively for comprehensive solutions. Implement advanced security, extensibility, and lifecycle management methodologies. Migrate and manage data efficiently within the Power Platform ecosystem. Overcome architectural challenges in multi-system integration using proven techniques. Author(s) Robert Rybaric, the author of this book, is an experienced enterprise architect specializing in Microsoft technologies. With years of expertise in designing enterprise systems, Robert brings practical insights into crafting effective solutions using Microsoft Power Platform. His approach emphasizes clarity and practicality, ensuring concepts are both illustrative and applicable for readers. Who is it for? This book is ideal for enterprise architects and technical decision-makers aiming to design and deploy complex solutions using the Microsoft Power Platform. It is tailored for professionals with familiarity with Microsoft Power Platform and Azure services who wish to refine their skills in enterprise architecture to meet growing business demands efficiently.

Pro SQL Server 2022 Wait Statistics: A Practical Guide to Analyzing Performance in SQL Server and Azure SQL Database

Use this practical guide to analyze and troubleshoot SQL Server performance using wait statistics. You'll learn to identify precisely why your queries are running slowly. And you'll know how to measure the amount of time consumed by each bottleneck so you can focus attention on making the largest improvements first. This edition is updated to cover analysis of wait statistics current with SQL Server 2022. Whether you are new to wait statistics, or already familiar with them, this book provides a deeper understanding on how wait statistics are generated and what they mean for your SQL Server instance’s performance. The book goes beyond the most common wait types into the more complex and performance-threatening wait types. You’ll learn about per-query wait statistics and session-based wait statistics, and the types of problems they can help you solve. The different wait types are categorized by their area of impact, including CPU, IO, Latching, Locking, and many more. Clear examples are included to help you gain practical knowledge of why and how specific wait times increase or decrease, how they impact your SQL Server’s performance, and what you can do to improve performance. After reading this book, you won’t want to be without the valuable information that wait statistics provide regarding where you should be spending your limited tuning time to maximize performance and value to your business. What You'll Learn Understand how the SQL Server engine processes requests Identify resource bottlenecks in a running SQL Server instance Locate wait statistics information inside DMVs and Query Store Analyze the root cause of sub-optimal performance Diagnose I/O contention and locking contention Benchmark SQL Server performance Improve database performance by lowering overall wait time Who This Book Is For Database administrators who want to identify and resolve performance bottlenecks, those who want to learn more about how the SQL Server engine accesses and uses resources inside SQL Server, and administrators concerned with achieving—and knowing they have achieved—optimal performance

Pro SQL Server 2022 Administration: A Guide for the Modern DBA

Get your daily work done efficiently using this comprehensive guide for SQL Server DBAs that covers all that a practicing database administrator needs to know. Updated for SQL Server 2022, this edition includes coverage of new features, such as Ledger, which provides an immutable record of table history to protect you against malicious data tampering, and integration with cloud providers to support hybrid cloud scenarios. You’ll also find new content on performance optimizations, such as query pan feedback, and security controls, such as new database roles, which are restructured for modern ways of working. Coverage also includes Query Store, installation on Linux, and the use of containerized SQL. Pro SQL Server 2022 Administration takes DBAs on a journey that begins with planning their SQL Server deployment and runs through installing and configuring the instance, administering and optimizing database objects, and ensuring that data issecure and highly available. Readers will learn how to perform advanced maintenance and tuning techniques, and discover SQL Server's hybrid cloud functionality. This book teaches you how to make the most of new SQL Server 2022 functionality, including integration for hybrid cloud scenarios. The book promotes best-practice installation, shows how to configure for scalability and high availability, and demonstrates the gamut of database-level maintenance tasks, such as index maintenance, database consistency checks, and table optimizations. What You Will Learn Integrate SQL Server with Azure for hybrid cloud scenarios Audit changes and prevent malicious data changes with SQL Server’s Ledger Secure and encrypt data to protect against embarrassing data breaches Ensure 24 x 7 x 365 access through high availability and disaster recovery features in today’s hybrid world Use Azure tooling, including Arc, to gain insight into and manage your SQL Server enterprise Install and configure SQL Server on Windows, Linux, and in containers Perform routine maintenance tasks, such as backups and database consistency checks Optimize performance and undertake troubleshooting in the Database Engine Who This Book Is For SQL Server DBAs who manage on-premise installations of SQL Server. This book is also useful for DBAs who wish to learn advanced features, such as integration with Azure, Query Store, Extended Events, and Policy-Based Management, or those who need to install SQL Server in a variety of environments.

Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet

Interview

Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the st

Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design

Interview

Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role

Summary Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution frame