talk-data.com talk-data.com

Topic

Master Data Management

data_governance data_quality data_integration

20

tagged

Activity Trend

3 peak/qtr
2020-Q1 2026-Q1

Activities

20 activities · Newest first

Summary In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in the research: while 77% of leaders trust the data feeding their AI systems, only 50% trust their organization's data overall. Ariel explains why truly productionizing AI demands broader, continuously refreshed data with stronger automation and governance, and highlights the challenges posed by unstructured data and vector stores. The conversation covers the need to shift from manual reviews to automated pipelines, the resurgence of metadata and master data management, and the importance of guardrails, traceability, and agent governance. Ariel also predicts a growing convergence between data teams and application integration teams and advises leaders to focus on high-value use cases, aggressive pipeline automation, and cataloging and governing the coming sprawl of AI agents, all while using AI to accelerate data engineering itself.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about data management investments that organizations are making to enable them to scale AI implementationsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing the motivation and scope of your recent survey on data management investments for AI across your respondents?What are the key takeaways that were most significant to you?The survey reveals a fascinating paradox: 77% of leaders trust the data used by their AI systems, yet only half trust their organization's overall data quality. For our data engineering audience, what does this suggest about how companies are currently sourcing data for AI? Does it imply they are using narrow, manually-curated "golden datasets," and what are the technical challenges and risks of that approach as they try to scale?The report highlights a heavy reliance on manual data quality processes, with one expert noting companies feel it's "not reliable to fully automate validation" for external or customer data. At the same time, maturity in "Automated tools for data integration and cleansing" is low, at only 42%. What specific technical hurdles or organizational inertia are preventing teams from adopting more automation in their data quality and integration pipelines?There was a significant point made that with generative AI, "biases can scale much faster," making automated governance essential. From a data engineering perspective, how does the data management strategy need to evolve to support generative AI versus traditional ML models? What new types of data quality checks, lineage tracking, or monitoring for feedback loops are required when the model itself is generating new content based on its own outputs?The report champions a "centralized data management platform" as the "connective tissue" for reliable AI. How do you see the scale and data maturity impacting the realities of that effort?How do architectural patterns in the shape of cloud warehouses, lakehouses, data mesh, data products, etc. factor into that need for centralized/unified platforms?A surprising finding was that a third of respondents have not fully grasped the risk of significant inaccuracies in their AI models if they fail to prioritize data management. In your experience, what are the biggest blind spots for data and analytics leaders?Looking at the maturity charts, companies rate themselves highly on "Developing a data management strategy" (65%) but lag significantly in areas like "Automated tools for data integration and cleansing" (42%) and "Conducting bias-detection audits" (24%). If you were advising a data engineering team lead based on these findings, what would you tell them to prioritize in the next 6-12 months to bridge the gap between strategy and a truly scalable, trustworthy data foundation for AI?The report states that 83% of companies expect to integrate more data sources for their AI in the next year. For a data engineer on the ground, what is the most important capability they need to build into their platform to handle this influx?What are the most interesting, innovative, or unexpected ways that you have seen teams addressing the new and accelerated data needs for AI applications?What are some of the noteworthy trends or predictions that you have for the near-term future of the impact that AI is having or will have on data teams and systems?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BoomiData ManagementIntegration & Automation DemoAgentstudioData Connector Agent WebinarSurvey ResultsData GovernanceShadow ITPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this episode of Data Unchained, we sit down with Malcolm Hawker, former Gartner analyst and Chief Data Officer at Profisee, to expose the real barriers to AI adoption. We explore why Master Data Management (MDM) is the foundation enterprises overlook, how decentralized systems and unstructured data derail governance, and why CDOs must evolve their role or risk irrelevance. This conversation challenges the myth of a single source of truth, breaks down the politics of data ownership, and offers a new vision for aligning data strategy with AI innovation.

AIReadiness #MasterDataManagement #DataGovernance #CDOInsights #EnterpriseAI #DataStrategy #UnstructuredData #DataInfrastructure #DigitalTransformation #AILeadership #DataUnchained #Profisee #MalcolmHawker #MollyPresley #TechInnovation

Cyberpunk by jiglr | https://soundcloud.com/jiglrmusic Music promoted by https://www.free-stock-music.com Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/deed.en_US Hosted on Acast. See acast.com/privacy for more information.

Summary In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data expert and discusses the challenges of reconciling large-scale organizational data. He explains how data silos arise from independent teams and highlights the importance of combining traditional techniques with modern AI to address the nuances of data reconciliation. Dan emphasizes the transformative potential of large language models (LLMs) in creating more natural user experiences, improving trust in AI-driven data solutions, and simplifying complex data management processes. He also discusses the balance between using AI for complex data problems and the necessity of human oversight to ensure accuracy and trust.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us don't miss Data Citizens® Dialogues, the forward-thinking podcast brought to you by Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of Data Citizens® Dialogues, industry leaders unpack data’s impact on the world; like in their episode “The Secret Sauce Behind McDonald’s Data Strategy”, which digs into how AI-driven tools can be used to support crew efficiency and customer interactions. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now! Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.Your host is Tobias Macey and today I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of businessInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the different ways that organizational data becomes unwieldy and needs to be consolidated and reconciled?How does that reconciliation relate to the practice of "master data management"What are the scaling challenges with the current set of practices for reconciling data?ML has been applied to data cleaning for a long time in the form of entity resolution, etc. How has the landscape evolved or matured in recent years?What (if any) transformative capabilities do LLMs introduce?What are the missing pieces/improvements that are necessary to make current AI systems usable out-of-the-box for data cleaning?What are the strategic decisions that need to be addressed when implementing ML/AI techniques in the data cleaning/reconciliation process?What are the risks involved in bringing ML to bear on data cleaning for inexperienced teams?What are the most interesting, innovative, or unexpected ways that you have seen ML techniques used in data resolution?What are the most interesting, unexpected, or challenging lessons that you have learned while working on using ML/AI in master data management?When is ML/AI the wrong choice for data cleaning/reconciliation?What are your hopes/predictions for the future of ML/AI applications in MDM and data cleaning?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links TamrMaster Data ManagementCERNLHCMichael StonebrakerConway's LawExpert SystemsInformation RetrievalActive LearningThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Designed and implemented well, automated workflows can make the modern business just a little less chaotic and complex. This blog explores the opportunity for automated workflows to help cross-functional teams collaborate and standardize organizational master data. Published at: https://www.eckerson.com/articles/master-data-management-and-operational-workflows-two-modern-use-cases

Send us a text Part 2 : Malcolm Hawker, Head of Data Strategy for Profisee, and thought leader in the field of Master Data Management (MDM) and Data Governance. If you're an MDM zealot, let's go deep!

Show Notes: 00:32 The Head of Data Strategy02:00 Make it Easy, Accurate, and Scale05:48 What is Real vs Hype09:00 Profisee's Differentiator14:26 Reach out to Malcolm15:28 How to become a CDO18:50 Focusing on outcomes24:52 The end of the worldLinkedin: https://www.linkedin.com/in/malhawker Website: https://profisee.com/

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Send us a text Part 1 : Welcome Malcolm Hawker, Head of Data Strategy for Profisee, and thought leader in the field of Master Data Management (MDM) and Data Governance. If you're an MDM zealot, let's go deep.

Show Notes 01:28 Meet Malcolm Hawker06:33 MDM's future09:48 A unique view on data fabric14:07 The rise and fall of AOL19:46 The definition of MDM26:28 MDM reference architectureLinkedin: https://www.linkedin.com/in/malhawker Website: https://profisee.com/

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while  keeping it simple & fun. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure

COVID, inflation, broken supply chains, and not-so-distant war make this a turbulent time for the modern consumer. During times like these, families tend to their nests, which leads to lots of home-improvement projects…which means lots of painting.

Today we explore the case study of a Fortune 500 producer of the paints and stains that coat many households, consumer products, and even mechanical vehicles. While business expands, this company needs to carefully align the records that track hundreds of suppliers, thousands of storefronts, and millions of customers.

Business expansion and complex supply chains make it particularly important—and challenging—for enterprises such as this paint producer, which we’ll call Bright Colors, to accurately describe the entities that make up their business. They need to be governed, validated data to describe entities such as their products, locations, and customers. Master data management, also known as MDM, streamlines operations and assists data governance by reconciling disparate data records into golden records and ideally a single source of truth.

We’re excited to share our conversation with an industry expert that helps Bright Colors and other Fortune 2000 enterprises navigate turbulent times with effective strategies for MDM and data governance.

Dave Wilkinson is chief technology officer with D3Clarity, a global strategy and implementation services firm that seeks to ensure digital certainty, security, and trust. D3Clarity is a partner of Semarchy, whose Intelligent Data Hub software helps enterprises govern and manage master data, reference data, data quality, enrichment, and workflows. Semarchy sponsored this podcast.

Fast-casual restaurants offer a fascinating microcosm of the turbulent forces confronting enterprises today—and the pivotal role that data plays in helping them maintain competitive advantage. COVID prompted customers to order their Chipotle burritos, Shake Shack milkshakes, and Bruegger’s Bagels for home delivery, and this trend continues in 2022. Supply-chain disruptions, meanwhile, force fast-casual restaurants to make some fast pivots between suppliers in order to keep their shelves stocked. And the market continues to grow as these companies win customers, add locations, and expand delivery partnerships.

These three industry trends—home delivery, supply-chain disruptions, and market expansion—all depend on governed, accurate data to describe entities such as orders, ingredients, and locations. Data quality and master data management therefore play a more pivotal role than ever in the success of fast-casual restaurants. Master data management, also known as MDM, streamlines operations and assists data governance by reconciling disparate data records into a golden record and source of truth. If you’re looking for an ideal case study for how MDM drives enterprise reinvention, agility, and growth, this is it.

We’re excited to talk with an industry expert that helps fast-casual restaurants handle these turbulent forces with effective strategies for managing data and especially master data. Matt Zingariello is Vice President of Data Strategy Services with Keyrus, a global consultancy that helps enterprises use data assets to optimize their digital strategies and customer experience. Matt leads a team that provides industry-specific advisory and implementation services to help enterprises address challenges such as data governance and MDM.

Keyrus is a partner of Semarchy, whose Intelligent Data Hub software helps enterprises govern and manage master data, reference data, data quality, enrichment, and workflows. Semarchy sponsored this podcast.

In our podcast, we'll define data quality and MDM as part of data governance. We’ll explore why enterprises need data quality and MDM, and how they can craft effective data quality and MDM strategies, with a focus on fast-casual restaurants as a case study.

It’s hard to find a data discipline today that is under more pressure than data governance. One on side, the supply of data is exploding. As enterprises transform their business to compete in the 2020s, they digitize myriad events and interactions, which creates mountains of data that they need to control. On the other side, demand for data is exploding. Business owners at all levels of the enterprise need to inform their decisions and drive their operations with data.

Under these pressures, data governance teams must ensure business owners access and consume the right, high-quality data. This requires master data management—the reconciliation of disparate data records into a golden record and source of truth—which assists data governance at many modern enterprises.

In this episode, our host Kevin Petrie, VP of Research at Eckerson Group talks with our guests Felicia Perez, Managing Director, Information as a Product Program at National Student Clearinghouse, and Patrick O'Halloran, enterprise data scientist as they define what data quality and MDM are, why you need them, and how best to achieve effective data quality and MDM.

Master Data Management is no shiny object. But like many traditional IT practices, MDM is being severely tested – and rendered all the more strategic – by digitalization and rising data volumes.

Originally published at https://www.eckerson.com/articles/five-master-data-management-best-practices-for-enterprises

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?

How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses?

How does the performance and flexibility affect the data modeling requirements?

Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?

What are some of the current limitations that you are struggling with?

For someone getting started with Snowflake what is involved with loading data into the platform?

What is their workflow for allocating and scaling compute capacity and running anlyses?

One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB?

Contact Info

LinkedIn Website @KentGraziano on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

SnowflakeDB

Free Trial Stack Overflow

Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto

Podcast Episode

SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management

Podcast Episode

FoundationDB

Podcast Episode

Apache Spark

Podcast Episode

SSIS == SQL Server Integration Services Talend Informatica Fivetran

Podcast Episode

Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric

Interview

Introduction

How did you get involved in t

Why is Data Quality still an issue after all these years? To get an answer to the prevalent question, Wayne Eckerson and Jason Beard engage in a dynamic exchange of questions which lead us to the root cause of data quality and data governance problems. Using examples from his past projects, Jason shows the value of business process mapping and how it exposes the hidden problems which go undetected under the standard IT lens.

In his most recent role as Vice President of Process & Data Management at Wiley, a book publisher, he was responsible for master data setup and governance, process optimization, business continuity planning, and change management for new and emerging business models. Jason has led business intelligence, data governance, master data management, Process Improvement, Business Transformation, and ERP projects in a variety of industries, including Scientific and Trade publishing, Educational Technology, Consumer Goods, Banking, Investments, and Insurance.

Send us a text Happy holidays from the Making Data Simple team! Enjoy a rebroadcast of a conversation with Seth Dobrin, Vice President and Chief Data Officer for IBM Analytics, as he and Al explore the strategies and people your company needs to disrupt and succeed in the year ahead. Do you or your team members need new credentials to work in data? Seth also discusses what you need in your toolkit to be a data scientist at IBM.

Show Notes 00.30 Connect with Al Martin on Twitter and LinkedIn. 01.00 Connect with Seth Dobrin on Twitter and LinkedIn. 01.40 Read "What IBM looks for in a Data Scientist" by Seth Dobrin and Jean-Francois Puget. 06.00 Learn more about GDPR.  13.00 Learn more about master data management. 13.05 Learn more about unified governance and integration.  13.25 Learn more about machine learning.  14.00 Connect and learn more about Ginni Rometty.  14.40 Learn more about cognitive computing. 19.35 Connect with Rob Thomas on Twitter and LinkedIn. 21.00 Connect with Jean-Francois Puget on Twitter and LinkedIn. Follow @IBMAnalytics Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Summary

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms

Interview

Introduction How did you get involved in the area of data management? Can you start by establishing a definition of data mastering that we can work from?

How does the master data set get used within the overall analytical and processing systems of an organization?

What is the traditional workflow for creating a master data set?

What has changed in the current landscape of businesses and technology platforms that makes that approach impractical? What are the steps that an organization can take to evolve toward an agile approach to data mastering?

At what scale of company or project does it makes sense to start building a master data set? What are the limitations of using ML/AI to merge data sets? What are the limitations of a golden master data set in practice?

Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them? Are there specific problem domains that are more likely to benefit from a master data set?

Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) What storage mechanisms are typically used for managing a master data set?

Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure? How do you manage latency issues when trying to reference the same entities from multiple disparate systems?

What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?

What suggestions do you have to help prevent such a project from being derailed?

What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of