Countless companies invest in their data quality, but often, the effort from their investment is not fully realized in the output. It seems like, despite the critical importance of data quality, data governance might be suffering from a branding issue. Data governance is sometimes looked at as the data police, but this is far from the truth. So, how can we change perspectives and introduce fun into data governance? Tiankai Feng is a Principal Data Consultant and Data Strategy & Data Governance Lead at Thoughtworks, He also works part-time as the Head of Marketing at DAMA Germany. Tiankai has had many data hats in his career—marketing data analyst, data product owner, analytics capability lead, and data governance leader for the last few years. He has found a passion for the human side of data—how to collaborate, coordinate, and communicate around data. TIankai often uses his music and humor to make data more approachable and fun. In the episode, Adel and Tiankai explore the importance of data governance in data-driven organizations, the challenges of data governance, how to define success criteria and measure the ROI of governance initiatives, non-invasive and creative approaches to data governance, the implications of generative AI on data governance, regulatory considerations, organizational culture and much more. Links Mentioned in the Show: Tiankai’s YouTube ChannelData Governance Fundamentals Cheat Sheet[Webinar] Unpacking the Fun in Data Governance: The Key to Scaling Data Quality[Course] Data Governance ConceptsRewatch sessions from RADAR: The Analytics Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business
talk-data.com
Topic
Data Quality
537
tagged
Activity Trend
Top Events
Summary
Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine
Interview
Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database?
How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago?
What are the factors that convince teams to use a NoSQL vs. SQL database?
NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus?
How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered?
How many "core capabilities" can you reasonably design around before they conflict with each other?
How have you approached the evolution of RavenDB as you add new capabilities and mature the project?
What are some of the early decisions that had to be unwound to enable new capabilities?
If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons t
Governance is difficult for an organization of any size, and many struggle to execute on data management in an efficient manner. At Assurance, the team has utilized Starburst Galaxy to embed ownership within the data mesh framework, completely transforming the way organizations handle data. By granting data owners complete control and visibility over their data, Assurance enables a more nuanced and effective approach to data management. This approach not only fosters a sense of responsibility but also ensures that data is relevant, up-to-date, and aligned with the evolving needs of the organization. In this presentation, Shen Weng and Mitchell Polsons will discuss the strategic implementation of compute ownership in Starburst Galaxy, showing how it empowers teams to identify and resolve issues quickly, significantly improving the uptime of key computing operations. This approach is vital for achieving operational excellence, characterized by enhanced efficiency, reliability, and quality. Additionally, the new data setup has enabled the Assurance team to simplify data transformation processes using dbt and to improve data quality monitoring with Monte Carlo, further streamlining and strengthening our data management practices.
Have you ever wondered how a data company does data? In this session, Isaac Obezo, Staff Data Engineer at Starburst, will take you for a peek behind the curtain into Starburst’s own data architecture built to support batch processing of telemetry data within Galaxy data pipelines. Isaac will walk you through our architecture utilizing tools like git, dbt, and Starburst Galaxy to create a CI/CD process allowing our data engineering team to iterate quickly to deploy new models, develop and land data, and create and improve existing models in the data lake. Isaac will also discuss Starburst’s mentality toward data quality, the use of data products, and the process toward delivering quality analytics.
Join the team from Moody's Analytics as they take you on a personal journey of optimizing their data pipelines for data quality and governance. Like many data practitioners, Ryan understands the frustration and anxiety that comes with accidentally introducing bad code into production pipelines—he's spent countless hours putting out fires caused by these unexpected changes. In this session, Ryan will recount his experiences with a previous data stack that lacked standardized testing methods and visibility into the impact of code changes on production data. He'll also share how their new data stack is safeguarded by Datafold's data diffing and continuous integration (CI) capabilities, which enables his team to work with greater confidence, peace of mind, and speed.
While everyone's talking about AI, far fewer have deployed it successfully or turned technology into business outcomes. This panel changes that, bringing together Generative AI experts for a deep dive into the practical application of generative AI. From building new customer offerings to refreshing internal processes, the panellists will reflect on the importance of data quality, data security, the responsible use of data as well as change management when it comes to embedding generative AI into the business strategy.
Drawing on his 2023 book ‘Confident Data Science’, Adam Nelson will show you how to measure your organization's data culture. Learn how to use this key metric to understand how well your organization’s culture performs along four key dimensions: Offering access to quality information about the data it has; providing the right access to the right people at the right time; investing in data skills development; and maintaining high data quality standards.
As organizations are exploring and expanding on their AI capabilities, Chief Data Officers are now responsible for governing the data for responsible and trustworthy AI. This session will cover 5 key principles to ensure successful adoption and scaling of AI initiatives that align with their company’s business strategy. From data quality to advocating for ethical AI practices, the Chief Data Officer’s mandate has expanded to compliance of new AI regulations.
Peggy Tsai, Chief Data Officer at BigID and adjunct faculty member at Carnegie Mellon University for the Chief Data Officer executive program, will provide insights into the AI governance strategies and outcomes crucial for cultivating an AI-first organization. Drawing on her extensive experience in data governance and AI, this session will be an invaluable guidance for all participants aiming to adopt industry-leading practices.
Spend less time prepping data and more time gaining insights with Gemini in BigQuery. In this session, you'll discover how to visually transform your data with AI for streamlined analysis. Witness a live demo of BigQuery data preparation. Seattle Children's will demonstrate the transformative effect of AI on data engineer productivity and accelerating development. Plus, get a sneak peek into the exciting roadmap of features including expanded connectivity, continuous integration and delivery workflows, and robust data quality.
Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.
Data quality is the most important attribute of a successful data platform that can accelerate data adoption and empower any organization with data-driven decisions. However, traditional profiling-based data quality and counts-based data quality and business rules-based data quality are outdated and not practical at the scale of petabyte-scaled data platforms where billions of rows get processed every day. In this talk, Sandhya Devineni and Rajesh Gundugollu will present a framework for using machine learning to detect data quality at scale in data products. The two data leaders at Asurion will highlight the lessons learned over years of crafting the advanced state of data quality using machine learning at scale, as well as discuss the pain points and blind spots of traditional data quality processes. After sharing lessons learned, the pair will dive into their implemented framework which can be utilized to improve the accuracy and reliability of data-driven decisions by identifying bad quality data records and revolutionizing how organiations approach data-driven decision making.
Join us for an insightful session on the evolving landscape of Data Quality and Observability practices, transitioning from manual to augmented approaches driven by semantics and GenAI. Discover the framework enabling organisations to build the architecture for conversational data quality, leaving behind the limitations of traditional, resource-heavy methods and legacy technology. Learn why context is paramount in data quality and observability, and leave with actionable insights to propel your organisation into the future of data management.
When data is the most valuable asset of your company, protecting it is a non-negotiable. While Information Security professionals are focused on Bad Actors, we have data operations and data governance professionals focused on Bad Data… Are they one and the same? What’s similar and what’s different between the worlds of data integrity and data security?
Drawing from a wealth of experience and real-world challenges, Gorkem will shed light on the pivotal role of data quality in the forefront of information security. We’ll discuss opportunities for early detection, auto-detection, and the establishment of tiered rules to manage and remediate bad data effectively. Learn how proactive governance and observability can transform data management from a reactive stance to a formidable defense mechanism, ensuring the integrity and security of your data ecosystem.
It’s a tale as old as time: a data migration that was supposed to take months turns into years turns into something that no longer has an end date—all while going over budget and increasing in complexity every day. In this session, Gleb is going deep on the methods, tooling, and hard lessons learned during a years-long migration at Lyft. Specifically, he'll share how you can leverage data quality testing methodologies like cross-database diffing to accelerate a data migration without sacrificing data quality. You should walk away with practices that will allow your data team to plan, move, and audit database objects with speed and confidence during a migration.
In today's data-driven world, organizations face the challenge of not only harnessing the power of data but also ensuring its responsible and effective use. This panel discussion will delve into the critical components of embedding data governance and data literacy into the fabric of organizational culture. Data governance forms the foundation of a robust data strategy, encompassing policies, processes, and frameworks to ensure data quality, integrity, and security. However, effective governance requires more than just frameworks; it necessitates a cultural shift where data stewardship is ingrained into every aspect of organizational operations. Moreover, data literacy is paramount in enabling individuals across an organization to effectively interpret, analyze, and derive insights from data. By cultivating a culture of data literacy, organizations empower employees to make informed decisions, driving innovation and growth. This panel will explore strategies for fostering a culture of accountability, collaboration, and trust around data practices driving sustainable success in today's dynamic business landscape.
In a classic cart before the horse scenario, many companies have jumped at leveraging Generative AI and other AI technologies. However, most of those same companies haven't completed the core work of building a reliable & secure foundation that provides data accessibility, analytics speed, and ensures data quality. The resulting risk for leaders is overinvestment in AI programs that may not have accurate & secure data access, further exposing the business to harm. It is a case of slowing down to speed up - ensure the foundation is solid before you build the house. In this talk by Starburst CEO Justin Borgman, and Head of Partner Solutions Architecture, Data & Analytics - AI/ML, Subodh Kumar from AWS, you'll learn about the essential data foundations for AI success. The foundation, the plumbing, and the framing that will set businesses up for AI success.
The need for an executive responsible for an organization’s information assets today may seem obvious. But some organizations still struggle with making a business case for the role. And even existing chief data officers can be confounded about how to formally justify their existence. This session will share eye-popping findings and analyses from Mr. Laney’s study of hundreds of organizations with and without a CDO.
As any good scientist knows, and any good data scientist should know, most discoveries begin with a hypothesis. We see a lot of surveys about the CDO role but don’t really have much of a point to make or look at the impact a CDO makes. This study examined over 500 organizations to determine how businesses with a CDO operate differently.
Drawing from the study's conclusions, attendees will learn about the benefits of a CDO, and how having one affects data quality, governance, data democratization and monetization. We'll explore whether having a CDO affects an organization's ability to value its data and how investors perceive it, and look at the career path of CDOs to better understand what makes an actual C-level CDO.
Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.
Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's explore the complex intersections of data, unplugged style!
In this episode #44, titled "Unpacking Open Source: A Look at GX, Monetization, Ruff's Controversy, the xz Hack & more" we're thrilled to have Paolo Léonard joining us to unpack the latest in technology and open-source discussions. Get ready for an enlightening journey through innovation, challenges, and hot takes in the digital realm. GX Cloud Unveiled: Paolo gives his first impression on the latest cloud data quality tool: GX CloudOpen source monetization: Delving into the trade-offs between open-source projects, their managed counterparts and other strategies in making open source sustainable financially, with Astral, FastAPI, Prefect’s role in this space.The Open Source Controversy with Ruff: A discussion on the ethical considerations when open-source projects turn profit-focused, highlighted by Ruff.Addressing the xz Hack: Delving into the challenges highlighted by the XZ backdoor discovery and how the community responds to these security threats.Jumping on the Mojo Train?: A conversation on Mojo's decision to open source its standard library and its impact on the future of modular machine learning.Becoming 'Clout' Certified: Hot takes on the value and impact of clout certification in the tech industry. Read more.
Summary
Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform
Interview
Introduction How did you get involved in the area of data management? Can you start by outlining the technical elements of what it means to have a "semantic layer"? In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts? What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.)
At what point does it become necessary/beneficial for a team to adopt such a service? What are the challenges involved in retrofitting a semantic layer into a production data system?
evolution of requirements/usage patterns technical complexities/performance and cost optimization What are the most interesting, innovative, or unexpected ways that you have seen Cube used? What are the most interesting, unexpec
Driving trust with data is essential to succeeding with analytics. However, time and time again, data quality remains an issue for most organizations today. In this session, Esther Munyi, Chief Data Officer at Sasfin, Amy Grace, Director, Military Engines Digital Strategy at Pratt & Whitney, Stefaan Verhulst, Chief Research & Development Officer, Director of Data Program at NYU Governance Lab, and Malarvizhi Veerappan, Program Manager and Senior Data Scientist at the World Bank will focus on strategies for improving data quality, fostering a culture of trust around data, and balancing robust governance with the need for accessible, high-quality data.