Data Quality

GenAI’s best use cases & getting your data platform ready

2024-09-18 · Big Data LDN 2024

Face To Face

by Tom Wilson

AI/ML GenAI

There are many things to think about on your Generative AI journey, but in this talk we’ll focus on two key ones. Have you identified use cases that will solve real business problems? Secondly, is your data platform prepared?

From this session, Cynozure’s Solution Architect, Tom Wilson will share his experience of what it takes to successfully integrate GenAI across your data platform, touching on data quality, governance and model management. With this, he’ll also share the practical applications of GenAI, both now and the near-future, that will positively shake up the way we do business and deliver value to organisations that embrace them.

Join us to learn about:

• The important role your data platform plays in unlocking GenAI’s potential

• Lessons, experiences and watch-outs from doing this

• Use cases GenAI is best suited to right now

• The complex and evolved use cases we can expect to see going forward

Decentralizing Data & Analytics: A Platform Strategy for Balancing Autonomy and Governance

2024-09-18 · Big Data LDN 2024

Face To Face

by Paolo Platter (Agile Lab)

Analytics GDPR/CCPA Cyber Security

This talk will explore a platform strategy that emphasizes the decentralization of data and analytics, aiming to achieve an optimal balance between autonomy and governance, thereby increasing iteration and innovation speed while ensuring compliance with regulations. Attendees will learn how to support the entire data product lifecycle, enabling teams to operate independently while adhering to governance and architectural standards.

The discussion will highlight the following key areas:

1. Autonomy and Innovation: How decentralized data platforms empower teams to innovate faster by reducing dependencies and bottlenecks. Examples of successful implementations will be provided, illustrating how autonomy can lead to increased iteration and innovation speed.

2. Governance and Compliance: Strategies for maintaining robust governance frameworks that ensure data quality, security, and compliance with regulations such as GDPR and HIPAA. The talk will cover tools and best practices for monitoring and enforcing compliance in a decentralized environment.

3. Data Product lifecycle: A comprehensive approach to supporting the data product lifecycle, from data product prototyping to the data product operations, monitoring and change management.

4. Adoption: Real-world scenarios where organizations have navigated the trade-offs between autonomy and governance, creating the right condition for platform adoption.

Feed your AI Strategy, with Data Products

2024-09-18 · Big Data LDN 2024

Face To Face

AI/ML

The success of any AI strategy hinges on the quality, accessibility, and relevance of the data that powers it. Data products play a crucial role in this context by transforming raw data into valuable, trusted, and purpose built data assets that fuel AI-driven innovation and decision-making.

By integrating data products into our AI initiatives, we can:

- Accelerate AI Development

- Enhance Decision-Making

- Foster Innovation

- Ensure Data Quality

Join us to learn how Starburst Data Products are feeding data hungry AI strategies across the enterprise to improve productivity, unlock new opportunities, drive competitive advantage, and lead in the era of intelligent business.

Mastering data quality: strategies for organisational excellence with Anthony Deighton

2024-09-05 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Jason Foster (Cynozure) , Anthony Deighton (Tamr)

AI/ML Analytics Data Management

In this episode, host Jason Foster sits down with Anthony Deighton, CEO at Tamr, to delve into the complexities of data quality and analytics. They explore the challenges organisations face in managing and improving data quality, the pivotal role of AI in addressing these challenges, and strategies for aligning data quality initiatives with business objectives. They also explore the evolving role of central data teams, led by Chief Data Officers, in spearheading enterprise-wide data quality initiatives and how businesses can effectively tackle key challenges.

Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. They work with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and change management and leadership. The company was named one of The Sunday Times' fastest-growing private companies in 2022 and 2023 and named the Best Place to Work in Data by DataIQ in 2023.

#238 Data & AI for Improving Patient Outcomes with Terry Myerson, CEO at Truveta

2024-08-26 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Terry Myerson (Truveta)

AI/ML Microsoft NLP

One of the prerequisites for being able to do great data analyses is that the data is well structured and clean and high quality. For individual projects, this is often annoying to get right. On a corporate level, it’s often a huge blocker to productivity. And then there’s healthcare data. When you consider all the healthcare records across the USA, or any other country for that matter, there are so many data formats created by so many different organizations, it’s frankly a horrendous mess. This is a big problem because there’s a treasure trove of data that researchers and analysts can’t make use of to answer questions about which medical interventions work or not. Bad data is holding back progress on improving everyone’s health. Terry Myerson is the CEO and Co-Founder of Truveta. Truveta enables scientifically rigorous research on more than 18% of the clinical care in the U.S. from a growing collective of more than 30 health systems. Previously, Terry enjoyed a 21-year career at Microsoft. As Executive Vice President, he led the development of Windows, Surface, Xbox, and the early days of Office 365, while serving on the Senior Leadership Team of the company. Prior to Microsoft, he co-founded Intersé, one of the earliest Internet companies, which Microsoft acquired in 1997. In the episode, Richie and Terry explore the current state of health records, challenges when working with health records, data challenges including privacy and accessibility, data silos and fragmentation, AI and NLP for fragmented data, regulatory grade AI, ongoing data integration efforts in healthcare, the future of healthcare and much more. Links Mentioned in the Show: TruvetaConnect with TerryHIPAACourse - Introduction to Data PrivacyRelated Episode: Using AI to Improve Data Quality in HealthcareRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

The Evolution of DataOps: Insights from DataKitchen's CEO

2024-08-04 · Data Engineering Podcast Listen

podcast_episode

by Chris Berg (DataKitchen) , Tobias Macey

Data Engineering Data Lake Data Lakehouse Data Management DataOps Delta DevOps DWH Hive Iceberg Trino

Summary In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineersInterview IntroductionHow did you get involved in the area of data management?Can you describe what DataKitchen is and the story behind it?You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?What are the challenges that never went away?You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?Can you talk through the technical implementation of your new obserability and quality testing platform?What does the onboarding and integration process look like?Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?What do you have planned for the future of your work at DataKitchen?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links DataKitchenPodcast EpisodeNASADataOps ManifestoData Reliability EngineeringData ObservabilitydbtDevOps Enterprise SummitBuilding The Data Warehouse by Bill Inmon (affiliate link)dataops-testgen, dataops-observabilityFree Data Quality and Data Observability CertificationDatabricksDORA MetricsDORA for dataThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Telling Your Data Story (w/ Scott Taylor - The Data Whisperer)

2024-08-02 · Mavens of Data Listen

podcast_episode

by Scott Taylor (MetaMeta Consulting)

Analytics Data Management Data Science

In Analytics and Data Science departments, we've got a pretty good sense for why investing in data is important for any organization. But how well could you pitch your company to spend its precious resources on improving data quality or better data management practices? Could you tell that data story to the right stakeholders when it matters? In this episode, you'll hear from The Data Whisperer, Scott Taylor, sharing his best advice and practical tips for becoming a better storyteller and getting people to take action. What You'll Learn: Why storytelling is a key skill for anyone who works in data The importance of data management, and what that really means Practical tips and frameworks for telling an effective data story Register for free to be part of the next live session: https://bit.ly/3XB3A8b About our guest: Scott Taylor The Data Whisperer, Scott Taylor, has helped countless companies by enlightening business executives to the strategic value of master data and proper data management. He focuses on business alignment and the "strategic WHY" rather than system implementation and the "technical HOW." At MetaMeta Consulting he works with Enterprise Data Leadership teams and Innovative Tech Brands to tell their data story. Get Scott's book: Telling Your Data Story: Data Storytelling for Data Management Follow Scott on LinkedIn

Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

#231 Manage Your Data Better with Shinji Kim, CEO at Select Star

2024-08-01 · DataFramed Listen

podcast_episode

by Shinji Kim (Select Star) , Richie (DataCamp)

AI/ML Data Governance IoT

One of the most annoying conversations about data that happens far too often is: “Can you do an analysis and answer this business problem for me?” “Sure, where’s the data?” “I don’t know. Probably in one of our databases.” At this point more time is spent hunting for data than actually analyzing it. Rather than grumbling about it, it would obviously be more productive to learn how to solve data discoverability issues. What’s the best way to properly document data sets? How can you avoid spending all your time maintaining dashboards that no one actually uses? Shinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems (concord.io), a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led building Akamai’s new IoT data platform for real-time messaging, log processing, and edge computing. Prior to Concord, Shinji was the first Product Manager hired at Yieldmo, where she led the Ad Format Lab, A/B testing, and yield optimization. Before Yieldmo, she was analyzing data and building enterprise applications at Deloitte Consulting, Facebook, Sun Microsystems, and Barclays Capital. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB. She advises early stage startups on product strategy, customer development, and company building. In the episode, Richie and Shinji explore the importance of data governance, the utilization of data, data quality, challenges in data usage, why documentation matters, metadata and data lineage, improving collaboration between data and business teams, data governance trends to look forward to, and much more. Links Mentioned in the Show: Select StarConnect with Shinji[Course] Data Governance ConceptsRelated Episode: Making Data Governance Fun with Tiankai Feng, Data Strategy & Data Governance Lead at ThoughtWorksRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

2024-07-28 · Data Engineering Podcast Listen

podcast_episode

by Tom Baeyens (Soda Data) , Tobias Macey

AI/ML API Cloud Computing Data Contracts Data Engineering Data Lake Data Lakehouse Data Management dbt Delta GenAI Hive +3 more

Summary Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your dataInterview IntroductionHow did you get involved in the area of data management?Can you describe the scope and purpose of data contracts in the context of this conversation?In what way(s) do they differ from data quality/data observability?Data contracts are also known as the API for data, can you elaborate on this?What are the types of guarantees and requirements that you can enforce with these data contracts?What are some examples of constraints or guarantees that cannot be represented in these contracts?Are data contracts related to the shift-left?Data contracts are also known as the API for data, can you elaborate on this?The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?How did you approach the design of the syntax and implementation for Soda's data contracts?Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?When are data contracts the wrong choice?What do you have planned for the future of data contracts?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SodaPodcast EpisodeJBossData ContractAirflowUnit TestingIntegration TestingOpenAPIGraphQLCircuit Breaker PatternSodaCLSoda Data ContractsData MeshGreat Expectationsdbt Unit TestsOpen Data ContractsODCS == Open Data Contract StandardODPS == Open Data Product SpecificationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

148 - LLMs need UX: How to Increase Your B2B Product’s Value with AI (Part 2)

2024-07-23 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Brian O’Neill (Designing for Analytics)

AI/ML Data Science GenAI LLM

Ready for more ideas about UX for AI and LLM applications in enterprise environments? In part 2 of my topic on UX considerations for LLMs, I explore how an LLM might be used for a fictitious use case at an insurance company—specifically, to help internal tools teams to get rapid access to primary qualitative user research. (Yes, it’s a little “meta”, and I’m also trying to nudge you with this hypothetical example—no secret!) ;-) My goal with these episodes is to share questions you might want to ask yourself such that any use of an LLM is actually contributing to a positive UX outcome Join me as I cover the implications for design, the importance of foundational data quality, the balance between creative inspiration and factual accuracy, and the never-ending discussion of how we might handle hallucinations and errors posing as “facts”—all with a UX angle. At the end, I also share a personal story where I used an LLM to help me do some shopping for my favorite product: TRIP INSURANCE! (NOT!)

Highlights/ Skip to:

(1:05) I introduce a hypothetical internal LLM tool and what the goal of the tool is for the team who would use it (5:31) Improving access to primary research findings for better UX (10:19) What “quality data” means in a UX context (12:18) When LLM accuracy maybe doesn’t matter as much (14:03) How AI and LLMs are opening the door for fresh visioning work (15:38) Brian’s overall take on LLMs inside enterprise software as of right now (18:56) Final thoughts on UX design for LLMs, particularly in the enterprise (20:25) My inspiration for these 2 episodes—and how I had to use ChatGPT to help me complete a purchase on a website that could have integrated this capability right into their website

Quotes from Today’s Episode “If we accept that the goal of most product and user experience research is to accelerate the production of quality services, products, and experiences, the question is whether or not using an LLM for these types of questions is moving the needle in that direction at all. And secondly, are the potential downsides like hallucinations and occasional fabricated findings, is that all worth it? So, this is a design for AI problem.” - Brian T. O’Neill (8:09) “What’s in our data? Can the right people change it when the LLM is wrong? The data product managers and AI leaders reading this or listening know that the not-so-secret path to the best AI is in the foundational data that the models are trained on. But what does the word quality mean from a product standpoint and a risk reduction one, as seen from an end-users’ perspective? Somebody who’s trying to get work done? This is a different type of quality measurement.” - Brian T. O’Neill (10:40)

“When we think about fact retrieval use cases in particular, how easily can product teams—internal or otherwise—and end-users understand the confidence of responses? When responses are wrong, how easily, if at all, can users and product teams update the model’s responses? Errors in large language models may be a significant design consideration when we design probabilistic solutions, and we no longer control what exactly our products and software are going to show to users. If bad UX can include leading people down the wrong path unknowingly, then AI is kind of like the team on the other side of the tug of war that we’re playing.” - Brian T. O’Neill (11:22) “As somebody who writes a lot for my consulting business, and composes music in another, one of the hardest parts for creators can be the zero-to-one problem of getting started—the blank page—and this is a place where I think LLMs have great potential. But it also means we need to do the proper research to understand our audience, and when or where they’re doing truly generative or creative work—such that we can take a generative UX to the next level that goes beyond delivering banal and obviously derivative content.” - Brian T. O’Neill (13:31) “One thing I actually like about the hype, investment, and excitement around GenAI and LLMs in the enterprise is that there is an opportunity for organizations here to do some fresh visioning work. And this is a place that designers and user experience professionals can help data teams as we bring design into the AI space.” - Brian T. O’Neill (14:04)

“If there was ever a time to do some new visioning work, I think now is one of those times. However, we need highly skilled design leaders to help facilitate this in order for this to be effective. Part of that skill is knowing who to include in exercises like this, and my perspective, one of those people, for sure, should be somebody who understands the data science side as well, not just the engineering perspective. And as I posited in my seminar that I teach, the AI and analytical data product teams probably need a fourth member. It’s a quartet and not a trio. And that quartet includes a data expert, as well as that engineering lead.” - Brian T. O’Neill (14:38)

Links Perplexity.ai: https://perplexity.ai Ideaflow: https://www.amazon.com/Ideaflow-Only-Business-Metric-Matters/dp/0593420586 My article that inspired this episode

5 Pro Tips for Identifying & Cleaning Dirty Data (w/ Susan Walsh - The Classification Guru)

2024-07-12 · Mavens of Data Listen

podcast_episode

by Susan Walsh

AI/ML Analytics Big Data

Data quality is the foundation of everything we do as Data Analysts and Data Scientists. So why do so many organizations suffer from dirty data? And what can you do to clean it up? In this session, we'll share some of the best data cleaning strategies and real actionable advice from The Classification Guru, Susan Walsh. You'll leave with a solid plan to start identifying problems with your data, and most importantly, to start fixing them on the path to clean data. What You'll Learn: Why dirty data is such a big problem, and the benefits of cleaning it up The most common types of dirty data you should be on the lookout for Where you should focus your data cleaning efforts to make the biggest impact Register for free to be part of the next live session: https://bit.ly/3XB3A8b About our guest: Susan Walsh is a specialist in data classification, taxonomy customisation and data cleansing. She also created the COAT philosophy, which is at the core of The Classification Guru's work. By bringing clarity and accuracy to data and procurement, Susan helps teams work more effectively and efficiently More than a numbers gal, Susan's also an industry thought leader, TEDx speaker and author of the published 'Between the Spreadsheets: Classifying and Fixing Dirty Data'. She's also spoken globally at events such as ProcureCon, Big Data LDN, Big Data & AI World and she cuts through the jargon to address the issues of dirty data and its consequences in an entertaining and engaging way. Fix your dirty data now: www.theclassificationguru.co Follow us on Socials: LinkedIn YouTube Instagram (Mavens of Data) Instagram (Maven Analytics) TikTok Facebook Medium X/Twitter

Andrew Jones - Reliable Data Platform

2024-07-08 · Straight Data Talk Listen

podcast_episode

by Andrew Jones (GoCardless) , Yuliia Tkachova (Masthead Data)

Data Contracts

Andrew Jones, principal engineer at GoCardless, is the author of the book "Driving Data Quality with Data Contracts." During this session, we talked a lot about what a data platform is, who data platform engineers are, what it takes to make a data platform reliable, and, most importantly, how Andrew and his team managed to build a reliable platform at GoCardless. Sure enough, we touched a little on data contracts, their implementation, and the possibility of vendors doing the same as Andrew's team did.Andrew's LinkedIn - https://www.linkedin.com/in/andrewrhysjones/

#222 [Radar Recap] Scaling Data Quality in the Age of Generative AI

2024-07-03 · DataFramed Listen

podcast_episode

by Barr Moses (Monte Carlo) , Prukalpa Sankar (Atlan) , George Fraser (Fivetran)

AI/ML Fivetran GenAI Monte Carlo

Generative AI's transformative power underscores the critical need for high-quality data. In this session, Barr Moses, CEO of Monte Carlo Data, Prukalpa Sankar, Cofounder at Atlan, and George Fraser, CEO at Fivetran, discuss the nuances of scaling data quality for generative AI applications, highlighting the unique challenges and considerations that come into play. Throughout the session, they share best practices for data and AI leaders to navigate these challenges, ensuring that governance remains a focal point even amid the AI hype cycle. Links Mentioned in the Show: Rewatch Session from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

How the Airflow Community Productionizes Generative AI

2024-07-01 · Airflow Summit 2024

session

by Pete DeJoy (Astronomer)

AI/ML Airflow Data Engineering GenAI LLM MLOps RAG

Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s. This talk will be a tour of various methods, best practices, and considerations used in the Airflow community when taking GenAI use cases to production. We’ll focus on 4 primary use cases; RAG, fine tuning, resource management, and batch inference and take a walk through patterns different members in the community have used to productionize this new, exciting technology.

Product Management perspective on Data Observability with Databand

2024-07-01 · Airflow Summit 2024

session

by Steve Sawyer

AI/ML IBM Fabric

In this session Steve Sawyer will discuss a case study for how IBM Data Observability with Databand, collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues for you data pipelines and warehouses. Additionally, he will provide a Product perspective on the technologies IBM is building to meet the data observability needs across the enterprise, and how it relates to our investments in AI and Data Fabric.

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

2024-06-30 · Data Engineering Podcast Listen

podcast_episode

by Petr Janda (SYNQ) , Tobias Macey

AI/ML Data Engineering Data Governance Data Lake Data Lakehouse Data Management Delta Hive Iceberg Python Trino

Summary This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigorInterview IntroductionHow did you get involved in the area of data management?Can you describe what Synq is and the story behind it? Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address?Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary?What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team?How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach?With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance?Can you describe how Synq is designed/implemented? How have the scope and goals of the product changed since you first started working on it?For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows?What are the types of incidents/errors that you are able to identify and alert on? What does a typical incident/error resolution process look like with Synq?What are the most interesting, innovative, or unexpected ways that you have seen Synq used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq?When is Synq the wrong choice?What do you have planned for the future of Synq?Contact Info LinkedInSubstackParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SynqIncident ManagementSLA == Service Level AgreementData GovernancePodcast EpisodePagerDutyOpsGenieClickhousePodcast EpisodedbtPodcast EpisodeSQLMeshPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Data Quality Score: How We Evolved the Data Quality Strategy at Airbnb

2024-06-12 · Data Engineering Open Forum at Netflix 2024 Watch

video

by Clark Wright (Airbnb)

Analytics Data Engineering

Speaker: Clark Wright (Staff Analytics Engineer at Airbnb)

This tech talk is a part of the Data Engineering Open Forum at Netflix 2024. Recently, Airbnb published a post to their Tech Blog called Data Quality Score: The next chapter of data quality at Airbnb. In this talk, Clark Wright shares the narrative of how data practitioners at Airbnb recognized the need for higher-quality data and then proposed, conceptualized, and launched Airbnb’s first Data Quality Score.

If you are interested in attending a future Data Engineering Open Forum, we highly recommend you join our Google Group (https://groups.google.com/g/data-engineering-open-forum) to stay tuned to event announcements.

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

2024-06-06 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Prukalpa Sankar (Atlan)

AI/ML BI Data Governance Data Management Data Science DataOps DWH GitHub Modern Data Stack

In the fast-paced work environments we are used to, the ability to quickly find and understand data is essential. Data professionals can often spend more time searching for data than analyzing it, which can hinder business progress. Innovations like data catalogs and automated lineage systems are transforming data management, making it easier to ensure data quality, trust, and compliance. By creating a strong metadata foundation and integrating these tools into existing workflows, organizations can enhance decision-making and operational efficiency. But how did this all come to be, who is driving better access and collaboration through data? Prukalpa Sankar is the Co-founder of Atlan. Atlan is a modern data collaboration workspace (like GitHub for engineering or Figma for design). By acting as a virtual hub for data assets ranging from tables and dashboards to models & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Slack, BI tools, data science tools and more. A pioneer in the space, Atlan was recognized by Gartner as a Cool Vendor in DataOps, as one of the top 3 companies globally. Prukalpa previously co-founded SocialCops, world leading data for good company (New York Times Global Visionary, World Economic Forum Tech Pioneer). SocialCops is behind landmark data projects including India’s National Data Platform and SDGs global monitoring in collaboration with the United Nations. She was awarded Economic Times Emerging Entrepreneur for the Year, Forbes 30u30, Fortune 40u40, Top 10 CNBC Young Business Women 2016, and a TED Speaker. In the episode, Richie and Prukalpa explore challenges within data discoverability, the inception of Atlan, the importance of a data catalog, personalization in data catalogs, data lineage, building data lineage, implementing data governance, human collaboration in data governance, skills for effective data governance, product design for diverse audiences, regulatory compliance, the future of data management and much more. Links Mentioned in the Show: AtlanConnect with Prukalpa[Course] Artificial Intelligence (AI) StrategyRelated Episode: Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at SnowflakeSign up to RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Databricks ML in Action

2024-05-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Amanda Baker , Stephanie Rivera , Hayley Horn , Anastasia Prokaieva

AI/ML Big Data Data Lakehouse Data Science Databricks LLM data data-engineering

Dive into the Databricks Data Intelligence Platform and learn how to harness its full potential for creating, deploying, and maintaining machine learning solutions. This book covers everything from setting up your workspace to integrating state-of-the-art tools such as AutoML and VectorSearch, imparting practical skills through detailed examples and code. What this Book will help me do Set up and manage a Databricks workspace tailored for effective data science workflows. Implement monitoring to ensure data quality and detect drift efficiently. Build, fine-tune, and deploy machine learning models seamlessly using Databricks tools. Operationalize AI projects including feature engineering, data pipelines, and workflows on the Databricks Lakehouse architecture. Leverage integrations with popular tools like OpenAI's ChatGPT to expand your AI project capabilities. Author(s) This book is authored by Stephanie Rivera, Anastasia Prokaieva, Amanda Baker, and Hayley Horn, seasoned experts in data science and machine learning from Databricks. Their collective years of expertise in big data and AI technologies ensure a rich and insightful perspective. Through their work, they strive to make complex concepts accessible and actionable. Who is it for? This book serves as an ideal guide for machine learning engineers, data scientists, and technically inclined managers. It's well-suited for those transitioning to the Databricks environment or seeking to deepen their Databricks-based machine learning implementation skills. Whether you're an ambitious beginner or an experienced professional, this book provides clear pathways to success.

From Moneyball to Gen AI

2024-05-12 · The Analytics Engineering Podcast Listen

podcast_episode

by Tristan Handy (dbt Labs) , Eric Avidon (TechTarget)

AI/ML Analytics Databricks GenAI Snowflake

Eric Avidon is a journalist at TechTarget who's interviewed Tristan a few times, and now Tristan gets to flip the script and interview Eric. Eric is a journalist veteran, covering everything from finance to the Boston Red Sox, but now he spends a lot of time with vendors in the data space and has a broad view of what's going on. Eric and Tristan discuss AI and analytics and how mature these features really are today, data quality and its importance, the AI strategies of Snowflake and Databricks, and a lot more. Plus, part way through you can hear Tristan reacting to a mild earthquake that hit the East Coast. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.

talk-data.com

Activity Trend

Top Events

Top Speakers

GenAI’s best use cases & getting your data platform ready

Decentralizing Data & Analytics: A Platform Strategy for Balancing Autonomy and Governance

Feed your AI Strategy, with Data Products

Mastering data quality: strategies for organisational excellence with Anthony Deighton

#238 Data & AI for Improving Patient Outcomes with Terry Myerson, CEO at Truveta

The Evolution of DataOps: Insights from DataKitchen's CEO

Telling Your Data Story (w/ Scott Taylor - The Data Whisperer)

#231 Manage Your Data Better with Shinji Kim, CEO at Select Star

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

148 - LLMs need UX: How to Increase Your B2B Product’s Value with AI (Part 2)

5 Pro Tips for Identifying & Cleaning Dirty Data (w/ Susan Walsh - The Classification Guru)

Andrew Jones - Reliable Data Platform

#222 [Radar Recap] Scaling Data Quality in the Age of Generative AI

How the Airflow Community Productionizes Generative AI

Product Management perspective on Data Observability with Databand

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Quality Score: How We Evolved the Data Quality Strategy at Airbnb

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

Databricks ML in Action

From Moneyball to Gen AI