talk-data.com talk-data.com

Topic

Data Quality

data_management data_cleansing data_validation

537

tagged

Activity Trend

82 peak/qtr
2020-Q1 2026-Q1

Activities

537 activities · Newest first

We talked about:

Meryam's background The constant evolution of startups How Meryam became interested in LLMs What is an LLM (generative vs non-generative models)? Why LLMs are important Open source models vs API models What TitanML does How fine-tuning a model helps in LLM use cases Fine-tuning generative models How generative models change the landscape of human work How to adjust models over time Vector databases and LLMs How to choose an open source LLM or an API Measuring input data quality Meryam's resource recommendations

Links:

Website: https://www.titanml.co/ Beta docs: https://titanml.gitbook.io/iris-documentation/overview/guide-to-titanml... Using llama2.0 in TitanML Blog: https://medium.com/@TitanML/the-easiest-way-to-fine-tune-and-inference-llama-2-0-8d8900a57d57 Discord: https://discord.gg/83RmHTjZgf Meryem LinkedIn: https://www.linkedin.com/in/meryemarik/

Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Cross-Platform Data Lineage with OpenLineage

There are more data tools available than ever before, and it is easier to build a pipeline than it has ever been. These tools and advancements have created an explosion of innovation, resulting in data within today's organizations becoming increasingly distributed and can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe.

OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark™, Flink®, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, we will show how to trace data lineage across Apache Spark and Apache Airflow. There will be a walk-through of the OpenLineage architecture and a live demo of a running pipeline with real-time data lineage.

Talk by: Julien Le Dem,Willy Lulciuc

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Labcorp Data Platform Journey: From Selection to Go-Live in Six Months

Join this session to learn about the Labcorp data platform transformation from on-premises Hadoop to AWS Databricks Lakehouse. We will share best practices and lessons learned from cloud-native data platform selection, implementation, and migration from Hadoop (within six months) with Unity Catalog.

We will share steps taken to retire several legacy on-premises technologies and leverage Databricks native features like Spark streaming, workflows, job pools, cluster policies and Spark JDBC within Databricks platform. Lessons learned in Implementing Unity Catalog and building a security and governance model that scales across applications. We will show demos that walk you through batch frameworks, streaming frameworks, data compare tools used across several applications to improve data quality and speed of delivery.

Discover how we have improved operational efficiency, resiliency and reduced TCO, and how we scaled building workspaces and associated cloud infrastructure using Terraform provider.

Talk by: Mohan Kolli and Sreekanth Ratakonda

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Lightup Data | How McDonald's Leveraged Lightup Data Quality

As one of the world's largest fast-food chains, McDonald's manages massive amounts of data for customers, sales, inventory, marketing, and more. And at that scale, ensuring the accuracy, reliability, and quality of all that data comes with a new set of complex challenges. Developing manual data quality checks with legacy tools was too time consuming and resource-intensive, requiring developer support and data domain expertise. Ultimately, they struggled to scale their checks across their enterprise data pipelines.

Join our featured customer session, where you’ll hear from Matt Sandler, Senior Director of Data and Analytics at McDonald’s, about how they use the Lightup Deep Data Quality platform to deploy pushdown data quality checks in minutes, not months — without developer support. From reactive to proactive, the McDonald’s data team leverages Lightup to scale their data quality checks across petabytes of data, ensuring high-quality data and reliable analytics for their products and services. During the session, you’ll learn:

  • The key challenges of scaling Data Quality checks with legacy tools
  • Why fixing data quality (fast) was critical to launching their new loyalty program and personalized marketing initiatives
  • How quickly McDonald’s ramped up with Lightup, transforming their data quality struggles into success

After the session, you’ll understand:

  • Why McDonald’s phased out their legacy Data Quality tools
  • The benefits of using pushdown data quality checks, AI-powered anomaly detection, and incident alerts
  • Best practices for scaling data quality checks in your own organization

Talk by: Matt Sandler and Manu Bansal

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

As part of Comcast Effectv’s transformation into a completely digital advertising agency, it was key to develop an approach to manage and remediate data quality issues related to customer data so that the sales organization is using reliable data to enable data-driven decision making. Like many organizations, Effectv's customer lifecycle processes are spread across many systems utilizing various integrations between them. This results in key challenges like duplicate and redundant customer data that requires rationalization and remediation. Data is at the core of Effectv’s modernization journey with the intended result of winning more business, accelerating order fulfillment, reducing make-goods and identifying revenue.

In partnership with Slalom Consulting, Comcast Effectv built a traditional lakehouse on Databricks to ingest data from all of these systems but with a twist; they anchored every engineering decision in how it will enable their data governance program.

In this session, we will touch upon the data transformation journey at Effectv and dive deeper into the implementation of data governance leveraging Databricks solutions such as Delta Lake, Unity Catalog and DB SQL. Key focus areas include how we baked master data management into our pipelines by automating the matching and survivorship process, and bringing it all together for the data consumer via DBSQL to use our certified assets in bronze, silver and gold layers.

By making thoughtful decisions about structuring data in Unity Catalog and baking MDM into ETL pipelines, you can greatly increase the quality, reliability, and adoption of single-source-of-truth data so your business users can stop spending cycles on wrangling data and spend more time developing actionable insights for your business.

Talk by: Maggie Davis and Risha Ravindranath

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

De-Risking Language Models for Faster Adoption

Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you! We'll be covering the following: 

  • Studying past incidents in the AI Incident Database and using this information to guide debugging.
  • Adhering to authoritative standards, like the NIST AI Risk Management Framework. 
  • Finding and fixing common data quality issues.
  • Applying general public tools and benchmarks as appropriate (e.g., BBQ, Winogender, TruthfulQA).
  • Binarizing specific tasks and debugging them using traditional model assessment and bias testing.
  • Engineering adversarial prompts with strategies like counterfactual reasoning, role-playing, and content exhaustion. 
  • Conducting random attacks: random sequences of attacks, prompts, or other tests that may evoke unexpected responses. 
  • Countering prompt injection attacks, auditing for backdoors and data poisoning, ensuring endpoints are protected with authentication and throttling, and analyzing third-party dependencies. 
  • Engaging stakeholders to help find problems system designers and developers cannot see. 
  • Everyone knows that generative AI is going to be huge. Don't let inadequate risk management ruin the party at your organization!

Talk by: Patrick Hall

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

Every year, billions of dollars are lost due to water risks from storms, floods, and droughts. Water data scarcity and excess are issues that risk models cannot overcome, creating a world of uncertainty. Divirod is building a platform of water data by normalizing diverse data sources of varying velocity into one unified data asset. In addition to publicly available third-party datasets, we are rapidly deploying our own IoT sensors. These sensors ingest signals at a rate of about 100,000 messages per hour into preprocessing, signal-processing, analytics, and postprocessing workloads in one spark-streaming pipeline to enable critical real-time decision-making processes. By leveraging streaming architecture, we were able to reduce end-to-end latency from tens of minutes to just a few seconds.

We are leveraging Delta Lake to provide a single query interface across multiple tables of this continuously changing data. This enables data science and analytics workloads to always use the most current and comprehensive information available. In addition to the obvious schema transformations, we implement data quality metrics and datum conversions to provide a trustworthy unified dataset.

Talk by: Adam Wilson and Heiko Udluft

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

In this presentation, Matillion’s Sarah Pollitt, Group Product Manager for ETL, will discuss how you can use Matillion to load data from popular data sources such as Salesforce, SAP, and over a hundred out-of-the-box connectors into your data lakehouse. You can quickly transform this data using powerful tools like Matillion or dbt, or your own custom notebooks, to derive valuable insights. She will also explore how you can run streaming pipelines to ensure real-time data processing, and how you can extract and manage this data using popular governance tools such as Alation or Collibra, ensuring compliance and data quality. Finally, Sarah will showcase how you can seamlessly integrate this data into your analytics tools of choice, such as Thoughtspot, PowerBI, or any other analytics tool that fits your organization's needs.

Talk by: Rick Wear

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Accenture | Databricks Enables Employee Data Domain to Align People w/ Business Outcomes

A global franchise retailer was struggling to understand the value of its employees and had not fostered a data-driven enterprise. During the journey to use facts as the basis for decision making, Databricks became the facilitator of DataMesh and created the pipelines, analytics and source engine for a three-layer — bronze, silver, gold — lakehouse that supports the HR domain and drives the integration of multiple additional domains: sales, customer satisfaction, product quality and more. In this talk, we will walk through:

  • The business rationale and drivers
  • The core data sources
  • The data products, analytics and pipelines
  • The adoption of Unity Catalog for data privacy compliance /adherence and data management
  • Data quality metrics

Join us to see the analytic product and the design behind this innovative view of employees and their business outcomes.

Talk by: Rebecca Bucnis

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Anomalo | Scaling Data Quality with Unsupervised Machine Learning Methods

The challenge is no longer how big, diverse, or distributed your data is. It's that you can't trust it. Companies are utilizing rules and metrics to monitor data quality, but they’re tedious to set up and maintain. We will present a set of fully unsupervised machine learning algorithms for monitoring data quality at scale, which requires no setup, catching unexpected issues and preventing alert fatigue by minimizing false positives. At the end of this talk, participants will be equipped with insight into unsupervised data quality monitoring, its advantages and limitations, and how it can help scale trust in your data.

Talk by: Vicky Andonova

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Fivetran | Fivetran and Catalyst Enable Businesses & Solve Critical Market Challenges

Fivetran helps Enterprise and Commercial companies improve the efficiency of their data movement, infrastructure, and analysis by providing a secure, scalable platform for high-volume data movement. In this fireside chat, we will dive into the pain points that drove Catalyst, a cloud-based platform that helps software companies grow revenue with advanced insights and workflows that strengthen customer adoption, retention, expansion and advocacy, to begin their search for a partnership that would automate and simplify data management along with the pivotal success driven by the implementation of Fivetran and Databricks. 

Discover how together Fivetran and Databricks:

  • Deliver scalable, real-time analytics to customers with minimal configuration and centralize customer data into customer success tools.
  • Improve Catalyst’s visibility into customer health, opportunities, and risks across all teams.
  • Turn data into revenue-driving insights around digital customer behavior with improved targeting and Ai/ Machine learning.
  • Provide a robust and scalable data infrastructure that supports Catalyst’s growing data needs, with improvements in data availability, data quality, and overall efficiency in data operations.

Talk by: Edward Chiu and Lauren Schwartz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Control of Streaming Healthcare Data

Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), initially partnered with Slalom to build a Databricks data lakehouse architecture in response to the analytics demands of the COVID-19 pandemic, since then they have expanded the platform to additional use cases. Recently they have worked together to engineer streaming data pipelines to process healthcare messages, such as HL7, to help CRISP become vendor independent.

This session will focus on the improvements CRISP has made to their data lakehouse platform to support streaming use cases and the impact these changes have had for the organization. We will touch on using Databricks Auto Loader to efficiently ingest incoming files, ensuring data quality with Delta Live Tables, and sharing data internally with a SQL warehouse, as well as some of the work CRISP has done to parse and standardize HL7 messages from hundreds of sources. These efforts have allowed CRISP to stream over 4 million messages daily in near real-time with the scalability it needs to continue to onboard new healthcare providers so it can continue to facilitate care and improve health outcomes.

Talk by: Andy Hanks and Chris Mantz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

You’ve got your pipelines flowing … how much do you know about the data inside? Most teams have some coverage with unit/contract/expectations tests, and you might have other quality checks. But it can be very ad-hoc and disorganized. You want to do more to beef up data quality and observability … does that mean you just need to write more tests and assertions? Come learn about the best way to see your data’s quality alongside DAGs in a familiar context. We’ll review 3 common tools to get a handle on quality in a cohesive way across all your DAGs: Great Expectations Monte Carlo Data Databand

Discover PepsiCo’s dynamic data quality strategy in a multi-cloud landscape. Join me, the Director of Data Engineering, as I unveil our Airflow utilization, custom operator integration, and the power of Great Expectations. Learn how we’ve harmonized Data Mesh into our decentralized development for seamless data integration. Explore our journey to maintain quality and enhance data as a strategic asset at PepsiCo.

Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards. With a control plane as a connecting layer, producers can build data contracts that consumers can rely on, ensuring DAGs only run when a contract is valid. Producers can govern how workflows should behave, and consumers receive the tooling they need to only opt into high quality data. Learn how to use data contracts and DataHub to make your Airflow pipelines more reliable - as well as other use cases that can help build a simpler, more flexible data stack.

LLMs are hugely popular with data engineers because they boost productivity. But companies must adapt their data governance programs to control risks related to data quality, privacy, intellectual property, fai-Datarness, and explainability. Published at: https://www.eckerson.com/articles/should-ai-bots-build-your-data-pipelines-part-ii-risks-and-governance-approaches-for-data-engineers-to-use-large-language-models

A robust data workflow testing strategy helps ensure the accuracy and reliability of data processed within a pipeline. Use this checklist to meet your organization’s data quality requirements according to the dimensions of accuracy, completeness, conformity, consistency, integrity, precision, timeliness, and uniqueness. Published at: https://www.eckerson.com/articles/developing-a-robust-data-quality-strategy-for-your-data-pipeline-workflows

Today I’m chatting with Phil Harvey, co-author of Data: A Guide to Humans and a technology professional with 23 years of experience working with AI and startups. In his book, Phil describes his philosophy of how empathy leads to more successful outcomes in data product development and the journey he took to arrive at this perspective. But what does empathy mean, and how do you measure its success? Brian and Phil dig into those questions, and Phil explains why he feels cognitive empathy is a learnable skill that one can develop and apply. Phil describes some leading indicators that empathy is needed on a data team, as well as leading indicators that a more empathetic approach to product development is working. While I use the term “design” or “UX” to describe a lot of what Phil is talking about, Phil actually has some strong opinions about UX and shares those on this episode. Phil also reveals why he decided to write Data: A Guide to Humans and some of the experiences that helped shape the book’s philosophy. 

Highlights/ Skip to:

Phil introduces himself and explains how he landed on the name for his book (00:54)  How Phil met his co-author, Noelia Jimenez Martinez, and the reason they started writing Data: A Guide to Humans (02:31) Phil unpacks his understanding of how he defines empathy, why it leads to success on AI projects, and what success means to him (03:54) Phil walks through a couple scenarios where empathy for users and stakeholders was lacking and the impacts it had (07:53) The work Phil has done internally to get comfortable doing the non-technical work required to make ML/AI/data products successful  (13:45) Phil describes some indicators that data teams can look for to know their design strategy is working (17:10) How Phil sees the methodology in his book relating to the world of UX (user experience) design (21:49) Phil walks through what an abstract concept like “empathy” means to him in his work and how it can be learned and applied as a practical skill (29:00)

Quotes from Today’s Episode “If you take success in itself, this is about achieving your intended outcomes. And if you do that with empathy, your outcomes will be aligned to the needs of the people the outcomes are for. Your outcomes will be accepted by stakeholders because they’ll understand them.” — Phil Harvey (05:05)

“Where there’s people not discussing and not considering the needs and feelings of others, you start to get this breakdown, data quality issues, all that.” – Phil Harvey (11:10)

“I wanted to write code; I didn’t want to deal with people. And you feel when you can do technical things, whether it’s machine-learning or these things, you end up with the ‘I’ve got a hammer and now everything looks like a nail problem.’ But you also have the [attitude] that my programming will solve everything.” – Phil Harvey (14:48)

“This is what startup-land really taught me—you can’t do everything. It’s very easy to think that you can and then burn yourself out. You need a team of people.” – Phil Harvey (15:09)

“Let’s listen to the users. Let’s bring that perspective in as opposed to thinking about aligning the two perspectives. Because any product is a change. You don’t ride a horse then jump in a car and expect the car to work like the horse.” – Phil Harvey (22:41)

“Let’s say you’re a leader in this space. … Listen out carefully for who’s complaining about who’s not listening to them. That’s a first early signal that there’s work to be done from an empathy perspective.” – Phil Harvey (25:00)

“The perspective of the book that Noelia and I have written is that empathy—and cognitive empathy particularly—is also a learnable skill. There are concrete and real things you can practice and do to improve in those skills.” – Phil Harvey (29:09)

Links Data: A Guide to Humans: https://www.amazon.com/Data-A-Guide-to-Humans/dp/1783528648 Twitter: https://twitter.com/codebeard LinkedIn: https://www.linkedin.com/in/philipdavidharvey/ Mastodon: https://mastodonapp.uk/@codebeard

Incident Management for Data People | Bigeye

ABOUT THE TALK: Incident management is a key practice used by DevOps and SRE teams to keep software reliable—but it's still uncommon among data teams! Datadog says incident management can "streamline their response procedures, reducing mean time to repair (MTTR) and minimizing any impact on end users."

In this talk, Kyle Kirwan, co-founder of data observability company Bigeye, will explain the basics of incident management and how data teams can use it to reduce disruptions to analytics and machine learning applications.

ABOUT THE SPEAKER: Kyle Kirwan is the co-founder and CEO of Bigeye. He began his career as a data scientist, went on to lead the development of Uber's internal data catalog/lineage/quality tools, and now helps data teams use data observability to improve pipeline reliability and data quality.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/