Streaming data is hard and costly — that's the default opinion, but it doesn’t have to be.In this session, discover how SEGA simplified complex streaming pipelines and turned them into a competitive edge. SEGA sees over 40,000 events per second. That's no easy task, but enabling personalised gaming experiences for over 50 million gamers drives a huge competitive advantage. If you’re wrestling with streaming challenges, this talk is your next checkpoint.We’ll unpack how Lakeflow Declarative Pipelines helped SEGA, from automated schema evolution and simple data quality management to seamless streaming reliability. Learn how Lakeflow Declarative Pipelines drives value by transforming chaos emeralds into clarity, delivering results for a global gaming powerhouse. We'll step through the architecture, approach and challenges we overcame.Join Craig Porteous, Microsoft MVP from Advancing Analytics, and Felix Baker, Head of Data Services at SEGA Europe, for a fast-paced, hands-on journey into Lakeflow Declarative Pipelines’ unique powers.
talk-data.com
Topic
Data Quality
537
tagged
Activity Trend
Top Events
Agentic AI is the next evolution in artificial intelligence, with the potential to revolutionize the industry. However, its potential is matched only by its risk: without high-quality, trustworthy data, agentic AI can be exponentially dangerous. Join Barr Moses, CEO and Co-Founder of Monte Carlo, to explore how to leverage Databricks' powerful platform to ensure your agentic AI initiatives are underpinned by reliable, high-quality data. Barr will share: How data quality impacts agentic AI performance at every stage of the pipeline Strategies for implementing data observability to detect and resolve data issues in real-time Best practices for building robust, error-resilient agentic AI models on Databricks. Real-world examples of businesses harnessing Databricks' scalability and Monte Carlo’s observability to drive trustworthy AI outcomes Learn how your organization can deliver more reliable agentic AI and turn the promise of autonomous intelligence into a strategic advantage.Audio for this session is delivered in the conference mobile app, you must bring your own headphones to listen.
In the rapidly evolving life sciences and healthcare industry, leveraging data-as-a-product is crucial for driving innovation and achieving business objectives. Join us to explore how Deloitte is revolutionizing data strategy solutions by overcoming challenges such as data silos, poor data quality, and lack of real-time insights with the Databricks Data Intelligence Platform. Learn how effective data governance, seamless data integration, and scalable architectures support personalized medicine, regulatory compliance, and operational efficiency. This session will highlight how these strategies enable biopharma companies to transform data into actionable insights, accelerate breakthroughs and enhance life sciences outcomes.
Modern insurers require agile, integrated data systems to harness AI. This framework for a global insurer uses Azure Databricks to unify legacy systems into a governed lakehouse medallion architecture (bronze/silver/gold layers), eliminating silos and enabling real-time analytics. The solution employs: Medallion architecture for incremental data quality improvement. Unity Catalog for centralized governance, row/column security, and audit compliance. Azure encryption/confidential computing for data mesh security. Automated ingestion/semantic/DevOps pipelines for scalability. By combining Databricks’ distributed infrastructure with Azure’s security, the insurer achieves regulatory compliance while enabling AI-driven innovation (e.g., underwriting, claims). The framework establishes a future-proof foundation for mergers/acquisitions (M&A) and cross-functional data products, balancing governance with agility.
In 2020, Delaware implemented a state-of-the-art, event-driven architecture for EFSA, enabling a highly decoupled system landscape, presented at the Data&AI Summit 2021. By centrally brokering events in near real-time, consumer applications react instantly to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscriber mechanism. Over the past years, we noticed some drawbacks. The processing of these custom events, primarily aimed for process integration weren’t covering all edge cases, the data quality was not always optimal due to missing events and we needed to create a complex logic for SCD2 tables. Lakeflow Connect allows us to extract the data directly from the source without the complex architecture in between, avoiding data loss and thus, data quality issues, and with some simple adjustments, an SCD2 table is created automatically. Lakeflow Connect allows us to create more efficient and intelligent data provisioning.
The next era of data transformation has arrived. AI is enhancing developer workflows, enabling downstream teams to collaborate effectively through governed self-service. Additionally, SQL comprehension is producing detailed metadata that boosts developer efficiency while ensuring data quality and cost optimization. Experience this firsthand with dbt’s data control plane, a centralized platform that provides organizations with repeatable, scalable, and governed methods to succeed with Databricks in the modern age.
Join us for an introductory session on Databricks DQX, a Python-based framework designed to validate the quality of PySpark DataFrames. Discover how DQX can empower you to proactively tackle data quality challenges, enhance pipeline reliability and make more informed business decisions with confidence. Traditional data quality tools often fall short by providing limited, actionable insights, relying heavily on post-factum monitoring, and being restricted to batch processing. DQX overcomes these limitations by enabling real-time quality checks at the point of data entry, supporting both batch and streaming data validation and delivering granular insights at the row and column level. If you’re seeking a simple yet powerful data quality framework that integrates seamlessly with Databricks, this session is for you.
A big challenge in LLM development and synthetic data generation is ensuring data quality and diversity. While data incorporating varied perspectives and reasoning traces consistently improves model performance, procuring such data remains impossible for most enterprises. Human-annotated data struggles to scale, while purely LLM-based generation often suffers from distribution clipping and low entropy. In a novel compound AI approach, we combine LLMs with probabilistic graphical models and other tools to generate synthetic personas grounded in real demographic statistics. The approach allows us to address major limitations in bias, licensing, and persona skew of existing methods. We release the first open-source dataset aligned with real-world distributions and show how enterprises can leverage it with Gretel Data Designer (now part of NVIDIA) to bring diversity and quality to model training on the Databricks platform, all while addressing model collapse and data provenance concerns head-on.
Auto Loader is the definitive tool for ingesting data from cloud storage into your lakehouse. In this session, we’ll unveil new features and best practices that simplify every aspect of cloud storage ingestion. We’ll demo out-of-the-box observability for pipeline health and data quality, walk through improvements for schema management, introduce a series of new data formats and unveil recent strides in Auto Loader performance. Along the way, we’ll provide examples and best practices for optimizing cost and performance. Finally, we’ll introduce a preview of what’s coming next — including a REST API for pushing files directly to Delta, a UI for creating cloud storage pipelines and more. Join us to help shape the future of file ingestion on Databricks.
GovTech is an agency in the Singapore Government focused on tech for good. The GovTech Chief Data Office (CDO) has built the GovTech Data Platform with Databricks at the core. As the government tech agency, we safeguard national-level government and citizen data. A comprehensive data strategy is essential to uplifting data maturity. GovTech has adopted the service model approach where data services are offered to stakeholders based on their data maturity. Their maturity is uplifted through partnership, readying them for more advanced data analytics. CDO offers a plethora of data assets in a “data restaurant” ranging from raw data to data products, all delivered via Databricks and enabled through fine-grained access control, underpinned by data management best practices such as data quality, security and governance. Within our first year on Databricks, CDO was able to save 8,000 man-hours, democratize data across 50% of the agency and achieve six-figure savings through BI consolidation.
Industrial data is the foundation for operational excellence, but sharing and leveraging this data across systems presents significant challenges. Fragmented approaches create delays in decision-making, increase maintenance costs, and erode trust in data quality. This session explores how the partnership between AVEVA and Databricks addresses these issues through CONNECT, which integrates directly with Databricks via Delta Sharing. By accelerating time to value, eliminating data wrangling, ensuring high data quality, and reducing maintenance costs, this solution drives faster, more confident decision-making and greater user adoption. We will showcase how Agnico Eagle Mines—the world’s third-largest gold producer with 10 mines across Canada, Australia, Mexico, and Finland—is leveraging this capability to overcome data intelligence barriers at scale. With this solution, Agnico Eagle is making insights more accessible and actionable across its entire organization.
In this course, you’ll learn how to Incrementally process data to power analytic insights with Structured Streaming and Auto Loader, and how to apply design patterns for designing workloads to perform ETL on the Data Intelligence Platform with Lakeflow Declarative Pipelines. First, we’ll cover topics including ingesting raw streaming data, enforcing data quality, implementing CDC, and exploring and tuning state information. Then, we’ll cover options to perform a streaming read on a source, requirements for end-to-end fault tolerance, options to perform a streaming write to a sink, and creating an aggregation and watermark on a streaming dataset. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with streaming workloads and familiarity with Lakeflow Declarative Pipelines. Labs: No Certification Path: Databricks Certified Data Engineer Professional
Organizations struggle to make sense of numerous programs and projects that overlap or operate in silos. This research will weave together data and analytics governance, MDM and data quality into one organized initiative that every CDAO should be interested in.
Data architects are increasingly tasked with provisioning quality unstructured data to support AI models. However, little has been done to manage unstructured data beyond data security and privacy requirements. This session will look at what it takes to improve the quality of unstructured data and the emerging best practices in this space.
As AI evolves into more agentic forms, capable of autonomous decision-making and complex interactions, the readiness of your data becomes a mission-critical priority. This roundtable gathers data & analytics leaders to explore the unique challenges of preparing data ecosystems for agentic AI. Discussions will focus on overcoming barriers such as data quality gaps, governance complexities, and scalability issues, while highlighting the transformative role of technologies like generative AI, data fabrics, and metadata-driven governance
Traditional approaches and thinking around data quality are out of date and not sufficient in the era of AI. Data, analytics and AI leaders will need to reconsider their approach to data quality going beyond the traditional six data quality dimensions. This session will help data leaders learn to think about data quality in a holistic way that support making data AI-ready.
Summary In this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Mai-Lan Tomsen Bukovec about the evolutions of S3 and how it has transformed data architectureInterview IntroductionHow did you get involved in the area of data management?Most everyone listening knows what S3 is, but can you start by giving a quick summary of what roles it plays in the data ecosystem?What are the major generational epochs in S3, with a particular focus on analytical/ML data systems?The first major driver of analytical usage for S3 was the Hadoop ecosystem. What are the other elements of the data ecosystem that helped shape the product direction of S3?Data storage and retrieval have been core primitives in computing since its inception. What are the characteristics of S3 and all of its copycats that led to such a difference in architectural patterns vs. other shared data technologies? (e.g. NFS, Gluster, Ceph, Samba, etc.)How does the unified pool of storage that is exemplified by S3 help to blur the boundaries between application data, analytical data, and ML/AI data?What are some of the default patterns for storage and retrieval across those three buckets that can lead to anti-patterns which add friction when trying to unify those use cases?The age of AI is leading to a massive potential for unlocking unstructured data, for which S3 has been a massive dumping ground over the years. How is that changing the ways that your customers think about the value of the assets that they have been hoarding for so long?What new architectural patterns is that generating?What are the most interesting, innovative, or unexpected ways that you have seen S3 used for analytical/ML/Ai applications?What are the most interesting, unexpected, or challenging lessons that you have learned while working on S3?When is S3 the wrong choice?What do you have planned for the future of S3?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS S3KinesisKafkaSQSEMRDrupalWordpressNetflix Blog on S3 as a Source of TruthHadoopMapReduceNasa JPLFINRA == Financial Industry Regulatory AuthorityS3 Object VersioningS3 Cross RegionS3 TablesIcebergParquetAWS KMSIceberg RESTDuckDBNFS == Network File SystemSambaGlusterFSCephMinIOS3 MetadataPhotoshop Generative FillAdobe FireflyTurbotax AI AssistantAWS Access AnalyzerData ProductsS3 Access PointAWS Nova ModelsLexisNexis ProtegeS3 Intelligent TieringS3 Principal Engineering TenetsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
The case study will focus on the approach to drive an organization wide data quality uplift initiative with clearly defined objectives, to enable appropriate rule coverage, right DQ checks, mastering of DQ rules, automated analogy detection, automated DQ issue remediation, centralized monitoring and a collaborative accountability.
It's now easier than ever for less technical users to access, manage and analyze data without needing help from IT. But, self-service data management isn't always straightforward, and there are plenty of pitfalls, like data quality issues, skills gaps and governance concerns. This session will cover practical ways to make self-service data management work.
Patrick Thompson, co-founder of Clarify and former co-founder of Iteratively (acquired by Amplitude), joined Yuliia and Dumky to discuss the evolution from data quality to decision quality. Patrick shares his experience building data contracts solutions at Atlassian and later developing analytics tracking tools. Patrick challenges the assumption that AI will eliminate the need for structured data. He argues that while LLMs excel at understanding unstructured data, businesses still need deterministic systems for automation and decision-making. Patrick shares insights on why enforcing data quality at the source remains critical, even in an AI-first world, and explains his shift from analytics to CRM while maintaining focus on customer data unification and business impact over technical perfectionism.Tune in!