Data Quality

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

2024-03-31 · Data Engineering Podcast Listen

podcast_episode

by Maayan Salom (Elementary) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Datafold dbt Delta Hudi +4 more

Summary

Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold. Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help

Interview

Introduction How did you get involved in the area of data management? Can you start by outlining what elements of observability are most relevant for dbt projects? What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights?

What are the challenges/shortcomings associated with those approaches?

Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools?

What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle?

Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects? How is Elementary designed/implemented?

How have the scope and goals of the project changed since you started working on it? What are the engineering ch

Fundamentals of Analytics Engineering

2024-03-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lasse Benninga (LaBenni Consulting) , Juan Manuel Perafan (Xebia) , Fanny Kassapian , Ricardo Angel Granados Lopez , Dumky de Wilde (MotherDuck) , Taís Laurindo Pereira , Jovan Gligorevic

Airbyte Analytics Analytics Engineering BigQuery CI/CD Data Engineering Data Modelling dbt Git business-intelligence data data-science +1 more

Master the art and science of analytics engineering with 'Fundamentals of Analytics Engineering.' This book takes you on a comprehensive journey from understanding foundational concepts to implementing end-to-end analytics solutions. You'll gain not just theoretical knowledge but practical expertise in building scalable, robust data platforms to meet organizational needs. What this Book will help me do Design and implement effective data pipelines leveraging modern tools like Airbyte, BigQuery, and dbt. Adopt best practices for data modeling and schema design to enhance system performance and develop clearer data structures. Learn advanced techniques for ensuring data quality, governance, and observability in your data solutions. Master collaborative coding practices, including version control with Git and strategies for maintaining well-documented codebases. Automate and manage data workflows efficiently using CI/CD pipelines and workflow orchestrators. Author(s) Dumky De Wilde, alongside six co-authors-experienced professionals from various facets of the analytics field-delivers a cohesive exploration of analytics engineering. The authors blend their expertise in software development, data analysis, and engineering to offer actionable advice and insights. Their approachable ethos makes complex concepts understandable, promoting educational learning. Who is it for? This book is a perfect fit for data analysts and engineers curious about transitioning into analytics engineering. Aspiring professionals as well as seasoned analytics engineers looking to deepen their understanding of modern practices will find guidance. It's tailored for individuals aiming to boost their career trajectory in data engineering roles, addressing fundamental to advanced topics.

The Definitive Guide to Data Integration

2024-03-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Raphaël MANSUY , Emeric CHAIZE , Mehdi TAZI , Pierre-Yves BONNEFOY

API Cloud Computing Data Engineering Modern Data Stack data data-engineering

Master the modern data stack with 'The Definitive Guide to Data Integration.' This comprehensive book covers the key aspects of data integration, including data sources, storage, transformation, governance, and more. Equip yourself with the knowledge and hands-on skills to manage complex datasets and unlock your data's full potential. What this Book will help me do Understand how to integrate diverse datasets efficiently using modern tools. Develop expertise in designing and implementing robust data integration workflows. Gain insights into real-time data processing and cloud-based data architectures. Learn best practices for data quality, governance, and compliance in integration. Master the use of APIs, workflows, and transformation patterns in practice. Author(s) The authors, None Bonnefoy, None Chaize, Raphaël Mansuy, and Mehdi Tazi, are seasoned experts in data engineering and integration. They bring years of experience in modern data technologies and consulting. Their approachable writing style ensures that readers at various skill levels can grasp complex concepts effectively. Who is it for? This book is ideal for data engineers, architects, analysts, and IT professionals. Whether you're new to data integration or looking to deepen your expertise, this guide caters to individuals seeking to navigate the challenges of the modern data stack.

#190 How Data Leaders Can Make Data Governance a Priority with Saurabh Gupta, Chief Strategy & Revenue Officer at The Modern Data Company

2024-03-22 · DataFramed Listen

podcast_episode

by Adel (DataFramed) , Saurabh Gupta (The Modern Data Company)

AI/ML Data Governance

There is a concept in software engineering which is called ‘shifting left’, this focuses on testing software a lot earlier in the development lifecycle than you would normally expect it to. This helps teams building the software create better rituals and processes, while also ensuring quality and usability are key aspects to evaluate as the software is being built. We know this works in software development, but what happens when these practices are used when building AI tools? Saurabh Gupta is a seasoned technology executive and is currently Chief Strategy & Revenue Officer The Modern Data Company. With over 25 years of experience in tech, data and strategy, he has led many strategy and modernization initiatives across industries and disciplines. Through his career, he has worked with various Internation Organizations and NGOs, Public sector and Private sector organizations. Before joining TMDC, he was the Head of Data Strategy & Governance at ThoughtWorks & CDO/Director for Washington DC Gov., where he developed the digital/data modernization strategy for education data. Prior to DCGov he played leadership and strategic roles at organizations including IMF and World Bank where he was responsible for their Data strategy and led the OpenData initiatives. He has also closely worked with African Development Bank, OECD, EuroStat, ECB, UN and FAO as a part of inter-organization working groups on data and development goals. As a part of the taskforce for international data cooperation under the G20 Data Gaps initiative, he chaired the technical working group on data standards and exchange. He also played an advisor role to the African Development Bank on their data democratization efforts under the Africa Information Highway. In the episode, Adel & Saurabh explore the importance of data quality and how ‘shifting left’ can improve data quality practices, the role of data governance, the emergence of data product managers, operationalizing ‘shift left’ strategies through collaboration and data governance, the challenges faced when implementing data governance, future trends in data quality and governance, and much more. Links Mentioned in the Show: The Modern Data CompanyMonte Carlo: The Annual State of Data Quality Survey[Course] Data Governance Concepts[Webinar] Crafting a Lean and Effective Data Governance Strategy Related Episode: Building Trust in Data with Data Governance New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

#189 From BI to AI with Nick Magnuson, Head of AI at Qlik

2024-03-20 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Nick Magnuson (Qlik)

AI/ML Analytics BI GenAI Qlik Cyber Security

Generative AI has made a mark everywhere, including BI platforms, but how can you combine AI and BI together? What effects can this have across organizations? With constituent aspects such as data quality, your AI strategy, and the specific use-case you’re trying to solve, it’s important to get the full picture and tread with intent. What are the subtleties that we need to get right in order for this marriage to work to its full potential? Nick Magnuson is the Head of AI at Qlik, executing the organization’s AI strategy, solution development, and innovation. Prior to Qlik, Nick was the CEO of Big Squid, which was acquired by Qlik in 2021. Nick has previously held executive roles in customer success, product, and engineering in the field of machine learning and predictive analytics. As a practitioner in this field for over 20 years, Nick has published original research in these areas, as well as cognitive bias and other quantitative topics. He has also served as an advisor to other analytics platforms and start-ups. A long-time investment professional, Nick continues to hold his Chartered Financial Analyst designation and is a past member of the Chicago Quantitative Alliance and Society of Quantitative Analysts. In the episode, Richie and Nick explore what Qlik offers, including products like Sense and Staige, how Staige uses AI to enhance customer capabilities, use cases of generative AI, advice on data privacy and security when using AI, data quality and its effect on the success of AI tools, AI strategy and leadership, how data roles are changing and the emergence of new positions, and much more.

Links Mentioned in the Show: QlikQlik StaigeQlik Sense[Skill Track] AI FundamentalsRelated Episode: Adapting to the AI Era with Jason Feifer, Editor in Chief of Entrepreneur MagazineSign up to RADAR: The Analytics Edition

New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

#188 Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of Alteryx

2024-03-18 · DataFramed Listen

podcast_episode

by Libby Duane Adams (Alteryx) , Richie (DataCamp)

AI/ML Alteryx Analytics BI Data Governance GenAI

Despite the critical role of analytics in guiding business decisions, organizations continue to face significant challenges in harnessing its full potential. As data sets expand and deadlines shrink, the urgency to scale analytics processes becomes paramount. What data leaders now need to focus on are essential strategies for analytics at scale, including fostering a culture of continuous learning, prioritizing data governance, and leveraging generative AI. Libby Duane Adams is the Chief Advocacy Officer and co-founder of Alteryx. She is responsible for strengthening upskilling and reskilling efforts for Alteryx customers to enable a culture of analytics, scaling the presence of the Alteryx SparkED education program and furthering diversity and inclusion in the workplace. As the former Chief Customer Officer, Libby has helped many Fortune 100 executives to identify and seize market opportunities, outsmart their competitors, and drive more revenue from their current businesses using analytics. In the episode, Richie and Libby explore the differences between analytics and business intelligence, analytics as a team sport, the importance of speed in analytics, generative AI and its implications in analytics, the role of data quality and governance, Alteryx’s AI platform, data skills as a workplace necessity, using AI to automate documentation and insights, success stories and mistakes within analytics, and much more. Links Mentioned in the Show: AlteryxAlteryx SparkED Program[Course] Introduction to AlteryxRelated Episode: From Data Literacy to AI Literacy with Cindi Howson, Chief Data Strategy Officer at ThoughtSpotSign up to RADAR: The Analytics Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Data Cleaning with Power BI

2024-02-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Gus Frazer

Analytics BI Data Science DAX Microsoft Power BI business-intelligence data data-science microsoft-power-platform power-bi

Delve into the powerful world of data cleaning with Microsoft Power BI in this detailed guide. You'll learn how to connect, transform, and optimize data from various sources, setting a strong foundation for insightful data-driven decisions. Equip yourself with the skills to master data quality, leverage DAX and Power Query, and produce actionable insights with improved efficiency. What this Book will help me do Master connecting to various data sources and importing data effectively into Power BI. Learn to use the Query Editor to clean and transform data efficiently. Understand how to use the M language to perform advanced data transformations. Gain expertise in creating optimized data models and handling relationships within Power BI. Explore insights-driven exploratory data analysis using Power BI's powerful tools. Author(s) None Frazer is an experienced data professional with a deep knowledge of business intelligence tools and analytics processes. With a strong background in data science and years of hands-on experience using Power BI, Frazer brings practical advice to help users improve their data preparation and analysis skills. Known for creating resources that are both comprehensive and approachable, Frazer is dedicated to empowering readers in their data journey. Who is it for? This book is ideal for data analysts, business intelligence professionals, and business analysts who work regularly with data. If you are someone with a basic understanding of BI tools and concepts looking to deepen their skills, especially in Power BI, this book will guide you effectively. It will also help data scientists and other professionals interested in data cleaning to build a robust basis for data quality and analysis. Whether you're addressing common data challenges or seeking to enhance your BI capabilities, this guide is tailored to accommodate your needs.

[AI and the Modern Data Stack] #183 Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

2024-02-21 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Sridhar Ramaswamy (Snowflake)

AI/ML Analytics Cloud Computing Data Management Databricks DWH GenAI Marketing Modern Data Stack NLP Snowflake Thoughtspot

Snowflake has been foundational in the data space for years. In the mid-2010s, the platform was a major driver of moving data to the cloud. More recently, it's become apparent that combining data and AI in the cloud is key to accelerating innovation. Snowflake has been rapidly adding AI features to provide value to the modern data stack, but what’s really been going on under the hood? At the time of recording, Sridhar Ramaswamy was the SVP of AI at Snowflake, being appointed CEO at Snowflake in February 2024. Sridhar was formerly Co-Founder of Neeva, acquired in 2023 by Snowflake. Before founding Neeva, Ramaswamy oversaw Google's advertising products, including search, display, video advertising, analytics, shopping, payments, and travel. He joined Google in 2003 and was part of the growth of AdWords and Google's overall advertising business. He spent more than 15 years at Google, where he started as a software engineer and rose to SVP of Ads & Commerce. In the episode, Richie and Sridhar explore Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, how NLP and AI have impacted enterprise business operations as well as new applications of AI in an enterprise environment, the challenges of enterprise search, the importance of data quality, management and the role of semantic layers in the effective use of AI, a look into Snowflakes products including Snowpilot and Cortex, the collaboration required for successful data and AI projects, advice for organizations looking to improve their data management and much more. About the AI and the Modern Data Stack DataFramed Series This week we’re releasing 4 episodes focused on how AI is changing the modern data stack and the analytics profession at large. The modern data stack is often an ambiguous and all-encompassing term, so we intentionally wanted to cover the impact of AI on the modern data stack from different angles. Here’s what you can expect: Why the Future of AI in Data will be Weird with Benn Stancil, CTO at Mode & Field CTO at ThoughtSpot — Covering how AI will change analytics workflows and tools How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks — Covering Databricks, data intelligence and how AI tools are changing data democratizationAdding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake — Covering Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, and how to improve your data managementAccelerating AI Workflows with Nuri Cankaya, VP of AI Marketing & La Tiffaney Santucci, AI Marketing Director at Intel — Covering AI’s impact on marketing analytics, how AI is being integrated into existing products, and the democratization of AI Links Mentioned in the Show: SnowflakeSnowflake acquires Neeva to accelerate search in the Data Cloud through generative AIUse AI in Seconds with Snowflake Cortex[Course] Introduction to SnowflakeRelated Episode: Why AI will Change Everything—with Former Snowflake CEO, Bob MugliaSign up to a...

Data quality can save the world with Danette McGilvray

2024-02-08 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Jason Foster (Cynozure) , Danette McGilvray (Granite Falls Consulting)

Join host Jason Foster in a captivating conversation with Danette McGilvray, the President and Principal of Granite Falls Consulting. Delve into the transformative power of data quality as they discuss practical approaches to address data issues at any organisational level. Explore responsibilities and understand the crucial role individuals play in this insightful dialogue. Tune in to discover why and how "Data quality can save the world."

WHY AI PROJECTS FALL SHORT AND HOW TO TURN THE TAIDE

2024-02-01 · Superweek 2024

talk

by Julien Coquet (Hub'Scan, France) , Ahmed Tarek (/ Media Monks)

AI/ML MLOps

Most statistics in the industry indicate that a significant numbers of AI projects are not generating ROI. In this presentation, Media.Monks experts Julien Coquet and Ahmed Tarek will discuss common errors and pitfalls encountered in AI project: using ML model out of context to the business model, no clear activation strategy, data quality and consistancy issues, lack of infrastructure to deploy models, no MLOps or model monitoring after deployment, etc.). Julien and Ahmed will offer solutions to these pAIn points.

Data Quality Rules (in english)

2024-01-16 · IN PERSON! Apache Kafka® Meetup Paris @ Devoteam - January 2024

talk

by Gilles Philippart (Confluent)

Kafka

Keep bad data out and refactor schemas with data quality rules.

Data Observability for Data Engineering

2023-12-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Michele Pinto , Sammy El Khammal

Analytics Data Engineering Python data data-engineering

"Data Observability for Data Engineering" introduces you to the foundational concepts of observing and validating data pipeline health. With real-world projects and Python code examples, you'll gain hands-on experience in improving data quality and minimizing risks, enabling you to implement strategies that ensure accuracy and reliability in your data systems. What this Book will help me do Master data observability techniques to monitor and validate data pipelines effectively. Learn to collect and analyze meaningful metrics to gauge and improve data quality. Develop skills in Python programming specific to applying data concepts such as observable data state. Address scalability challenges using state-of-the-art observability frameworks and practices. Enhance your ability to manage and optimize data workflows ensuring seamless operation from start to end. Author(s) Authors Michele Pinto and Sammy El Khammal bring a wealth of experience in data engineering and observing scalable data systems. Pinto specializes in constructing robust analytics platforms while Khammal offers insights into integrating software observability into massive pipelines. Their collaborative writing style ensures readers find both practical advice and theoretical foundations. Who is it for? This book is geared toward data engineers, architects, and scientists who seek to confidently handle pipeline challenges. Whether you're addressing specific issues or wish to introduce proactive measures in your team, this guide meets the needs of those ready to leverage observability as a key practice.

Designing Data Transfer Systems That Scale

2023-12-04 · Data Engineering Podcast Listen

podcast_episode

by Andrei Tserakhau (DoubleCloud) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Datafold Delta Hudi Iceberg SaaS +3 more

Summary

The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold today! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture

Interview

Introduction How did you get involved in the area of data management? Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve?

What were the shortcomings of other options in the ecosystem that led you to building a new system?

What was the design of your initial solution to the problem?

What are the sharp edges that you had to deal with to operate and use that i

Data Quality, Contracts and 100 Year Old Hares

2023-11-20 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Tim , Paolo (dataroots)

AI/ML Data Contracts dbt LLM

Send us a text Welcome to another engaging episode of Datatopics Unplugged, the podcast where tech and relaxation intersect. Today, we're excited to host two special guests, Paolo and Tim, who bring their unique perspectives to our cozy corner. Guests of Today Paolo: An enthusiast of fantasy and sci-fi reading, Paolo is on a personal mission to reduce his coffee consumption. He has a unique way of measuring his height, at 0.89 Sams tall. With over two and a half years of experience as a data engineer at dataroots, Paolo contributes a rich professional perspective. His hobbies extend to playing field hockey and a preference for the warmer summer season.Tim: Occasionally known as Dr. Dunkenstein, Tim brings a mix of humor and insight. He measures his height at 0.87 Sams tall. As the Head of Bizdev, he prefers to steer clear of grand titles, revealing his views on hierarchical structures and monarchies.Topics Biz Corner: Kyutai: We delve into France's answer to OpenAI with Paolo Leonard, exploring the implications and future of Kyutai: https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/GPT-NL: A discussion led by Bart Smeets on the Netherlands' own open language model and its potential impact: https://www.computerweekly.com/news/366558412/Netherlands-starts-building-its-own-AI-language-modelTech Corner: Data Quality Insights: A blog post by Paolo on data quality vs. data validation. We'll explore when and why data quality is essential, and evaluate tools like dbt, soda, deequ, and great_expectations: https://dataroots.io/blog/state-of-data-quality-october-2023Soda Data Contracts: An overview of the newly released OSS Data Contract Engine by Soda. https://docs.soda.io/soda/data-contracts.htmlFood for Thought Corner: Hare - A 100-Year Programming Language: Bart starts a discussion on the ambition of Hare to remain relevant for a century: https://harelang.org/blog/2023-11-08-100-year-language/.Join us for this mix of expert insights and light-hearted moments. Whether you're deeply embedded in the tech world or just dipping your toes in, this episode promises to be both informative and entertaining!

And, yes. There is a voucher, go to dataroots.io and navigate to the shop (top right) and use voucher code murilos_bargain_blast for a 25EUR discount!

Fundamentals of Data Science

2023-11-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Swarup Roy , Dhruba K. Bhattacharyya , Jugal K. Kalita

AI/ML Analytics Big Data Data Analytics Data Science NLP data data-science

Fundamentals of Data Science: Theory and Practice presents basic and advanced concepts in data science along with real-life applications. The book provides students, researchers and professionals at different levels a good understanding of the concepts of data science, machine learning, data mining and analytics. Users will find the authors’ research experiences and achievements in data science applications, along with in-depth discussions on topics that are essential for data science projects, including pre-processing, that is carried out before applying predictive and descriptive data analysis tasks and proximity measures for numeric, categorical and mixed-type data. The book's authors include a systematic presentation of many predictive and descriptive learning algorithms, including recent developments that have successfully handled large datasets with high accuracy. In addition, a number of descriptive learning tasks are included. Presents the foundational concepts of data science along with advanced concepts and real-life applications for applied learning Includes coverage of a number of key topics such as data quality and pre-processing, proximity and validation, predictive data science, descriptive data science, ensemble learning, association rule mining, Big Data analytics, as well as incremental and distributed learning Provides updates on key applications of data science techniques in areas such as Computational Biology, Network Intrusion Detection, Natural Language Processing, Software Clone Detection, Financial Data Analysis, and Scientific Time Series Data Analysis Covers computer program code for implementing descriptive and predictive algorithms

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

2023-11-13 · Data Engineering Podcast Listen

podcast_episode

by Eran Yahav (Technion – Israel Institute of Technology) , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Datafold dbt Delta +8 more

Summary

Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine

Interview

Introduction How did you get involved in machine learning? Can you describe what Tabnine is and the story behind it? What are the individual and organizational motivations for using AI to generate code?

What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.) What are the elements of skepticism/overs

#162 Scaling Data Engineering in Retail with Mohammad Sabah, SVP of Engineering & Data at Thrive Market

2023-11-06 · DataFramed Listen

podcast_episode

by Mohammad Sabah (Thrive Market) , Richie (DataCamp)

AI/ML Data Engineering Data Governance Data Science dbt

Poor data engineering is like building a shaky foundation for a house—it leads to unreliable information, wasted time and money, and even legal problems, making everything less dependable and more troublesome in our digital world. In the retail industry specifically, data engineering is particularly important for managing and analyzing large volumes of sales, inventory, and customer data, enabling better demand forecasting, inventory optimization, and personalized customer experiences. It helps retailers make informed decisions, streamline operations, and remain competitive in a rapidly evolving market. Insight and frameworks learned from data engineering practices can be applied to a multitude of people and problems, and in turn, learning from someone who has been at the forefront of data engineering is invaluable. Mohammad Sabah is SVP of Engineering and Data at Thrive Market, and was appointed to this role in 2018. He joined the company from The Honest Company where he served as VP of Engineering & Chief Data Scientist. Sabah joined The Honest Company following its acquisition of Insnap, which he co-founded in 2015. Over the course of his career, Sabah has held various data science and engineering roles at companies including Facebook, Workday, Netflix, and Yahoo! In the episode, Richie and Mo explore the importance of using AI to identify patterns and proactively address common errors, the use of tools like dbt and SODA for data pipeline abstraction and stakeholder involvement in data quality, data governance and data quality as foundations for strong data engineering, validation layers at each step of the data pipeline to ensure data quality, collaboration between data analysts and data engineers for holistic problem-solving and reusability of patterns, ownership mentality in data engineering and much more. Links from the show: PagerDutyDomoOpsGeneCareer Track: Data Engineer

Shining Some Light In The Black Box Of PostgreSQL Performance

2023-11-06 · Data Engineering Podcast Listen

podcast_episode

by Lukas Fittl , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Datafold dbt Delta +8 more

Summary

Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres

Interview

Introduction How did you get involved in the area of data management? What are the different ways that database performance problems impact the business? What are the most common contributors to performance issues? What are the useful signals that indicate performance challenges in the database?

For a given symptom, what are the steps that you recommend for determining the proximate cause?

What are the potential negative impacts to be aware of when tu

Surveying The Market Of Database Products

2023-10-30 · Data Engineering Podcast Listen

podcast_episode

by Tanya Bragin (ClickHouse) , Tobias Macey

Analytics BI CI/CD ClickHouse Cloud Computing Data Engineering Data Management Datafold dbt ELK Modern Data Stack Oracle +4 more

Summary

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market

Interview

Introduction How did you get involved in the area of data management? What are the aspects of the database market that keep you interested as a VP of product?

How have your experiences at Elastic informed your current work at Clickhouse?

What are the main product categories for databases today?

What are the industry trends that have the most impact on the development and growth of different product categories? Which categories do you see growing the fastest?

When a team is selecting a database technology for a given task, what are the types of questions that they should be asking? Transactional engines like Postgres, SQL Server, Oracle, etc. were long used

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Ravi Ramadoss (Moody's Analytics CRE) , Gleb Mezhanskiy (Datafold) , Ryan Kelly (Moody's Analytics CRE)

Analytics CI/CD Data Engineering Datafold dbt

Join the team from Moody's Analytics as they take you on a personal journey of optimizing their data pipelines for data quality and governance. Like many data practitioners, Ryan and Ravi understand the frustration and anxiety that comes with accidentally introducing bad code into production pipelines—they've spent countless hours putting out fires caused from these unexpected changes.

In this session, Ryan and Ravi recount their experiences with a previous data stack that lacked standardized testing methods and visibility into the impact of code changes on production data. They also share how their new data stack is safeguarded by Datafold's data diffing and continuous integration (CI) capabilities, which enables their team to work with greater confidence, peace of mind, and speed.

Speakers: Gleb Mezhanskiy, CEO, Datafold; Ravi Ramadoss, Director of Data Engineering, Moody's Analytics CRE; Ryan Kelly, Data Engineer, Moody's Analytics CRE

Register for Coalesce at https://coalesce.getdbt.com

talk-data.com

Activity Trend

Top Events

Top Speakers

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Fundamentals of Analytics Engineering

The Definitive Guide to Data Integration

#190 How Data Leaders Can Make Data Governance a Priority with Saurabh Gupta, Chief Strategy & Revenue Officer at The Modern Data Company

#189 From BI to AI with Nick Magnuson, Head of AI at Qlik

#188 Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of Alteryx

Data Cleaning with Power BI

[AI and the Modern Data Stack] #183 Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

Data quality can save the world with Danette McGilvray

WHY AI PROJECTS FALL SHORT AND HOW TO TURN THE TAIDE

Data Quality Rules (in english)

Data Observability for Data Engineering

Designing Data Transfer Systems That Scale

Data Quality, Contracts and 100 Year Old Hares

Fundamentals of Data Science

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

#162 Scaling Data Engineering in Retail with Mohammad Sabah, SVP of Engineering & Data at Thrive Market

Shining Some Light In The Black Box Of PostgreSQL Performance

Surveying The Market Of Database Products

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023