DataOps

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

2024-07-01 · Airflow Summit 2024

session

by Doug Rogan , Vasantha Kosuri Marshall

AI/ML Airflow Astronomer Cloud Computing ELK GCP

Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through. To achieve these objectives, the team employs Astronomer/Airflow at the core of its strategic approach. This involves various deployments of Astronomer/Airflow that integrate seamlessly and securely (via Apigee) to initiate batch data processing and ML jobs on the cloud, as well as compute-intensive computer vision tasks on-premises, with essential alerting provided through the ELK stack. This presentation will delve into the architecture and strategic planning surrounding the hybrid batch router, highlighting its pivotal role in promoting rapid innovation and scalability in the development of ADAS features.

Doug Needham - Architecture Deep Dive, The Hard Work of GenAI, and more

2024-06-12 · The Joe Reis Show Listen

podcast_episode

by Doug Needham , Joe Reis (DeepLearning.AI)

AI/ML GenAI

Doug Needham is an OG DBA and data architect who built DataOps workflows back in Desert Storm (!) and has managed to stay very current with data to today. We talk about data architecture war stories, the hard work to do generative AI in the enterprise, and much more. Enjoy!

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

2024-06-06 · DataFramed Listen

podcast_episode

by Richie (DataCamp) , Prukalpa Sankar (Atlan)

AI/ML BI Data Governance Data Management Data Quality Data Science DWH GitHub Modern Data Stack

In the fast-paced work environments we are used to, the ability to quickly find and understand data is essential. Data professionals can often spend more time searching for data than analyzing it, which can hinder business progress. Innovations like data catalogs and automated lineage systems are transforming data management, making it easier to ensure data quality, trust, and compliance. By creating a strong metadata foundation and integrating these tools into existing workflows, organizations can enhance decision-making and operational efficiency. But how did this all come to be, who is driving better access and collaboration through data? Prukalpa Sankar is the Co-founder of Atlan. Atlan is a modern data collaboration workspace (like GitHub for engineering or Figma for design). By acting as a virtual hub for data assets ranging from tables and dashboards to models & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Slack, BI tools, data science tools and more. A pioneer in the space, Atlan was recognized by Gartner as a Cool Vendor in DataOps, as one of the top 3 companies globally. Prukalpa previously co-founded SocialCops, world leading data for good company (New York Times Global Visionary, World Economic Forum Tech Pioneer). SocialCops is behind landmark data projects including India’s National Data Platform and SDGs global monitoring in collaboration with the United Nations. She was awarded Economic Times Emerging Entrepreneur for the Year, Forbes 30u30, Fortune 40u40, Top 10 CNBC Young Business Women 2016, and a TED Speaker. In the episode, Richie and Prukalpa explore challenges within data discoverability, the inception of Atlan, the importance of a data catalog, personalization in data catalogs, data lineage, building data lineage, implementing data governance, human collaboration in data governance, skills for effective data governance, product design for diverse audiences, regulatory compliance, the future of data management and much more. Links Mentioned in the Show: AtlanConnect with Prukalpa[Course] Artificial Intelligence (AI) StrategyRelated Episode: Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at SnowflakeSign up to RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile app Empower your business with world-class data and AI skills with DataCamp for business

Data Engineering with Databricks Cookbook

2024-05-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pulkit Chadha

Big Data Cloud Computing Data Engineering Data Governance Databricks Delta DevOps Python Spark SQL Data Streaming data +1 more

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows. What this Book will help me do Master Apache Spark for data ingestion, transformation, and analysis. Learn to optimize data processing and improve query performance with Delta Lake. Manage streaming data processing with Spark Structured Streaming capabilities. Implement DataOps and DevOps workflows tailored for Databricks. Enforce data governance policies using Unity Catalog for scalable solutions. Author(s) Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge. Who is it for? This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.

DataOps for Generative AI Data Pipelines, Part III: Team Collaboration - Audio Blog

2024-05-23 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML GenAI

Explore the reasons for data engineers to collaborate with data scientists, machine learning (ML) engineers, and developers on DataOps initiatives that support GenAI. Published at: https://www.eckerson.com/articles/dataops-for-generative-ai-data-pipelines-part-iii-team-collaboration

DataOps for Generative AI Data Pipelines, Part II: Must-Have Characteristics - Audio Blog

2024-05-03 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML GenAI

Companies that adopt DataOps increase the odds of success by making GenAI data pipelines what they should be: modular, scalable, robust, flexible, and governed. Published: https://www.eckerson.com/articles/dataops-for-generative-ai-data-pipelines-part-ii-must-have-characteristics

Kent Graziano - The Data(Ops) Warrior

2024-04-16 · The Joe Reis Show Listen

podcast_episode

by Kent Graziano (SnowflakeDB) , Joe Reis (DeepLearning.AI)

Snowflake

There's the interview you think you're going to have, then there's the interview you get. This is one of those, in the best way possible. I expected to chat about his time at Snowflake. We didn't even get past his early days building data warehouses because it was so fascinating. Did you know Kent is arguably one of the very first practitioners (probably an accidental inventor) of DataOps?

This is sort of a "prequel" episode. Kent Graziano and I chat about his early days as a data practitioner.

Open Source Nessie: Enabling DataOps, Catalog Versioning and Git for Data

2024-04-10 · Data Universe 2024

Face To Face

by alex merced (Dremio)

Data Lakehouse Git

Project Nessie is an open-source project that provides a Git-like approach to version control for data lakehouse tables. This makes it possible to track data changes over time and revert to previous versions if necessary.

In a lakehouse environment, catalog versioning is essential for ensuring the accuracy and reliability of data. By tracking changes to the catalog, you can ensure that everyone is working with the same data version. This can help to prevent errors and inconsistencies.

Project Nessie can be used to implement catalog versioning in a lakehouse environment. This can be done by creating a Nessie repository for the catalog and then tracking changes to the repository using Git.

This presentation will discuss the benefits of using Project Nessie for catalog versioning in a lakehouse environment. We will also discuss how to implement catalog versioning using Project Nessie.

DataOps for Generative AI Data Pipelines, Part I: What and Why - Audio blog

2024-03-04 · Secrets of Data Analytics Leaders Listen

podcast_episode

AI/ML GenAI

The success of Generative AI depends on fundamental disciplines like DataOps. Published at: https://www.eckerson.com/articles/dataops-for-generative-ai-data-pipelines-part-i-what-and-why

134 - What Sanjeev Mohan Learned Co-Authoring “Data Products for Dummies”

2024-01-09 · Experiencing Data w/ Brian T. O’Neill (AI & data product management leadership—powered by UX design) Listen

podcast_episode

by Sanjeev Mohan (Gartner (former)) , Brian O’Neill (Designing for Analytics)

AI/ML Analytics Data Analytics dbt GenAI Snowflake

In this episode, I’m chatting with former Gartner analyst Sanjeev Mohan who is the Co-Author of Data Products for Dummies. Throughout our conversation, Sanjeev shares his expertise on the evolution of data products, and what he’s seen as a result of implementing practices that prioritize solving for use cases and business value. Sanjeev also shares a new approach of structuring organizations to best implement ownership and accountability of data product outcomes. Sanjeev and I also explore the common challenges of product adoption and who is responsible for user experience. I purposefully had Sanjeev on the show because I think we have pretty different perspectives from which we see the data product space.

Highlights/ Skip to:

I introduce Sanjeev Mohan, co-author of Data Products for Dummies (00:39) Sanjeev expands more on the concept of writing a “for Dummies” book (00:53) Sanjeev shares his definition of a data product, including both a technical and a business definition (01:59) Why Sanjeev believes organizational changes and accountability are the keys to preventing the acceleration of shipping data products with little to no tangible value (05:45) How Sanjeev recommends getting buy-in for data product ownership from other departments in an organization (11:05) Sanjeev and I explore adoption challenges and the topic of user experience (13:23) Sanjeev explains what role is responsible for user experience and design (19:03) Who should be responsible for defining the metrics that determine business value (28:58) Sanjeev shares some case studies of companies who have adopted this approach to data products and their outcomes (30:29) Where companies are finding data product managers currently (34:19) Sanjeev expands on his perspective regarding the importance of prioritizing business value and use cases (40:52) Where listeners can get Data Products for Dummies, and learn more about Sanjeev’s work (44:33)

Quotes from Today’s Episode “You may slap a label of data product on existing artifact; it does not make it a data product because there’s no sense of accountability. In a data product, because they are following product management best practices, there must be a data product owner or a data product manager. There’s a single person [responsible for the result]. — Sanjeev Mohan (09:31)

“I haven’t even mentioned the word data mesh because data mesh and data products, they don’t always have to go hand-in-hand. I can build data products, but I don’t need to go into the—do all of data mesh principles.” – Sanjeev Mohan (26:45)

“We need to have the right organization, we need to have a set of processes, and then we need a simplified technology which is standardized across different teams. So, this way, we have the benefit of reusing the same technology. Maybe it is Snowflake for storage, DBT for modeling, and so on. And the idea is that different teams should have the ability to bring their own analytical engine.” – Sanjeev Mohan (27:58)

“Generative AI, right now as we are recording, is still in a prototyping phase. Maybe in 2024, it’ll go heavy-duty production. We are not in prototyping phase for data products for a lot of companies. They’ve already been experimenting for a year or two, and now they’re actually using them in production. So, we’ve crossed that tipping point for data products.” – Sanjeev Mohan (33:15)

“Low adoption is a problem that’s not just limited to data products. How long have we had data catalogs, but they have low adoption. So, it’s a common problem.” – Sanjeev Mohan (39:10)

“That emphasis on technology first is a wrong approach. I tell people that I’m sorry to burst your bubble, but there are no technology projects, there are only business projects. Technology is an enabler. You don’t do technology for the sake of technology; you have to serve a business cause, so let’s start with that and keep that front and center.” – Sanjeev Mohan (43:03)

Links Data Products for Dummies: https://www.dataops.live/dataproductsfordummies “What Exactly is A Data Product” article: https://medium.com/data-mesh-learning/what-exactly-is-a-data-product-7f6935a17912 It Depends: https://www.youtube.com/@SanjeevMohan Chief Data Analytics and Product Officer of Equifax: https://www.youtube.com/watch?v=kFY7WGc-jFM SanjMo Consulting: https://www.sanjmo.com/ dataops.live: https://dataops.live dataops.live/dataproductsfordummies: https://dataops.live/dataproductsfordummies LinkedIn: https://www.linkedin.com/in/sanjmo/ Medium articles: https://sanjmo.medium.com

Sascha Giese: DataOps: Much More Than DevOps for Data

2023-12-04 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Sascha Giese (SolarWinds)

Big Data DevOps

Explore the fusion of DevOps and DataOps with Sascha Giese, SolarWinds Global Tech Evangelist. Learn to optimize data efficiency, mitigate risks, and transform your database management for reliability and scalability. 🚀📊 #DataOps #devops

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-premises

2023-11-29 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Elad Eldor

Cloud Computing DevOps Kafka Data Streaming data data-engineering streaming-messaging

This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those engineers who are responsible for the stability and performance of Kafka clusters in production, whether those clusters are deployed in the cloud or on-premises. This book teaches you how to detect and troubleshoot the issues, and eventually how to prevent them. Kafka stability is hard to achieve, especially in high throughput environments, and the purpose of this book is not only to make troubleshooting easier, but also to prevent production issues from occurring in the first place. The guidance in this book is drawn from the author's years of experience in helping clients and internal customers diagnose and resolve knotty production problems and stabilize their Kafka environments. The book is organized into recipe-style troubleshooting checklists that field engineers can easily follow when under pressure to fix an unstable cluster. This is the book you will want by your side when the stakes are high, and your job is on the line. What You Will Learn Monitor and resolve production issues in your Kafka clusters Provision Kafka clusters with the lowest costs and still handle the required loads Perform root cause analyses of issues affecting your Kafka clusters Know the ways in which your Kafka cluster can affect its consumers and producers Prevent or minimize data loss and delays in data streaming Forestall production issues through an understanding of common failure points Create checklists for troubleshooting your Kafka clusters when problems occur Who This Book Is For Site reliability engineers tasked with maintaining stability of Kafka clusters, Kafka administrators who troubleshoot production issues around Kafka, DevOps and DataOps experts who are involved with provisioning Kafka (whether on-premises or in the cloud), developers of Kafka consumers and producers who wish to learn more about Kafka

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Ash Sultan (Datatonic)

Agile/Scrum Analytics BI CI/CD Data Engineering dbt DevOps DWH Modern Data Stack

Behind any good DataOps within a Modern Data Stack (MDS) architecture is a solid DevOps design! This is particularly pressing when building an MDS solution at scale, as reliability, quality and availability of data requires a very high degree of process automation while remaining fast, agile and resilient to change when addressing business needs.

While DevOps in Data Engineering is nothing new - for a broad-spectrum solution that includes data warehouse, BI, etc seemed either a bit out of reach due to overall complexity and cost - or simply overlooked due to perceived issues around scaling often attributed to the challenges of automation in CI/CD processes. However, this has been fast changing with tools such as dbt having super cool features which allow a very high degree of autonomy in the CI/CD processes with relative ease, with flexible and cutting edge features around pre-commits, Slim CI, etc.

In this session, Datatonic covers the challenges around building and deploying enterprise-grade MDS solutions for analytics at scale and how they have used dbt to address those - especially around near-complete autonomy to the CI/CD processes!

Speaker: Ash Sultan, Lead Data Architect, Datatonic

Register for Coalesce at https://coalesce.getdbt.com

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Nitu Nivedita

AI/ML Analytics Databricks Delta DevOps LLM Matplotlib NLP Plotly Power BI React Data Streaming

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Moving Beyond Data Integration with Data Collaboration

2023-07-25 · O'Reilly Data Science Books O'Reilly Amazon

book

by Karanjot Jaswal , Dan DeMers (Cinchy)

Agile/Scrum Data Governance Data Management Microsoft data data-science

How can you maximize data collaboration across your organization without having to build integrations between individual applications, systems, and other data sources? Data collaboration architectures that don't depend on integrations aren't a new idea, but they've assumed greater urgency as organizations increasingly struggle to manage the ever-growing numbers of data sources that exist inside their IT estates. In this report, Cinchy cofounders Dan DeMers and Karanjot Jaswal show CIOs, CTOs, CDOs, and other IT leaders how to rethink their organization's approach to data architectures, data management, and data governance. You'll learn about different approaches to creating data platforms that liberate and autonomize data, enable agile data management, apply consistent data access controls, and maximize visibility without requiring application-specific integrations. With this report, you'll discover: Why data integration is often handled piecemeal—combining one app with another rather than integrating all apps together How data collaboration platforms enable data sharing across all apps, systems, and sources without application-specific integrations Four major platforms you can use to make data available to all applications and services: Cinchy, K2View, Microsoft Dataverse, and The Modern Data Company Principles and practices for deploying the data collaboration platform of your choice Dan DeMers is the CEO and cofounder of Cinchy. Karanjot Jaswal is cofounder and CTO of Cinchy.

DataOps In Data Engineering - Audio Blog

2023-07-05 · Secrets of Data Analytics Leaders Listen

podcast_episode

Data Engineering

The unbundling of the data ecosystem is causing organizations to “duct tape” products and frameworks together to build their solutions and data delivery processes. Organizations fail to build and deploy end-to-end, automated, repeatable data-driven systems, ignoring data engineering & dataops principles as well as best practices. Published at: https://www.eckerson.com/articles/dataops-in-data-engineering

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

2023-06-25 · Data Engineering Podcast Listen

podcast_episode

by Toby Mao (SQLMesh) , Tobias Macey

AI/ML Airflow CDP Data Engineering Data Lake Data Management dbt GitHub ORC Pandas Python SAS +5 more

Summary

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what SQLMesh is and the story behind it?

DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?

What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?

How do those rough edges impact the productivity and effectiveness of teams using those

Can you describe how SQLMesh is implemented?

How have the design and goals evolved since you first started working on it?

What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh? For teams who have already invested in dbt, what is the migration path from or integration with dbt? You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator? What do you see as the potential benefits of integration with e.g. data-diff? What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains? What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh? When is SQLMesh the wrong choice? What do you have planned for the future of SQLMesh?

Contact Info

tobymao on GitHub @captaintobs on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

SQLMesh Tobiko Data SAS AirBnB Minerva SQLGlot Cron AST == Abstract Syntax Tree Pandas Terraform dbt

Podcast Episode

SQLFluff

Podcast.init Episode

The intro and outro music is from The Hug by The Freak Fandango Orc

From MLOps to DataOps - Santona Tuli

2023-06-23 · DataTalks.Club Listen

podcast_episode

by Santona Tuli (Upsolver)

AI/ML Data Lakehouse dbt GitHub HTML Modern Data Stack MLOps SQL

We talked about:

Santona's background Focusing on data workflows Upsolver vs DBT ML pipelines vs Data pipelines MLOps vs DataOps Tools used for data pipelines and ML pipelines The “modern data stack” and today's data ecosystem Staging the data and the concept of a “lakehouse” Transforming the data after staging What happens after the modeling phase Human-centric vs Machine-centric pipeline Applying skills learned in academia to ML engineering Crafting user personas based on real stories A framework of curiosity Santona's book and resource recommendations

Links:

LinkedIn: https://www.linkedin.com/in/santona-tuli/ Upsolver website: upsolver.com Why we built a SQL-based solution to unify batch and stream workflows: https://www.upsolver.com/blog/why-we-built-a-sql-based-solution-to-unify-batch-and-stream-workflows

Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service

2023-06-04 · Data Engineering Podcast Listen

podcast_episode

by Tevje Olin (Agile Data Engine) , Tobias Macey

Agile/Scrum AI/ML Airflow Analytics API Azure BigQuery CDP CI/CD Data Engineering Data Lake Data Management +9 more

Summary

A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?

How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead?

What does CI/CD look like for a data warehouse?

How is it different from CI/CD for software applications?

Can you describe how Agile Data Engine is architected?

How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals?

What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics?

What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?

In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine?

Guest Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

About Agile Data Engine

Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.

Links

Agile Data Engine Bill Inmon Ralph Kimball Snowflake Redshift BigQuery Azure Synapse Airflow

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast

Data Access Management - Bart Vandekerckhove

2023-06-02 · DataTalks.Club Listen

podcast_episode

by Bart Vandekerckhove (Raito)

Data Governance Data Management GitHub HTML MLOps Terraform

We talked about:

Bart's background What is data governance? Data dictionaries and data lineage Data access management How to learn about data governance What skills are needed to do data governance effectively When an organization needs to start thinking about data governance Good data access management processes Data masking and the importance of automating data access DPO and CISO roles How data access management works with a data mesh approach Avoiding the role explosion problem The importance of data governance integration in DataOps Terraform as a stepping stone to data governance How Raito can help an organization with data governance Open-source data governance tools

Links:

LinkedIn: https://www.linkedin.com/in/bartvandekerckhove/ Twitter: https://twitter.com/Bart_H_VDK Github: https://github.com/raito-io Website: https://www.raito.io/ Data Mesh Learning Slack: https://data-mesh-learning.slack.com/join/shared_invite/zt-1qs976pm9-ci7lU8CTmc4QD5y4uKYtAA#/shared-invite/email DataQG Website: https://dataqg.com/ DataQG Slack: https://dataqgcommunitygroup.slack.com/join/shared_invite/zt-12n0333gg-iTZAjbOBeUyAwWr8I~2qfg#/shared-invite/email DMBOK (Data Management Book of Knowledge): https://www.dama.org/cpages/body-of-knowledge DMBOK Wheel describing the data governance activities: https://www.dama.org/cpages/dmbok-2-wheel-images

Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

talk-data.com

Activity Trend

Top Events

Top Speakers

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

Doug Needham - Architecture Deep Dive, The Hard Work of GenAI, and more

#213 Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

Data Engineering with Databricks Cookbook

DataOps for Generative AI Data Pipelines, Part III: Team Collaboration - Audio Blog

DataOps for Generative AI Data Pipelines, Part II: Must-Have Characteristics - Audio Blog

Kent Graziano - The Data(Ops) Warrior

Open Source Nessie: Enabling DataOps, Catalog Versioning and Git for Data

DataOps for Generative AI Data Pipelines, Part I: What and Why - Audio blog

134 - What Sanjeev Mohan Learned Co-Authoring “Data Products for Dummies”

Sascha Giese: DataOps: Much More Than DevOps for Data

Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-premises

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

Moving Beyond Data Integration with Data Collaboration

DataOps In Data Engineering - Audio Blog

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

From MLOps to DataOps - Santona Tuli

Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service

Data Access Management - Bart Vandekerckhove