talk-data.com talk-data.com

Topic

CI/CD

Continuous Integration/Continuous Delivery (CI/CD)

devops automation software_development ci_cd

262

tagged

Activity Trend

21 peak/qtr
2020-Q1 2026-Q1

Activities

262 activities · Newest first

Summary

Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process

Interview

Introduction How did you get involved in the area of data management? What are the different aspects/types of ETL that you are seeing generative AI applied to?

What kind of impact are you seeing in terms of time spent/quality of output/etc.?

What kinds of projects are most likely to benefit from the application of generative AI? Can you describe what a typical workflow of using AI to build ETL workflows looks like?

What are some of the types of errors that you are likely to experience from the AI? Once the pipeline is defined, what does the ongoing maintenance look like? Is the AI required to operate within the pipeline in perpetuity?

For individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address? What are the most interesting, innovative, or unexpected w

Abstract : Les composants des chaines devops (ou pipelines CI/CD) sont désormais des composants critiques du développement des applications cloud native et contiennent des informations hyper sensibles et des secrets. Ils peuvent être compromis et être un vecteur d'attaque sur la software supply chain. Nous présenterons les enjeux de de sécurité relatifs à ces pipelines CI/CD et ferons un focus sur ce nouveau projet OWASP, né en 2022 de la collaboration d'experts Appsec, en l'illustrant par des exemples.

Summary

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications

Interview

Introduction How did you get involved in the area of data management? Can you describe what vector search is and how it differs from other search technologies?

What are the technical challenges related to providing vector search? What are the applications for vector search that merit the added complexity?

Vector databases have been gaining a lot of attention recently with the proliferation of LLM applicati

Summary

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products

Interview

Introduction How did you get involved in the area of data management? Can you describe what the term "linked data product" means and some examples of when you might build one?

What is the overlap between knowledge graphs and "linked data products"?

What is JSON-LD?

What are the domains in which it is typically used? How does it assist in developing linked data products?

what are the characterist

Summary

Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration

Interview

Introduction How did you get involved in the area of data management? Can you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.) What are the misconceptions about the applications of/need for/cost to implement data orchestration?

How do those challenges of customer education change across roles/personas?

Because of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine? You have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time? One of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine? What are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working o

Summary

Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading

Interview

Introduction How did you get involved in the area of data management? Can you describe what dlt is and the story behind it?

What is the problem you want to solve with dlt? Who is the target audience?

The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt? Can you describe how dlt is implemented? What are the benefits of building it in Python? How have the design and goals of the project changed since you first started working on it? How does that language choice influence the performance and scaling characteristics? What problems do users solve with dlt? What are the interfaces available for extending/customizing/integrating with dlt? Can you talk through the process of adding a new source/destination? What is the workflow for someone building a pipeline with dlt? How does the experience scale when supporting multiple connections? Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to th

Summary

Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Vignesh Ravichandran about building an internal database as a service platform at Cloudflare

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the different database workloads that you have at Cloudflare?

What are the different methods that you have used for managing database instances?

What are the requirements and constraints that you had to account for in designing your current system? Why Postgres? optimizations for Postgres

simplification from not supporting multiple engines

limitations in postgres that make multi-tenancy challenging scale of operation (data volume, request rate What are the most interesting, innovative, or unexpected ways that you have seen your DBaaS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on your internal database platform? When is an internal database as a service the wrong choice? What do you have planned for the future of Postgres hosting at Cloudflare?

Contact Info

LinkedIn Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Mac

Summary

Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform

Interview

Introduction How did you get involved in the area of data management? Can you describe what Illumidesk is and the story behind it? What are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences? How are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware? What are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors? What are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports? What are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content? Can you describe how you have architected the Illumidesk platform? How have the design and goals shifted since you first began working on it? What are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities? What are the failure modes of the content generation that you need to account for? What are the most interesting, innovative, or unexpected ways that you have seen Illumidesk us

Summary

Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines

Interview

Introduction How did you get involved in the area of data management? Can you start by defining what you mean by a "modern" data pipeline? At Rivery you published a white paper identifying seven principles of modern data pipelines:

Zero infrastructure management ELT-first mindset Speaks SQL and Python Dynamic multi-storage layers Reverse ETL & operational analytics Full transparency Faster time to value

What are the applications of data that you focused on while identifying these principles? How do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business? What are the technical components of a pipeline infrastructure that are necessary to support a "modern" workflow? How do the technologies involved impact the organizational involvement with how data is applied throughout the business? When using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data? What are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles? What are the cases where some/all of these principles are undesirable/impractical to implement? What are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned somethi

Sponsored: dbt Labs | Modernizing the Data Stack: Lessons Learned From Evolution at Zurich Insurance

In this session, we will explore the path Zurich Insurance took to modernize its data stack and data engineering practices, and the lessons learned along the way. We'll touch on how and why the team chose to:

  • Adopt community standards in code quality, code coverage, code reusability, and CI/CD
  • Rebuild the way data engineering collaborates with business teams
  • Explore data tools accessible to non-engineering users, with considerations for code-first and no-code interfaces
  • Structure our dbt project and orchestration — and the factors that played into our decisions

Talk by: Jose L Sanchez Ros and Gerard Sola

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Feeding the World One Plant at a Time

Join this session to learn how the CVML and Data Platform team at BlueRiver Technology utilized Databricks to maximize savings on herbicide usage and revolutionize Precision Agriculture.

Blue River Technology is an agricultural technology company that uses computer vision and machine learning (CVML) to revolutionize the way crops are grown and harvested. BRT’s See & Spray technology, which uses CVML to identify and precisely determine whether the plant is a weed or a crop so it can deliver a small, targeted dose of herbicide directly to the plant, while leaving the crop unharmed. By using this approach, Blue River significantly reduces the amount of herbicides used in agriculture by over 70% and has a positive impact on the environment and human health.

The technical challenges we seek to overcome are:  - Processing massive petabytes of proprietary data at scale and in real time. Equipment in the field can generate up to 40TBs of data per hour per machine. - Aggregating, curating and visualizing at scale data can often be convoluted, error-prone and complex.  - Streamlining pipelines runs from weeks to hours to ensure continuous delivery of data.  - Abstracting and automating  the infra, deployment and data management from each program. - Building downstream data products based on descriptive analysis, predictive analysis or prescriptive analysis to drive the machine behavior.

The business questions we seek to answer for any machine are:  - Are we getting the spray savings we anticipated? - Are we reducing the use of herbicide at the scale we expected? - Are spraying nozzles performing at the expected rate? - Finding the relevant data to troubleshoot new edge conditions.  - Providing a simple interface for data exploration to both technical and non-technical personas to help improve our model. - Identifying repetitive and new faults in our machines. - Filtering out data based on certain incidents. - Identifying anomalies for e.g. sudden drop in spray saving, like frequency of broad spray suddenly is too high.

How we are addressing and plan to address these challenges: - Designating Databricks as our purposeful DB for all data - using the bronze, silver and gold layer standards. - Processing new machine logs using a Delta Live table as a source both in batch and incremental manner. - Democratize access for data scientists, product managers, data engineers who are not proficient with the robotic software stack via notebooks for quick development as well as real time dashboards.

Talk by: Fahad Khan and Naveed Farooqui

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Scaling MLOps for a Demand Forecasting Across Multiple Markets for a Large CPG

In this session, we look at how one of the world’s largest CPG company setup a scalable MLOps pipeline for a demand forecasting use case that predicted demand at 100,000+ DFUs (demand forecasting units) on a weekly basis across more than 20 markets. This implementation resulted in significant cost savings in terms of improved productivity, reduced cloud usage and faster time to value amongst other benefits. You will leave this session with a clearer picture on the following:

  • Best practices in scaling MLOps with Databricks and Azure for a demand forecasting use case with a multi-market and multi-region roll-out.
  • Best practices related to model re-factoring and setting up standard CI-CD pipelines for MLOps.
  • What are some of the pitfalls to avoid in such scenarios?

Talk by: Sunil Ranganathan and Vinit Doshi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Bridging the Production Gap: Develop and Deploy Code Easily With IDEs

Hear from customers how they are using software development best practices to combine the best of Integrated Development Environments (IDEs) with Databricks. See the latest developments that unlock key productivity gains from IDEs like code linters, AI code assistants and integrations with CI/CD tools to make going to production smoother and more reliable.

Attend this session to learn how to use IDEs with Databricks and take advantage of:

  • Native development - Write code, edit files and run on Databricks with the familiarity of your favorite IDE with DB Connect
  • Interactive debugging - Step through code in a cluster to quickly pinpoint and fix errors so that code is more robust and easily maintained
  • CI/CD pipelines - Set up and manage your CI/CD pipelines using the new CLI
  • IDE ecosystems - Use familiar integrations to streamline code reviews and deploy code faster

Sign up today to boost your productivity by combining your favorite IDE with the scale of Databricks.

Talk by: Saad Ansari and Fabian Jakobs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.

Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.

Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.

Talk by: Rafi Kurlansik and Pieter Noordhuis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

In this presentation, we discuss how we built a fully managed workflow orchestration system at Salesforce using Apache Airflow to facilitate dependable data lake infrastructure on the public cloud. We touch upon how we utilized kubernetes for increased scalability and resilience, as well as the most effective approaches for managing and scaling data pipelines. We will also talk about how we addressed data security and privacy, multitenancy, and interoperability with other internal systems. We discuss how we use this system to empower users with the ability to effortlessly build reliable pipelines that incorporate failure detection, alerting, and monitoring for deep insights through monitoring, removing the undifferentiated heavy lifting associated with running and managing their own orchestration engines. Lastly, we elaborate on how we integrated our in-house CI/CD pipelines to enable effective DAG and dependency management, further enhancing the system’s capabilities.

Representing the Murex Reporting team at UniCredit we would like to present our journey with Airflow, and how over the past two years it enabled us to automate and simplify our batch workflows. Comparing to our previous rigid mainframe scheduling approach, we have created a robust and scalable framework complete with a CI/CD process, bringing our time to market of scheduling changes down from 3 days to 1. Basing our solution on DAG networks joined by ResumeDagRunOperators and an array of custom-built plugins (such as static time predecessors) we were able to replicate the scheduling of our overnight ETL processes (consisting of approx. 8000 tasks with many-to-many dependencies) in Airflow, satisfying our bank reporting SLAs without performance regression and gaining massively improved process visibility and control. Our presentation will illustrate our journey and explore some of these customizations, which venture outside of Airflow’s core functionalities.

Change management in data teams can be challenging to say the least. Not only you have to evolve your data pipelines, data structures, and datasets themselves across environments, you also have to keep data exploration and visualizations tools in sync. In this talk, we’ll be exploring how to do this best across environments (ie: dev, staging and prod), talking about how CI/CD can help, implementing good data ops practices and cranking up the level of rigor where it matters. We’ll also talk about rigor-vs-speed tradeoffs, where clearly not all data pipelines are born equal, and how to think about to evolve the level of rigor over time in places where it matters most.

A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments. Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency. Distributing common code via a private pypi gives teams more control over what code their DAGs run. And a PythonOperator Shim in production utilizes virtualenv to run Python code with each team’s defined requirements for their DAG. In doing these things we enable effective multi-tenancy, and facilitate easier development and production workflows for Airflow.

Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.

Summary

A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine?

How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead?

What does CI/CD look like for a data warehouse?

How is it different from CI/CD for software applications?

Can you describe how Agile Data Engine is architected?

How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals?

What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics?

What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities?

In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine?

Guest Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

About Agile Data Engine

Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.

Links

Agile Data Engine Bill Inmon Ralph Kimball Snowflake Redshift BigQuery Azure Synapse Airflow

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack: Rudderstack

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast