CI/CD

Data Engineering with Scala and Spark

2024-01-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rupam Bhattacharjee , Eric Tome , David Radford

API Data Engineering Scala Spark jvm-languages programming-languages software-development

Data Engineering with Scala and Spark guides you through building robust data pipelines that process massive datasets efficiently. You will learn practical techniques leveraging Scala and Spark with a hands-on approach to mastering data engineering tasks including ingestion, transformation, and orchestration. What this Book will help me do Set up a data pipeline development environment using Scala Utilize Spark APIs like DataFrame and Dataset for effective data processing Implement CI/CD and testing strategies for pipeline maintainability Optimize pipeline performance through tuning techniques Apply data profiling and quality enforcement using tools like Deequ Author(s) Eric Tome, Rupam Bhattacharjee, and David Radford bring decades of combined experience in data engineering and distributed systems. Their work spans cutting-edge data processing solutions using Scala and Spark. They aim to help professionals excel in building reliable, scalable pipelines. Who is it for? This book is tailored for working data engineers familiar with data workflow processes who desire to enhance their expertise in Scala and Spark. If you aspire to build scalable, high-performance data solutions or transition raw data into strategic assets, this book is ideal.

Ricardo Sueiras: Using Modern Application Principals to Automate Your Apache Airflow Data Pipelines

2023-12-04 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Ricardo Sueiras (AWS)

Airflow Big Data

Discover the power of modern application principles in automating your Apache Airflow Data Pipelines with Ricardo Sueiras. 🚀🐍 Learn how to leverage CI/CD for seamless development, testing, and deployment, and say goodbye to manual cron job management! 💻📈 #ApacheAirflow #automation

✨ H I G H L I G H T S ✨

🙌 A huge shoutout to all the incredible participants who made Big Data Conference Europe 2023 in Vilnius, Lithuania, from November 21-24, an absolute triumph! 🎉 Your attendance and active participation were instrumental in making this event so special. 🌍

Don't forget to check out the session recordings from the conference to relive the valuable insights and knowledge shared! 📽️

Once again, THANK YOU for playing a pivotal role in the success of Big Data Conference Europe 2023. 🚀 See you next year for another unforgettable conference! 📅 #BigDataConference #SeeYouNextYear

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

2023-11-13 · Data Engineering Podcast Listen

podcast_episode

by Eran Yahav (Technion – Israel Institute of Technology) , Tobias Macey

AI/ML Analytics BI Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Data Quality Datafold dbt Delta +8 more

Summary

Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine

Interview

Introduction How did you get involved in machine learning? Can you describe what Tabnine is and the story behind it? What are the individual and organizational motivations for using AI to generate code?

What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.) What are the elements of skepticism/overs

From Development to Production: How Mobile DevOps can accelerate mobile development

2023-11-08 · BLN DevOps November edition #43

talk

Analytics DevOps

In this talk I will explore how Mobile DevOps can significantly accelerate the mobile development lifecycle. I will dive deep into the strategies, tools, and best practices that empower mobile development teams to seamlessly transition from the development phase to production, all while maintaining the highest standards of quality and reliability.\n\nDuring this talk, you will discover:\n\nThe Mobile DevOps mindset: Understand the core principles and mindset shifts that are essential for integrating DevOps practices into your mobile development workflow.\n\nStreamlining development workflows: Learn how to optimize your development process to reduce bottlenecks and streamline collaboration between development, QA, and operations teams.\n\nAutomation and Continuous Integration/Continuous Deployment (CI/CD): Explore how automation tools and CI/CD pipelines can help you automate repetitive tasks, increase efficiency, and ensure consistent app delivery.\n\nMonitoring and feedback loops: Discover the importance of real-time monitoring, performance analytics, and user feedback in shaping a continuous improvement cycle for your mobile apps.\n\nThis talk will equip you with the knowledge and insights needed to harness the power of Mobile DevOps and accelerate your mobile app development journey.

Cracking the Data Engineering Interview

2023-11-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kedeisha Bryan , Taamir Ransome

Cloud Computing Data Engineering Data Modelling ETL/ELT Python Cyber Security SQL data data-engineering

"Cracking the Data Engineering Interview" is your essential guide to mastering the data engineering interview process. This book offers practical insights and techniques to build your resume, refine your skills in Python, SQL, data modeling, and ETL, and confidently tackle over 100 mock interview questions. Gain the knowledge and confidence to land your dream role in data engineering. What this Book will help me do Craft a compelling data engineering portfolio to stand out to employers. Refresh and deepen understanding of essential topics like Python, SQL, and ETL. Master over 100 interview questions that cover both technical and behavioral aspects. Understand data engineering concepts such as data modeling, security, and CI/CD. Develop negotiation, networking, and personal branding skills crucial for job applications. Author(s) None Bryan and None Ransome are seasoned authors with a wealth of experience in data engineering and professional development. Drawing from their extensive industry backgrounds, they provide actionable strategies for aspiring data engineers. Their approachable writing style and real-world insights make complex topics accessible to readers. Who is it for? This book is ideal for aspiring data engineers looking to navigate the job application process effectively. Readers should be familiar with data engineering fundamentals, including Python, SQL, cloud data platforms, and ETL processes. It's tailored for professionals aiming to enhance their portfolios, tackle challenging interviews, and boost their chances of landing a data engineering role.

Como se tornar um Cientista de Dados em 2024 - Data Hackers Podcast 75

2023-11-06 · Data Hackers Listen

podcast_episode

by Nilton Ueda (AB-Inbev/Ambev) , Monique Femme (PUCRS) , Paulo Vasconcellos , Mikaeri Ohana (CI&T) , Gabriel Lages

AI/ML Data Science Microsoft Tableau

Se você sonha em mergulhar no mundo dos dados, exploramos as estratégias e habilidades necessárias para trilhar o caminho de se tornar um cientista de dados em 2024. Descubra como se preparar para as oportunidades do futuro e dominar o universo da ciencia de dados!

Neste episódio do Data Hackers — a maior comunidade de AI e Data Science do Brasil-, conheçam essa dupla de especialistas:

Mikaeri Ohana — Líder de AI e ML na CI&T, Criadora de Conteúdo no Explica Mi, premiada pelo Google como Google Developer Expert em ML e pela Microsoft como Microsoft Most Valuable Professional em AI, mestranda na Unicamp e fundadora da Escola Tesseract. Nilton Ueda — Global Data Product Manager at @AB-Inbev/Ambev, Professor MBA FIAP/MACKENZIE/IMPACTA/IBMEC, @LATAM Tableau Ambassador 3x

Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

embed

Conheça nosso convidado:

Mikaeri Ohana Nilton Ueda

Bancada Data Hackers:

Paulo Vasconcellos Monique Femme Gabriel Lages

Falamos no episódioLinks de referências:

Participe e responda a pesquisa State of Data: http://www.stateofdata.com.br/podcast Onde encontrar a Mikaeri Http://Instagram.com/explicami https://medium.com/@mikaeriohana https://www.linkedin.com/in/mikaeriohana Onde encontrar o Nilton: https://www.linkedin.com/in/niltonkazuyukiueda/

Shining Some Light In The Black Box Of PostgreSQL Performance

2023-11-06 · Data Engineering Podcast Listen

podcast_episode

by Lukas Fittl , Tobias Macey

AI/ML Analytics BI Cloud Computing Data Engineering Data Lake Data Lakehouse Data Management Data Quality Datafold dbt Delta +8 more

Summary

Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres

Interview

Introduction How did you get involved in the area of data management? What are the different ways that database performance problems impact the business? What are the most common contributors to performance issues? What are the useful signals that indicate performance challenges in the database?

For a given symptom, what are the steps that you recommend for determining the proximate cause?

What are the potential negative impacts to be aware of when tu

Surveying The Market Of Database Products

2023-10-30 · Data Engineering Podcast Listen

podcast_episode

by Tanya Bragin (ClickHouse) , Tobias Macey

Analytics BI ClickHouse Cloud Computing Data Engineering Data Management Data Quality Datafold dbt ELK Modern Data Stack Oracle +4 more

Summary

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market

Interview

Introduction How did you get involved in the area of data management? What are the aspects of the database market that keep you interested as a VP of product?

How have your experiences at Elastic informed your current work at Clickhouse?

What are the main product categories for databases today?

What are the industry trends that have the most impact on the development and growth of different product categories? Which categories do you see growing the fastest?

When a team is selecting a database technology for a given task, what are the types of questions that they should be asking? Transactional engines like Postgres, SQL Server, Oracle, etc. were long used

Supercharging analytics engineers to balance quality & speed via automated CI checks - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Jorrit Posor (FINN GmbH) , Chiel Fernhout (Datafold) , Felix Kreitschmann (FINN Auto)

Analytics Analytics Engineering Data Engineering Datafold

Supercharge your analytics engineering with the power of automated CI checks. Learn how FINN, a global car subscription service, has harnessed the capabilities of automated CI checks to maintain the delicate balance between swift development and robust data pipeline quality as they've scaled their data teams. Dive into insights and strategies to ensure quality without sacrificing speed and discover how to improve your data operations.

Speakers: Chiel Fernhout, Software Engineer, Datafold; Jorrit Posor, Tech Lead Data Engineering, FINN GmbH; Felix Kreitschmann, Senior PM, Data, FINN Auto

Register for Coalesce at https://coalesce.getdbt.com

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Ash Sultan (Datatonic)

Agile/Scrum Analytics BI Data Engineering DataOps dbt DevOps DWH Modern Data Stack

Behind any good DataOps within a Modern Data Stack (MDS) architecture is a solid DevOps design! This is particularly pressing when building an MDS solution at scale, as reliability, quality and availability of data requires a very high degree of process automation while remaining fast, agile and resilient to change when addressing business needs.

While DevOps in Data Engineering is nothing new - for a broad-spectrum solution that includes data warehouse, BI, etc seemed either a bit out of reach due to overall complexity and cost - or simply overlooked due to perceived issues around scaling often attributed to the challenges of automation in CI/CD processes. However, this has been fast changing with tools such as dbt having super cool features which allow a very high degree of autonomy in the CI/CD processes with relative ease, with flexible and cutting edge features around pre-commits, Slim CI, etc.

In this session, Datatonic covers the challenges around building and deploying enterprise-grade MDS solutions for analytics at scale and how they have used dbt to address those - especially around near-complete autonomy to the CI/CD processes!

Speaker: Ash Sultan, Lead Data Architect, Datatonic

Register for Coalesce at https://coalesce.getdbt.com

Hands-on tips to get started with CI in dbt Cloud - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Joel Labes (dbt Labs)

Cloud Computing dbt

Learn best practices for improving your data workflows at scale. In this session, the dbt Labs team shares tactical ideas for setting up CI for the first time and shipping with confidence, as well as tips to take your implementation to the next level.

Speaker: Joel Labes, Senior Developer Experience Advocate, dbt Labs

Register for Coalesce at https://coalesce.getdbt.com

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Ravi Ramadoss (Moody's Analytics CRE) , Gleb Mezhanskiy (Datafold) , Ryan Kelly (Moody's Analytics CRE)

Analytics Data Engineering Data Quality Datafold dbt

Join the team from Moody's Analytics as they take you on a personal journey of optimizing their data pipelines for data quality and governance. Like many data practitioners, Ryan and Ravi understand the frustration and anxiety that comes with accidentally introducing bad code into production pipelines—they've spent countless hours putting out fires caused from these unexpected changes.

In this session, Ryan and Ravi recount their experiences with a previous data stack that lacked standardized testing methods and visibility into the impact of code changes on production data. They also share how their new data stack is safeguarded by Datafold's data diffing and continuous integration (CI) capabilities, which enables their team to work with greater confidence, peace of mind, and speed.

Speakers: Gleb Mezhanskiy, CEO, Datafold; Ravi Ramadoss, Director of Data Engineering, Moody's Analytics CRE; Ryan Kelly, Data Engineer, Moody's Analytics CRE

Register for Coalesce at https://coalesce.getdbt.com

Powering MuleSoft's (a Salesforce Company) modern data analytics framework with dbt - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Dakota Kelley (phData) , Yijun Cao (Salesforce)

Analytics Data Analytics dbt

In this session, Yijun Cao and Dakota Kelley highlight how dbt plays a key role in MuleSoft’s modern data analytics framework. As MuleSoft continues to grow, the data platform engineering team encountered some challenges with the data transformations. The session dives deeper into the specific challenges faced by the team and demonstrates how dbt’s features including CI/CD, testing, documentation and lineage help overcome these challenges.

Speakers: Yijun Cao, Data Platform Engineer, Salesforce; Dakota Kelley, Solution Architect, phData

Register for Coalesce at https://coalesce.getdbt.com

Better CI for better data quality - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Grace Goheen (dbt Labs)

Analytics Cloud Computing Data Quality dbt

Continuous Integration (CI) in dbt Cloud makes it easy to test every change you make prior to deploying. It’s a hallmark of mature analytics workflows. We’ve made some major improvements to dbt Cloud CI, so it’s easier than ever to prevent breaking changes, save on costs, and keep those pesky stakeholders happy.

Join the dbt Labs product team on this magical journey to a world of better data quality, and see for yourself what CI can do for you.

Speaker: Grace Goheen, Product Manager, dbt Labs

Register for Coalesce at https://coalesce.getdbt.com

Data and monolith: Scaling a computationally slim 1500+ model beast - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Michael Revelo (ClickUp)

dbt DWH Marketing Snowflake

Learn how ClickUp uses dbt, dbt packages, and Snowflake to save on storage and compute costs using Slim CI and how they empower a data warehouse centric culture across Sales, Marketing, Product Growth, Finance, and RevOps all while maintaining one monolithic dbt build job.

Speaker: Michael Revelo, Data Platform Lead , ClickUp

Register for Coalesce at https://coalesce.getdbt.com

On the benefits and virtues of drilling pilot holes - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Leo Folsom (Datafold)

Cloud Computing Datafold dbt Git

A significant proportion of dbt Cloud users do not have a dbt CI job set up. Among those who do, many don’t leverage powerful functionality like state comparison and deferral to implement Slim CI, likely causing teams to miss errors and building unnecessary tables. Setting up Slim CI in dbt Cloud can be especially challenging for larger-scale data organizations who have multiple data environments, git branches, and targets. Watch this session to learn how you can build and evolve a strong, lasting data environment using Slim CI.

Speakers: Leo Folsom, Solutions Engineer, Datafold

Register for Coalesce at https://coalesce.getdbt.com

The more, the merrier: Managing a dynamic, expanding, self-service dbt project - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Alice Leach (Whatnot)

dbt

In the past year at Whatnot, the team has watched their dbt project grow from three developers and fewer than 50 models, to 50 developers and more than 1000 models. In this talk, Alice Leach, a data engineer at Whatnot, discusses the knowledge the team has acquired during this time and the solutions they have built to address the challenges that come with scaling projects. This is broken down into three “G”s: guard rails (robust CI/ CD processes, model monitoring and automated clean up); guidelines (modular workspaces, macros and documentation); and gadgets (dbt code generation and the operations and hooks in place to allow the dbt project to interface with other tools).

Speaker: Alice Leach, Data Engineer, Whatnot

Register for Coalesce at https://coalesce.getdbt.com

Defining A Strategy For Your Data Products

2023-10-23 · Data Engineering Podcast Listen

podcast_episode

by Ranjith Raghunath , Tobias Macey

AI/ML Analytics BI Cloud Computing Data Engineering Data Management Data Quality Data Science Datafold dbt Modern Data Stack Neo4j +3 more

Summary

The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Ranjith Raghunath about tactical elements of a data product strategy

Interview

Introduction How did you get involved in the area of data management? Can you describe what is encompassed by the idea of a data product strategy?

Which roles in an organization need to be involved in the planning and implementation of that strategy?

order of operations:

strategy -> platform design -> implementation/adoption platform implementation -> product strategy -> interface development

managing grain of data in products team organization to support product development/deployment customer communications - what questions to ask? requirements gathering, helping to understand "the art of the possible" What are the most interesting, innovative, or unexpected ways that you have seen organizations approach data product strategies? What are the most interesting, unexpected, or challenging lessons that you have learned while working on

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

2023-10-15 · Data Engineering Podcast Listen

podcast_episode

by Eric Sammer (Decodable) , Tobias Macey

AI/ML Airbyte Analytics Flink API Kinesis BI Cloud Computing Data Engineering Data Management Data Quality Data Science +21 more

Summary

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable

Interview

Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it?

What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction?

What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data?

How have you worked to address that in the Decodable platform and interfaces?

As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable?

Contact Info

esammer on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Decodable

Podcast Episode

Understanding the Apache Flink Journey Flink

Podcast Episode

Debezium

Podcast Episode

Kafka Redpanda

Podcast Episode

Kinesis PostgreSQL

Podcast Episode

Snowflake

Podcast Episode

Databricks Startree Pinot

Podcast Episode

Rockset

Podcast Episode

Druid InfluxDB Samza Storm Pulsar

Podcast Episode

ksqlDB

Podcast Episode

dbt GitHub Actions Airbyte Singer Splunk Outbox Pattern

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Neo4J: NODES Conference Logo

NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)

Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to Neo4j.com/NODES today to see the full agenda and register!Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Datafold:

This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare…

Using Data To Illuminate The Intentionally Opaque Insurance Industry

2023-10-09 · Data Engineering Podcast Listen

podcast_episode

by Max Cho (CoverageCat) , Tobias Macey

AI/ML Analytics BI Cloud Computing Data Collection Data Engineering Data Management Data Quality Data Science Datafold dbt Modern Data Stack +4 more

Summary

The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry

Interview

Introduction How did you get involved in the area of data management? Can you describe what CoverageCat is and the story behind it? What are the different sources of data that you work with?

What are the most challenging aspects of collecting that data? Can you describe the formats and characteristics (3 Vs) of that data?

What are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective? Can you describe how you have architected your data platform?

How have the design and goals changed since you first started working on it? What are you optimizing for in your selection and implementation process?

What are the sharp edges/weak points that you worry about in your existing data flows?

How do you guard against those flaws in your day-to-day operations?

What are the

talk-data.com

Activity Trend

Top Events

Top Speakers

Data Engineering with Scala and Spark

Ricardo Sueiras: Using Modern Application Principals to Automate Your Apache Airflow Data Pipelines

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

From Development to Production: How Mobile DevOps can accelerate mobile development

Cracking the Data Engineering Interview

Como se tornar um Cientista de Dados em 2024 - Data Hackers Podcast 75

Shining Some Light In The Black Box Of PostgreSQL Performance

Surveying The Market Of Database Products

Supercharging analytics engineers to balance quality & speed via automated CI checks - Coalesce 2023

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

Hands-on tips to get started with CI in dbt Cloud - Coalesce 2023

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023

Powering MuleSoft's (a Salesforce Company) modern data analytics framework with dbt - Coalesce 2023

Better CI for better data quality - Coalesce 2023

Data and monolith: Scaling a computationally slim 1500+ model beast - Coalesce 2023

On the benefits and virtues of drilling pilot holes - Coalesce 2023

The more, the merrier: Managing a dynamic, expanding, self-service dbt project - Coalesce 2023

Defining A Strategy For Your Data Products

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Using Data To Illuminate The Intentionally Opaque Insurance Industry