Maxime Beauchemin

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

2025-11-24 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management Data Quality Datafold

Summary In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact Info LinkedInLinks AgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La

2025-07-01 · Airflow Summit 2025

session

Modern Data Stack

Data teams have a bad habit: reinventing the wheel. Despite the explosion of open-source tooling, best practices, and managed services, teams still find themselves building bespoke data platforms from scratch—often hitting the same roadblocks as those before them. Why does this keep happening, and more importantly, how can we break the cycle? In this talk, we’ll unpack the key reasons data teams default to building rather than adopting, from technical nuances to cultural and organizational dynamics. We’ll discuss why fragmentation in the modern data stack, the pressure to “own” infrastructure, and the allure of in-house solutions make this problem so persistent. Using real-world examples, we’ll explore strategies to help data teams focus on delivering business value rather than endlessly rebuilding foundational infrastructure. Whether you’re an engineer, a data leader, or an open-source contributor, this session will provide insights into navigating the build-vs-buy tradeoff more effectively.

An Exploration Of The Impediments To Reusable Data Pipelines

2024-12-08 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

Activity Schema AI/ML Data Engineering Data Management Data Modelling Datafold

Summary In this episode of the Data Engineering Podcast the inimitable Max Beauchemin talks about reusability in data pipelines. The conversation explores the "write everything twice" problem, where similar pipelines are built without code reuse, and discusses the challenges of managing different SQL dialects and relational databases. Max also touches on the evolving role of data engineers, drawing parallels with front-end engineering, and suggests that generative AI could facilitate knowledge capture and distribution in data engineering. He encourages the community to share reference implementations and templates to foster collaboration and innovation, and expresses hopes for a future where code reuse becomes more prevalent.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm joined again by Max Beauchemin to talk about the challenges of reusability in data pipelinesInterview IntroductionHow did you get involved in the area of data management?Can you start by sharing your current thesis on the opportunities and shortcomings of code and component reusability in the data context?What are some ways that you think about what constitutes a "component" in this context?The data ecosystem has arguably grown more varied and nuanced in recent years. At the same time, the number and maturity of tools has grown. What is your view on the current trend in productivity for data teams and practitioners?What do you see as the core impediments to building more reusable and general-purpose solutions in data engineering?How can we balance the actual needs of data consumers against their requests (whether well- or un-informed) to help increase our ability to better design our workflows for reuse?In data engineering there are two broad approaches; code-focused or SQL-focused pipelines. In principle one would think that code-focused environments would have better composability. What are you seeing as the realities in your personal experience and what you hear from other teams?When it comes to SQL dialects, dbt offers the option of Jinja macros, whereas SDF and SQLMesh offer automatic translation. There are also tools like PRQL and Malloy that aim to abstract away the underlying SQL. What are the tradeoffs across those options that help or hinder the portability of transformation logic?Which layers of the data stack/steps in the data journey do you see the greatest opportunity for improving the creation of more broadly usable abstractions/reusable elements?low/no code systems for code reuseimpact of LLMs on reusability/compositionimpact of background on industry practices (e.g. DBAs, sysadmins, analysts vs. SWE, etc.)polymorphic data models (e.g. activity schema)What are the most interesting, innovative, or unexpected ways that you have seen teams address composability and reusability of data components?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-oriented tools and utilities?What are your hopes and predictions for sharing of code and logic in the future of data engineering?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links Max's Blog PostAirflowSupersetTableauLookerPowerBICohort AnalysisNextJSAirbytePodcast EpisodeFivetranPodcast EpisodeSegmentdbtSQLMeshPodcast EpisodeSparkLAMP StackPHPRelational AlgebraKnowledge GraphPython MarshmallowData Warehouse Lifecycle Toolkit (affiliate link)Entity Centric Data Modeling Blog PostAmplitudeOSACon presentationol-data-platform Tobias' team's data platform codeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

2024-07-01 · Airflow Summit 2024

session

AI/ML LLM Marketing

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact. Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation. 3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

2023-07-09 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

Activity Schema AI/ML Airflow Analytics Data Engineering Data Management

Summary

For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases

Interview

Introduction How did you get involved in the area of data management? Can you describe what entity-centric modeling (ECM) is and the story behind it?

How does it compare to dimensional modeling strategies? What are some of the other competing methods Comparison to activity schema

What impact does this have on ML teams? (e.g. feature engineering)

What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)

What is the impact on the underlying compute engine on the modeling strategies used?

What are some examples of data sources or problem domains for which this approach is well suited?

What are some cases where entity centric modeling techniques might be counterproductive?

What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?

What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?

How does this work across business domains within a given organization (especially at "enterprise" scale)?

What are the most interesting, innovative, or unexpected ways that you have seen ECM used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?

When is ECM the wrong choice?

What are your predictions for the future direction/adoption of ECM or other modeling techniques?

Contact Info

mistercrunch on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Entity Centric Modeling Blog Post Max's Previous Apperances

Defining Data Engineering with Maxime Beauchemin Self Service Data Exploration And Dashboarding With Superset Exploring The Evolving Role Of Data Engineers Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Apache Airflow Apache Superset Preset Ubisoft Ralph Kimball The Rise Of The Data Engineer The Downfall Of The Data Engineer The Rise Of The Data Scientist Dimensional Data Modeling Star Schema Databas

Change Management Done Right Across Environments and Tools: DAGs, datasets and visualizations

2023-07-01 · Airflow Summit 2023

session

CI/CD

Change management in data teams can be challenging to say the least. Not only you have to evolve your data pipelines, data structures, and datasets themselves across environments, you also have to keep data exploration and visualizations tools in sync. In this talk, we’ll be exploring how to do this best across environments (ie: dev, staging and prod), talking about how CI/CD can help, implementing good data ops practices and cranking up the level of rigor where it matters. We’ll also talk about rigor-vs-speed tradeoffs, where clearly not all data pipelines are born equal, and how to think about to evolve the level of rigor over time in places where it matters most.

The tale of a startup's data journey and its growing need for orchestration

2022-07-01 · Airflow Summit 2022

session

Airflow Analytics BigQuery dbt Fivetran Modern Data Stack

This talk tells the story of how we have approached data and analytics as a startup at Preset and how the need for a data orchestrator grew over time. Our stack is (loosely) Fivetran/Segment/dbt/BigQuery/Hightouch, and we finally got to a place where we suffer quite a bit from not having an orchestrator and are bringing in Airflow to address our orchestration needs. This talk is about how startups approach solving data challenges, the shifting role of the orchestrator in the modern data stack, and the growing need for an orchestrator as your data platform becomes more complex.

Exploring The Evolving Role Of Data Engineers

2021-12-27 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

BI Data Engineering Data Management DWH ETL/ELT Kubernetes

Summary Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers

Interview

Introduction How did you get involved in the area of data management? What is your current working definition of a data engineer?

How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"?

How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective?

How should a new/aspiring data engineer focus their time and energy to become effective?

One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns? An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services? With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as

Operating contexts: patterns around defining how a DAG should behave in dev, staging, prod & beyond

2021-07-01 · Airflow Summit 2021

session

As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …). Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this. In this talk, we’ll talk about what an “operating context” is, why it’s useful, and describe common patterns and best practices around this topic.

The Grand Vision And Present Reality of DataOps

2021-05-04 · Data Engineering Podcast Listen

podcast_episode

with Kevin Stumpf (Tecton) , Maxime Beauchemin (Preset) , Tobias Macey , Lior Gavish (Monte Carlo)

Airflow BI BigQuery CI/CD Cloud Computing Data Engineering

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin, Lior Gavish, and Kevin Stumpf about the real world challenges of embracing DataOps practices and systems, and how to keep things secure as you scale

Interview

Introduction How did you get involved in the area of data management? Before we get started, can you each give your definition of what "DataOps" means to you?

How does this differ from "business as usual" in the data industry? What are some of the things that DataOps isn’t (despite what marketers might say)?

What are the biggest difficulties that you have faced in going from concept to production with a workflow or system intended to power self-serve access to other membe

Self Service Data Exploration And Dashboarding With Superset

2021-04-27 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

Airflow Analytics BI BigQuery CI/CD Cloud Computing

Summary The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration, dashboards, and business intelligence

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Superset is? Superset is becoming part of the reference architecture for a modern data stack. What are the factors that have contributed to its popularity over other tools such as Redash, Metabase, Looker, etc.? Where do dashboarding and exploration tools like Superset fit in the responsibilities and workflow of a data engineer? What are some of the challenges that Superset faces in being performant when working with large data sources?

Which data sources have you found to be the most challenging to work with?

What are some anti-patterns that users of Superset mig

Advanced Apache Superset for Data Engineers

2020-07-01 · Airflow Summit 2020

session

Superset

Superset is the leading open source data exploration and visualization platform. In this talk, we’ll be presenting Superset with a focus on advanced topics that are most relevant to Data Engineers. The presentation will be largely a live demo of the product, with a deeper dive into advanced topics for Data Engineers. This is a sponsored talk, presented by Preset .

Improving Airflow's user experience

2020-07-01 · Airflow Summit 2020

session

with Ry Walker (Astronomer) , Maxime Beauchemin (Preset) , Viraj Parekh

Airflow Astronomer Cyber Security

Astronomer is focused on improving Airflow’s user experience through the entire lifecycle — from authoring + testing DAGs, to building containers and deploying the DAGs, to running and monitoring both the DAGs and the infrastructure that they are operating within — with an eye towards increased security and governance as well. In this talk we walk you through some current UX challenges, an overview of how the Astronomer platform addresses the major challenges, and also provide sneak peek of the things that we’re working on in the coming months to improve Airflow’s user experience. This is a sponsored talk, presented by Astronomer .

Keynote: Airflow then and now

2020-07-01 · Airflow Summit 2020

session

with Bolke de Bruin , Maxime Beauchemin (Preset)

Airflow

Bolke and Maxime tell us about past on current time of Apache Airflow.

Defining Data Engineering with Maxime Beauchemin - Episode 3

2017-03-05 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

Airflow Beam Data Engineering Data Management Data Modelling Datadog

Summary

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

Transcript provided by CastSource

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin

Questions

Introduction How did you get involved in the field of data engineering? How do you define data engineering and how has that changed in recent years? Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen? For someone who wants to get started in the field of data engineering what are some of the necessary skills? What do you see as the biggest challenges facing data engineers currently? At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain? How much analytical knowledge is necessary for a typical data engineer? What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality? You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice? How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices? How do you see the role of data engineers evolving in the next few years?

Keep In Touch

@mistercrunch on Twitter mistercrunch on GitHub Medium

Links

Datadog Airflow The Rise of the Data Engineer Druid.io Luigi Apache Beam Samza Hive Data Modeling

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Introducing The Show

2017-01-08 · Data Engineering Podcast Listen

podcast_episode

with Maxime Beauchemin (Preset) , Tobias Macey

Data Engineering Data Management DevOps

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers. I’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer.

Interview

Who am I Systems administrator and software engineer, now DevOps, focus on automation Host of Podcast.init How did I get involved in data management Why am I starting a podcast about Data Engineering Interesting area with a lot of activity Not currently any shows focused on data engineering What kinds of topics do I want to cover Data stores Pipelines Tooling Automation Monitoring Testing Best practices Common challenges Defining the role/job hunting Relationship with data engineers/data analysts Get in touch and subscribe Website Newsletter Twitter Email

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

talk-data.com

Frequent Collaborators

Filter by Event / Source

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La

An Exploration Of The Impediments To Reusable Data Pipelines

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Change Management Done Right Across Environments and Tools: DAGs, datasets and visualizations

The tale of a startup's data journey and its growing need for orchestration

Exploring The Evolving Role Of Data Engineers

Operating contexts: patterns around defining how a DAG should behave in dev, staging, prod & beyond

The Grand Vision And Present Reality of DataOps

Self Service Data Exploration And Dashboarding With Superset

Advanced Apache Superset for Data Engineers

Improving Airflow's user experience

Keynote: Airflow then and now

Defining Data Engineering with Maxime Beauchemin - Episode 3

Introducing The Show