Ramp's $8 Billion Data Strategy (W/ Ian Macomber and Ryan Delgado)

2023-08-11 · The Analytics Engineering Podcast Listen

podcast_episode

by Tristan Handy (dbt Labs) , Ian Macomber (Ramp) , Julia Schottenstein (dbt labs) , Ryan Delgado (Ramp)

Analytics Analytics Engineering Data Engineering Data Science

Ian Macomber, head of analytics engineering and data science at Ramp and formerly the VP of analytics and data engineering at Drizly, and Ryan Delgado, a staff software engineer at Ramp, have played pivotal roles in establishing Ramp's data team from the ground up and are spearheading the development of their comprehensive roadmap. In this conversation with Tristan and Julia, Ian and Ryan share insights on how Ramp's data team transformed unstructured data from contracts into valuable insights to enable faster decision-making. The $8 billion company values speed and empowers teams to build, ship, and measure products quickly. Ian and Ryan also talked about their approach to adopting new tech and elevating data as an equal player alongside product engineering and design. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Quantifying The Return On Investment For Your Data Team

2023-08-06 · Data Engineering Podcast Listen

podcast_episode

by Barr Moses (Monte Carlo) , Anna Filippova (dbt Labs) , Tobias Macey

AI/ML Analytics Data Engineering Data Management GenAI Modern Data Stack Monte Carlo Python SaaS Snowflake SQL

Summary

As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Barr Moses and Anna Filippova about how and whether to measure the ROI of your data team

Interview

Introduction How did you get involved in the area of data management? What are the typical motivations for measuring and tracking the ROI for a data team?

Who is responsible for collecting that information? How is that information used and by whom?

What are some of the downsides/risks of tracking this metric? (law of unintended consequences) What are the inputs to the number that constitutes the "investment"? infrastructure, payroll of employees on team, time spent working with other teams? What are the aspects of data work and its impact on the business that complicate a calculation of the "return" that is generated? How should teams think about measuring data team ROI? What are some concrete ROI metrics data teams can use?

What level of detail is useful? What dimensions should be used for segmenting the calculations?

How can visibility into this ROI metric be best used to inform the priorities and project scopes of the team? With so many tools in the modern data stack today, what is the role of technology in helping drive or measure this impact? How do your respective solutions, Monte Carlo and dbt, help teams measure and scale data value? With generative AI on the upswing of the hype cycle, what are the impacts that you see it having on data teams?

What are the unrealistic expectations that it will produce? How can it speed up time to delivery?

What are the most interesting, innovative, or unexpected ways that you have seen data team ROI calculated and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on measuring the ROI of data teams? When is measuring ROI the wrong choice?

Contact Info

Barr

Anna

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Monte Carlo

Podcast Episode

dbt

Podcast Episode

JetBlue Snowflake Con Presentation Generative AI Large Language Models

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guessw

Data Warehousing using Fivetran, dbt and DBSQL

2023-08-03 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Data Lakehouse Databricks DWH Fivetran HTML SQL

In this video you will learn how to use Fivetran to ingest data from Salesforce into your Lakehouse. After the data has been ingested, you will then learn how you can transform your data using dbt. Then we will use Databricks SQL to query, visualize and govern your data. Lastly, we will show you how you can use AI functions in Databricks SQL to call language learning models.

Read more about Databricks SQL https://docs.databricks.com/en/sql/index.html#what-is-databricks-sql

Strategies For A Successful Data Platform Migration

2023-07-31 · Data Engineering Podcast Listen

podcast_episode

by Rob Goretsky , Gleb Mezhanskiy (Datafold) , Tobias Macey

AI/ML Airflow Analytics Amazon EMR BigQuery Dagster Data Engineering Data Management Data Science Datafold ELK GitHub +9 more

Summary

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack

Interview

Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration?

What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem?

Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?

Contact Info

Gleb

LinkedIn @glebmm on Twitter

Rob

LinkedIn RobGoretsky on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Datafold

Podcast Episode

Informatica Airflow Snowflake

Podcast Episode

Redshift Eventbrite Teradata BigQuery Trino EMR == Elastic Map-Reduce Shadow IT

Podcast Episode

Mode Analytics Looker Sunk Cost Fallacy data-diff

Podcast Episode

SQLGlot Dagster dbt

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Hex: Hex Tech Logo

Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast

dbt Labs on dbt (w/ Daniel Le)

2023-07-28 · The Analytics Engineering Podcast Listen

podcast_episode

by Daniel Le (dbt Labs) , Julia Schottenstein (dbt labs)

Analytics Analytics Engineering Cloud Computing SaaS

Daniel Le is the CFO at dbt Labs where he has built multiple teams. He is also the former head of FP&A and operations at Zoom, and he helped scale FP&A as the former finance director at Okta. In this conversation with Julia, Daniel shares his view as CFO on the challenges SaaS companies face and the importance of finance teams creating a holistic view of their business. Daniel gives advice to data leaders about how they can automate business processes with dbt Cloud and use self-service analytics to automate revenue recognition, generate consistent headcount analytics, and more to impact their organization. Read more about Daniel's story here. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Cross-Platform Data Lineage with OpenLineage

2023-07-28 · Databricks DATA + AI Summit 2023 Watch

video

by Willy Lulciuc (WeWork) , Julien Le Dem (Astronomer)

AI/ML Airflow Analytics Flink Data Quality Databricks Spark

There are more data tools available than ever before, and it is easier to build a pipeline than it has ever been. These tools and advancements have created an explosion of innovation, resulting in data within today's organizations becoming increasingly distributed and can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe.

OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark™, Flink®, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, we will show how to trace data lineage across Apache Spark and Apache Airflow. There will be a walk-through of the OpenLineage architecture and a live demo of a running pipeline with real-time data lineage.

Talk by: Julien Le Dem,Willy Lulciuc

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Five Things You Didn't Know You Could Do with Databricks Workflows

2023-07-27 · Databricks DATA + AI Summit 2023 Watch

video

by Prashanth Babu (Databricks)

AI/ML AWS Azure Cloud Computing Data Lakehouse Databricks Delta GCP Python Spark SQL

Databricks workflows has come a long way since the initial days of orchestrating simple notebooks and jar/wheel files. Now we can orchestrate multi-task jobs and create a chain of tasks with lineage and DAG with either fan-in or fan-out among multiple other patterns or even run another Databricks job directly inside another job.

Databricks workflows takes its tag: “orchestrate anything anywhere” pretty seriously and is a truly fully-managed, cloud-native orchestrator to orchestrate diverse workloads like Delta Live Tables, SQL, Notebooks, Jars, Python Wheels, dbt, SQL, Apache Spark™, ML pipelines with excellent monitoring, alerting and observability capabilities as well. Basically, it is a one-stop product for all orchestration needs for an efficient lakehouse. And what is even better is, it gives full flexibility of running your jobs in a cloud-agnostic and cloud-independent way and is available across AWS, Azure and GCP.

In this session, we will discuss and deep dive on some of the very interesting features and will showcase end-to-end demos of the features which will allow you to take full advantage of Databricks workflows for orchestrating the lakehouse.

Talk by: Prashanth Babu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jay Yang (RiskLab, UBS) , Hari Rajaram

Analytics Data Lakehouse Databricks Delta Kafka Spark SQL Data Streaming

Procore is a construction project management software that helps construction professionals efficiently manage their projects and collaborate with their teams. Our mission is to connect everyone in construction on a global platform.

Procore is the system of record for all construction projects. Our customers need to access the data in near real-time for construction insights. Enhanced reporting is a self-service operational reporting module that allows quick data access with consistency to thousands of tables and reports.

Procore data platform rebuilt the module (originally built on the relational database) using Databricks and Delta lake. We used Apache Spark™ streaming to maintain the consistent state on the ingestion side from Kafka and plan to leverage the fully capable functionalities of DBSQL using the serverless SQL warehouse to read the medallion models (built via DBT) in Delta Lake. In addition, the Unity Catalog and the Delta share features helped us share the data across regions seamlessly. This design enabled us to improve the p95 and p99 read time by xx% (which were initially timing out).

Attend this session to hear about the learnings and experience of building a Data Lakehouse architecture.

Talk by: Jay Yang and Hari Rajaram

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Cyrielle Simeone , Miranda Luna (Databricks)

AI/ML Analytics BI Cloud Computing Data Lake Data Lakehouse Databricks DWH ELK Fivetran Power BI SQL +1 more

Many organizations rely on complex cloud data architectures that create silos between applications, users and data. This fragmentation makes it difficult to access accurate, up-to-date information for analytics, often resulting in the use of outdated data. Enter the lakehouse, a modern data architecture that unifies data, AI, and analytics in a single location.

This session explores why the lakehouse is the best data warehouse, featuring success stories, use cases and best practices from industry experts. You'll discover how to unify and govern business-critical data at scale to build a curated data lake for data warehousing, SQL and BI. Additionally, you'll learn how Databricks SQL can help lower costs and get started in seconds with on-demand, elastic SQL serverless warehouses, and how to empower analytics engineers and analysts to quickly find and share new insights using their preferred BI and SQL tools such as Fivetran, dbt, Tableau, or Power BI.

Talk by: Miranda Luna and Cyrielle Simeone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: dbt Labs | Modernizing the Data Stack: Lessons Learned From Evolution at Zurich Insurance

Unlock the Next Evolution of the Modern Data Stack With the Lakehouse Revolution -- with Live Demos

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Kyle Hale (Databricks) , Roberto Salcido (Databricks)

Cloud Computing Data Lakehouse Databricks Fivetran Modern Data Stack Tableau

As the data landscape evolves, organizations are seeking innovative solutions that provide enhanced value and scalability without exploding costs. In this session, we will explore the exciting frontier of the Modern Data Stack on Databricks Lakehouse, a game-changing alternative to traditional Data Cloud offerings. Learn how Databricks Lakehouse empowers you to harness the full potential of Fivetran, dbt, and Tableau, while optimizing your data investments and delivering unmatched performance.

We will showcase real-world demos that highlight the seamless integration of these modern data tools on the Databricks Lakehouse platform, enabling you to unlock faster and more efficient insights. Witness firsthand how the synergy of Lakehouse and the Modern Data Stack outperforms traditional solutions, propelling your organization into the future of data-driven innovation. Don't miss this opportunity to revolutionize your data strategy and unleash unparalleled value with the lakehouse revolution.

Talk by: Kyle Hale and Roberto Salcido

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

Live from the Lakehouse: LLMs, LangChain, and analytics engineering workflow with dbt Labs

2023-07-14 · Databricks DATA + AI Summit 2023 Watch

video

by Drew Banin (Fishtown Analytics) , Nicolas Palaez (Databricks) , Harrison Chase (LangChain)

AI/ML Analytics Analytics Engineering Data Lakehouse Databricks GenAI LLM Marketing

Hear from three guests. Harrison Chase (CEO, LangChain) and Nicolas Palaez (Sr. Technical Marketing Manager, Databricks) on LLMs and generative AI. Third guest, Drew Banin (co-founder, dbt Labs), discusses analytics engineering workflow with his company dbt Labs, how he started the company, and how they provide value with the Databricks partnership. Hosted by Ari Kaplan (Head of Evangelism, Databricks) and Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The Arc of Data Innovation (w/ Bob Muglia, former CEO of Snowflake)

2023-07-12 · The Analytics Engineering Podcast Listen

podcast_episode

by Tristan Handy (dbt Labs) , Bob Muglia (Snowflake; Microsoft) , Julia Schottenstein (dbt labs)

Analytics Analytics Engineering Data Engineering LLM Microsoft RDBMS Snowflake

Bob Muglia likely needs no introduction. The former CEO of Snowflake led the company during its early, transformational years after a long career at Microsoft and Juniper. Bob recently released the book The Datapreneurs about the arc of innovation in the data industry, starting with the first relational databases all the way to the present craze of LLMs and beyond. In this conversation with Tristan and Julia, Bob shares insights into the future of data engineering and its potential business impact while offering a glimpse into his professional journey. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

2023-07-09 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

Activity Schema AI/ML Airflow Analytics Data Engineering Data Management Data Modelling ETL/ELT GitHub Informatica dimensional modeling Python +3 more

Summary

For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases

Interview

Introduction How did you get involved in the area of data management? Can you describe what entity-centric modeling (ECM) is and the story behind it?

How does it compare to dimensional modeling strategies? What are some of the other competing methods Comparison to activity schema

What impact does this have on ML teams? (e.g. feature engineering)

What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)

What is the impact on the underlying compute engine on the modeling strategies used?

What are some examples of data sources or problem domains for which this approach is well suited?

What are some cases where entity centric modeling techniques might be counterproductive?

What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?

What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?

How does this work across business domains within a given organization (especially at "enterprise" scale)?

What are the most interesting, innovative, or unexpected ways that you have seen ECM used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?

When is ECM the wrong choice?

What are your predictions for the future direction/adoption of ECM or other modeling techniques?

Contact Info

mistercrunch on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Entity Centric Modeling Blog Post Max's Previous Apperances

Defining Data Engineering with Maxime Beauchemin Self Service Data Exploration And Dashboarding With Superset Exploring The Evolving Role Of Data Engineers Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Apache Airflow Apache Superset Preset Ubisoft Ralph Kimball The Rise Of The Data Engineer The Downfall Of The Data Engineer The Rise Of The Data Scientist Dimensional Data Modeling Star Schema Databas

Tristan Handy - Balancing Competing Tensions and Handling Complexity

2023-07-06 · The Joe Reis Show Listen

podcast_episode

by Tristan Handy (dbt Labs) , Joe Reis (DeepLearning.AI)

Data Engineering

Tristan Handy and I chat about balancing competing tensions, both personally and leading dbt Labs. We also discuss the power of organizational behavior, naming problems to solve, and home remodeling.

This is different from the normal interviews you'll hear with Tristan, and I hope you enjoy it!

dbtlabs #data #analyticsengineering

If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Subscribe to my Substack: https://joereis.substack.com/

Airflow at StyleSeat: Our journey, challenges & results

2023-07-01 · Airflow Summit 2023

session

by Ramya Pappu , Kunal Haria

Airflow AWS CloudWatch AWS Lambda Docker Python

We will share the case study of Airflow at StyleSeat, where within a year our data grew from 2 million data points per day to 200 million. Our original solution for orchestrating this data was not enough, so we migrated to an Airflow based solution. Previous implementation Our tasks were orchestrated with hourly triggers on AWS Cloudwatch rules in their own log groups. Each task was a lambda individually defined as a task and executed python code from a docker image. As complexity increased, there were frequent downtimes and manual executions for failed tasks and their downstream dependencies.With every downtime, our business stakeholders started losing trust in Data and recovery times were longer with each downtime. We needed a modern orchestration platform which would enable our team to define and instrument complex pipelines as code, provide visibility into executions and define retry criteria on failures. Airflow was identified as a crucial & critical piece in modernizing our orchestration which would help us further onboard DBT. We wanted a managed solution and a partner who could help guide us to a successful migration.

A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmos

2023-07-01 · Airflow Summit 2023

session

by Luan Moreno Medeiros Maciel (Pythian)

Airflow Astronomer Cosmos DWH ETL/ELT Python Snowflake SQL

ETL data pipelines are the bread and butter of data teams that must design, develop, and author DAGs to accommodate the various business requirements. dbt is becoming one of the most used tools to perform SQL transformations on the Data Warehouse, allowing teams to harness the power of queries at scale. Airflow users are constantly finding new ways to integrate dbt with the Airflow ecosystem and build a single pane of glass where Data Engineers can manage and administer their pipelines. Astronomer Cosmos, an open-source product, has been introduced to integrate Airflow with dbt Core seamlessly. Now you can easily see your dbt pipelines fully integrated on Airflow. You will learn the following: How to integrate dbt Core with Airflow How to use Cosmos How to build data pipelines at scale

Data Product DAGs

2023-07-01 · Airflow Summit 2023

session

by Andrea Bombino

Airflow Data Contracts Data Engineering Kubernetes

This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily. Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product. Our standard data product DAGs contain the following stages: Data contract: check integrity of data before transforming the data Data transformation: applies dbt transformation via a kubernetes pod operator Data distribution: mainly informing downstream applications that new data is available to be consumed For use cases where different data products need to be finished before triggering another data product, we have a mechanism with an engine in between that keeps track of finished dags and triggers DAGs based on a mapping table containing data products dependencies.

Empowering Collaborative Data Workflows with Airflow and Cloud Services

2023-07-01 · Airflow Summit 2023

session

by Hoa Nguyen , Stanisław Smyl

Airflow Cloud Computing Kubernetes Python YAML

Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization. The framework, which utilized dynamic DAG generation configured using yaml files, Kubernetes jobs and dbt transformations, abstracted away many of the details associated with workflow orchestration, allowing analysts to focus on their Python or R code and data processing logic while enabling data engineers to monitor the pipelines and ensure their scalability.

talk-data.com

dbt

Activity Trend

Top Events

Top Speakers

Ramp's $8 Billion Data Strategy (W/ Ian Macomber and Ryan Delgado)

Quantifying The Return On Investment For Your Data Team

Data Warehousing using Fivetran, dbt and DBSQL

Strategies For A Successful Data Platform Migration

dbt Labs on dbt (w/ Daniel Le)

Cross-Platform Data Lineage with OpenLineage

Five Things You Didn't Know You Could Do with Databricks Workflows

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

Sponsored: dbt Labs | Modernizing the Data Stack: Lessons Learned From Evolution at Zurich Insurance

Unlock the Next Evolution of the Modern Data Stack With the Lakehouse Revolution -- with Live Demos

Sponsored: Matillion | Using Matillion to Boost Productivity w/ Lakehouse and your Full Data Stack

Live from the Lakehouse: LLMs, LangChain, and analytics engineering workflow with dbt Labs

The Arc of Data Innovation (w/ Bob Muglia, former CEO of Snowflake)

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Tristan Handy - Balancing Competing Tensions and Handling Complexity

dbtlabs #data #analyticsengineering

Airflow at StyleSeat: Our journey, challenges & results

A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmos

Data Product DAGs

Empowering Collaborative Data Workflows with Airflow and Cloud Services