talk-data.com talk-data.com

Topic

YAML

Yet Another Markup Language (YAML)

data_serialization configuration_file_format human_readable file_format

29

tagged

Activity Trend

9 peak/qtr
2020-Q1 2026-Q1

Activities

29 activities · Newest first

dbt for rapid deployment of a data product - Coalesce 2023

The team at nib Health has internal projects that contain standardized packages for running a dbt project, such as pipeline management, data testing, and data modeling macros. In this talk, they share how they utilized the yaml documentation files in dbt to create standardized tagging for both data security (PII), project tags, and product domain tags that get pushed into Snowflake, Immuta, and Select Star.

Speaker: Pip Sidaway, Data Product Manager, nib

Register for Coalesce at https://coalesce.getdbt.com

Unlocking model governance and multi-project deployments with dbt-meshify - Coalesce 2023

Join us for story hour, as we follow two intrepid analytics engineers working in a large dbt project as they go on a journey to meshify their dbt project, with help from a ✨special guest✨

Along the way, learn about dbt-meshify - a new CLI tool to automate the creation of model governance and cross-project lineage features in your dbt project. dbt-meshify refactors your code for you, helping you add model contracts, versions, groups, access, cross-project lineage, and more -- all in a matter of minutes! No bespoke YAML writing needed.

Speakers: Grace Goheen, Product Manager, dbt Labs; Nicholas Yager, Principal Analytics Engineer, HubSpot; Dave Connors, DX, dbt Labs

Register for Coalesce at https://coalesce.getdbt.com

One to many: Moving from a monolithic dbt project to multi-project collaboration - Coalesce 2023

At the beginning of this year, Cityblock Health was afforded a unique opportunity: to rebuild their existing dbt project from scratch.

Launched in mid-2019, the legacy project had grown organically into a tangled mess of 1800+ models, with further development becoming more and more difficult.

Faced with the challenge of retroactively imposing order on the existing project, their leadership gave them the opportunity to start fresh instead.

They jumped at the chance, and began applying many of the lessons they learned at Coalesce 2022 to set the new project up for success: - SQL linting, with SQLFluff - YAML linting, with yamllint - dbt best practices, with dbt-checkpoint and dbt-project-evaluator

As a result, this core project has become the model for multi-project collaboration at Cityblock. Rather than a single monolithic project, the new state features a collection of smaller projects, each governed by a high bar for code quality.

Speakers: Katie Claiborne, Staff analytics engineer, Cityblock Health; Nathaniel Burren, Analytics Engineer, Cityblock Health

Register for Coalesce at https://coalesce.getdbt.com

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.

Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.

Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.

Talk by: Rafi Kurlansik and Pieter Noordhuis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization. The framework, which utilized dynamic DAG generation configured using yaml files, Kubernetes jobs and dbt transformations, abstracted away many of the details associated with workflow orchestration, allowing analysts to focus on their Python or R code and data processing logic while enabling data engineers to monitor the pipelines and ensure their scalability.

Why Metrics Are Even More Valuable Than You Think They Are

Creating / migrating metric metadata to dbt can be a pain, because the level of underlying data knowledge required to create the YAML files properly. You might have found yourself wondering, “is this worth it just to standardize metric definitions?”. This talk will tell you why it is definitely worth it… because the functionality you unlock goes beyond just standard metric definitions. Adopting the dbt standard metric syntax unlocks three additional possibilities for your data:

  1. Automated time-aware metric calculations

  2. Dynamic drill downs and segmentation to empower slice and dice analysis

  3. Self-service dynamic transforms using templated SQL

Check slides here: https://docs.google.com/presentation/d/1nJHP2E6NGZ-KHG4_gNiI6w2lq4kjWgQIanAC9yf3cng

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

In this talk I will introduce a DAG authoring and editing tool for Airflow that we have built. Installed as a plugin, this tool allows users to author DAGs compose existing operators and hooks with virtually no Python experience. We walk through a demo of DAG authorship and deployment, and spend time reviewing the underlying open-source standards used and the general approach that was taken to develop the code. In addition to allowing dags to be created in a visual editor, the underlying tech enables Airflow DAGs to be described programmatically in YAML or JSON. DAGs described there can be saved in backing databases instead of Python files.

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

In this session, we’ll introduce the concept of Pulumi Components - packages that can be authored in one language and consumed in any other language. This enables platform engineering teams to create powerful patterns for reuse across their organization, such as sharing infrastructure libraries written in common programming languages that can easily be instantiated from a simple YAML file.