Unlike transforming staged data to marts, ingesting data into staging requires robustness to data volume and type changes, schema evolution, and data drift. Especially when performing change data capture (CDC) on large databases (~100 tables to a database), we’ll ideally reinforce our dbt models with automatic:
- Mapping of dynamic columns and data types between the source and the target stag
- evolution of stage table schemas at pace with incoming data, including for nested data structures
- parsing and flattening of any arrays and JSON structs in the data.
Manually performing these tasks for data at scale is a tall order due to the many permutations with which CDC data can deviate. Waiting to implement them in mart transformation models is potentially detrimental to the business, as well as doesn’t reduce the complexity. Santona Tuli shares learnings from integrating dbt Core into high-scale data ingestion workloads, including trade-offs between ease-of-use and scale.
Speaker: Santona Tuli, Head/Director of Data, Upsolver
Register for Coalesce at https://coalesce.getdbt.com