How to use OpenLineage for DIY lineage with the dbt-ol package
talk-data.com
Speaker
Kevin Hu
7
talks
Data Observability at Datadog; focuses on data lineage and observability.
Bio from: dbt Coalesce 2022
Filter by Event / Source
Talks & appearances
7 activities · Newest first
SpotOn works with FIS (formerly WorldPay) to handle payment processing, allowing for more detailed transaction management than other processors. Our data team took on the challenge of transitioning to FIS to gain better control over transaction details.
The legacy data pipelines we inherited were problematic and unreliable. They consisted of an SFTP file server, cron jobs, and Python/Shell scripts that moved data from SFTP to S3 and then processed it into Postgres. These systems were fragile, often breaking when new or different data arrived, requiring manual intervention and frequent restarts.
We recognized the need for a better solution. Our team decided to use Snowpipe and dbt to streamline our data processing. This approach allowed us to manage and parse complex data formats efficiently. We used dbt to create models that could handle the varied and detailed specifications provided by FIS, ensuring that as updates came in, they could be easily integrated.
With this new setup, we have significantly reduced the fragility of our pipelines. Using dbt Cloud, we've improved collaboration and error detection, ensuring data integrity and better insights into usage patterns. This new system supports not only payment processing but also other critical functions like customer loyalty and marketing, aggregating and cleaning data from various sources.
As we continue migrating from older systems like TSYS, we see the clear benefits of this modernization. Our experience with dbt has proven invaluable in supporting our business-critical data operations and ensuring smooth transitions and reliable data handling.
Speakers: Kevin Hu CEO Metaplane
Daniel Corley Senior Analytics Engineer SpotOn
Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements
Whenever Kevin and I get together, we "nerd snipe" each other. This conversation is no different, and it's a wide-ranging conversation about how the data landscape evolves alongside LLMs, education, startup mentorship, and the possible (looming?) startup mass extinction.
Kevin's LinkedIn: https://www.linkedin.com/in/kevinzenghu/
Metaplane: https://metaplane.dev/
There's a kind of funny (and painful) paradigm that exists in the data world: more data, more problems. As companies grow their data, there's more broken dashboards to find, impacted users to chase, and cobbling data together that happens. But what if there was tooling to make the scaling process a little less painful, and a lot more scalable?
Join Marion Pavillet (MUX) and Kevin Hu (Metaplane) as they share stories around scaling data stacks at a large and small companies, demonstrate how they used modern tools to help improve data lineage and accuracy, and discuss the power of metadata in improving data products over time. Attendees can expect to leave this session with new concepts they can use to frame problems faced by rapidly growing companies, a blueprint for implementing DataOps, and solutions to which they can map their own situations.
Check the slides here: https://docs.google.com/presentation/d/1jL2OfAyPFJl0Dq6evmdwnpK6xrebNxEbjVj1nIcwXsU/
Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.
How do you support exponentially growing companies without breaking as a data team? The answer is increasing your leverage with tools and processes. This session centers around four principles to achieve this goal: 1. don’t reinvent the wheel, 2. make your own job easier, 3. save time for innovation, and 4. invest in onboarding.
First, the first data leader at Vendr, the SaaS buying platform with customers like GitLab, Brex, and The Washington Post, will share his learnings on building a stack and team that scaled as the company grew 10x from 30 to 300 employees in under two years.
Second, we’ll give a demo of how Metaplane pulls lineage and metadata from a modern data stack that is centered around dbt. By the end of the demo, you’ll know how to setup tests, extract lineage throughout your data stack, and triage data quality alerts.More details coming soon!
Check the slides here: https://docs.google.com/presentation/d/15dQJIGeGhG0WGO6MLXtxWhmf8neY-u0c8ZLRG9GJB-s/edit?usp=sharing
Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.
As a PhD candidate at MIT, Kevin (and friends) published Sherlock, a data type detection engine (a surprisingly bedeviling problem) for data cleaning + data discovery. Now as co-founder and CEO of Metaplane, a data observability startup, Kevin applies these same automated data discovery methods to help data teams keep their data healthy. In this conversation with Tristan & Julia, Kevin wins the coveted award for "most crystal-clear explanations of complex technical concepts through physics analogy." For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.
Summary Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between.
Interview
Introduction How did you get involved in the area of data management? Can you describe what Metaplane is and the story behind it? Data observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term?
What are the areas of differentiation that you see across vendors in the space?
Can you describe how the Metaplane platform is architected?
How have the design and goals of Metaplane changed or evolved since you started working on it?
establishing seasonality in data metrics blind spots from operating at the level of the data warehouse What are the most interesting, innovative, or unexpected ways that you have seen Metaplane used? What are the most interesti