Data Engineering

Karin Wolok - All Things DevRel and More

2023-11-29 · The Joe Reis Show Listen

podcast_episode

by Karin Wolok (StarTree) , Joe Reis (DeepLearning.AI)

Karin Wolok and I chat about all things devrel - what it is, why it matters, the good the bad, and everything in between. We also talk about her time in the music industry, which is a fascinating side thread. Karin is amazingly smart, energetic, and fun to talk with. Enjoy!

Note - this was recorded right after the awesome DEWCon, the premier data engineering conference in Bangalore, India. Shoutout to everyone there!

Addressing The Challenges Of Component Integration In Data Platform Architectures

2023-11-27 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Airflow Analytics AWS AWS Lambda BI Cloud Computing Data Lake Data Lakehouse Data Management dbt Delta +10 more

Summary

Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis today to get started! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth

Interview

Introduction How did you get involved in the area of data management? data sharing weight of history

existing integrations with dbt switching cost for e.g. SQLMesh de facto standard of Airflow

Single source of truth

permissions management across application layers Database engine Storage layer in a lakehouse Presentation/access layer (BI) Data flows dbt -> table level lineage orchestration engine -> pipeline flows

task based vs. asset based

Metadata platform as the logical place for horizontal view

Contact Info

LinkedIn Website

Parting Questio

Unlocking Your dbt Projects With Practical Advice For Practitioners

2023-11-20 · Data Engineering Podcast Listen

podcast_episode

by Dustin Dorsey (Onix) , Cameron Cyr , Tobias Macey

AI/ML Analytics Cloud Computing Data Lake Data Lakehouse Data Management dbt Delta Hudi Iceberg SaaS SQL +2 more

Summary

The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects

Interview

Introduction How did you get involved in the area of data management? What was your path to adoption of dbt?

What did you use prior to its existence? When/why/how did you start using it?

What are some of the common challenges that teams experience when getting started with dbt?

How does prior experience in analytics and/or software engineering impact those outcomes?

You recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort?

What new lessons did you learn about dbt in the process of writing the book?

The introduction of dbt is largely res

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

2023-11-13 · Data Engineering Podcast Listen

podcast_episode

by Eran Yahav (Technion – Israel Institute of Technology) , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Lake Data Lakehouse Data Management Data Quality Datafold dbt Delta +8 more

Summary

Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine

Interview

Introduction How did you get involved in machine learning? Can you describe what Tabnine is and the story behind it? What are the individual and organizational motivations for using AI to generate code?

What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.) What are the elements of skepticism/overs

Cracking the Data Engineering Interview

2023-11-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Kedeisha Bryan , Taamir Ransome

CI/CD Cloud Computing Data Modelling ETL/ELT Python Cyber Security SQL data data-engineering

"Cracking the Data Engineering Interview" is your essential guide to mastering the data engineering interview process. This book offers practical insights and techniques to build your resume, refine your skills in Python, SQL, data modeling, and ETL, and confidently tackle over 100 mock interview questions. Gain the knowledge and confidence to land your dream role in data engineering. What this Book will help me do Craft a compelling data engineering portfolio to stand out to employers. Refresh and deepen understanding of essential topics like Python, SQL, and ETL. Master over 100 interview questions that cover both technical and behavioral aspects. Understand data engineering concepts such as data modeling, security, and CI/CD. Develop negotiation, networking, and personal branding skills crucial for job applications. Author(s) None Bryan and None Ransome are seasoned authors with a wealth of experience in data engineering and professional development. Drawing from their extensive industry backgrounds, they provide actionable strategies for aspiring data engineers. Their approachable writing style and real-world insights make complex topics accessible to readers. Who is it for? This book is ideal for aspiring data engineers looking to navigate the job application process effectively. Readers should be familiar with data engineering fundamentals, including Python, SQL, cloud data platforms, and ETL processes. It's tailored for professionals aiming to enhance their portfolios, tackle challenging interviews, and boost their chances of landing a data engineering role.

#162 Scaling Data Engineering in Retail with Mohammad Sabah, SVP of Engineering & Data at Thrive Market

2023-11-06 · DataFramed Listen

podcast_episode

by Mohammad Sabah (Thrive Market) , Richie (DataCamp)

AI/ML Data Governance Data Quality Data Science dbt

Poor data engineering is like building a shaky foundation for a house—it leads to unreliable information, wasted time and money, and even legal problems, making everything less dependable and more troublesome in our digital world. In the retail industry specifically, data engineering is particularly important for managing and analyzing large volumes of sales, inventory, and customer data, enabling better demand forecasting, inventory optimization, and personalized customer experiences. It helps retailers make informed decisions, streamline operations, and remain competitive in a rapidly evolving market. Insight and frameworks learned from data engineering practices can be applied to a multitude of people and problems, and in turn, learning from someone who has been at the forefront of data engineering is invaluable. Mohammad Sabah is SVP of Engineering and Data at Thrive Market, and was appointed to this role in 2018. He joined the company from The Honest Company where he served as VP of Engineering & Chief Data Scientist. Sabah joined The Honest Company following its acquisition of Insnap, which he co-founded in 2015. Over the course of his career, Sabah has held various data science and engineering roles at companies including Facebook, Workday, Netflix, and Yahoo! In the episode, Richie and Mo explore the importance of using AI to identify patterns and proactively address common errors, the use of tools like dbt and SODA for data pipeline abstraction and stakeholder involvement in data quality, data governance and data quality as foundations for strong data engineering, validation layers at each step of the data pipeline to ensure data quality, collaboration between data analysts and data engineers for holistic problem-solving and reusability of patterns, ownership mentality in data engineering and much more. Links from the show: PagerDutyDomoOpsGeneCareer Track: Data Engineer

Shining Some Light In The Black Box Of PostgreSQL Performance

2023-11-06 · Data Engineering Podcast Listen

podcast_episode

by Lukas Fittl , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing Data Lake Data Lakehouse Data Management Data Quality Datafold dbt Delta +8 more

Summary

Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres

Interview

Introduction How did you get involved in the area of data management? What are the different ways that database performance problems impact the business? What are the most common contributors to performance issues? What are the useful signals that indicate performance challenges in the database?

For a given symptom, what are the steps that you recommend for determining the proximate cause?

What are the potential negative impacts to be aware of when tu

The Enterprise Big Data Framework

2023-11-03 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jan-Willem Middelburg

Big Data data data-engineering

Transform enterprise Big Data into valuable assets with this comprehensive guide to data analysis, data engineering, algorithm design and data architecture.

Data Science: The Hard Parts

2023-11-01 · O'Reilly Data Science Books O'Reilly Amazon

book

by Daniel Vaughan

AI/ML Data Science KPI data data-science

This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one. Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries. With this book, you will: Understand how data science creates value Deliver compelling narratives to sell your data science project Build a business case using unit economics principles Create new features for a ML model using storytelling Learn how to decompose KPIs Perform growth decompositions to find root causes for changes in a metric Daniel Vaughan is head of data at Clip, the leading paytech company in Mexico. He's the author of Analytical Skills for AI and Data Science (O'Reilly).

Data Engineering with AWS - Second Edition

2023-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Gareth Eagar

Analytics AWS AWS Glue Data Governance QuickSight Redshift S3 Cyber Security data data-engineering

Learn data engineering and modern data pipeline design with AWS in this comprehensive guide! You will explore key AWS services like S3, Glue, Redshift, and QuickSight to ingest, transform, and analyze data, and you'll gain hands-on experience creating robust, scalable solutions. What this Book will help me do Understand and implement data ingestion and transformation processes using AWS tools. Optimize data for analytics with advanced AWS-powered workflows. Build end-to-end modern data pipelines leveraging cutting-edge AWS technologies. Design data governance strategies using AWS services for security and compliance. Visualize data and extract insights using Amazon QuickSight and other tools. Author(s) Gareth Eagar is a Senior Data Architect with over 25 years of experience in designing and implementing data solutions across various industries. He combines his deep technical expertise with a passion for teaching, aiming to make complex concepts approachable for learners at all levels. Who is it for? This book is intended for current or aspiring data engineers, data architects, and analysts seeking to leverage AWS for data engineering. It suits beginners with a basic understanding of data concepts who want to gain practical experience as well as intermediate professionals aiming to expand into AWS-based systems.

Surveying The Market Of Database Products

2023-10-30 · Data Engineering Podcast Listen

podcast_episode

by Tanya Bragin (ClickHouse) , Tobias Macey

Analytics BI CI/CD ClickHouse Cloud Computing Data Management Data Quality Datafold dbt ELK Modern Data Stack Oracle +4 more

Summary

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market

Interview

Introduction How did you get involved in the area of data management? What are the aspects of the database market that keep you interested as a VP of product?

How have your experiences at Elastic informed your current work at Clickhouse?

What are the main product categories for databases today?

What are the industry trends that have the most impact on the development and growth of different product categories? Which categories do you see growing the fastest?

When a team is selecting a database technology for a given task, what are the types of questions that they should be asking? Transactional engines like Postgres, SQL Server, Oracle, etc. were long used

Supercharging analytics engineers to balance quality & speed via automated CI checks - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Jorrit Posor (FINN GmbH) , Chiel Fernhout (Datafold) , Felix Kreitschmann (FINN Auto)

Analytics Analytics Engineering CI/CD Datafold

Supercharge your analytics engineering with the power of automated CI checks. Learn how FINN, a global car subscription service, has harnessed the capabilities of automated CI checks to maintain the delicate balance between swift development and robust data pipeline quality as they've scaled their data teams. Dive into insights and strategies to ensure quality without sacrificing speed and discover how to improve your data operations.

Speakers: Chiel Fernhout, Software Engineer, Datafold; Jorrit Posor, Tech Lead Data Engineering, FINN GmbH; Felix Kreitschmann, Senior PM, Data, FINN Auto

Register for Coalesce at https://coalesce.getdbt.com

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Ash Sultan (Datatonic)

Agile/Scrum Analytics BI CI/CD DataOps dbt DevOps DWH Modern Data Stack

Behind any good DataOps within a Modern Data Stack (MDS) architecture is a solid DevOps design! This is particularly pressing when building an MDS solution at scale, as reliability, quality and availability of data requires a very high degree of process automation while remaining fast, agile and resilient to change when addressing business needs.

While DevOps in Data Engineering is nothing new - for a broad-spectrum solution that includes data warehouse, BI, etc seemed either a bit out of reach due to overall complexity and cost - or simply overlooked due to perceived issues around scaling often attributed to the challenges of automation in CI/CD processes. However, this has been fast changing with tools such as dbt having super cool features which allow a very high degree of autonomy in the CI/CD processes with relative ease, with flexible and cutting edge features around pre-commits, Slim CI, etc.

In this session, Datatonic covers the challenges around building and deploying enterprise-grade MDS solutions for analytics at scale and how they have used dbt to address those - especially around near-complete autonomy to the CI/CD processes!

Speaker: Ash Sultan, Lead Data Architect, Datatonic

Register for Coalesce at https://coalesce.getdbt.com

Panel discussion: Fixing the data eng lifecycle - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Sung Won Chung (Datafold) , Mehdi Ouazza (MotherDuck) , Matt Housley (Halfpipe Systems)

Analytics Data Analytics Datafold Modern Data Stack Motherduck

As Joe Reis recently opined, if you want to know what’s next in data engineering, just look at the software engineer. The MDS-in-a-box pattern has been a game changer for applying software engineering principles to local data development– improving the ability to share data, collaborate on modeling work and data analysis the same way we build and share open source tooling.

This panel brings together experts in data engineering, data analytics and software engineering to explore the current state of the pattern, pieces that remain missing today and how emerging tools and data engineering testing capabilities can refine the transition from local development to production workflows.

Speakers: Matt Housley, CTO, Halfpipe Systems; Mehdi Ouazza, Developer Advocate, MotherDuck; Sung Won Chung, Solutions Engineer, Datafold; Louise de Leyritz, Host, The Data Couch podcast

Register for Coalesce at https://coalesce.getdbt.com

Driving scalability at Carsales: The democratization of our data platform - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Adam Carbone (Carsales)

Analytics SQL

Two years ago, the data team at Carsales embarked on a journey to modernize their data platform, transforming their approach to analytics, reporting, and data-driven decision making. Navigating through an ever-evolving data landscape, Carsales has prioritized tools and methodologies that boost operational efficiency. This talk delves into their strategic transition towards a decentralized data platform, emphasizing the design to elevate scalability, reliability, and accessibility by embracing best-in-class software engineering practices. By enabling the application of SQL-based infrastructure, they've ensured that even individuals with minimal data engineering background can extract valuable insights while upholding system reliability.

Speaker: Adam Carbone, Data Engineering Lead, Carsales

Register for Coalesce at https://coalesce.getdbt.com

My (almost) musical career and RMIT’s journey adopting dbt - Coalesce 2023

2023-10-27 · dbt Coalesce 2023 Watch

video

by Sarah Taylor (RMIT University) , Darren Ware (RMIT University)

dbt Modern Data Stack

In this presentation, Sarah and Darren discuss RMIT University's journey to implementing the modern data stack with dbt. They bring tales of their musical successes and misadventures, lessons learned with both music and data engineering, and how these seemingly disparate worlds overlap.

Speakers: Darren Ware, Senior Data Engineer, RMIT University; Sarah Taylor, Lead Data Engineer, RMIT University

Register for Coalesce at https://coalesce.getdbt.com

Cost centers to cash cows: Data teams as growth drivers - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Tejas Manohar (Hightouch) , Brandon Beidel (Red Ventures)

dbt

Hear from Red Ventures about how through the creation of standard implementation playbooks, their data team is able to quickly roll out modern tools like dbt and Hightouch across businesses to drive growth.

Speakers: Tejas Manohar, cofounder/co-CEO, Hightouch; Brandon Beidel, Director of Data Engineering, Red Ventures

Register for Coalesce at https://coalesce.getdbt.com

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Ravi Ramadoss (Moody's Analytics CRE) , Gleb Mezhanskiy (Datafold) , Ryan Kelly (Moody's Analytics CRE)

Analytics CI/CD Data Quality Datafold dbt

Join the team from Moody's Analytics as they take you on a personal journey of optimizing their data pipelines for data quality and governance. Like many data practitioners, Ryan and Ravi understand the frustration and anxiety that comes with accidentally introducing bad code into production pipelines—they've spent countless hours putting out fires caused from these unexpected changes.

In this session, Ryan and Ravi recount their experiences with a previous data stack that lacked standardized testing methods and visibility into the impact of code changes on production data. They also share how their new data stack is safeguarded by Datafold's data diffing and continuous integration (CI) capabilities, which enables their team to work with greater confidence, peace of mind, and speed.

Speakers: Gleb Mezhanskiy, CEO, Datafold; Ravi Ramadoss, Director of Data Engineering, Moody's Analytics CRE; Ryan Kelly, Data Engineer, Moody's Analytics CRE

Register for Coalesce at https://coalesce.getdbt.com

Operationalizing Ramp’s data with dbt and Materialize - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Ryan Delgado (Ramp) , Nikhil Benesch (Materialize)

Analytics Data Modelling dbt DWH SaaS

Traditional data warehouses excel at churning through terabytes of data for historical analysis. But for real-time, business-critical use cases, traditional data warehouses can’t produce results fast enough—and they still rack up a huge bill in the process.

So when Ramp’s data engineering team needed to serve complex analytics queries on the critical path of their production application, they knew they needed a new tool for the job. Enter Materialize, the first operational data warehouse. Like a traditional data warehouse, Materialize centralizes the data from all of a business’s production systems, from application databases to SaaS tools. But unlike a traditional data warehouse, Materialize enables taking immediate and automatic action when that data changes. Queries that once took hours or minutes to run are up-to-date in Materialize within seconds.

This talk presents how Ramp is unlocking new real-time use cases using Materialize as their operational data warehouse. The best part? The team still uses dbt for data modeling and deployment management, just like they are able to with their traditional batch workloads.

Speakers: Nikhil Benesch, CTO, Materialize; Ryan Delgado, Staff Software Engineer, Data Platform, Ramp

Register for Coalesce at https://coalesce.getdbt.com

Using data pipeline contract to prevent breakage in analytics reporting - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Jisan Zaman (Xometry)

Analytics DWH Fivetran Snowflake postgresql

It’s 2023, why are software engineers still breaking analytics reporting? We’ve all been there, being alerted by an analyst or C-level stakeholders, saying “this report is broken”, only to spend hours determining that an engineer deleted a column on the source database that is now breaking your pipeline and reporting.

At Xometry, the data engineering team wanted to fix this problem at its root and give the engineering teams a clear and repeatable process that allowed them to be the owners of their own database data. Xometry named the process DPICT (data pipeline contract) and built several internal tools that integrated seamlessly with their developer’s microservice toolsets.

Their software engineers mostly build their database microservices using Postgres, and bring in the data using Fivetran. Using that as the baseline, the team created a set of tools that would allow the engineers to quickly build the staging layer of their database in the data warehouse (Snowflake), but also alert them of the consequences of removing a table or column in downstream reporting.

In this talk, Jisan shares the nuts and bolts of the designed solution and process that allowed the team to onboard 13 different microservices seamlessly, working with multiple domains and dozens of developers. The process also helped software engineers to own their own data and realize their impact. The team has saved hundreds hours of data engineering time and resources not having to chase down what changed upstream to break data. Overall, this process has helped to bring transparency to the whole data ecosystem.

Speaker: Jisan Zaman, Data Engineering Manager, Xometry

Register for Coalesce at https://coalesce.getdbt.com

talk-data.com

Activity Trend

Top Events

Top Speakers

Karin Wolok - All Things DevRel and More

Addressing The Challenges Of Component Integration In Data Platform Architectures

Unlocking Your dbt Projects With Practical Advice For Practitioners

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Cracking the Data Engineering Interview

#162 Scaling Data Engineering in Retail with Mohammad Sabah, SVP of Engineering & Data at Thrive Market

Shining Some Light In The Black Box Of PostgreSQL Performance

The Enterprise Big Data Framework

Data Science: The Hard Parts

Data Engineering with AWS - Second Edition

Surveying The Market Of Database Products

Supercharging analytics engineers to balance quality & speed via automated CI checks - Coalesce 2023

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

Panel discussion: Fixing the data eng lifecycle - Coalesce 2023

Driving scalability at Carsales: The democratization of our data platform - Coalesce 2023

My (almost) musical career and RMIT’s journey adopting dbt - Coalesce 2023

Cost centers to cash cows: Data teams as growth drivers - Coalesce 2023

Identifying novel data issues that go undetected through CI/CD with dbt and Datafold - Coalesce 2023

Operationalizing Ramp’s data with dbt and Materialize - Coalesce 2023

Using data pipeline contract to prevent breakage in analytics reporting - Coalesce 2023