talk-data.com talk-data.com

Topic

BI

Business Intelligence (BI)

data_visualization reporting analytics

1211

tagged

Activity Trend

111 peak/qtr
2020-Q1 2026-Q1

Activities

1211 activities · Newest first

Summary

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro. Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market

Interview

Introduction How did you get involved in the area of data management? What are the aspects of the database market that keep you interested as a VP of product?

How have your experiences at Elastic informed your current work at Clickhouse?

What are the main product categories for databases today?

What are the industry trends that have the most impact on the development and growth of different product categories? Which categories do you see growing the fastest?

When a team is selecting a database technology for a given task, what are the types of questions that they should be asking? Transactional engines like Postgres, SQL Server, Oracle, etc. were long used

Enterprise MDS deployment at scale: dbt & DevOps - Coalesce 2023

Behind any good DataOps within a Modern Data Stack (MDS) architecture is a solid DevOps design! This is particularly pressing when building an MDS solution at scale, as reliability, quality and availability of data requires a very high degree of process automation while remaining fast, agile and resilient to change when addressing business needs.

While DevOps in Data Engineering is nothing new - for a broad-spectrum solution that includes data warehouse, BI, etc seemed either a bit out of reach due to overall complexity and cost - or simply overlooked due to perceived issues around scaling often attributed to the challenges of automation in CI/CD processes. However, this has been fast changing with tools such as dbt having super cool features which allow a very high degree of autonomy in the CI/CD processes with relative ease, with flexible and cutting edge features around pre-commits, Slim CI, etc.

In this session, Datatonic covers the challenges around building and deploying enterprise-grade MDS solutions for analytics at scale and how they have used dbt to address those - especially around near-complete autonomy to the CI/CD processes!

Speaker: Ash Sultan, Lead Data Architect, Datatonic

Register for Coalesce at https://coalesce.getdbt.com

How Canadian Football League’s data team runs marketing plays with dbt & RudderStack - Coalesce 2023

Leveraging the power of RudderStack and dbt, Canadian Football League’s (CFL) data team abstracts the complexity of data in the warehouse and provides their marketing team with highly targeted audiences across a large variety of platforms and data sources. During this session we’ll hear how CFL went from manually sharing CSV files to modeling targeted segments, directly focused on OKRs, in their warehouse.

Speakers: Eric Dodds, Head of Product Marketing, RudderStack; Dave Musambi, Sr. Director, Business Intelligence, Canadian Football League

Register for Coalesce at https://coalesce.getdbt.com

Lazy devs unite! Building a data ecosystem that spoils data engineers - Coalesce 2023

Join Ryan Dolley and Jan Soubusta for a journey into the world of end-to-end analytics pipelines and how they can be a data engineer's best friend.

Learn how to automate boring tasks and create a safe haven for data engineers using the dynamic duo: dbt for transformative magic and GoodData for analytics awesomeness.

Combined with an data extraction and orchestration tools, you form the Voltron of easy to automate end-to-end analytics flows bringing data from source systems all the way through BI and to your end users.

And don't miss the grand finale where they reveal an alternative deployment on dbt Cloud that's so easy to orchestrate your coffee mug could do it. Prepare to laugh, learn, and level up your data game!

Speakers: Ryan Dolley, VP of Product Strategy, GoodData; Jan Soubusta, Distinguished Software Engineer, GoodData

Register for Coalesce at https://coalesce.getdbt.com

Need for speed (and less spending): The story of finance data at Snowflake - Coalesce 2023

Take a look at how the finance data team at Snowflake leverages dbt to drive strategic decision-making within the finance and accounting organizations. This talk spans topics including development velocity, cost governance, and data stewardship.

Speakers: Sandra Herchen, Analytics Engineering Manager, Snowflake; Jack Peele, Business Intelligence Analyst, Snowflake

Register for Coalesce at https://coalesce.getdbt.com

Self serve: The (elusive) holy grail of data teams - Coalesce 2023

The self-serve revolution remains unrealized despite BI tools from Business Objects to Tableau and Looker all aiming for that holy grail. What went wrong? This talk argues that for self-serve to work, it has to work the way humans already work.

Speaker: Paul Blankley, CTO / Co-founder, Zenlytic

Register for Coalesce at https://coalesce.getdbt.com

The need for a new approach to the semantic layer - Coalesce 2023

Artyom Keydunov, Co-founder & CEO of Cube, discusses the future for the semantic layer, uniting BI tools, embedded analytics, and AI agents. This new approach brings together data, enabling data tools to introspect data model definitions and seamlessly interoperate within the data stack. By embracing this new approach—data practitioners will reap the benefits with enhanced user experiences, quicker data delivery, and streamlined workloads. See how a semantic layer can harmonize your data tools to drive maximum impact for yourself and end-users.

Speaker: Artyom Keydunov, Co-founder & CEO, Cube Dev

Register for Coalesce at https://coalesce.getdbt.com

Taming BI sprawl with dbt: From 0 to 600 models in 18 months - Coalesce 2023

Hear how the data team at Element Biosciences is tackling BI sprawl while rapidly scaling data models using dbt. They share three use cases along with the tools that enabled their teams to align on crucial metrics to make data-driven decisions and unlock tangible business benefits.

Speaker: Matthew Hoss, Sr. Manager, Business Systems, Element Biosciences

Register for Coalesce at https://coalesce.getdbt.com

Is AI the new AE? - Coalesce 2023

ChatGPT is the talk of the town, with everyone asking - are the robots finally coming for our jobs? Join this panel of data leaders to hear about why they believe that ChatGPT and other AI tools are actually an analytical engineer's new best friend.

Speakers: Kate Schiffelbein, Head of Business Intelligence, Northbeam; Lindsay Murphy, Head of Data, Secoda; Patrick Ross, Solutions Architect, Data Clymer

Register for Coalcese at https://coalesce.getdbt.com/

Summary

The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Ranjith Raghunath about tactical elements of a data product strategy

Interview

Introduction How did you get involved in the area of data management? Can you describe what is encompassed by the idea of a data product strategy?

Which roles in an organization need to be involved in the planning and implementation of that strategy?

order of operations:

strategy -> platform design -> implementation/adoption platform implementation -> product strategy -> interface development

managing grain of data in products team organization to support product development/deployment customer communications - what questions to ask? requirements gathering, helping to understand "the art of the possible" What are the most interesting, innovative, or unexpected ways that you have seen organizations approach data product strategies? What are the most interesting, unexpected, or challenging lessons that you have learned while working on

Summary

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable

Interview

Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it?

What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction?

What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data?

How have you worked to address that in the Decodable platform and interfaces?

As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable?

Contact Info

esammer on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Decodable

Podcast Episode

Understanding the Apache Flink Journey Flink

Podcast Episode

Debezium

Podcast Episode

Kafka Redpanda

Podcast Episode

Kinesis PostgreSQL

Podcast Episode

Snowflake

Podcast Episode

Databricks Startree Pinot

Podcast Episode

Rockset

Podcast Episode

Druid InfluxDB Samza Storm Pulsar

Podcast Episode

ksqlDB

Podcast Episode

dbt GitHub Actions Airbyte Singer Splunk Outbox Pattern

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Neo4J: NODES Conference Logo

NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)

Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to Neo4j.com/NODES today to see the full agenda and register!Rudderstack: Rudderstack

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize: Materialize

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Datafold: Datafold

This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare…

Join Avery Smith and Jordan Temple in this episode of the Data Career Podcast.

Jordan shares his insights on breaking into the data industry with a non-technical background, leveraging skills in Power BI and the power of a solid online presence.

Tune in now to gain inspiration and guidance for your data career journey.

Connect with Jordan Temple:

🤝 Connect on Linkedin

🤝 Ace your data analyst interview with the interview simulator

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps: (07:02) - Excel + Power BI

(12:20) - Landing a job with a DM

(15:10) - Hybrid > Remote

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Summary

The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry

Interview

Introduction How did you get involved in the area of data management? Can you describe what CoverageCat is and the story behind it? What are the different sources of data that you work with?

What are the most challenging aspects of collecting that data? Can you describe the formats and characteristics (3 Vs) of that data?

What are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective? Can you describe how you have architected your data platform?

How have the design and goals changed since you first started working on it? What are you optimizing for in your selection and implementation process?

What are the sharp edges/weak points that you worry about in your existing data flows?

How do you guard against those flaws in your day-to-day operations?

What are the

Join Avery for a captivating episode with Data Consulting expert Leon Gordon, to delve into the world of data consulting and the skills needed to excel in this field.

Don't miss out on this fascinating conversation filled with valuable insights and advice for data professionals looking to make a difference as consultants.

Tune in now to better understand the data consulting industry and how you can navigate your way to a successful career!

Connect with Leon Gordon:

🤝 Connect on Linkedin

🎒 Learn About Onyx Data

🤝 Ace your data analyst interview with the interview simulator⁠

📩 Get my weekly email with helpful data career tips

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Timestamps:

(5:56) - Data Consulting

(9:45) - Power BI

(25:00) - Data Projects

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Summary

Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process

Interview

Introduction How did you get involved in the area of data management? What are the different aspects/types of ETL that you are seeing generative AI applied to?

What kind of impact are you seeing in terms of time spent/quality of output/etc.?

What kinds of projects are most likely to benefit from the application of generative AI? Can you describe what a typical workflow of using AI to build ETL workflows looks like?

What are some of the types of errors that you are likely to experience from the AI? Once the pipeline is defined, what does the ongoing maintenance look like? Is the AI required to operate within the pipeline in perpetuity?

For individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address? What are the most interesting, innovative, or unexpected w

Summary

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications

Interview

Introduction How did you get involved in the area of data management? Can you describe what vector search is and how it differs from other search technologies?

What are the technical challenges related to providing vector search? What are the applications for vector search that merit the added complexity?

Vector databases have been gaining a lot of attention recently with the proliferation of LLM applicati

Learning and Operating Presto

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed SQL query engine can be challenging even for the most experienced engineers. With this practical book, data engineers and architects, platform engineers, cloud engineers, and software engineers will learn how to use Presto operations at your organization to derive insights on datasets wherever they reside. Authors Angelica Lo Duca, Tim Meehan, Vivek Bharathan, and Ying Su explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Alibaba Cloud, Hewlett Packard Enterprise, IBM, Intel, and many more use Presto and how you can quickly deploy Presto in production. With this book, you will: Learn how to install and configure Presto Use Presto with business intelligence tools Understand how to connect Presto to a variety of data sources Extend Presto for real-time business insight Learn how to apply best practices and tuning Get troubleshooting tips for logs, error messages, and more Explore Presto's architectural concepts and usage patterns Understand Presto security and administration

Today I’m joined by Anthony Deighton, General Manager of Data Products at Tamr. Throughout our conversation, Anthony unpacks his definition of a data product and we discuss whether or not he feels that Tamr itself is actually a data product. Anthony shares his views on why it’s so critical to focus on solving for customer needs and not simply the newest and shiniest technology. We also discuss the challenges that come with building a product that’s designed to facilitate the creation of better internal data products, as well as where we are in this new wave of data product management, and the evolution of the role.

Highlights/ Skip to:

I introduce Anthony, General Manager of Data Products at Tamr, and the topics we’ll be discussing today (00:37) Anthony shares his observations on how BI analytics are an inch deep and a mile wide due to the data that’s being input (02:31) Tamr’s focus on data products and how that reflects in Anthony’s recent job change from Chief Product Officer to General Manager of Data Products (04:35) Anthony’s definition of a data product (07:42) Anthony and I explore whether he feels that decision support is necessary for a data product (13:48) Whether or not Anthony feels that Tamr qualifies as a data product (17:08) Anthony speaks to the importance of focusing on outcomes and benefits as opposed to endlessly knitting together features and products (19:42) The challenges Anthony sees with metrics like Propensity to Churn (21:56) How Anthony thinks about design in a product like Tamr (30:43) Anthony shares how data science at Tamr is a tool in his toolkit and not viewed as a “fourth” leg of the product triad/stool (36:01) Anthony’s views on where we are in the evolution of the DPM role (41:25) What Anthony would do differently if he could start over at Tamr knowing what he knows now (43:43)

Links Tamr: https://www.tamr.com/ Innovating: https://www.amazon.com/Innovating-short-guide-making-things/dp/B0C8R79PVB The Mom Test: https://www.amazon.com/The-Mom-Test-Rob-Fitzpatrick-audiobook/dp/B07RJZKZ7F LinkedIn: https://www.linkedin.com/in/anthonydeighton/

In data science, the push for unbiased machine learning models is evident. So much effort is made into ensuring the products we create are done thoughtfully and correctly, but are we investing the same effort in ensuring our teams, the very architects of these models, are diverse and inclusive? Bias in data can lead to skewed results, and similarly, a lack of diversity in teams can result in narrow perspectives. As we prioritize building diversity and inclusion into our data, it's equally crucial to embed these principles within our teams. So, who is best equipped to guide us in integrating DEI from a data perspective? Tracy Daniels is the Chief Data Officer for Truist Financial Corporation. She leads the team responsible for Truist’s enterprise data capabilities, including strategy, governance, data platform delivery, client, master & reference data, and the centers of excellence for business intelligence visualization and artificial intelligence & machine learning. She is also the executive sponsor for Truist’s Enterprise Technology & Operations Diversity Council. Daniels joined Truist in 2018. She has more than 25 years of banking and technology experience leading high performing technology portfolio, development, infrastructure and global operations organizations. Tracy enjoys participating in civic and philanthropic endeavors including serving on the Georgia State University Foundation Board of Trustees. She has been recognized as a National 2013 WOC STEM Rising Star award recipient, the 2017 Working Mother magazine Mother of the Year recipient, and a 2021 Women In Technology (WIT) Women of the Year in STEAM finalist. In the episode Tracy and Richie discuss Truist's approach to Diversity, Equity, and Inclusion (DEI) and its alignment with the company's purpose and values, the distinction between diversity and inclusion, the positive outcomes of implementing DEI correctly, the importance of not missing opportunities both externally with customers and internally with talent, the significance of aligning diversity programs with business metrics and hiring to promote DEI, considerations for job advertisements that appeal to a diverse audience, and much more.  Links mentioned in the show: McKinsey on Diversity and InclusionBrookings Piece on Mitigating Bias in DataAlgorithmic Justice LeagueEuropean Legislation on Data and DiversityCourse: AI EthicsRadar: Data & AI Literacy Edition

Summary

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products

Interview

Introduction How did you get involved in the area of data management? Can you describe what the term "linked data product" means and some examples of when you might build one?

What is the overlap between knowledge graphs and "linked data products"?

What is JSON-LD?

What are the domains in which it is typically used? How does it assist in developing linked data products?

what are the characterist