talk-data.com talk-data.com

Topic

Analytics

data_analysis insights metrics

4552

tagged

Activity Trend

398 peak/qtr
2020-Q1 2026-Q1

Activities

4552 activities · Newest first

Summary Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search

Interview

Introduction How did you get involved in the area of data management? Can you describe what Unstruk Data is and the story behind it? What would you classify as "unstructured data"?

What are some examples of industries that rely on large or varied sets of unstructured data? What are the challenges for analytics that are posed by the different categories of unstructured data?

What is the current state of the industry for working with unstructured data?

What are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem? Where does it sit in the overall landscape of data tools?

Can you describe how the Unstruk data warehouse is implemented?

What are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials? How has the design and architecture evolved or changed since you began working on it?

How do you handle versioning of data, give

Check out "Telling Your Data Story" by Scott Taylor | https://amzn.to/3qaNakb

Want to attend the Master Data Marathon 3.0 hosted by Scott Taylor? Use code MDM50 for 50% off right now! https://thinklinkers.com/events/master_data_marathon_2021


Super exciting episode today. I got to interview Scott Taylor - The Data Whisperer. What a cool guy. We talked about data governance and data management. What are they? What does that even mean? We talked about Scott’s decades of experience in the data industry and how his use of branding has helped him build his career.

I’m always impressed with Scott’s branding. Let’s start with his name; the data whisperer. Great slogan, catchy, gives you an idea of what he’s about. Then there’s his truth hat, and we talk about that in the episode, and then he has puppets.


Want to break into data science? Check out my new course coming out later this summer: Data Career Jumpstart - https://www.datacareerjumpstart.com

Want to leave a question for the Ask Avery Show?

Written Mailbag: https://forms.gle/78zD544drpDAcTRV9

Audio Mailbag: https://anchor.fm/datacareerpodcast/message

Want to be on The Ask Avery Show? Sign up for a spot here:

https://calendly.com/datacareer/ask-avery?month=2021-05

Watch The Ask Avery Show Live Tuesday’s at 8PM: https://www.datacareerjumpstart.com/AskAvery

Add The Ask Avery Show to your calendar: https://calendar.google.com/calendar/ical/c_u2rk36mj5mgqg5g42glm9a741c%40group.calendar.google.com/public/basic.ics

Subscribe on YouTube: https://www.youtube.com/channel/UCuyfszBAd3gUt9vAbC1dfqA

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Behavioral Data Analysis with R and Python

Harness the full power of the behavioral data in your company by learning tools specifically designed for behavioral data analysis. Common data science algorithms and predictive analytics tools treat customer behavioral data, such as clicks on a website or purchases in a supermarket, the same as any other data. Instead, this practical guide introduces powerful methods specifically tailored for behavioral data analysis. Advanced experimental design helps you get the most out of your A/B tests, while causal diagrams allow you to tease out the causes of behaviors even when you can't run experiments. Written in an accessible style for data scientists, business analysts, and behavioral scientists, thispractical book provides complete examples and exercises in R and Python to help you gain more insight from your data--immediately. Understand the specifics of behavioral data Explore the differences between measurement and prediction Learn how to clean and prepare behavioral data Design and analyze experiments to drive optimal business decisions Use behavioral data to understand and measure cause and effect Segment customers in a transparent and insightful way

On this episode, we chat with Tommaso Rocchi, a 2020 Master of Arts graduate of The Global Entertainment and Music Business program at Berklee College of Music in Valencia, Spain. As a former college radio Music Director at the University of Padua in northern Italy, Rocchi then moved on to Berklee to focus on copyright law, new business models, and data analytics.

In September 2020, Rocchi penned a Chartmetric article entitled “How Data is Redefining the Role of A&R in the Music Industry Today,” based off of his research at Berklee. It focuses on where the field of A&R has gone in the digital era from its analog roots, and how data plays a significant role, but should never replace the human side of how professionals operate.

Rocchi is currently a Project Manager for Data and Analytics at Netherlands-based classical music label PENTATONE.

Connect with Tommaso: LinkedIn | Twitter | Instagram Read "How Data is Redefining the Role of A&R in the Music Industry Today" here. If you want more free insights, follow our podcast, our blog, and our socials. If you're an artist with a free Chartmetric account, sign up for the artist plan, made exclusively for you, here. If you're new to Chartmetric, follow the URL above after creating a free account here.

We can watch (sort of) what users do on our sites. That's web analytics. We can ask them how they felt about the experience. That's voice of the customer. But, can we (and should we?) actually analyze their emotional reactions? On this episode, Michael and Tim sat down with Dr. Liraz Margalit, Head of Digital Behavioral Research at Clicktale, to bend their brains a bit around that very topic. And, they left the discussion thinking differently about conversion rates, and even realizing that scroll tracking might just have a valuable application! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

This episode originally aired on June 20, 2017.

In this episode, I share my quick experience leaving for 17 days to go do service in the Dominican Republic. 

Want to break into data science? Check out my new course coming out later this summer: Data Career Jumpstart - https://www.datacareerjumpstart.com

Want to leave a question for the Ask Avery Show?

Written Mailbag: https://forms.gle/78zD544drpDAcTRV9

Audio Mailbag: https://anchor.fm/datacareerpodcast/message

Want to be on The Ask Avery Show? Sign up for a spot here:

https://calendly.com/datacareer/ask-avery?month=2021-05

Watch The Ask Avery Show Live Tuesday’s at 8PM: https://www.datacareerjumpstart.com/AskAvery

Add The Ask Avery Show to your calendar: https://calendar.google.com/calendar/ical/c_u2rk36mj5mgqg5g42glm9a741c%40group.calendar.google.com/public/basic.ics

Subscribe on YouTube: https://www.youtube.com/channel/UCuyfszBAd3gUt9vAbC1dfqA

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Summary Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics?

How do the various systems relate to each other for building a full workflow? How do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform?

What have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads? What are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers? What are the systems that you have found to be easiest to work with?

Which are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design?

How has your work with customers fed back into the products that you are building on top of? What are some examples of architectural or software patterns that are unique to the GCP product suite? What are the most interesting, innovative, or unexpected ways that y

podcast_episode
by Mike Brisson (Moody's Analytics) , Mark Zandi (Moody's Analytics)

Mike Brisson, Senior Economist at Moody's Analytics joins Mark Zandi  and the Moody's Analytics team to discuss the latest CPI report, labor market, homebuyer perceptions, and the vehicle market.

Questions or Comments, please email us at [email protected]. We would love to hear from you.    To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Expert Data Modeling with Power BI

Expert Data Modeling with Power BI provides a comprehensive guide to creating effective and optimized data models using Microsoft Power BI. This book will teach you everything you need to know, from connecting to data sources to setting up complex models that enable insightful reporting and business analytics. What this Book will help me do Gain expertise in implementing virtual tables and time intelligence functionalities in Power BI's DAX language. Identify and correctly set up Dimension and Fact tables using the Power Query Editor interface. Master advanced data preparation techniques to build efficient Star Schemas for modeling. Apply best practices for preparing and modeling data for real-world business cases. Become proficient in advanced features like aggregations, incremental refresh, and row-level security. Author(s) Soheil Bakhshi is a seasoned Power BI expert and author with years of experience in business intelligence and analytics. His practical knowledge of data modeling and approachable writing style make complex concepts understandable. Soheil's passion for empowering users to harness the full potential of Power BI is evident through his clear guidance and real-world examples. Who is it for? This book is perfect for business intelligence developers, data analysts, and advanced users of Power BI who aim to deepen their understanding of data modeling. It assumes a familiarity with Power BI's basic functions and core concepts like Star Schema. If you're looking to refine your modeling practices and create versatile, dynamic solutions, this resource is for you.

podcast_episode
by Dante DeAntonio (Moody's Analytics) , Cris deRitis , Mark Zandi (Moody's Analytics) , Ryan Sweet

Dante DeAntonio, Senior Economist at Moody's Analytics, joins Mark Zandi to discuss the May U.S. employment report and the state of the labor market.

Questions or Comments, please email us at [email protected]. We would love to hear from you.    To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Summary SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Justin Mejorada-Pier and Charlie Gu about Querybook, an open source IDE for your big data projects

Interview

Introduction How did you get involved in the area of data management? Can you describe what Querybook is and the story behind it? What are the main use cases or workflows that Querybook is designed for?

What are the shortcomings of dashboarding/BI tools that make something like Querybook necessary?

The tag line calls out the fact that Querybook is an IDE for "big data". What are the manifestations of that focus in the feature set and user experience? Who are the target users of Querybook and how does that inform the feature priorities and user experience? Can you describe how Querybook is architected?

How have the goals and design changed or evolved since you first began working on it? What were some of the assumptions or design choices that you had to unwind in the process of open sourcing it?

What is the workflow for someone building a DataDoc with Querybook?

What is the experience of working as a collaborator on an analysis?

How do you handle lifecycle management of query results? What are your thoughts on the potential for extending Querybook beyond SQL-oriented analysis and integrating something like Jupyter kernels? What are the most interesting, innovative, or unexpected ways that you have seen Querybook used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Querybook? When is Querybook the wrong choice? What do you have planned for the future of Querybook?

Contact Info

Justin

Link

On this episode, we talk to Adam Kanwal about the article he wrote for the Chartmetric blog entitled, How to Promote Your Music in Southeast Asian Trigger Cities. Kanwal is a Digital Marketing and Analytics Consultant, working with artists including Amorphous, Still Woozy, Remi Wolf, Suzuki Saint, and Miss Madeline to analyze TikTok and YouTube trajectories, building campaigns from the ground up. He’s formed partnerships with over one hundred influencers globally, and has also served as the Digital Marketing Specialist for Shifted Recording in New York City.

He’s a 2021 graduate from Cornell University in New York, with a background in Human Development, and minoring in International Relations and Music. His primary interests are in creating psychologically smart, culturally relevant, and globally reverberating digital marketing campaigns for up-and-coming musical artists. Read How to Promote Your Music in Southeast Asian Trigger Cities here. If you want more free insights, follow our podcast, our blog, and our socials. If you're an artist with a free Chartmetric account, sign up for the artist plan, made exclusively for you, here. If you're new to Chartmetric, follow the URL above after creating a free account here.

Summary Every part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Blake Burch about Shipyard, and his mission to create the easiest way for data teams to launch, monitor, and share resilient pipelines with less engineering

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are building at Shipyard and the story behind it? What are the main goals that you have for Shipyard?

How does it compare to other data orchestration frameworks in the market?

Who are

Mastering Tableau 2021 - Third Edition

Tableau 2021 brings a wide range of tools and techniques for mastering data visualization and business intelligence. In this book, you will delve into the advanced methodologies to fully utilize Tableau's capabilities. Whether you're dealing with geo-spatial, time-series analytics, or complex dashboards, this resource provides expertise through real-world data challenges. What this Book will help me do Draw connections between multiple databases and create insightful Tableau dashboards. Master advanced data visualization techniques that lead to impactful storytelling. Understand Tableau's integration with programming languages such as Python and R. Analyze datasets with time-series and geo-spatial methods to gain predictive insights. Leverage Tableau Prep Builder for efficient data cleaning and transformation processes. Author(s) Marleen Meier and David Baldwin are seasoned professionals in business intelligence and data analytics. They bring years of practical experience and have helped numerous organizations worldwide transform their data visualization strategies using Tableau. Their collaborative approach ensures a comprehensive, beginner to advanced learning experience. Who is it for? This book is perfect for business intelligence analysts, data analysts, and industry professionals who are already familiar with Tableau's basics and wish to expand their knowledge. It provides advanced techniques and implementations of Tableau for improving data storytelling and dashboard performance. Readers seeking to connect Tableau with external programming tools will also greatly benefit from this guide.

podcast_episode
by Cris deRitis , Mark Zandi (Moody's Analytics) , Jim Parrott (Urban Institute) , Ryan Sweet

Jim Parrott at Urban Institute, joins Mark Zandi and the Moody's Analytics team to discuss the state of the U.S. housing market and the supply issues its facing. We also discuss this week's key economic data. 

Questions or Comments, please email us at [email protected]. We would love to hear from you.    To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Summary The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Mark Cusack about Yellowbrick, a data warehouse designed for distributed clouds

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Yellowbrick is and some of the story behind it? What does the term "distributed cloud" signify and what challenges are associated with it? How would you characterize Yellowbrick’s position in the database/DWH market? How is Yellowbrick architected?

How have the goals and design of the platform changed or evolved over time?

How does Yellowbrick maintain visibility across the different data locations that it is responsible for?

What capabilities does it offer for being able to join across the disparate "clouds"?

What are some data modeling strategies that users should consider when designing their deployment of Yellowbrick? What are some of the capabilities of Yellowbrick that you find most useful or technically interesting? For someone who is adopting Yellowbrick, what is the process for getting it integrated into their data systems? What are the most underutilized, overlooked, or misunderstood features of Yellowbrick? What are the most interesting, innovative, or unexpected ways that you have seen Yellowbrick used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Yellowbrick? When is Yellowbrick the wrong choice? What do you have planned for the future of the product?

Contact Info

LinkedIn @markcusack on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Yellowbrick Teradata Rainstor Distributed Cloud Hybrid Cloud SwimOS

Podcast Episode

K

Summary Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Edo Liberty about Pinecone, a vector database for powering machine learning and similarity search

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Pinecone is and the story behind it? What are some of the contexts where someone would want to perform a similarity search?

What are the considerations that someone should be aware of when deciding between Pinecone and Solr/Lucene for a search oriented use case?

What are some of the other use cases that Pinecone enables? In the absence of Pinecone, what kinds of systems and solutions are people b

Architecting Data-Intensive SaaS Applications

Through explosive growth in the past decade, data now drives significant portions of our lives, from crowdsourced restaurant recommendations to AI systems identifying effective medical treatments. Software developers have unprecedented opportunity to build data applications that generate value from massive datasets across use cases such as customer 360, application health and security analytics, the IoT, machine learning, and embedded analytics. With this report, product managers, architects, and engineering teams will learn how to make key technical decisions when building data-intensive applications, including how to implement extensible data pipelines and share data securely. The report includes design considerations for making these decisions and uses the Snowflake Data Cloud to illustrate best practices. This report explores: Why data applications matter: Get an introduction to data applications and some of the most common use cases Evaluating platforms for building data apps: Evaluate modern data platforms to confidently consider the merits of potential solutions Building scalable data applications: Learn design patterns and best practices for storage, compute, and security Handling and processing data: Explore techniques and real-world examples for building data pipelines to support data applications Designing for data sharing: Learn best practices for sharing data in modern data applications

Interview I did with Matt Sharp, current data engineer at MX and data LinkedIn personality. We talk about Matt’s journey from chemical engineering to working at Intel Micron and becoming a data scientist, and finally switching into FinTech and becoming a data engineer. We talk about the difference between small companies and big companies and the pros and cons of each. We talk about what data engineering even is, the importance of projects, how LinkedIn can be used to be an intrepreneur and networking within your company, and more!

Connect with Matthew on LinkedIn: https://www.linkedin.com/in/matthew-sharp-813b1846/

Subscribe on YouTube: https://www.youtube.com/channel/UCuyfszBAd3gUt9vAbC1dfqA

Want to leave a question for the Ask Avery Show?

Written Mailbag: https://forms.gle/78zD544drpDAcTRV9

Audio Mailbag: https://anchor.fm/datacareerpodcast/message

Want to be on The Ask Avery Show? Sign up for a spot here:

https://calendly.com/datacareer/ask-avery?month=2021-05

Watch The Ask Avery Show Live Tuesday’s at 8PM: https://www.datacareerjumpstart.com/AskAvery

Add The Ask Avery Show to your calendar: https://calendar.google.com/calendar/ical/c_u2rk36mj5mgqg5g42glm9a741c%40group.calendar.google.com/public/basic.ics

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa