talk-data.com talk-data.com

Topic

Data Engineering

etl data_pipelines big_data

1127

tagged

Activity Trend

127 peak/qtr
2020-Q1 2026-Q1

Activities

1127 activities · Newest first

John O'Gorman and I discussed his career as a very early data OG, how he might have created one of the first analytical stores, the fundamentals of semantics, data products, and much more.


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

Extreme Self-Service: Turning Data Consumers into Data Constructors | Whatnot

ABOUT THE TALK: Small data teams face supply and demand problems. Triaging and prioritizing data work can be overwhelming. But what if data consumers could create their own products with minimal training?

Learn how to empower data consumers without disrupting others. Discover lessons from an 'extreme' self-service analytics approach: best practices, fostering a data community, promoting SQL literacy, and establishing solid guard rails.

ABOUT THE SPEAKER: Alice Leach is a Data Engineer at Whatnot Inc., a live stream platform and marketplace that enables collectors and enthusiasts to connect, buy, and sell verified products. She transitioned from academia to data in 2021, working first as a data scientist then data engineer. Her current work at Whatnot focuses on designing and building robust, self-service data workflows using a modern data stack.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil

Modern Data Management   How to Set Your Data Team Up for Success | Select Star

ABOUT THE TALK: Got your Modern Data Stack setup, now what? A mature data practice goes beyond setting up the data pipeline, and ensures there are both systems and processes in place to make it easy for everyone to find and understand data.

Learn how Select Star enables data discovery, making knowledge searchable and understandable for all. Uncover best practices for setting up a data discovery portal as your single source of truth.

ABOUT THE SPEAKER: Alec Bialosky is currently the Director of Business Operations at Select Star where he spends the majority of his time working with prospects and customers to help them achieve their data discovery goals with Select Star.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/data...

The State of Cross Company Data Exchange | General Folders

ABOUT THE TALK: Data exchange is vital for business partnerships, but current practices are manual, prone to leaks, hard to validate, monitor, and audit.

Tune in to this talk for an overview of data sharing methods, security comparisons, simplicity, and speed. Discover best practices and solutions to overcome challenges.

ABOUT THE SPEAKER: Pardis Noorzad is CEO at General Folders. She led a data team at Twitter, covering a variety of consumer products. Pardis has also built products in growth stage fintech and digital health and early stage AI platform companies.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil

Job Ready SQL

Learn the most important SQL skills and apply them in your job—quickly and efficiently! SQL (Structured Query Language) is the modern language that almost every relational database system supports for adding data, retrieving data, and modifying data in a database. Although basic visual tools are available to help end-users input common commands, data scientists, business intelligence analysts, Cloud engineers, Machine Learning programmers, and other professionals routinely need to query a database using SQL. Job Ready SQL provides you with the foundational skills necessary to work with data of any kind. Offering a straightforward ‘learn-by-doing’ approach, this concise and highly practical guide teaches you all the basics of SQL so you can apply your knowledge in real-world environments immediately. Throughout the book, each lesson includes clear explanations of key concepts and hands-on exercises that mirror real-world SQL tasks. Teaches the basics of SQL database creation and management using easy-to-understand language Helps readers develop an understanding of fundamental concepts and more advanced applications such as data engineering and data science Discusses the key types of SQL commands, including Data Definition Language (DDL) commands and Data Manipulation Language (DML) commands Includes useful reference information on querying SQL-based databases Job Ready SQL is a must-have resource for students and working professionals looking to quickly get up to speed with SQL and take their relational database skills to the next level.

Sarah Floris (aka The Dutch Engineer) is prolific with creating content aimed at DataOps and data engineering. In this wide ranging chat, we cover content platforms for technical creators, podcasting, data engineering vs ML engineering, why DataOps is awesome, courses, layoffs, and much more.


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

Summary

Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems

Interview

Introduction How did you get involved in the area of data management? Can you describe what Segment Unify is and the story behind it? What are the net-new capabilities that it brings to the Segment product suite? What are some of the categories of attributes that need to be managed in a prototypical customer profile? What are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile?

What is the potential impact of more detailed customer profiles on LTV?

How do you manage permissions/auditability of updating or amending profile data? Can you describe how the Unify product is implemented?

What are the technical challenges that you had to address while developing/launching this product?

What is the workflow for a team who is adopting the Unify product?

What are the other Segment products that need to be in use to take advantage of Unify?

What are some of the most complex edge cases to address in identity resolution? How does reverse ETL factor into the enrichment process for profile data? What are some of the issues that you have to account for in synchronizing profiles across platforms/products?

How do you mititgate the impact of "regression to the mean" for systems that don't support all of the attributes that you want to maintain in a profile record?

What are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)? What are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify? When is Segment Unify the wrong choice? What do you have planned for the future of Segment Unify?

Contact Info

Kevin

LinkedIn Blog

Hanhan

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your

I've been spending some time in Europe, and most recently returned from Germany the other day. There are definitely some differences between the US and Europe, particularly in how each conducts business and regulation. Tune in for my thoughts on these differences.

data #dataengineering #datascience


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

podcast_episode
by Katharine Jarmul (Cape Privacy) , Joe Reis (DeepLearning.AI)

Katharine Jarmul (Principal data scientist at Thoughtworks and author of Practical Data Privacy (O’Reilly, 2023)) and I chat about all things data privacy. She brings battle-tested experience and unique perspectives in the areas of ML/AI privacy, AI risk, regulation, and much more. I learned a ton, and I hope you do too!

LinkedIn: https://www.linkedin.com/in/katharinejarmul/

Twitter: https://twitter.com/kjam

Probably Private newsletter: https://probablyprivate.com/


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

The economic downturn is affecting the tech and data industry in a major way. Lots of layoffs, consolidation, and pain. But is this also a good thing for the industry? Listen and get my opinion in this nerdy rant about the economy, interest rates, and doing more with less.

data #dataengineering #datascience


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

Scalable & Sustainable Feature Engineering with Hamilton | DAGWorks

ABOUT THE TALK: Hamilton is a novel open-source framework for developing and maintaining scalable feature engineering dataflows.

We introduce the framework, discuss its motivations and initial successes at Stitch Fix, showcase its lightweight data lineage and catalog abilities, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.

ABOUT THE SPEAKER: Elijah Ben Izzy has always enjoyed working at the intersection of math and engineering. He has more recently focused on building tools to make data scientists and researchers more productive.

He built infrastructure to help quantitative researchers efficiently turn ideas into production trading models at Two Sigma and ran the Model Lifecycle team at Stitch Fix.

He is now the CTO at DAGWorks, which aims to solve the problem of building and maintaining complex ETLs for machine learning.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Neelesh Salian joins the show to discuss the rise of data engineering, where the data landscape is heading, and career growth as an engineer. Neelesh has a ton of experience in the technology space, and you'll learn a lot from his wisdom.

LinkedIn: https://www.linkedin.com/in/neeleshsalian/

Substack: https://hysterical.substack.com/

Twitter: https://twitter.com/neelesh_salian

dataengineering #data #careeradvice #softwareengineering #swe


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

Summary

Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles

Interview

Introduction How did you get involved in the area of data management? Can you describe what Meroxa is and the story behind it?

How have the focus and goals of the platform and company evolved over the past 2 years?

Who are the target customers for Meroxa?

What problems are they trying to solve when they come to your platform?

Applications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams?

What are some of the remaining blockers for teams who want to start using real-time data?

With the democratization of real-time data, what are the new categories of products and applications that are being unlocked?

How are organizations thinking about the potential value that those types of apps/services can provide?

With data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it?

What are some of the technical controls that are available for organizations that are risk-averse?

What skills do developers need to be able to effectively design, develop, and deploy real-time data applications?

How does this differ when talking about internal vs. consumer/end-user facing applications?

What are the most interesting, innovative, or unexpected ways that you have seen Meroxa used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa? When is Meroxa the wrong choice? What do you have planned for the future of Meroxa?

Contact Info

LinkedIn @devarispbrown on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Meroxa

Podcast Episode

Kafka Kafka Connect Conduit - golang Kafka connect replacement Pulsar Redpanda Flink Beam Clickhouse Druid Pinot

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC

We talked about:

Johannes’s background Johannes’s Open Source Spotlight demos – Refinery and Bricks The difficulties of working with natural language processing (NLP) Incorporating ChatGPT into a process as a heuristic What is Bricks? The process of starting a startup – Kern Making the decision to go with open source Pros and cons of launching as open source Kern’s business model Working with enterprises Johannes as a salesperson The team at Kern Johannes’s role at Kern How Johannes and Henrik separate responsibilities at Kern Working with very niche use cases The short story of how Kern got its funding Johannes’s resource recommendation

Links:

Refinery's GitHub repo: https://github.com/code-kern-ai/refinery Bricks' Github repo: https://github.com/code-kern-ai/bricks Bricks Open Source Spotlight demo: https://www.youtube.com/watch?v=r3rXzoLQy2U Refinery Open Source Spotlight demo: https://www.youtube.com/watch?v=LlMhN2f7YDg Discord: https://discord.com/invite/qf4rGCEphW Ker's Website: https://www.kern.ai

Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

I teach data at the University of Utah, and I’m also on the board of advisors for my department. What curriculum and approach do I advise universities use for teaching data and technology? Listen and find out.


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.

Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").

This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.

In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.

Ken Jee joins the show to chat about how he makes awesome content, podcasting and being authentic, jiu jitsu, maximizing your time, and adapting to AI.

datascience #kenjee #data #ai


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

Summary

Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm interviewing Paul Blankley and Ryan Janssen about Zenlytic, a no-code business intelligence tool focused on emerging commerce brands

Interview

Introduction How did you get involved in the area of data management? Can you describe what Zenlytic is and the story behind it? Business intelligence is a crowded market. What was your process for defining the problem you are focused on solving and the method to achieve that outcome? Self-serve data exploration has been attempted in myriad ways over successive generations of BI and data platforms. What are the barriers that have been the most challenging to overcome in that effort?

What are the elements that are coming together now that give you confidence in being able to deliver on that?

Can you describe how Zenlytic is implemented?

What are the evolutions in the understanding and implementation of semantic layers that provide a sufficient substrate for operating on? How have the recent breakthroughs in large language models (LLMs) improved your ability to build features in Zenlytic? What is your process for adding domain semantics to the operational aspect of your LLM?

For someone using Zenlytic, what is the process for getting it set up and integrated with their data? Once it is operational, can you describe some typical workflows for using Zenlytic in a business context?

Who are the target users? What are the collaboration options available?

What are the most complex engineering/data challenges that you have had to address in building Zenlytic? What are the most interesting, innovative, or unexpected ways that you have seen Zenlytic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zenlytic? When is Zenlytic the wrong choice? What do you have planned for the future of Zenlytic?

Contact Info

Paul Blankley (LinkedIn)

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Zenlytic OLAP Cube Large Language Model Starburst Pr

ChatGPT was the iPhone moment for AI, and things are moving insanely quickly. What do generative AI models mean for us, especially children, who are arguably the last of the Pre-AI generation? I dive into some thoughts this week about how we need to work alongside the machines, the impact of generative AI on kids, and so on. Buckle up. We are in for a very interesting next few years as we sort out where AI fits into our day-to-day lives.

data #datascience #dataengineering #chatgpt #ai


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Check out my substack: https://joereis.substack.com/