talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

1446

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

1446 activities · Newest first

Practical Data Science with Python

Practical Data Science with Python guides you through the entire process of leveraging Python tools to analyze and gain insights from data. You'll start with foundational concepts and coding essentials, progressing through statistical analysis, machine learning techniques, and ethical considerations. What this Book will help me do Clean, prepare, and explore data using pandas and NumPy. Understand and implement machine learning models such as random forests and support vector machines. Perform statistical tests and analyze distributions to enhance data insights. Utilize SQL with Python for efficient data interaction. Generate automated reports and dashboards for data storytelling. Author(s) Nathan George has extensive professional experience as a data scientist and Python developer. He specializes in the application of machine learning and statistical methods to solve real-world problems. His writing combines technical depth with an approachable style, aiming to provide readers with actionable knowledge and skills. Who is it for? This book is perfect for data science beginners who have a basic understanding of Python and want to build practical data analysis skills. Students in analytics programs or professionals looking to transition into a data science role will find value in its approachable yet comprehensive coverage. Aspiring data analysts and career changers will gain firsthand exposure to Python-based data science best practices. If you're eager to develop practical, hands-on experience in the data science field, this is the guide for you.

Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics

Interview

Introduction How did you get involved in the area of data management? Can you describe what Bodo is and the story behind it? What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows? Why have you focused your efforts on the Python language and toolchain?

Do you see any potential for expanding into other language communities? What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?

Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?

What are the tradeoffs of HPC vs scale-out distributed systems?

Can you d

Join host Avery Smith on this episode of the Data Career Podcast for an exciting 'Ask Avery' session! We cover various topics, including the roles and differences between data analysts, data engineers, and data scientists, as well as transitioning careers, essential skills for data engineering, forecasting techniques, and more.

f you have questions about data visualization, Python, or breaking into data science, this episode has got you covered.

Tune in for valuable insights and professional advice to boost your data career!

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Pandas in Action

Take the next steps in your data science career! This friendly and hands-on guide shows you how to start mastering Pandas with skills you already know from spreadsheet software. In Pandas in Action you will learn how to: Import datasets, identify issues with their data structures, and optimize them for efficiency Sort, filter, pivot, and draw conclusions from a dataset and its subsets Identify trends from text-based and time-based data Organize, group, merge, and join separate datasets Use a GroupBy object to store multiple DataFrames Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data. About the Technology Data analysis with Python doesn’t have to be hard. If you can use a spreadsheet, you can learn pandas! While its grid-style layouts may remind you of Excel, pandas is far more flexible and powerful. This Python library quickly performs operations on millions of rows, and it interfaces easily with other tools in the Python data ecosystem. It’s a perfect way to up your data game. About the Book Pandas in Action introduces Python-based data analysis using the amazing pandas library. You’ll learn to automate repetitive operations and gain deeper insights into your data that would be impractical—or impossible—in Excel. Each chapter is a self-contained tutorial. Realistic downloadable datasets help you learn from the kind of messy data you’ll find in the real world. What's Inside Organize, group, merge, split, and join datasets Find trends in text-based and time-based data Sort, filter, pivot, optimize, and draw conclusions Apply aggregate operations About the Reader For readers experienced with spreadsheets and basic Python programming. About the Author Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries. Quotes Of all the introductory pandas books I’ve read—and I did read a few—this is the best, by a mile. - Erico Lendzian, idibu.com This approachable guide will get you up and running quickly with all the basics you need to analyze your data. - Jonathan Sharley, SiriusXM Media Understanding and putting in practice the concepts of this book will help you increase productivity and make you look like a pro. - Jose Apablaza, Steadfast Networks Teaches both novice and expert Python users the essential concepts required for data analysis and data science. - Ben McNamara, DataGeek

Caitlin Colgrove is Co-founder & CTO at Hex, a data workspace that allows teams to collaborate in both SQL and Python to publish interactive data apps. In this conversation, Tristan, Julia and Caitlin dive into the possibilities that real-time collaborative notebooks unlock for data teams — what if our collaboration style looked more like Google Docs than a Git workflow? For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.

Ryan Wade joins us on AOF today to talk about how to use advanced analytics in your organization! Ryan has been in the analytics game for the last 20 years and is now a Senior Solution Consultant at Blue Granite, based in Indianapolis, Indiana. He recently authored the amazing must-read book, Advanced Analytics in Power BI with R and Python, and in today's chat, we get to hear all about why he wrote the book, who it is for and how you can use it to accelerate your data journey! I met Ryan while speaking at a few conferences and was always impressed with his knowledge and great sense of humor! A professional football player turned data scientist, Ryan has a passion for breaking down advanced analytics in a way anyone can understand. Whether you're already using advanced analytics or researching how to get started Ryan's knowledge on the topic will help you. Tune in with a pencil and paper in hand!   In this episode, you'll learn: [0:09:22] The rise of the R and Python programming languages in the data world. [0:16:44] The necessary, well-thought-out preparatory steps for a project utilizing advanced analytics. [0:19:39] Why attention-grabbing visuals are not the most important part of data storytelling! [0:23:13] Creating a sufficient team for data analytics and the vital roles of the database administrator, active directory administrator, and more! [0:39:07] Client conversations around shortcomings and hurdles in advanced analytics.  For full show notes, and the links mentioned visit: https://bibrainz.com/podcast/82   Enjoyed the Show?  Please leave us a review on iTunes.

Data Science for Marketing Analytics - Second Edition

In 'Data Science for Marketing Analytics', you'll embark on a journey that integrates the power of data analytics with strategic marketing. With a focus on practical application, this guide walks you through using Python to analyze datasets, implement machine learning models, and derive data-driven insights. What this Book will help me do Gain expertise in cleaning, exploring, and visualizing marketing data using Python. Build machine learning models to predict customer behavior and sales outcomes. Leverage unsupervised learning techniques for effective customer segmentation. Compare and optimize predictive models using advanced evaluation methods. Master Python libraries like pandas and Matplotlib for data manipulation and visualization. Author(s) Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali combine their extensive expertise in data analytics and marketing to bring you this comprehensive guide. Drawing from years of applying analytics in real-world marketing scenarios, they provide a hands-on approach to learning data science tools and techniques. Who is it for? This book is perfect for marketing professionals and analysts eager to harness the capabilities of Python to enhance their data-driven strategies. It is also ideal for data scientists looking to apply their skills in marketing across various roles. While a basic understanding of data analysis and Python will help, all key concepts are introduced comprehensively for beginners.

In this episode of DataFramed, we speak with Rick Scavetta and Boyan Angelov about their new book, Python and R for the Modern Data Scientist: The Best of Both Worlds, and how it dawns the start of a new bilingual data science community.  

Throughout the episode, Rick and Boyan discuss the history of Python and R, what led them to write the book, how Python and R can be interoperable, the advantages of each language and where to use it, how beginner data scientists should think about learning programming languages, how experienced data scientists can take it to the next level by learning a language they’re not necessarily comfortable with, and more. 

Relevant links from the interview:

We’d love your feedback! Let us know which topics you’d like us to cover and what you think of DataFramed by answering this 30-second surveyCheck out Rick and Boyan’s bookCheck out Rick’s courses on DataCampCheck out Boyan's other booksConnect with Rick on LinkedInConnect with Boyan on LinkedIn

Pandas Brain Teasers

This book contains 25 short programs that will challenge your understanding of Pandas. Like any big project, the Pandas developers had to make some design decisions that at times seem surprising. This book uses those quirks as a teaching opportunity. By understanding the gaps in your knowledge, you'll become better at what you do. Some of the teasers are from the author's experience shipping bugs to production, and some from others doing the same. Teasers and puzzles are fun, and learning how to solve them can teach you to avoid programming mistakes and maybe even impress your colleagues and future employers. Working with data is central to nearly everything we do, from disease contact tracing and analyzing health records to smart meters that track utility consumption behavior. With the power of Python's pandas library, you can process and analyze this data in a highly efficient and simple-to-understand way. And with 25 brain teasers designed to turn this technology's quirks into a teaching opportunity, you'll be honing your data science skills while having fun at the same time. Following a simple format, you'll challenge yourself and your understanding of pandas. Read a short Python program that uses pandas, try to guess the output, run the code yourself, and then go to the next page for an explanation of the solution. From common pitfalls and hidden gotchas to unexpected twists and turns, you'll deepen your understanding of pandas, learn to write more efficient code, and reduce the number of bugs in the software you develop. You may even impress your colleagues and your employers, both present and future. Learn the tricks of the trade with Python's pandas, in one of the most fun and creative ways around. What You Need: To run the code you'll need Python version 3.8 or upper and Pandas version 1.0 or upper installed. We use Python version 3.8.3 and Pandas version 1.0.5; the output might change in future versions.

SQL Pocket Guide, 4th Edition

If you use SQL in your day-to-day work as a data analyst, data scientist, or data engineer, this popular pocket guide is your ideal on-the-job reference. You'll find many examples that address the language's complexities, along with key aspects of SQL used in Microsoft SQL Server, MySQL, Oracle Database, PostgreSQL, and SQLite. In this updated edition, author Alice Zhao describes how these database management systems implement SQL syntax for both querying and making changes to a database. You'll find details on data types and conversions, regular expression syntax, window functions, pivoting and unpivoting, and more. Quickly look up how to perform specific tasks using SQL Apply the book's syntax examples to your own queries Update SQL queries to work in five different database management systems NEW: Connect Python and R to a relational database NEW: Look up frequently asked SQL questions in the "How Do I?" chapter

Summary The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses

Interview

Introduction How did you get involved in the area of data management? Can you describe what Nexla is and the story behind it? What are the major problems that Nexla is aiming to solve?

What are the components of a data platform that Nexla might replace?

What are the use cases and benefits of being able to publish data sets for use outside and across organizations? What are the different elements involved in implementing DataOps? How is the Nexla platform implemented?

What have been the most comple engineering challenges? How has the architecture changed or evolved since you first began working on it? What are some of the assumpt

Getting Started with Streamlit for Data Science

Getting Started with Streamlit for Data Science is your essential guide to quickly and efficiently building dynamic data science web applications in Python using Streamlit. Whether you're embedding machine learning models, visualizing data, or deploying projects, this book helps you excel in creating and sharing interactive apps with ease. What this Book will help me do Set up a development environment to create your first Streamlit application. Implement and visualize dynamic data workflows by integrating various Python libraries into Streamlit. Develop and showcase machine learning models within Streamlit for clear and interactive presentations. Deploy your projects effortlessly using platforms like Streamlit Sharing, Heroku, and AWS. Utilize tools like Streamlit Components and themes to enhance the aesthetics and usability of your apps. Author(s) Tyler Richards is a data science expert with extensive experience in leveraging technology to present complex data models in an understandable way. He brings practical solutions to readers, aiming to empower them with the tools they need to succeed in the field of data science. Tyler adopts a hands-on teaching method with illustrative examples to ensure clarity and easy learning. Who is it for? This book is designed for anyone involved in data science, from beginners just starting in the field to experienced professionals who want to learn to create interactive web applications using Streamlit. Ideal for those with a working knowledge of Python, this resource will help you streamline your workflows and enhance your project presentations.

Summary A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code

Interview

Introduction How did you get involved in the area of data management? Can you describe what Compilerworks is and the story behind it? What is a compiler?

How are you applying compilers to the challenges of data processing systems?

What are some use cases that Compilerworks is uniquely well suited to? There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks? Can you describe the design and implementation of the Compilerworks platform?

How has the system changed or evolved since you first began working on it?

What programming languages and SQL dialects do you currently support?

Which have been the most challenging to work with? How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?

Can you talk through the process of getting Compilerworks

Data Science at the Command Line, 2nd Edition

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux. You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers. Obtain data from websites, APIs, databases, and spreadsheets Perform scrub operations on text, CSV, HTML, XML, and JSON files Explore data, compute descriptive statistics, and create visualizations Manage your data science workflow Create your own tools from one-liners and existing Python or R code Parallelize and distribute data-intensive pipelines Model data with dimensionality reduction, regression, and classification algorithms Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

Summary All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data.

Interview

Introduction Can you describe what Stemma is and the story behind it? Can you give me more context into how and why Stemma fits into the current data engineering world? Among the popular tools of today for data warehousing and other products that stitch data together – what is Stemma’s place? Where does it fit into the workflow? How has the explosion in options for data cataloging and discovery influenced your thinking on the necessary feature set for that class of tools? How do you compare to your competitors With how long we have been using data and building systems to analyze it, why do you think that trust in the results is still such a momentous problem? Tell me more about Stemma and how it compares to Amundsen? Can you tell me more about the impact of Stemma/Amundsen to companies that use it? What are the opportunities for innovating on top of Stemma to help organizations streamline communication between data producers and consumers? Beyond the technological capabilities of a data platform, the bigger question is usually the social/organizational patterns around data. How have the "best practices" around the people side of data changed in the recent past?

What are the points of friction that

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.

Interview

Introduction How did you get involved in the area of data management? Can you describe what Hudi is and the story behind it? What are the use cases that it is focused on supporting? There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.? Can you describe how Hudi is architected?

How have the goals and design of Hudi changed or evolved since you first began working on it? If you were to start the whole project over today, what would you do differently?

Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment? One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?

How does Hudi make that a tractable problem?

What are the data platform components that are needed to support an installation of Hudi? What is involved in migrating an existing data lake to use Hudi?

How would someone approach supporting heterogeneous table formats in their lake?

As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem? What are the most interesting, innovative, or unexpected ways that you have seen Hudi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi? When is Hudi the wrong choice? What do you have planned for the future of Hudi?

Contact Info

Linkedin Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Hudi Docs Hudi Design & Architecture Incremental Processing CDC == Change Data Capture

Podcast Episodes

Oracle GoldenGate Voldemort Kafka Hadoop Spark HBase Parquet Iceberg Table Format

Data Engineering Episode

Hive ACID Apache Kudu

Podcast Episode

Vertica Delta Lake

Podcast Episode

Optimistic Concurrency Control MVCC == Multi-Version Concurrency Control Presto Flink

Podcast Episode

Trino

Podcast Episode

Gobblin LakeFS

Podcast Episode

Nessie

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Data Science Projects with Python - Second Edition

Data Science Projects with Python offers a hands-on, project-based approach to learning data science using real-world data sets and tools. You will explore data using Python libraries like pandas and Matplotlib, build machine learning models with scikit-learn, and apply advanced techniques like XGBoost and SHAP values. This book equips you to confidently extract insights, evaluate models, and deliver results with clarity. What this Book will help me do Learn to load, clean, and preprocess data using Python and pandas. Build and evaluate predictive models, including logistic regression and random forests. Visualize data effectively using Python libraries like Matplotlib. Master advanced techniques like XGBoost and algorithmic fairness. Communicate data-driven insights to aid decision making in practical scenarios. Author(s) Stephen Klosterman is an experienced data scientist with a strong focus on practical applications of machine learning in business. Combining a rich academic background with hands-on industry experience, he excels at explaining complex concepts in an approachable way. As the author of 'Data Science Projects with Python,' his goal is to provide learners with the skills needed for real-world data science challenges. Who is it for? This book is ideal for beginners in data science and machine learning who have some basic programming knowledge in Python. Aspiring data scientists will benefit from its practical, end-to-end examples. Professionals seeking to expand their skillset in predictive modeling and delivering business insights will find this book invaluable. Some foundation in statistics and programming is recommended.

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar

Interview

Introduction

How did you get involved in the area of data management?

Can you describe what the Astra platform is and the story behind it?

How does streaming fit into your overall product vision and the needs of your customers?

What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?

What are the core use cases that you are aiming to support with Astra Streaming?

Can you describe the architecture and automation of your hosted platform for Pulsar?

What are the integration points that you have built to make it work well with Cassandra?

What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?

What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?

What is the process for someone to adopt and integrate with your Astra Streaming service?

How do you handle migrating existing projects, particularly if they are using Kafka currently?

One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?

What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?

What are the ways that you are engaging with and supporting the Pulsar community?

What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?

What are the most interesting, innovative, or unexpected ways that you have seen Astra used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?

When is Astra the wrong choice?

What do you have planned for the future of Astra?

Contact Info

Prabhat

LinkedIn @prabhatja on Twitter prabhatja on GitHub

Jonathan

LinkedIn @spyced on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar

Podcast Episode Streamnative Episode

Datastax Astra Streaming Datastax Astra DB Luna Streaming Distribution Datastax Cassandra Kesque (formerly Kafkaesque) Kafka RabbitMQ Prometheus Grafana Pulsar Heartbeat Pulsar Summit Pulsar Summit Presentation on Kafka Connectors Replicated Chaos Engineering Fallout chaos engineering tools Jepsen

Podcast Episode

Jack VanLightly

BookKeeper TLA+ Model

Change Data Capture

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them

Interview

Introduction How did you get involved in the area of data management? Can you describe what you are building at Datafold and the story behind it? What are the biggest factors that you see contributing to data quality issues?

How are teams identifying and addressing those failures?

How does the data platform architecture impact the potential for introducing quality problems? What are some of the potential risks or consequences of introducing errors in data processing? How can organizations shift to being proactive in their data quality management?

How much of a role does tooling play in addressing the introduct

Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data

Learn about business intelligence (BI) features in T-SQL and how they can help you with data science and analytics efforts without the need to bring in other languages such as R and Python. This book shows you how to compute statistical measures using your existing skills in T-SQL. You will learn how to calculate descriptive statistics, including centers, spreads, skewness, and kurtosis of distributions. You will also learn to find associations between pairs of variables, including calculating linear regression formulas and confidence levels with definite integration. No analysis is good without data quality. Advanced Analytics with Transact-SQL introduces data quality issues and shows you how to check for completeness and accuracy, and measure improvements in data quality over time. The book also explains how to optimize queries involving temporal data, such as when you search for overlapping intervals. More advanced time-oriented information in the book includes hazard and survival analysis. Forecasting with exponential moving averages and autoregression is covered as well. Every web/retail shop wants to know the products customers tend to buy together. Trying to predict the target discrete or continuous variable with few input variables is important for practically every type of business. This book helps you understand data science and the advanced algorithms use to analyze data, and terms such as data mining, machine learning, and text mining. Key to many of the solutions in this book are T-SQL window functions. Author Dejan Sarka demonstrates efficient statistical queries that are based on window functions and optimized through algorithms built using mathematical knowledge and creativity. The formulas and usage of those statistical procedures are explained so you can understand and modify the techniques presented. T-SQL is supported in SQL Server,Azure SQL Database, and in Azure Synapse Analytics. There are so many BI features in T-SQL that it might become your primary analytic database language. If you want to learn how to get information from your data with the T-SQL language that you already are familiar with, then this is the book for you. What You Will Learn Describe distribution of variables with statistical measures Find associations between pairs of variables Evaluate the quality of the data you are analyzing Perform time-series analysis on your data Forecast values of a continuous variable Perform market-basket analysis to predict customer purchasing patterns Predict target variable outcomes from one or more input variables Categorize passages of text by extracting and analyzing keywords Who This Book Is For Database developers and database administrators who want to translate their T-SQL skills into the world of business intelligence (BI) and data science. For readers who want to analyze large amounts of data efficiently by using their existing knowledge of T-SQL and Microsoft’s various database platforms such as SQL Server and Azure SQL Database. Also for readers who want to improve their querying by learning new and original optimization techniques.