talk-data.com talk-data.com

Filter by Source

Select conferences and events

Activities & events

Title & Speakers Event

Do you work with tabular data? Learn how to clean, prepare, and organise datasets properly in Python.

Data Cleaning with Python Pandas

Working with real data means dealing with missing values, errors, duplicates, and inconsistent formats. Before any analysis or machine learning, data must be cleaned and prepared properly. Data cleaning is one of the most important and time-consuming tasks in data work. This session gives a clear and practical introduction to data cleaning using Python and Pandas. It focuses on common real-world problems and shows simple, correct ways to fix them.

Who is this for?

Students, developers, and anyone who works with data and needs to clean and prepare datasets using Python. This session is useful if you work with messy files such as CSV or Excel, want to understand how Pandas handles missing or incorrect data, and want to build reliable data analysis pipelines.

Who is leading the session?

The session is led by Dr. Stelios Sotiriadis, CEO of Warestack and Associate Professor and MSc Programme Director at Birkbeck, University of London.

He works in data processing, distributed systems, cloud computing, and Python-based analytics. He holds a PhD from the University of Derby, completed a postdoctoral fellowship at the University of Toronto, and has worked with Huawei, IBM, Autodesk, and several startups. Since 2018, he has been teaching at Birkbeck and founded Warestack in 2021.

What we will cover

This is a hands-on introduction with real examples and short exercises. Topics include loading data with Pandas, inspecting datasets, handling missing values, fixing data types, removing duplicates, cleaning text data, filtering and transforming columns, combining datasets, and common data cleaning mistakes to avoid.

Requirements

A laptop with Python installed (Windows, macOS, or Linux), Visual Studio Code, and Python pip. Lab computers can be used if needed.

Format

A 1.5-hour live session with short explanations, live coding, and guided exercises. The session runs in person, with streaming available for remote participants.

Prerequisites

Basic to intermediate Python knowledge, including functions, loops, and basic data structures. Some familiarity with Pandas is helpful but not required.

Data Cleaning with Python Pandas

In this course, you’ll learn the fundamentals of preparing data for machine learning using Databricks. We’ll cover topics like exploring, cleaning, and organizing data tailored for traditional machine learning applications. We’ll also cover data visualization, feature engineering, and optimal feature storage strategies. By building a strong foundation in data preparation, this course equips you with the essential skills to create high-quality datasets that can power accurate and reliable machine learning and AI models. Whether you're developing predictive models or enabling downstream AI applications, these capabilities are critical for delivering impactful, data-driven solutions. Pre-requisites: Familiarity with Databricks workspace, notebooks, as well as Unity Catalog. An intermediate level knowledge of Python (scikit-learn, Matplotlib), Pandas, and PySpark. As well as with concepts of exploratory data analysis, feature engineering, standardization, and imputation methods). Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

AI/ML DataViz Databricks Matplotlib Pandas PySpark Python Scikit-learn
Data + AI Summit 2025

Discover all-practical implementations of the key algorithms and models for handling unlabeled data. Full of case studies demonstrating how to apply each technique to real-world problems. In Data Without Labels you’ll learn: Fundamental building blocks and concepts of machine learning and unsupervised learning Data cleaning for structured and unstructured data like text and images Clustering algorithms like K-means, hierarchical clustering, DBSCAN, Gaussian Mixture Models, and Spectral clustering Dimensionality reduction methods like Principal Component Analysis (PCA), SVD, Multidimensional scaling, and t-SNE Association rule algorithms like aPriori, ECLAT, SPADE Unsupervised time series clustering, Gaussian Mixture models, and statistical methods Building neural networks such as GANs and autoencoders Dimensionality reduction methods like Principal Component Analysis and multidimensional scaling Association rule algorithms like aPriori, ECLAT, and SPADE Working with Python tools and libraries like sci-kit learn, numpy, Pandas, matplotlib, Seaborn, Keras, TensorFlow, and Flask How to interpret the results of unsupervised learning Choosing the right algorithm for your problem Deploying unsupervised learning to production Maintenance and refresh of an ML solution Data Without Labels introduces mathematical techniques, key algorithms, and Python implementations that will help you build machine learning models for unannotated data. You’ll discover hands-off and unsupervised machine learning approaches that can still untangle raw, real-world datasets and support sound strategic decisions for your business. Don’t get bogged down in theory—the book bridges the gap between complex math and practical Python implementations, covering end-to-end model development all the way through to production deployment. You’ll discover the business use cases for machine learning and unsupervised learning, and access insightful research papers to complete your knowledge. About the Technology Generative AI, predictive algorithms, fraud detection, and many other analysis tasks rely on cheap and plentiful unlabeled data. Machine learning on data without labels—or unsupervised learning—turns raw text, images, and numbers into insights about your customers, accurate computer vision, and high-quality datasets for training AI models. This book will show you how. About the Book Data Without Labels is a comprehensive guide to unsupervised learning, offering a deep dive into its mathematical foundations, algorithms, and practical applications. It presents practical examples from retail, aviation, and banking using fully annotated Python code. You’ll explore core techniques like clustering and dimensionality reduction along with advanced topics like autoencoders and GANs. As you go, you’ll learn where to apply unsupervised learning in business applications and discover how to develop your own machine learning models end-to-end. What's Inside Master unsupervised learning algorithms Real-world business applications Curate AI training datasets Explore autoencoders and GANs applications About the Reader Intended for data science professionals. Assumes knowledge of Python and basic machine learning. About the Author Vaibhav Verdhan is a seasoned data science professional with extensive experience working on data science projects in a large pharmaceutical company. Quotes An invaluable resource for anyone navigating the complexities of unsupervised learning. A must-have. - Ganna Pogrebna, The Alan Turing Institute Empowers the reader to unlock the hidden potential within their data. - Sonny Shergill, Astra Zeneca A must-have for teams working with unstructured data. Cuts through the fog of theory ili Explains the theory and delivers practical solutions. - Leonardo Gomes da Silva, onGRID Sports Technology The Bible for unsupervised learning! Full of real-world applications, clear explanations, and excellent Python implementations. - Gary Bake, Falconhurst Technologies

data data-science data-science-tools Pandas AI/ML Data Science GenAI Keras Matplotlib NumPy Python Seaborn TensorFlow
Pandas Workout 2024-06-05

Practice makes perfect pandas! Work out your pandas skills against dozens of real-world challenges, each carefully designed to build an intuitive knowledge of essential pandas tasks. In Pandas Workout you’ll learn how to: Clean your data for accurate analysis Work with rows and columns for retrieving and assigning data Handle indexes, including hierarchical indexes Read and write data with a number of common formats, such as CSV and JSON Process and manipulate textual data from within pandas Work with dates and times in pandas Perform aggregate calculations on selected subsets of data Produce attractive and useful visualizations that make your data come alive Pandas Workout hones your pandas skills to a professional-level through two hundred exercises, each designed to strengthen your pandas skills. You’ll test your abilities against common pandas challenges such as importing and exporting, data cleaning, visualization, and performance optimization. Each exercise utilizes a real-world scenario based on real-world data, from tracking the parking tickets in New York City, to working out which country makes the best wines. You’ll soon find your pandas skills becoming second nature—no more trips to StackOverflow for what is now a natural part of your skillset. About the Technology Python’s pandas library can massively reduce the time you spend analyzing, cleaning, exploring, and manipulating data. And the only path to pandas mastery is practice, practice, and, you guessed it, more practice. In this book, Python guru Reuven Lerner is your personal trainer and guide through over 200 exercises guaranteed to boost your pandas skills. About the Book Pandas Workout is a thoughtful collection of practice problems, challenges, and mini-projects designed to build your data analysis skills using Python and pandas. The workouts use realistic data from many sources: the New York taxi fleet, Olympic athletes, SAT scores, oil prices, and more. Each can be completed in ten minutes or less. You’ll explore pandas’ rich functionality for string and date/time handling, complex indexing, and visualization, along with practical tips for every stage of a data analysis project. What's Inside Clean data with less manual labor Retrieving and assigning data Process and manipulate text Calculations on selected data subsets About the Reader For Python programmers and data analysts. About the Author Reuven M. Lerner teaches Python and data science around the world and publishes the “Bamboo Weekly” newsletter. He is the author of Manning’s Python Workout (2020). Quotes A carefully crafted tour through the pandas library, jam-packed with wisdom that will help you become a better pandas user and a better data scientist. - Kevin Markham, Founder of Data School, Creator of pandas in 30 days Will help you apply pandas to real problems and push you to the next level. - Michael Driscoll, RFA Engineering, creator of Teach Me Python The explanations, paired with Reuven’s storytelling and personal tone, make the concepts simple. I’ll never get them wrong again! - Rodrigo Girão Serrão, Python developer and educator The definitive source! - Kiran Anantha, Amazon

data data-science data-science-tools Pandas CSV Data Science JSON Python

Sue Bayes shows us how to analyse data with Python starting with the basics and focussing on actionable knowledge. If you are interested in understanding what Python is and how to use it, this talk is for you. Topics:

  • Introduction to Python and data analysis
  • Python basics: syntax, variables, data types, control structures, functions and libraries
  • The Pandas library: dataframes and series, data import/export, filter, sort
  • Data cleaning & visualisation e.g., handling missing data
  • Visualisation with the matplotlib/seaborn packages

This session aims to unlock Python’s potential for data analytics, making it accessible and practical for professionals seeking to enhance their analytical skills.

Speaker Bio Our speaker,Sue Bayes, is a Microsoft certified Azure Enterprise Data Analyst Associate and Power BI Data Analyst Associate. She has over five years of independent work as a Power BI developer and data analyst, and delivered comprehensive reporting solutions across both public and private sectors. Her expertise spans a diverse range of areas including project management, financial reporting, sector-specific analysis, and bespoke data cleansing. improvements.

Her technical proficiency encompasses a broad spectrum of tools and languages including R, Python, SQL, C#, as well as M and DAX. Prior to founding DataBayes Ltd., she spent 15 years as a lecturer in Business and Computing, a role that honed her teaching skills and deepened her passion for data analytics.

Python, in particular, has been a focus of her teaching. It is an exceptionally useful language for data analytics, owing to its simplicity and the powerful suite of tools it offers.

Python for Data Analysis - Sue Bayes

Please register using the zoom link to get a reminder:

https://us02web.zoom.us/webinar/register/1316984993505/WN_i227Ph51SHu9znRh6BAKyg

This workshop will be a hands-on tutorial for the python pandas library. pandas is one of the popular tools used for manipulating, cleaning, integration and wrangling of tabular data. Data scientists spend significant amount of their time on such operations. This workshop aims to introduce how pandas can be used in data analysis by working on real datasets.

The workshop will be held using Jupyter-notebook program. One easy way of installing this program is through anaconda platform.

https://www.anaconda.com/products/individual

Agenda:

(PST) 10:25 am - 10:30 am Arrival, socializing, and Opening (PST) 10:30 am - 12:20 pm Dr. Yasin Ceran, "Data Manipulation with Pandas" (PST) 12:20 pm - 12:30 pm Q&A

About Dr. Yasin Ceran:

Yasin Ceran is passionate about all things data and holds a vast experience in data analysis, mathematical modeling and Apache Spark, and in SQL, Python and R. He is currently an associate professor at KAIST, South Korea, as well as teaching at San Jose State University at the heart of Silicon Valley. Yasin has worked rigorously on an array of data-related projects encompassing data mining, statistics, modeling, and is dedicated to sharing his experience and expertise with learners.

https://us02web.zoom.us/webinar/register/1316984993505/WN_i227Ph51SHu9znRh6BAKyg

Webinar Passcode 356741

Data Manipulation with Pandas

Please register using the zoom link to get a reminder:

https://us02web.zoom.us/webinar/register/1316984993505/WN_rmJ7nMIzQWK76evLIozZOg

This workshop will be a hands-on tutorial for the python pandas library. pandas is one of the popular tools used for manipulating, cleaning, integration and wrangling of tabular data. Data scientists spend significant amount of their time on such operations. This workshop aims to introduce how pandas can be used in data analysis by working on real datasets.

The workshop will be held using Jupyter-notebook program. One easy way of installing this program is through anaconda platform.

https://www.anaconda.com/products/individual

Agenda:

(PST) 10:25 am - 10:30 am Arrival, socializing, and Opening (PST) 10:30 am - 12:20 pm Dr. Yasin Ceran, "Data Manipulation with Pandas" (PST) 12:20 pm - 12:30 pm Q&A

About Dr. Yasin Ceran:

Yasin Ceran is passionate about all things data and holds a vast experience in data analysis, mathematical modeling and Apache Spark, and in SQL, Python and R. He is currently an associate professor at KAIST, South Korea, as well as teaching at San Jose State University at the heart of Silicon Valley. Yasin has worked rigorously on an array of data-related projects encompassing data mining, statistics, modeling, and is dedicated to sharing his experience and expertise with learners.

https://us02web.zoom.us/webinar/register/1316984993505/WN_rmJ7nMIzQWK76evLIozZOg

Webinar Passcode 356741

Data Manipulation with Pandas

Please register using the zoom link to get a reminder:

https://us02web.zoom.us/webinar/register/1316984993505/WN_rmJ7nMIzQWK76evLIozZOg

This workshop will be a hands-on tutorial for the python pandas library. pandas is one of the popular tools used for manipulating, cleaning, integration and wrangling of tabular data. Data scientists spend significant amount of their time on such operations. This workshop aims to introduce how pandas can be used in data analysis by working on real datasets.

The workshop will be held using Jupyter-notebook program. One easy way of installing this program is through anaconda platform.

https://www.anaconda.com/products/individual

Agenda:

(PST) 10:25 am - 10:30 am Arrival, socializing, and Opening (PST) 10:30 am - 12:20 pm Dr. Yasin Ceran, "Data Manipulation with Pandas" (PST) 12:20 pm - 12:30 pm Q&A

About Dr. Yasin Ceran:

Yasin Ceran is passionate about all things data and holds a vast experience in data analysis, mathematical modeling and Apache Spark, and in SQL, Python and R. He is currently an associate professor at KAIST, South Korea, as well as teaching at San Jose State University at the heart of Silicon Valley. Yasin has worked rigorously on an array of data-related projects encompassing data mining, statistics, modeling, and is dedicated to sharing his experience and expertise with learners.

https://us02web.zoom.us/webinar/register/1316984993505/WN_rmJ7nMIzQWK76evLIozZOg

Webinar Passcode 356741

Data Manipulation with Pandas
Sam Lau – author , Joseph Gonzalez – author , Deborah Nolan – author

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data. Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas. Refine a question of interest to one that can be studied with data Pursue data collection that may involve text processing, web scraping, etc. Glean valuable insights about data through data cleaning, exploration, and visualization Learn how to use modeling to describe the data Generalize findings beyond the data

data data-science Data Collection Data Science Pandas Python
O'Reilly Data Science Books
dr sefer baday – Assistant Professor @ Informatics Institute of Istanbul Technical University

A hands-on tutorial for the Python pandas library covering data manipulation, cleaning, integration, and wrangling of tabular data.

Pandas Python jupyter notebook
Mastering Data Manipulation with Pandas

Please register using the zoom link to get a reminder:

https://us02web.zoom.us/webinar/register/4616893679805/WN_LGk9QFbJS_qRAC5ifQNlHw

This workshop will be a hands-on tutorial for the python pandas library. pandas is one of the popular tools used for manipulating, cleaning, integration and wrangling of tabular data. Data scientists spend significant amount of their time on such operations. This workshop aims to introduce how pandas can be used in data analysis by working on real datasets.

The workshop will be held using Jupyter-notebook program. One easy way of installing this program is through anaconda platform. https://www.anaconda.com/products/individual

Agenda:

11:45 am - 11:55 am Arrival, socializing and Opening 11:55 am - 1:00 pm Dr. Sefer Baday, "Mastering Data Manipulation with Pandas" 1:00 pm - 1:10 pm Q&A

About Dr. Sefer Baday:

Dr. Baday Works as an asst.prof in the Informatics Institute of Istanbul Technical University, Turkey. He has chemical engineering BS and Computational science and engineering MS degrees from Bogazici and Koc Universities in Turkey. He obtained his PhD degree from the University of Basel, Switzerland. Prior to current appointment, he had worked as a researcher in the University of Cambridge UK. His research is based on the application of molecular simulation and informatics approaches for drug discovery. He has been teaching various data related courses such as data analysis and visualization, data science etc.

Please register using the zoom link to get a reminder:

https://us02web.zoom.us/webinar/register/4616893679805/WN_LGk9QFbJS_qRAC5ifQNlHw

Mastering Data Manipulation with Pandas
Daniel Y. Chen – author

Manage and Automate Data Analysis with Pandas in Python Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple data sets. Pandas for Everyone, 2nd Edition, brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world data science problems such as using regularization to prevent data overfitting, or when to use unsupervised machine learning methods to find the underlying structure in a data set. New features to the second edition include: Extended coverage of plotting and the seaborn data visualization library Expanded examples and resources Updated Python 3.9 code and packages coverage, including statsmodels and scikit-learn libraries Online bonus material on geopandas, Dask, and creating interactive graphics with Altair Chen gives you a jumpstart on using Pandas with a realistic data set and covers combining data sets, handling missing data, and structuring data sets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes. Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability and introduces you to the wider Python data analysis ecosystem. Work with DataFrames and Series, and import or export data Create plots with matplotlib, seaborn, and pandas Combine data sets and handle missing data Reshape, tidy, and clean data sets so theyre easier to work with Convert data types and manipulate text strings Apply functions to scale data manipulations Aggregate, transform, and filter large data sets with groupby Leverage Pandas advanced date and time capabilities Fit linear models using statsmodels and scikit-learn libraries Use generalized linear modeling to fit models with different response variables Compare multiple models to select the best one Regularize to overcome overfitting and improve performance Use clustering in unsupervised machine learning ...

data data-science data-science-tools Pandas AI/ML Data Science DataViz Matplotlib Python Scikit-learn Seaborn
O'Reilly Data Science Books
Kyran Dale – author

How do you turn raw, unprocessed, or malformed data into dynamic, interactive web visualizations? In this practical book, author Kyran Dale shows data scientists and analysts--as well as Python and JavaScript developers--how to create the ideal toolchain for the job. By providing engaging examples and stressing hard-earned best practices, this guide teaches you how to leverage the power of best-of-breed Python and JavaScript libraries. Python provides accessible, powerful, and mature libraries for scraping, cleaning, and processing data. And while JavaScript is the best language when it comes to programming web visualizations, its data processing abilities can't compare with Python's. Together, these two languages are a perfect complement for creating a modern web-visualization toolchain. This book gets you started. You'll learn how to: Obtain data you need programmatically, using scraping tools or web APIs: Requests, Scrapy, Beautiful Soup Clean and process data using Python's heavyweight data processing libraries within the NumPy ecosystem: Jupyter notebooks with pandas+Matplotlib+Seaborn Deliver the data to a browser with static files or by using Flask, the lightweight Python server, and a RESTful API Pick up enough web development skills (HTML, CSS, JS) to get your visualized data on the web Use the data you've mined and refined to create web charts and visualizations with Plotly, D3, Leaflet, and other libraries

data data-science data-science-tasks data-visualization API DataViz HTML JavaScript Matplotlib NumPy Pandas Plotly Python Seaborn
O'Reilly Data Visualization Books

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you'll learn how: IPython and Jupyter provide computational environments for scientists using Python NumPy includes the ndarray for efficient storage and manipulation of dense data arrays Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data Matplotlib includes capabilities for a flexible range of data visualizations Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms

software-development programming-languages Python AI/ML Data Science Matplotlib NumPy Pandas Scikit-learn
O'Reilly Data Science Books
Wes McKinney – author

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub. Use the Jupyter notebook and IPython shell for exploratory computing Learn basic and advanced features in NumPy Get started with data analysis tools in the pandas library Use flexible tools to load, clean, transform, merge, and reshape data Create informative visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Analyze and manipulate regular and irregular time series data Learn how to solve real-world data analysis problems with thorough, detailed examples

data data-science Data Science GitHub Matplotlib NumPy Pandas Python
O'Reilly Data Engineering Books
Kevin Kho – core contributor @ Fugue , Tobias Macey – host

Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites

Interview

Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abst

AI/ML API BigEye Data Engineering Data Management GitHub JavaScript Kubernetes Looker Modern Data Stack Pandas Python Snowflake Spark SQL
Matt Harrison – Python expert , Tobias Macey – host

Summary Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects

Interview

Introduction How did you get involved in the area of data management? What are the main tasks that you have seen Pandas used for in a data engineering context? What are some of the common mistakes that can lead to poor performance when scaling to large data sets? What are some of the utility features that you have found most helpful for data processing? One of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas? Pandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows? Pandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maint

Airflow API Arrow BigEye Cloud Computing Data Engineering Data Management Data Science ETL/ELT Git Informatica Kubernetes Pandas Python Spark

In 'Data Science for Marketing Analytics', you'll embark on a journey that integrates the power of data analytics with strategic marketing. With a focus on practical application, this guide walks you through using Python to analyze datasets, implement machine learning models, and derive data-driven insights. What this Book will help me do Gain expertise in cleaning, exploring, and visualizing marketing data using Python. Build machine learning models to predict customer behavior and sales outcomes. Leverage unsupervised learning techniques for effective customer segmentation. Compare and optimize predictive models using advanced evaluation methods. Master Python libraries like pandas and Matplotlib for data manipulation and visualization. Author(s) Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali combine their extensive expertise in data analytics and marketing to bring you this comprehensive guide. Drawing from years of applying analytics in real-world marketing scenarios, they provide a hands-on approach to learning data science tools and techniques. Who is it for? This book is perfect for marketing professionals and analysts eager to harness the capabilities of Python to enhance their data-driven strategies. It is also ideal for data scientists looking to apply their skills in marketing across various roles. While a basic understanding of data analysis and Python will help, all key concepts are introduced comprehensively for beginners.

data data-science AI/ML Analytics Data Analytics Data Science Marketing Matplotlib Pandas Python
David Mertz – author

Dive into the intricacies of data cleaning, a crucial aspect of any data science and machine learning pipeline, with 'Cleaning Data for Effective Data Science.' This comprehensive guide walks you through tools and methodologies like Python, R, and command-line utilities to prepare raw data for analysis. Learn practical strategies to manage, clean, and refine data encountered in the real world. What this Book will help me do Understand and utilize various data formats such as JSON, SQL, and PDF for data ingestion and processing. Master key tools like pandas, SciPy, and Tidyverse to manipulate and analyze datasets efficiently. Develop heuristics and methodologies for assessing data quality, detecting bias, and identifying irregularities. Apply advanced techniques like feature engineering and statistical adjustments to enhance data usability. Gain confidence in handling time series data by employing methods for de-trending and interpolating missing values. Author(s) David Mertz has years of experience as a Python programmer and data scientist. Known for his engaging and accessible teaching style, David has authored numerous technical articles and books. He emphasizes not only the technicalities of data science tools but also the critical thinking that approaches solutions creatively and effectively. Who is it for? 'Cleaning Data for Effective Data Science' is designed for data scientists, software developers, and educators dealing with data preparation. Whether you're an aspiring data enthusiast or an experienced professional looking to refine your skills, this book provides essential tools and frameworks. Prior programming knowledge, particularly in Python or R, coupled with an understanding of statistical fundamentals, will help you make the most of this resource.

data data-science AI/ML Data Quality Data Science JSON Pandas Python SciPy SQL

The Data Wrangling Workshop is your beginner's guide to the essential techniques and practices of data manipulation using Python. Throughout the book, you will progressively build your skills, learning key concepts such as extracting, cleaning, and transforming data into actionable insights. By the end, you'll be confident in handling various data wrangling tasks efficiently. What this Book will help me do Understand and apply the fundamentals of data wrangling using Python. Combine and aggregate data from diverse sources like web data, SQL databases, and spreadsheets. Use descriptive statistics and plotting to examine dataset properties. Handle missing or incorrect data effectively to maintain data quality. Gain hands-on experience with Python's powerful data science libraries like Pandas, NumPy, and Matplotlib. Author(s) Brian Lipp, None Roychowdhury, and Dr. Tirthajyoti Sarkar are experienced educators and professionals in the fields of data science and engineering. Their collective expertise spans years of teaching and working with data technologies. They aim to make data wrangling accessible and comprehensible, focusing on practical examples to equip learners with real-world skills. Who is it for? The Data Wrangling Workshop is ideal for developers, data analysts, and business analysts aiming to become data scientists or analytics experts. If you're just getting started with Python, you will find this book guiding you step-by-step. A basic understanding of Python programming, as well as relational databases and SQL, is recommended for smooth learning.

data data-science data-science-tools Pandas Analytics Data Quality Data Science Matplotlib NumPy Python RDBMS SQL