Pandas

Data Preparation for Machine Learning

2025-06-09 · Data + AI Summit 2025

talk

AI/ML DataViz Databricks Matplotlib PySpark Python Scikit-learn

In this course, you’ll learn the fundamentals of preparing data for machine learning using Databricks. We’ll cover topics like exploring, cleaning, and organizing data tailored for traditional machine learning applications. We’ll also cover data visualization, feature engineering, and optimal feature storage strategies. By building a strong foundation in data preparation, this course equips you with the essential skills to create high-quality datasets that can power accurate and reliable machine learning and AI models. Whether you're developing predictive models or enabling downstream AI applications, these capabilities are critical for delivering impactful, data-driven solutions. Pre-requisites: Familiarity with Databricks workspace, notebooks, as well as Unity Catalog. An intermediate level knowledge of Python (scikit-learn, Matplotlib), Pandas, and PySpark. As well as with concepts of exploratory data analysis, feature engineering, standardization, and imputation methods). Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Machine Learning at Scale

2025-06-09 · Data + AI Summit 2025

talk

AI/ML API Databricks Python Spark

The course intends to equip professional-level machine learning practitioners with knowledge and hands-on experience in utilizing Apache Spark™ for machine learning purposes, including model fine-tuning. Additionally, the course covers using the Pandas library for scalable machine learning tasks. The initial section of the course focuses on comprehending the fundamentals of Apache Spark™ along with its machine learning capabilities. Subsequently, the second section delves into fine-tuning models using the hyperopt library. The final segment involves learning the implementation of the Pandas API within Apache Spark™, encompassing guidance on Pandas UDFs (User-Defined Functions) and the Functions API for model inference. Pre-requisites: Familiarity with Databricks workspace and notebooks; knowledge of machine learning model development and deployment with MLflow (e.g. basic understanding of DS/ML concepts, common model metrics and python libraries as well as a basic understanding of scaling workloads with Spark) Labs: Yes Certification Path: Databricks Certified Machine Learning Professional

Polars, DuckDB, PySpark, PyArrow, pandas, cuDF: how Narwhals has brought them all together!

2025-06-08 · PyData London 2025 Watch

talk

by Marco Gorelli (Narwhals)

Data Science DuckDB Polars PySpark

Suppose you want to write a data science tool to do feature engineering. Your experience may go like this: - Expectation: you can focus on state-of-the art techniques for feature engineering. - Reality: you keep having to make you codebase more complex because a new dataframe library has come out and users are demanding support for it.

Or rather, it might have gone like that in the pre-Narwhals era. Because now, you can focus on solving the problems which your tool set out to do, and let Narwhals handle the subtle differences between different kinds of dataframe inputs!

Data Without Labels

2025-05-26 · O'Reilly Data Science Books O'Reilly Amazon

book

by Vaibhav Verdhan

AI/ML Data Science GenAI Keras Matplotlib NumPy Python Seaborn TensorFlow data data-science data-science-tools

Discover all-practical implementations of the key algorithms and models for handling unlabeled data. Full of case studies demonstrating how to apply each technique to real-world problems. In Data Without Labels you’ll learn: Fundamental building blocks and concepts of machine learning and unsupervised learning Data cleaning for structured and unstructured data like text and images Clustering algorithms like K-means, hierarchical clustering, DBSCAN, Gaussian Mixture Models, and Spectral clustering Dimensionality reduction methods like Principal Component Analysis (PCA), SVD, Multidimensional scaling, and t-SNE Association rule algorithms like aPriori, ECLAT, SPADE Unsupervised time series clustering, Gaussian Mixture models, and statistical methods Building neural networks such as GANs and autoencoders Dimensionality reduction methods like Principal Component Analysis and multidimensional scaling Association rule algorithms like aPriori, ECLAT, and SPADE Working with Python tools and libraries like sci-kit learn, numpy, Pandas, matplotlib, Seaborn, Keras, TensorFlow, and Flask How to interpret the results of unsupervised learning Choosing the right algorithm for your problem Deploying unsupervised learning to production Maintenance and refresh of an ML solution Data Without Labels introduces mathematical techniques, key algorithms, and Python implementations that will help you build machine learning models for unannotated data. You’ll discover hands-off and unsupervised machine learning approaches that can still untangle raw, real-world datasets and support sound strategic decisions for your business. Don’t get bogged down in theory—the book bridges the gap between complex math and practical Python implementations, covering end-to-end model development all the way through to production deployment. You’ll discover the business use cases for machine learning and unsupervised learning, and access insightful research papers to complete your knowledge. About the Technology Generative AI, predictive algorithms, fraud detection, and many other analysis tasks rely on cheap and plentiful unlabeled data. Machine learning on data without labels—or unsupervised learning—turns raw text, images, and numbers into insights about your customers, accurate computer vision, and high-quality datasets for training AI models. This book will show you how. About the Book Data Without Labels is a comprehensive guide to unsupervised learning, offering a deep dive into its mathematical foundations, algorithms, and practical applications. It presents practical examples from retail, aviation, and banking using fully annotated Python code. You’ll explore core techniques like clustering and dimensionality reduction along with advanced topics like autoencoders and GANs. As you go, you’ll learn where to apply unsupervised learning in business applications and discover how to develop your own machine learning models end-to-end. What's Inside Master unsupervised learning algorithms Real-world business applications Curate AI training datasets Explore autoencoders and GANs applications About the Reader Intended for data science professionals. Assumes knowledge of Python and basic machine learning. About the Author Vaibhav Verdhan is a seasoned data science professional with extensive experience working on data science projects in a large pharmaceutical company. Quotes An invaluable resource for anyone navigating the complexities of unsupervised learning. A must-have. - Ganna Pogrebna, The Alan Turing Institute Empowers the reader to unlock the hidden potential within their data. - Sonny Shergill, Astra Zeneca A must-have for teams working with unstructured data. Cuts through the fog of theory ili Explains the theory and delivers practical solutions. - Leonardo Gomes da Silva, onGRID Sports Technology The Bible for unsupervised learning! Full of real-world applications, clear explanations, and excellent Python implementations. - Gary Bake, Falconhurst Technologies

Think Stats, 3rd Edition

2025-04-11 · O'Reilly Data Science Books O'Reilly Amazon

book

by Allen B. Downey

DataViz NumPy Python SciPy data data-science data-science-tools

If you know how to program, you have the skills to turn data into knowledge. This thoroughly revised edition presents statistical concepts computationally, rather than mathematically, using programs written in Python. Through practical examples and exercises based on real-world datasets, you'll learn the entire process of exploratory data analysis—from wrangling data and generating statistics to identifying patterns and testing hypotheses. Whether you're a data scientist, software engineer, or data enthusiast, you'll get up to speed on commonly used tools including NumPy, SciPy, and Pandas. You'll explore distributions, relationships between variables, visualization, and many other concepts. And all chapters are available as Jupyter notebooks, so you can read the text, run the code, and work on exercises all in one place. Analyze data distributions and visualize patterns using Python libraries Improve predictions and insights with regression models Dive into specialized topics like time series analysis and survival analysis Integrate statistical techniques and tools for validation, inference, and more Communicate findings with effective data visualization Troubleshoot common data analysis challenges Boost reproducibility and collaboration in data analysis projects with interactive notebooks

Speed up ML workflows up to 50x with NVIDIA RAPIDS in Google Colab

2025-04-09 · Google Cloud Next '25

session

AI/ML Cloud Computing Data Science GCP LLM Python Scikit-learn

Learn how to speed up popular data science libraries such as pandas and scikit-learn by up to 50x in Google Colab using pre-installed NVIDIA RAPIDS Python libraries. Boost both speed and scale for your workflows by simply selecting a GPU runtime in Colab – no code changes required. In addition, Gemini helps Colab users incorporate GPUs and generate pandas code from simple natural language prompts.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Day 1 — April 5, 2025: Building and training neural networks with PyTorch

2025-04-05 · April 5-6: FREE 2-Day Deep Learning Fundamentals NVIDIA DLI Certification Course

workshop

by Antonio Rueda-Toicen (Hasso Plattner Institute)

Python PyTorch jupyter

Day 1 focuses on building and training neural networks with PyTorch. Learn to implement neural networks for image classification from scratch using PyTorch.

Day 1: Building and training neural networks with PyTorch

2025-04-05 · April 5-6: FREE 2-Day Deep Learning Fundamentals NVIDIA DLI Certification Course

workshop

by Antonio Rueda-Toicen (Hasso Plattner Institute)

Python PyTorch fiftyone github codespaces google colab jupyter

Focus on building and training neural networks with PyTorch.

Day 2: Visual dataset curation with FiftyOne and iterative improvement of image classification models

2025-04-05 · April 5-6: FREE 2-Day Deep Learning Fundamentals NVIDIA DLI Certification Course

workshop

by Antonio Rueda-Toicen (Hasso Plattner Institute)

Python PyTorch fiftyone github codespaces google colab jupyter

Focus on visual dataset curation with FiftyOne and iterative improvement of image classification models.

Breakout Session 1 - Telling a Good Story with Data

2025-02-21 · Microsoft DevConnect

Breakout Session

Matplotlib Plotly Python Seaborn jupyter notebook vs code

Master data storytelling with Python using Pandas, Matplotlib, Seaborn, and Plotly. Gain hands-on insights into data analysis and visualization with Jupyter Notebook in VS Code.

DuckDB: Up and Running

2024-12-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Wei-Meng Lee

Analytics CSV Data Analytics DuckDB JSON Parquet Polars Python SQL data data-science data-science-tools

DuckDB, an open source in-process database created for OLAP workloads, provides key advantages over more mainstream OLAP solutions: It's embeddable and optimized for analytics. It also integrates well with Python and is compatible with SQL, giving you the performance and flexibility of SQL right within your Python environment. This handy guide shows you how to get started with this versatile and powerful tool. Author Wei-Meng Lee takes developers and data professionals through DuckDB's primary features and functions, best practices, and practical examples of how you can use DuckDB for a variety of data analytics tasks. You'll also dive into specific topics, including how to import data into DuckDB, work with tables, perform exploratory data analysis, visualize data, perform spatial analysis, and use DuckDB with JSON files, Polars, and JupySQL. Understand the purpose of DuckDB and its main functions Conduct data analytics tasks using DuckDB Integrate DuckDB with pandas, Polars, and JupySQL Use DuckDB to query your data Perform spatial analytics using DuckDB's spatial extension Work with a diverse range of data including Parquet, CSV, and JSON

Industry Roundup #2: AI Agents for Data Work, The Return of the Full-Stack Data Scientist and Old languages Make a Comeback

2024-12-06 · DataFramed Listen

podcast_episode

by Adel (DataFramed) , Richie (DataCamp)

AI/ML Data Science Polars Python

Welcome to DataFramed Industry Roundups! In this series of episodes, Adel & Richie sit down to discuss the latest and greatest in data & AI. In this episode, we touch upon AI agents for data work, will the full-stack data scientist make a return, old languages making a comeback, Python's increase in performance, what they're both thankful for, and much more. Links Mentioned in the Show Fractal’s Data Science Agent: AryaArticle: What Makes a True AI Agent? Rethinking the Pursuit of AutonomyCassie Kozyrkov on DataFramedTIOBE Index for November 2024Community discussion on FortranTutorial: High Performance Data Manipulation in Python: pandas 2.0 vs. polars New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Pandas Cookbook - Third Edition

2024-10-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by William Ayd , Matthew Harrison

Data Science NumPy Python data data-science data-science-tools

Discover the power of pandas for your data analysis tasks. Pandas Cookbook provides practical, hands-on recipes for mastering pandas 2.x, guiding you through real-world scenarios quickly and effectively. What this Book will help me do Efficiently manipulate and clean data using pandas. Perform advanced grouping and aggregation operations. Handle time series data with pandas robust functions. Optimize pandas code for better performance. Integrate pandas with tools like NumPy and databases. Author(s) William Ayd and Matthew Harrison co-authored this insightful cookbook. With years of practical experience in data science and Python development, both authors aim to make data analysis accessible and efficient using pandas. Who is it for? This book is perfect for Python developers and data analysts looking to enhance their data manipulation skills. Whether you're a beginner aiming to understand pandas or a professional seeking advanced insights, this book is tailored for anyone handling structured data.

Wes McKinney

2024-10-30 · The Joe Reis Show Listen

podcast_episode

by Wes McKinney (Posit) , Joe Reis (DeepLearning.AI)

Arrow

Wes McKinney and I chat about Positron, Arrow, how he created Pandas and Arrow, and what makes him tick.

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Engineering Data Quality GCP Kafka +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

2024-09-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Robert Johansson

AI/ML Analytics Data Analytics Data Science Matplotlib NumPy Python Scikit-learn SciPy data data-science data-science-tools

Learn how to leverage the scientific computing and data analysis capabilities of Python, its standard library, and popular open-source numerical Python packages like NumPy, SymPy, SciPy, matplotlib, and more. This book demonstrates how to work with mathematical modeling and solve problems with numerical, symbolic, and visualization techniques. It explores applications in science, engineering, data analytics, and more. Numerical Python, Third Edition, presents many case study examples of applications in fundamental scientific computing disciplines, as well as in data science and statistics. This fully revised edition, updated for each library's latest version, demonstrates Python's power for rapid development and exploratory computing due to its simple and high-level syntax and many powerful libraries and tools for computation and data analysis. After reading this book, readers will be familiar with many computing techniques, including array-based and symbolic computing, visualization and numerical file I/O, equation solving, optimization, interpolation and integration, and domain-specific computational problems, such as differential equation solving, data analysis, statistical modeling, and machine learning. What You'll Learn Work with vectors and matrices using NumPy Review Symbolic computing with SymPy Plot and visualize data with Matplotlib Perform data analysis tasks with Pandas and SciPy Understand statistical modeling and machine learning with statsmodels and scikit-learn Optimize Python code using Numba and Cython Who This Book Is For Developers who want to understand how to use Python and its ecosystem of libraries for scientific computing and data analysis.

#62 The End of Pandas, Rise of Ibis: AI, Function Calling, & Python’s New Tools

2024-09-26 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

AI/ML API Docker DuckDB LLM Polars Python Rust

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. We dive into conversations smoother than your morning coffee (but let’s be honest, just as caffeinated) where industry insights meet light-hearted banter. Whether you’re a data wizard or just curious about the digital chaos around us, kick back and get ready to talk shop—unplugged style! In this episode: Farewell Pandas, Hello Future: Pandas is out, and Ibis is in. We're talking faster, smarter data processing—featuring the rise of DuckDB and the powerhouse that is Polars. Is this the end of an era for Pandas?UV vs. Rye: Forget pip—are these new Python package managers built in Rust the future? We break down UV, Rye, and what it all means for your next Python project.AI-Generated Podcasts: Is AI about to take over your favorite podcasts? We explore the potential of Google’s Notebook LM to transform content into audio gold.When AI Steals Your Voice: Jeff Geerling’s voice gets cloned by AI—without his consent. We dive into the wild world of voice cloning, the ethics, and the future of AI-generated media.Hacking AI with Prompt Injection: Could you outsmart AI? We share some wild strategies from the game Gandalf that challenge your prompt injection skills and teach you how to jailbreak even the toughest guardrails.Jony Ive’s New Gadget Rumor: Is Jony Ive plotting an Apple killer? Rumors are swirling about a new AI-powered handheld device that could shake up the smartphone market.Zero-Downtime Deployments with Kamal Proxy: No more downtime! We geek out over Kamal Proxy, the sleek HTTP tool designed for effortless Docker deployments.Function Calling and LLMs: Get ready for the next evolution in AI—function calling. We discuss its rise in LLMs and dive into the Gorilla project, the leaderboard testing the future of smart APIs.

Open Source Sustainability & Philanthropy: Building Contributor Communities

2024-09-26 · PyData Paris 2024

talk

by Devpriya Dave , Alyssa Wright

Open Source Software, the backbone of today’s digital infrastructure, must be sustainable for the long-term. Qureshi and Fang (2011) find that motivating, engaging, and retaining new contributors is what makes open source projects sustainable.

Yet, as Steinmacher, et al. (2015) identifies, first-time open source contributors often lack timely answers to questions, newcomer orientation, mentors, and clear documentation. Moreover, since the term was first coined in 1998, open source lags far behind other technical domains in participant diversity. Trinkenreich, et al. (2022) reports that only about 5% of projects were reported to have women as core developers, and women authored less than 5% of pull requests, but had similar or even higher rates of pull request acceptances to men. So, how can we achieve more diversity in open source communities and projects?

Bloomberg’s Women in Technology (BWIT) community, Open Source Program Office (OSPO), and Corporate Philanthropy team collaborated with NumFOCUS to develop a volunteer incentive model that aligns business value, philanthropic impact, and individual technical growth. Through it, participating Bloomberg engineers were given the opportunity to convert their hours spent contributing to the pandas open source project into a charitable donation to a non-profit of their choice.

The presenters will discuss how we wove together differing viewpoints: non-profit foundation and for-profit corporation, corporate philanthropy and engineers, first-time contributors and core devs. They will showcase why and how we converted technical contributions into charitable dollars, the difference this community-building model had in terms of creating a diverse and sustained group of new open source contributors, and the viability of extending this to other open source projects and corporate partners to contribute to the long-term sustainability of open source—thereby demonstrating the true convergence of tech and social impact.

NOTE: [1] Qureshi, I, and Fang, Y. "Socialization in open source software projects: A growth mixture modeling approach." 2011. [2] Steinmacher, I., et al. "Social barriers faced by newcomers placing their first contribution in open source software projects." 2015. [3] Trinkenreich, B., et al. "Women’s participation in open source software: A survey of the literature." 2022.

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

2024-09-26 · PyData Paris 2024

talk

by Erik Welch , Rick Ratzel

Analytics Python

NetworkX is arguably the most popular graph analytics library available today, but one of its greatest strengths - the pure-python implementation - is also possibly its biggest weakness. If you're a seasoned data scientists or a new student of the fascinating field of graph analytics, you're probably familiar with NetworkX and interested in how to make this extremely easy-to-use library powerful enough to handle realistically large graph workflows that often exceed the limitations of its pure-python implementation.

This talk will describe a relatively new capability of NetworkX; support for accelerated backends, and how they can benefit NetworkX users by allowing it to finally be both easy to use and fast. Through the use of backends, NetworkX can also be incorporated into workflows that take advantage of similar accelerators, such as Accelerated Pandas (cudf.pandas), to finally make these easy to use solutions scale to larger problems.

Attend this talk to learn about how you can leverage the various backends available to NetworkX today to seamlessly run graph analytics on GPUs, use GraphBLAS implementations, and more, all without leaving the comfort and convenience of the most popular graph analytics library available.

Jupylates: spaced repetition for teaching with Jupyter

2024-09-25 · PyData Paris 2024

talk

by Nicolas M. Thiéry , Chiara Marmo

AI/ML GitLab

Jupyter based environments are getting a lot of traction for teaching computing, programming, and data sciences. The narrative structure of notebooks has indeed proven its value for guiding each student at it's own pace to the discovery and understanding of new concepts or new idioms (e.g. how do I extract a column in pandas?). But then these new pieces of knowledge tend to quickly fade out and be forgotten. Indeed long term acquisition of knowledge and skills takes reinforcement by repetition. This is the foundation of many online learning platforms like Webwork or WIMS that offer exercises with randomization and automatic feedback. And of popular "AI-powered" apps -- e.g. to learn foreign languages -- that use spaced repetition algorithms designed by educational and neuro sciences to deliver just the right amount of repetition.

What if you could author such exercizes as notebooks, to benefit from everything that Jupyter can offer (think rich narratives, computations, visualization, interactions)? What if you could integrate such exercises right into your Jupyter based course? What if a learner could get personalized exercise recommandations based on their past learning records, without having to give away these sensitive pieces of information away?

That's Jupylates (work in progress). And thanks to the open source scientific stack, it's just a small Jupyter extension.

talk-data.com

Activity Trend

Top Events

Top Speakers

Data Preparation for Machine Learning

Machine Learning at Scale

Polars, DuckDB, PySpark, PyArrow, pandas, cuDF: how Narwhals has brought them all together!

Data Without Labels

Think Stats, 3rd Edition

Speed up ML workflows up to 50x with NVIDIA RAPIDS in Google Colab

Day 1 — April 5, 2025: Building and training neural networks with PyTorch

Day 1: Building and training neural networks with PyTorch

Day 2: Visual dataset curation with FiftyOne and iterative improvement of image classification models

Breakout Session 1 - Telling a Good Story with Data

DuckDB: Up and Running

Industry Roundup #2: AI Agents for Data Work, The Return of the Full-Stack Data Scientist and Old languages Make a Comeback

Pandas Cookbook - Third Edition

Wes McKinney

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

#62 The End of Pandas, Rise of Ibis: AI, Function Calling, & Python’s New Tools

Open Source Sustainability & Philanthropy: Building Contributor Communities

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

Jupylates: spaced repetition for teaching with Jupyter