Pandas

Investing for Programmers

2025-09-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Stefan Papp

AI/ML API GenAI LLM Matplotlib NumPy Python data data-science data-science-tools

Maximize your portfolio, analyze markets, and make data-driven investment decisions using Python and generative AI. Investing for Programmers shows you how you can turn your existing skills as a programmer into a knack for making sharper investment choices. You’ll learn how to use the Python ecosystem, modern analytic methods, and cutting-edge AI tools to make better decisions and improve the odds of long-term financial success. In Investing for Programmers you’ll learn how to: Build stock analysis tools and predictive models Identify market-beating investment opportunities Design and evaluate algorithmic trading strategies Use AI to automate investment research Analyze market sentiments with media data mining In Investing for Programmers you'll learn the basics of financial investment as you conduct real market analysis, connect with trading APIs to automate buy-sell, and develop a systematic approach to risk management. Don’t worry—there’s no dodgy financial advice or flimsy get-rich-quick schemes. Real-life examples help you build your own intuition about financial markets, and make better decisions for retirement, financial independence, and getting more from your hard-earned money. About the Technology A programmer has a unique edge when it comes to investing. Using open-source Python libraries and AI tools, you can perform sophisticated analysis normally reserved for expensive financial professionals. This book guides you step-by-step through building your own stock analysis tools, forecasting models, and more so you can make smart, data-driven investment decisions. About the Book Investing for Programmers shows you how to analyze investment opportunities using Python and machine learning. In this easy-to-read handbook, experienced algorithmic investor Stefan Papp shows you how to use Pandas, NumPy, and Matplotlib to dissect stock market data, uncover patterns, and build your own trading models. You’ll also discover how to use AI agents and LLMs to enhance your financial research and decision-making process. What's Inside Build stock analysis tools and predictive models Design algorithmic trading strategies Use AI to automate investment research Analyze market sentiment with media data mining About the Reader For professional and hobbyist Python programmers with basic personal finance experience. About the Author Stefan Papp combines 20 years of investment experience in stocks, cryptocurrency, and bonds with decades of work as a data engineer, architect, and software consultant. Quotes Especially valuable for anyone looking to improve their investing. - Armen Kherlopian, Covenant Venture Capital A great breadth of topics—from basic finance concepts to cutting-edge technology. - Ilya Kipnis, Quantstrat Trader A top tip for people who want to leverage development skills to improve their investment possibilities. - Michael Zambiasi, Raiffeisen Digital Bank Brilliantly bridges the worlds of coding and finance. - Thomas Wiecki, PyMC Labs

Meet Docling: The “Pandas” for document AI

2025-09-24 · PyData Amsterdam 2025

talk

by Panos Vagenas (Docling / IBM Research) , Mingxuan Zhao

AI/ML

A workshop session to show you the basics on how to use Docling to enhance document ingestion in your AI workflow.

Narwhals: enabling universal dataframe support

2025-09-02 · PyData Berlin 2025 Watch

talk

by Marco Gorelli (Narwhals)

Data Science DuckDB Plotly Polars PySpark

Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals.

Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted.

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

2025-09-01 · PyData Berlin 2025 Watch

talk

by Mehdi Ouazza (MotherDuck)

Analytics DuckDB ETL/ELT Polars Python SQL

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

Addressing the Spark and Pandas duality. Smart feature creation with Stile

2025-08-26 · From a Fintech lens: MCP server live-coding & feature selection data hacks

talk

by Gilles Verbockhaven (ING Retail Banking Analytics)

PySpark Spark stile

The presentation will introduce the “Stile eco-system” developed at ING Analytics to speed up the time to market of machine learning models for the instant lending domain. The main issue to solve is the duality between Spark and Pandas for feature generation. Spark is used for development while dealing with the billions of transactions stored in the data warehouse. Pandas is used in production when applications are scored one by one in a real-time situation. During the presentation, Gilles will explain how the template for model development works, with a specific focus on feature creation. Additionally, Gilles will highlight how Pandas and PySpark are integrated in common functionalities, and the user-friendly testing framework developed to ensure consistency between the two worlds, and, finally, how to easily trim the code to only produce the features required for the final model. Gilles Verbockhaven is Chapter Lead at ING Retail Banking Analytics and manages a team of five Data Scientists. He has been working at ING for 20 years now and has experience in various domains, ranging from market risk to modelling. Since 2017, he has been working in the Machine Learning area and has specialized in designing analytic solutions for collections and pricing. In his free time, he spends his energy running and biking.

Narwhals: A lightweight compatibility layer between dataframe libraries

2025-07-31 · PyData Cardiff - July 2025

talk

by Marco Gorelli (Narwhals)

DuckDB Polars PySpark narwhals

Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, Pandas, PySpark, DuckDB and more.

Looking at the Data with Buckaroo

2025-07-30 · July Meetup: Building Better Workflows with Buckaroo & PyTorch

talk

Polars automatic cleaning buckaroo dataframe diffing low-code interface pandera

Explore Buckaroo, an open-source dataframe UI for pandas and polars that lets you scroll, search, and summarize dataframes directly in your notebook — no boilerplate code needed. You’ll learn how to use Buckaroo in daily workflows, from essential UI features to advanced capabilities like the low-code interface, dataframe diffing, automatic cleaning, and Pandera integration. This talk is aimed at data scientists with a basic understanding of pandas and Jupyter notebooks.

Accelerated DataFrames for all: Bringing GPU acceleration to pandas and Polars

2025-07-10 · SciPy 2025

talk

by Vyas Ramasubramani

Analytics Data Analytics Polars Python

In Python, data analytics users often prioritize convenience, flexibility, and familiarity over pure performance. The cuDF DataFrame library provides a pandas-like experience with from 10x up to 50x performance improvements, but subtle differences prevent it from being a true drop-in replacement for many users. This talk will showcase the evolution of this library to provide zero-code change experiences, first for pandas users and now for Polars. We will provide examples of this usage and a high level overview of how users can make use of these today. We will then delve into the details of how GPU acceleration is implemented differently in pandas and Polars, along with a deep dive into some of the different technical challenges encountered for each. This talk will have something for both data practitioners and library developers.

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

2025-07-10 · SciPy 2025

talk

by Ana Comesana

AI/ML NumPy Python Scikit-learn

Synthetic aviation fuels (SAFs) offer a pathway to improving efficiency, but high cost and volume requirements hinder property testing and increase risk of developing low-performing fuels. To promote productive SAF research, we used Fourier Transform Infrared (FTIR) spectra to train accurate, interpretable fuel property models. In this presentation, we will discuss how we leveraged standard Python libraries – NumPy, pandas, and scikit-learn – and Non-negative Matrix Factorization to decompose FTIR spectra and develop predictive models. Specifically, we will review the pipeline developed for preprocessing FTIR data, the ensemble models used for property prediction, and how the features correlate with physicochemical properties.

Generative AI in Engineering Education: A Tool for Learning, Not a Replacement for Skills

2025-07-09 · SciPy 2025

talk

by Ryan C Cooper

AI/ML GenAI NumPy SciPy

Generative Artificial Intelligence (AI) is reshaping engineering education by offering students new ways to engage with complex concepts and content. Ethical concerns including bias, intellectual property, and plagiarism make Generative AI a controversial educational tool. Overreliance on AI may also lead to academic integrity issues, necessitating clear student codes of conduct that define acceptable use. As educators we should carefully design learning objectives to align with transferrable career skills in our fields. By practicing backward design with a focus on career-readiness skills, we can incorporate useful prompt engineering, rapid prototyping, and critical reasoning skills that incorporate generative AI. Engineering students want to develop essential career skills such as critical thinking, communication, and technology. This talk will focus on case studies for using generative AI and rapid prototyping for scientific computing in engineering courses for physics, programming, and technical writing. These courses include assignments and reading examples using NumPy, SciPy, Pandas, etc. in Jupyter notebooks. Embracing generative AI tools has helped students compare, evaluate, and discuss work that was inaccessible before generative AI. This talk explores strategies for using AI in engineering education while accomplishing learning objectives and giving students opportunities to practice career readiness skills.

Dynamic Data with Matplotlib

2025-07-09 · SciPy 2025

talk

by Kyle Sunden

DataViz Matplotlib Python

Matplotlib is already a favorite plotting library for creating static data visualizations in Python. Here, we discuss the development of a new DataContainer interface and accompanying transformation pipeline which enable easier dynamic data visualization in Matplotlib. This improves the experience of plotting pure functions, automatically recomputing when you pan and zoom. Data containers can ingest data from a variety of sources, including structured data such as Pandas Dataframes or Xarrays, up to live updating data from web services or databases. The flexible transformation pipeline allows for control over how your data is encoded into a plot.

Building machine learning pipelines that scale: a case study using Ibis and IbisML

2025-07-07 · SciPy 2025

talk

by Anjali Datta , Deepyaman Datta

AI/ML Analytics Data Engineering Python Scikit-learn SQL

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.

Introduction to Data Analysis Using Pandas

2025-07-07 · SciPy 2025

talk

by Stefanie Molin

Matplotlib Python Seaborn

Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.

Pandas makes it possible to work with tabular data and perform all parts of the analysis from collection and manipulation through aggregation and visualization. While most of this session focuses on pandas, during our discussion of visualization, we will also introduce at a high level Matplotlib (the library that pandas uses for its visualization features, which when used directly makes it possible to create custom layouts, add annotations, etc.) and Seaborn (another plotting library, which features additional plot types and the ability to visualize long-format data).

The-Silmaril: Practice #ontology engineering with Python (and other languages).

2025-07-07 · SciPy 2025

talk

by Shaurya Agarwal

PySpark Python SciPy

Ontologies provide a powerful way to structure knowledge, enable reasoning, and support more meaningful queries compared to traditional data models. Recently, interest in ontologies has resurged, driven by advancements in language models, reasoning capabilities, and the growing adoption of platforms like Palantir Foundry.

In this hands-on tutorial, participants will explore ontology development across multiple domains using a variety of Python-based tools such as rdflib, Owlready2, PySpark, Pandas, and SciPy. They will learn how ontologies facilitate semantic reasoning, improve data interoperability, and enhance query capabilities.
Additionally, attendees will build a rudimentary reasoning engine to better understand inference mechanisms.
The tutorial emphasizes practical applications and comparisons with conventional data representations, making it ideal for researchers, data engineers, and developers interested in knowledge representation and reasoning.

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

2025-07-07 · SciPy 2025

talk

by Guen Prawiroatmodjo , Jacob Matson (MotherDuck) , Alex Monahan (MotherDuck)

Cloud Computing DuckDB HTML Polars Python SQL

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.

5 Simple Strategies To Enhance Your DAGs For Data Processing

2025-07-01 · Airflow Summit 2025

session

by William Orgertrice

Airflow Data Engineering

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away. We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features. By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!

Lessons from Airflow gone wrong: How to set yourself up to scale successfully

2025-07-01 · Airflow Summit 2025

session

by Caitlin Petro , Annie Friedman (Astronomer)

Airflow Data Quality

Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares. We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you. From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include: Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste. Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying. SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).

Definitive Python Polars with Jeroen Janssens and Thijs Nieuwdorp

2025-06-25 · The Joe Reis Show Listen

podcast_episode

by Thijs Nieuwdorp (VodafoneZiggo) , Joe Reis (DeepLearning.AI) , Dr. Jeroen Janssens (Posit)

Polars Python

Jeroen Janssens and Thijs Nieuwdo join me to chat about all things Polars. We discuss the evolution of the Polars library, its advantages over pandas, and their journey of writing 'Python Polars: The Definitive Guide.'

Hands-on ML pipeline bootcamp: from raw data to model evaluation

2025-06-21 · Introduction AI Mini Bootcamp - Dr. Yasin Ceran

workshop

NumPy cross-validation data transformation imputation techniques linear models

Date: Jun 14, 2025 | In this hands-on workshop, you will build a complete machine learning pipeline—from raw data to model evaluation. You’ll learn to prepare and clean data using NumPy and Pandas, handle missing values with imputation techniques, apply data transformation methods for effective modeling, train linear models with hyperparameter tuning, and use cross-validation to assess and improve model performance.

No-Code Change in Your Python UDF for Arrow Optimization

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Hyukjin Kwon (Databricks)

API Arrow Python Spark

Apache Spark™ has introduced Arrow-optimized APIs such as Pandas UDFs and the Pandas Functions API, providing high performance for Python workloads. Yet, many users continue to rely on regular Python UDFs due to their simple interface, especially when advanced Python expertise is not readily available. This talk introduces a powerful new feature in Apache Spark that brings Arrow optimization to regular Python UDFs. With this enhancement, users can leverage performance gains without modifying their existing UDFs — simply by enabling a configuration setting or toggling a UDF-level parameter. Additionally, we will dive into practical tips and features for using Arrow-optimized Python UDFs effectively, exploring their strengths and limitations. Whether you’re a Spark beginner or an experienced user, this session will allow you to achieve the best of both simplicity and performance in your workflows with regular Python UDFs.

talk-data.com

Activity Trend

Top Events

Top Speakers

Investing for Programmers

Meet Docling: The “Pandas” for document AI

Narwhals: enabling universal dataframe support

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

Addressing the Spark and Pandas duality. Smart feature creation with Stile

Narwhals: A lightweight compatibility layer between dataframe libraries

Looking at the Data with Buckaroo

Accelerated DataFrames for all: Bringing GPU acceleration to pandas and Polars

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

Generative AI in Engineering Education: A Tool for Learning, Not a Replacement for Skills

Dynamic Data with Matplotlib

Building machine learning pipelines that scale: a case study using Ibis and IbisML

Introduction to Data Analysis Using Pandas

The-Silmaril: Practice #ontology engineering with Python (and other languages).

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

5 Simple Strategies To Enhance Your DAGs For Data Processing

Lessons from Airflow gone wrong: How to set yourself up to scale successfully

Definitive Python Polars with Jeroen Janssens and Thijs Nieuwdorp

Hands-on ML pipeline bootcamp: from raw data to model evaluation

No-Code Change in Your Python UDF for Arrow Optimization