Search – talk-data.com

Title & Speakers	Event
Event SciPy 2025 2025-07-12
Towards Robust Security in Scientific Open Source Projects 2025-07-12 · 00:45 Juanita Gomez In the open-source community, the security of software packages is a critical concern since it constitutes a significant portion of the global digital infrastructure. This BoF session will focus on the supply chain security of open-source software in scientific computing. We aim to bring together maintainers and contributors of scientific Python packages to discuss current security practices, identify common vulnerabilities, and explore tools and strategies to enhance the security of the ecosystem. Join us to share your experiences, challenges, and ideas on fortifying our open-source projects against potential threats and ensuring the integrity of scientific research. Python Cyber Security
SciPy 2025 Sprint Prep BoF 2025-07-12 · 00:45 Come join the BoF to do a practice run on contributing to a GitHub project. We will walk through how to open a Pull Request for a bugfix, using the workflow most libraries participating at the weekend sprints use (hosted by the sprint chairs) GitHub SciPy
Agentic-Ai and latency implications 2025-07-12 · 00:45 Anil Sharma Since agent processing take significant time, what happens to this latency induced if agentic-ai is implemented in existing workflow. What are the latency challenges ? What could be key strategies to overcome challenges? What should we do to change the user expectation.=? What should be done to maintain/enhance user experience? What trade-offs should be considers between performance, latency, cost etc? AI/ML
SciPy 2026 2025-07-11 · 23:40 Madicken , Julie Hollek Come share your ideas next year's SciPy. Participants will have an opportunity to sign up to be on next year's organizing committee. SciPy
Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace 2025-07-11 · 23:40 Steve Van Tuyl Recent breakthroughs in large language model-based artificial intelligence (AI) have captured the public’s interest in AI more broadly. With the growing adoption of these technologies in professional and educational settings, public dialog about their potential impacts on the workforce has been ubiquitous. It is, however, difficult to separate the public dialog about the potential impact of the technology from the experienced impact of the technology in the research software engineer and data science workplace. Likewise, it is challenging to separate the generalized anxiety about AI from its specific impacts on individuals working in specialized work settings. As research software engineers (RSEs) and those in adjacent computational fields engage with AI in the workplace, the realities of the impacts of this technology are becoming clearer. However, much of the dialog has been limited to high-level discussion around general intra-institutional impacts, and lacks the nuance required to provide helpful guidance to RSE practitioners in research settings, specifically. Surprisingly, many RSEs are not involved in career discussions on what the rise of AI means for their professions. During this BoF, we will hold a structured, interactive discussion session with the goal of identifying critical areas of engagement with AI in the workplace including: current use of AI, AI assistance and automation, AI skills and workforce development, AI and open science, and AI futures. This BoF will represent the first of a series of discussions held jointly by the Academic Data Science Alliance and the US Research Software Engineer Association over the coming year, with support from Schmidt Sciences. The insights gathered from these sessions will inform the development of guidance resources on these topic areas for the broader RSE and computational data practitioner communities. AI/ML Data Science GenAI LLM
GPU Accelerated Python 2025-07-11 · 23:40 Katrina Riehl If you have interest in NumPy, SciPy, Signal Processing, Simulation, DataFrames, Linear Programming (LP), Vehicle Routing Problems (VRP), or Graph Analysis, we'd love to hear what performance you're seeing and how you're measuring. NumPy Python SciPy
Lightning Talks 2025-07-11 · 22:30 Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference. SciPy
Break 2025-07-11 · 22:05
Remote development for students and indie researchers with Spyder 2025-07-11 · 21:35 C.A.M. Gerlach , Carlos Cordoba PhD students, postdocs and independent researchers often struggle when trying to execute code developed locally in the cloud or HPC clusters for better performance. This is even more difficult if they can't count on IT staff to set up the necessary infrastructure for them on the remote machine, which is common in third-world countries. Spyder 6.1 will come with a whole set of improvements to address that limitation, from setting up a server automatically to easily run code remotely on behalf of users, to manage remote Conda environments and the remote file system from the comfort of a local Spyder installation. Cloud Computing
Dive into Flytekit's Internals: A Python SDK to Quickly Bring your Code Into Production 2025-07-11 · 21:35 Thomas J. Fan Flyte is a Linux Foundation OSS orchestrator built for Data and Machine Learning workflows focused on scalability, reliability, and developer productivity. Flyte’s Python SDK, Flytekit, empowers developers by shipping their code from their local environments onto a cluster with one simple CLI command. In this talk, you will learn about the design and implementation details that powers Flytekit’s core features, such as “fast registration” and “type transformers”, and a plugin system that enables Dask, Ray, or distributed GPU workflows. AI/ML Linux Python
Accelerating scientific data releases: Automated metadata generation with LLM agents 2025-07-11 · 21:35 Tudor Garbulet , Chirag Shah The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards. The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data. Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management. Additional Material: - Project supported by USGS and ORNL - Codebase will be available on GitHub after paper publication - Fine-tuned LLM models will be available on Hugginface after paper publication AI/ML Data Management GenAI GitHub LLM
From the outside, in: How the napari community supports users and empowers transition to contribution 2025-07-11 · 20:55 Tim Monko Napari, an open-source viewer for scientific data, has an inviting and well-established community that encourages contribution to its own project and the broader bioimage analysis community. This talk will explore how napari supports non-traditional contributors—especially those without formal software development experience—through its welcoming community, human-centered documentation, and rich plugin ecosystem. As someone with a pure biology background, I will share my journey into computational bioimage analysis and the scientific Python world, and contributing to napari's community. By sharing my experience writing a plugin and contributing to the core project, I will show how community-driven projects, like napari, lower barriers to entry, empower scientists, and cultivate a diverse, engaged research and developer community. Python
From Model to Trust: Building upon tamper-proof ML metadata records 2025-07-11 · 20:55 Mihai Maruseac The increasing prevalence of AI models necessitates robust mechanisms to ensure their trustworthiness. This talk introduces a standardized, PKI-agnostic approach to verifying the origins and integrity of machine learning models, as built by the OpenSSF Model Signing project. We extend this methodology beyond models to encompass datasets and other associated files, offering a holistic solution for maintaining data provenance and integrity. AI/ML
marimo: an open-source reactive Python notebook 2025-07-11 · 20:55 Akshay Agrawal – guest @ Marimo Python notebooks are a workhorse of scientific computing. But traditional notebooks have problems — they suffer from a reproducibility crisis; they are difficult to use with interactive widgets; their file format does not play well with Git; and they aren't reusable like regular Python scripts or modules. This talk presents a marimo, an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. We discuss design decisions and their tradeoffs, and show how these decisions make marimo notebooks reproducible in execution and packaging, Git-friendly, executable as scripts, and shareable as apps. Dataflow Git Python
From One Notebook to Many Reports: Automating with Quarto 2025-07-11 · 20:15 Charlotte Wickham Would you rather read a “Climate summary” or a “Climate summary for exactly where you live”? Producing documents that tailor your scientific results to an individual or their situation increases understanding, engagement, and connection. But, producing many reports can be onerous. If you are looking for a way to automate producing many reports, or you produce reports like this but find yourself in copy-and-paste hell, come along to learn how Quarto solves this problem with parameterized reports - you create a single Python notebook, but you generate many beautiful customized PDFs. Slides GitHub Python
Learning the art of fostering open-source communities 2025-07-11 · 20:15 Sanket Verma Open-source projects are intricate ecosystems that consist of humans contributing in a diverse manner. These contributions are one of the essential elements driving the projects and must be encouraged. The humans behind these contributions play a vital role in constituting the lively and diverse community of the project. Both the humans and their contributions must be preserved and handled with utmost care for the success and evolution of the project. As with every community, certain best practices should be followed to maintain its health, and certain pitfalls should be avoided. In this talk, I’ll share what I have learned from maintaining the vibrant and wonderful Zarr project and its community over the years.
Real-time ML: Accelerating Python for inference (< 10ms) at scale 2025-07-11 · 20:15 Elliot Marx Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML! Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta. AI/ML Python
Lunch 2025-07-11 · 19:00
(Exclusively on Zoom) Not Remotely Fun: Virtual Lightning Talks 2025-07-11 · 19:00 Sign up for the CHANCE to give a 5-minute lightning talk by messaging David Nicholson or Rebecca BurWei on Slack. Or, show up to the Zoom on time and we'll take names for the first 5 minutes. Talks will be randomly selected. Virtual surprises await! Virtual and in-person conference attendees welcome! Zoom: https://numfocus-org.zoom.us/j/82704423021?pwd=rJSUmdWwGaqIL8WKY4s6l7B6049rBM.1 2025-07-11 12:00 until 2026-07-11 13:00
Processing Cloud-optimized data in Python (Dataplug) 2025-07-11 · 18:25 Universitat Rovira i Virgili (Pedro Garcia Lopez) , Enrique Molina Giménez The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads. Cloud Computing Python

Towards Robust Security in Scientific Open Source Projects 2025-07-12 · 00:45

Juanita Gomez

In the open-source community, the security of software packages is a critical concern since it constitutes a significant portion of the global digital infrastructure. This BoF session will focus on the supply chain security of open-source software in scientific computing. We aim to bring together maintainers and contributors of scientific Python packages to discuss current security practices, identify common vulnerabilities, and explore tools and strategies to enhance the security of the ecosystem. Join us to share your experiences, challenges, and ideas on fortifying our open-source projects against potential threats and ensuring the integrity of scientific research.

Python Cyber Security

SciPy 2025 Sprint Prep BoF 2025-07-12 · 00:45

Come join the BoF to do a practice run on contributing to a GitHub project. We will walk through how to open a Pull Request for a bugfix, using the workflow most libraries participating at the weekend sprints use (hosted by the sprint chairs)

GitHub SciPy

Agentic-Ai and latency implications 2025-07-12 · 00:45

Anil Sharma

Since agent processing take significant time, what happens to this latency induced if agentic-ai is implemented in existing workflow. What are the latency challenges ? What could be key strategies to overcome challenges? What should we do to change the user expectation.=? What should be done to maintain/enhance user experience? What trade-offs should be considers between performance, latency, cost etc?

AI/ML

SciPy 2026 2025-07-11 · 23:40

Madicken , Julie Hollek

Come share your ideas next year's SciPy. Participants will have an opportunity to sign up to be on next year's organizing committee.

SciPy

Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace 2025-07-11 · 23:40

Steve Van Tuyl

Recent breakthroughs in large language model-based artificial intelligence (AI) have captured the public’s interest in AI more broadly. With the growing adoption of these technologies in professional and educational settings, public dialog about their potential impacts on the workforce has been ubiquitous. It is, however, difficult to separate the public dialog about the potential impact of the technology from the experienced impact of the technology in the research software engineer and data science workplace. Likewise, it is challenging to separate the generalized anxiety about AI from its specific impacts on individuals working in specialized work settings.

As research software engineers (RSEs) and those in adjacent computational fields engage with AI in the workplace, the realities of the impacts of this technology are becoming clearer. However, much of the dialog has been limited to high-level discussion around general intra-institutional impacts, and lacks the nuance required to provide helpful guidance to RSE practitioners in research settings, specifically. Surprisingly, many RSEs are not involved in career discussions on what the rise of AI means for their professions.

During this BoF, we will hold a structured, interactive discussion session with the goal of identifying critical areas of engagement with AI in the workplace including: current use of AI, AI assistance and automation, AI skills and workforce development, AI and open science, and AI futures. This BoF will represent the first of a series of discussions held jointly by the Academic Data Science Alliance and the US Research Software Engineer Association over the coming year, with support from Schmidt Sciences. The insights gathered from these sessions will inform the development of guidance resources on these topic areas for the broader RSE and computational data practitioner communities.

AI/ML Data Science GenAI LLM

GPU Accelerated Python 2025-07-11 · 23:40

Katrina Riehl

If you have interest in NumPy, SciPy, Signal Processing, Simulation, DataFrames, Linear Programming (LP), Vehicle Routing Problems (VRP), or Graph Analysis, we'd love to hear what performance you're seeing and how you're measuring.

NumPy Python SciPy

Lightning Talks 2025-07-11 · 22:30

Lightning talks are 5-minute talks on any topic of interest for the SciPy community. We encourage spontaneous and prepared talks from everyone, but we can’t guarantee spots. Sign ups are at the NumFOCUS booth during the conference.

SciPy

Break 2025-07-11 · 22:05

Remote development for students and indie researchers with Spyder 2025-07-11 · 21:35

C.A.M. Gerlach , Carlos Cordoba

PhD students, postdocs and independent researchers often struggle when trying to execute code developed locally in the cloud or HPC clusters for better performance. This is even more difficult if they can't count on IT staff to set up the necessary infrastructure for them on the remote machine, which is common in third-world countries. Spyder 6.1 will come with a whole set of improvements to address that limitation, from setting up a server automatically to easily run code remotely on behalf of users, to manage remote Conda environments and the remote file system from the comfort of a local Spyder installation.

Cloud Computing

Dive into Flytekit's Internals: A Python SDK to Quickly Bring your Code Into Production 2025-07-11 · 21:35

Thomas J. Fan

Flyte is a Linux Foundation OSS orchestrator built for Data and Machine Learning workflows focused on scalability, reliability, and developer productivity. Flyte’s Python SDK, Flytekit, empowers developers by shipping their code from their local environments onto a cluster with one simple CLI command. In this talk, you will learn about the design and implementation details that powers Flytekit’s core features, such as “fast registration” and “type transformers”, and a plugin system that enables Dask, Ray, or distributed GPU workflows.

AI/ML Linux Python

Accelerating scientific data releases: Automated metadata generation with LLM agents 2025-07-11 · 21:35

Tudor Garbulet , Chirag Shah

The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.

The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.

Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.

Additional Material: - Project supported by USGS and ORNL - Codebase will be available on GitHub after paper publication - Fine-tuned LLM models will be available on Hugginface after paper publication

AI/ML Data Management GenAI GitHub LLM

From the outside, in: How the napari community supports users and empowers transition to contribution 2025-07-11 · 20:55

Tim Monko

Napari, an open-source viewer for scientific data, has an inviting and well-established community that encourages contribution to its own project and the broader bioimage analysis community. This talk will explore how napari supports non-traditional contributors—especially those without formal software development experience—through its welcoming community, human-centered documentation, and rich plugin ecosystem.
As someone with a pure biology background, I will share my journey into computational bioimage analysis and the scientific Python world, and contributing to napari's community. By sharing my experience writing a plugin and contributing to the core project, I will show how community-driven projects, like napari, lower barriers to entry, empower scientists, and cultivate a diverse, engaged research and developer community.

Python

From Model to Trust: Building upon tamper-proof ML metadata records 2025-07-11 · 20:55

Mihai Maruseac

The increasing prevalence of AI models necessitates robust mechanisms to ensure their trustworthiness. This talk introduces a standardized, PKI-agnostic approach to verifying the origins and integrity of machine learning models, as built by the OpenSSF Model Signing project. We extend this methodology beyond models to encompass datasets and other associated files, offering a holistic solution for maintaining data provenance and integrity.

AI/ML

marimo: an open-source reactive Python notebook 2025-07-11 · 20:55

Akshay Agrawal – guest @ Marimo

Python notebooks are a workhorse of scientific computing. But traditional notebooks have problems — they suffer from a reproducibility crisis; they are difficult to use with interactive widgets; their file format does not play well with Git; and they aren't reusable like regular Python scripts or modules.

This talk presents a marimo, an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. We discuss design decisions and their tradeoffs, and show how these decisions make marimo notebooks reproducible in execution and packaging, Git-friendly, executable as scripts, and shareable as apps.

Dataflow Git Python

From One Notebook to Many Reports: Automating with Quarto 2025-07-11 · 20:15

Charlotte Wickham

Would you rather read a “Climate summary” or a “Climate summary for exactly where you live”? Producing documents that tailor your scientific results to an individual or their situation increases understanding, engagement, and connection. But, producing many reports can be onerous.

If you are looking for a way to automate producing many reports, or you produce reports like this but find yourself in copy-and-paste hell, come along to learn how Quarto solves this problem with parameterized reports - you create a single Python notebook, but you generate many beautiful customized PDFs.

Slides

GitHub Python

Learning the art of fostering open-source communities 2025-07-11 · 20:15

Sanket Verma

Open-source projects are intricate ecosystems that consist of humans contributing in a diverse manner. These contributions are one of the essential elements driving the projects and must be encouraged. The humans behind these contributions play a vital role in constituting the lively and diverse community of the project. Both the humans and their contributions must be preserved and handled with utmost care for the success and evolution of the project.

As with every community, certain best practices should be followed to maintain its health, and certain pitfalls should be avoided. In this talk, I’ll share what I have learned from maintaining the vibrant and wonderful Zarr project and its community over the years.

Real-time ML: Accelerating Python for inference (< 10ms) at scale 2025-07-11 · 20:15

Elliot Marx

Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML! Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.

AI/ML Python

Lunch 2025-07-11 · 19:00

(Exclusively on Zoom) Not Remotely Fun: Virtual Lightning Talks 2025-07-11 · 19:00

Sign up for the CHANCE to give a 5-minute lightning talk by messaging David Nicholson or Rebecca BurWei on Slack. Or, show up to the Zoom on time and we'll take names for the first 5 minutes. Talks will be randomly selected. Virtual surprises await! Virtual and in-person conference attendees welcome!

Zoom: https://numfocus-org.zoom.us/j/82704423021?pwd=rJSUmdWwGaqIL8WKY4s6l7B6049rBM.1 2025-07-11 12:00 until 2026-07-11 13:00

Processing Cloud-optimized data in Python (Dataplug) 2025-07-11 · 18:25

Universitat Rovira i Virgili (Pedro Garcia Lopez) , Enrique Molina Giménez

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.

Cloud Computing Python

talk-data.com

People (2 results)

Activities & events