In the open-source community, the security of software packages is a critical concern since it constitutes a significant portion of the global digital infrastructure. This BoF session will focus on the supply chain security of open-source software in scientific computing. We aim to bring together maintainers and contributors of scientific Python packages to discuss current security practices, identify common vulnerabilities, and explore tools and strategies to enhance the security of the ecosystem. Join us to share your experiences, challenges, and ideas on fortifying our open-source projects against potential threats and ensuring the integrity of scientific research.
talk-data.com
Topic
Python
67
tagged
Activity Trend
Top Events
If you have interest in NumPy, SciPy, Signal Processing, Simulation, DataFrames, Linear Programming (LP), Vehicle Routing Problems (VRP), or Graph Analysis, we'd love to hear what performance you're seeing and how you're measuring.
Flyte is a Linux Foundation OSS orchestrator built for Data and Machine Learning workflows focused on scalability, reliability, and developer productivity. Flyte’s Python SDK, Flytekit, empowers developers by shipping their code from their local environments onto a cluster with one simple CLI command. In this talk, you will learn about the design and implementation details that powers Flytekit’s core features, such as “fast registration” and “type transformers”, and a plugin system that enables Dask, Ray, or distributed GPU workflows.
Napari, an open-source viewer for scientific data, has an inviting and well-established community that encourages contribution to its own project and the broader bioimage analysis community. This talk will explore how napari supports non-traditional contributors—especially those without formal software development experience—through its welcoming community, human-centered documentation, and rich plugin ecosystem.
As someone with a pure biology background, I will share my journey into computational bioimage analysis and the scientific Python world, and contributing to napari's community. By sharing my experience writing a plugin and contributing to the core project, I will show how community-driven projects, like napari, lower barriers to entry, empower scientists, and cultivate a diverse, engaged research and developer community.
Python notebooks are a workhorse of scientific computing. But traditional notebooks have problems — they suffer from a reproducibility crisis; they are difficult to use with interactive widgets; their file format does not play well with Git; and they aren't reusable like regular Python scripts or modules.
This talk presents a marimo, an open-source reactive Python notebook that addresses these concerns by modeling notebooks as dataflow graphs and storing them as Python files. We discuss design decisions and their tradeoffs, and show how these decisions make marimo notebooks reproducible in execution and packaging, Git-friendly, executable as scripts, and shareable as apps.
Would you rather read a “Climate summary” or a “Climate summary for exactly where you live”? Producing documents that tailor your scientific results to an individual or their situation increases understanding, engagement, and connection. But, producing many reports can be onerous.
If you are looking for a way to automate producing many reports, or you produce reports like this but find yourself in copy-and-paste hell, come along to learn how Quarto solves this problem with parameterized reports - you create a single Python notebook, but you generate many beautiful customized PDFs.
Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud or acute diseases like sepsis requires processing events that emerged seconds ago. How do we build an infrastructure platform that executes complex data pipelines (< 10ms) end-to-end and on-demand? All while meeting data teams where they are–in Python–the language of ML! Learn how we built a symbolic interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized in C++ and eventually run in production workloads at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.
The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.
The SciPy library provides objects representing well over 100 univariate probability distributions. These have served the scientific Python ecosystem for decades, but they are built upon an infrastructure that has not kept up with the demands of today’s users. To address its shortcomings, SciPy 1.15 includes a new infrastructure for working with probability distributions. This talk will introduce users to the new infrastructure and demonstrate its many advantages in terms of usability, flexibility, accuracy, and performance.
The SciPy Proceedings (https://proceedings.scipy.org) have long served as a cornerstone for publishing research in the scientific python community; with over 330 peer-reviewed articles being published over the last 17 years. In 2024, the SciPy Proceedings underwent a significant transformation, adopting MyST Markdown (https://mystmd.org) and Curvenote (https://curvenote.com) to enhance accessibility, interactivity, and reproducibility — including publishing of Jupyter Notebooks. The new proceedings articles are web-first, providing features such as deep-dive links for cross-references and previews of GItHub content, interactive 3D visualizations, and rich-rendering of Jupyter Notebooks. In this talk, we will (1) present the new authoring & reading capabilities introduced in 2024; (2) highlight connections to prominent open-science initiatives and their impact on advancing computational research publishing; and (3) demonstrate the underlying technologies and how they enhance integrations with SciPy packages and how to use these tools in your own communication workflows.
Our presentation will give an overview of the revised authoring process for SciPy Proceedings; how we improve metadata standards in a similar way to code-linting and continuous integration; and the integration of live previews of the articles, including auto-generated PDFs and JATS XML (a standard used in scientific publishing). The peer-review process for the proceedings currently happens using GitHub’s peer-review commenting in a similar fashion to the Journal of Open Source Software; we will demonstrate this process as well as showcase opportunities for working with distributed review services such as PREreview (https://prereview.org). The open publishing pipeline has streamlined the submission, review, and revision processes while maintaining high scientific quality and improving the completeness of scholarly metadata. Finally, we will present how this work connects into other high-profile scientific publishing initiatives that have incorporated Jupyter Notebooks and live computational figures as well as interactive displays of large-scale data. These initiatives include Notebooks Now! by the American Geophysical Union, which is focusing on ensuring that Jupyter Notebooks can be properly integrated into the scholarly record; and the Microscopy Society of America’s work on interactive publishing and publishing of large-scale microscopy data with interactive visualizations. These initiatives and the SciPy Proceedings are enabled by recent improvements in open-source tools including MyST Markdown, JupyterLab, BinderHub, and Curvenote, which enable new ways to share executable research content. These initiatives collectively aim to improve both the reproducibility, interactivity, and the accessibility of research by providing improved connections between data, software and narrative research articles.
By embracing open science principles and modern technologies, the SciPy Proceedings exemplify how computational research can be more transparent, reproducible, and accessible. The shift to computational publishing, especially in the context of the scientific python community, opens new opportunities for researchers to publish not only their final results but also the computational workflows, datasets, and interactive visualizations that underpin them. This transformation aligns with broader efforts in open science infrastructure, such as integrating persistent identifiers (DOIs, ORCID, ROR), and adopting FAIR (Findable, Accessible, Interoperable, Reusable) principles for computational content. Building on these foundations, as well as open tools like MyST Markdown and Curvenote, provides a scalable model for open scientific publishing that bridges the gap between computational research and scholarly communication, fostering a more collaborative, iterative, and continuous approach to scientific knowledge dissemination.
Neuroscientists record brain activity using probes that capture rapid voltage changes ('spikes') from neurons. Spike sorting, the process of isolating these signals and attributing them to specific neurons, faces significant challenges: incompatible file formats, diverse algorithms, and inconsistent quality control. SpikeInterface provides a unified Python framework that standardizes data handling across technologies and enables reproducibility. In this talk, we will discuss: 1) SpikeInterface's modular components for I/O, processing, and sorting; 2) containerized dependency management that eliminates complex installation conflicts between diverse spike sorters; and 3) parallelization tools optimized for the memory-intensive nature of large-scale electrophysiology recordings.
After two decades of planning, Rubin Observatory is finally observing the sky. Built to image the entire southern hemisphere every few nights with a 3.2-gigapixel camera, Rubin will produce a time-lapse of the Universe, revealing moving asteroids, pulsing stars, supernovae, and rare transients that you only catch if you're always watching.
In this talk, I'll share the “first look” images from Rubin Observatory as well as what it took to get here: from scalable algorithms to infrastructure that moves data from a mountaintop in Chile to scientists around the world in seconds. I'll reflect on what we learned building the data management system in Python over the years, including stories of choices that impacted scalability, interfaces, and maintainability. Rubin Observatory is here. And it's for you.
Research software engineer (RSE) communities of practice specific to a given science are crucial social structures between developers, maintainers and users prompting naturally occurring peer mentoring opportunities, software improvements through collaborative contributions, and sharing of best practices and lessons learned from challenges specific to that science discipline. Members of such communities benefit from the vast resources and support available through other RSEs of their own scientific field, and the users of those software benefit from a more capable and user-friendly product.
While the US-RSE (us-rse.org) advocates for recognition of the overall RSE community, provides individual RSEs with a sense of belonging (e.g., inclusivity), and provides helpful resources, it lacks the science specific support possible in more focused communities of practice. This session features short scene-setting presentations, followed by an open panel discussion with leaders of science-specific communities of practice for RSEs (e.g., Python in Heliophysics Community (PyHC), PlanetaryPy, earthaccess, and Pangeo) on the benefits of and lessons learned from leading those groups in comparison to more general RSE communities. Example discussion topics include the benefits of science-specific RSE communities, development of science-specific software standards, encouraging psychological safety, and community creation and sustainability.
This BoF aims to host discussion about best practices for maintaining executable tutorials that are reproducible and reliable. The BoF is intended to be a platform to collect tips and tricks of CI/CD practices, too. The moderators recently put together a repository that builds on their experiences of maintaining numerous tutorial repositories https://scientific-python.github.io/executable-tutorials/ that covers some of the use cases but we are well aware that there are still user scenarios and use cases that are not well covered.
The BoF is a complement for both the Teaching&Learning and Maintainers track, none of the talks in those tracks seem to focus on the technical challenges around tutorials.
In Python, data analytics users often prioritize convenience, flexibility, and familiarity over pure performance. The cuDF DataFrame library provides a pandas-like experience with from 10x up to 50x performance improvements, but subtle differences prevent it from being a true drop-in replacement for many users. This talk will showcase the evolution of this library to provide zero-code change experiences, first for pandas users and now for Polars. We will provide examples of this usage and a high level overview of how users can make use of these today. We will then delve into the details of how GPU acceleration is implemented differently in pandas and Polars, along with a deep dive into some of the different technical challenges encountered for each. This talk will have something for both data practitioners and library developers.
High energy particle (HEP) physics research is going through fundamental changes as we move to collect larger amounts of data from the Large Hadron Collider (LHC). Analysis facilities and distributed computing, through HTCs, have come together to create the next pythonic generation of analysis by utilizing htcdaskgateway, a Dask gateway extension, allowing users to spawn workers compatible with both their analysis and heterogeneous clusters in line with authentication requirements. This is enabling physicists to engage with scientific python in ways they had not before because of domain specific C++ tools. An example of htcdaskgateway’s use is Fermilab’s Elastic Analysis Facility.
Working with data in grids or spreadsheets is great for collaboration as there are many different tools to view and edit the files. Data science workflows often include packages like openpyxl to create, load, edit, and export spreadsheets that then are shared with others who can use other tools like Excel, Google Sheets, or IDEs to view them. The new Python in Excel feature as well as the Anaconda Toolbox add-in provides the tools to run Python directly in cells in a spreadsheet, making it easier for Pythonistas to access and collaborate on code. This talk will introduce how these features work, demo collaborating on Python code in a worksheet, and talk about some case studies where these tools have been used to teach and collaborate with Python.
The Issaquah Robotics Society (IRS) has been teaching Python and data analysis to high school students since 2016. Our presentation will summarize what we’ve learned from nine years of combining Python, competitive robotics, and high school students with no prior programming experience. We’ll focus on the importance of keeping it fun, learning the tools, and how to provide useful feedback without making learning Python feel like just another class. We’ll also explain how Python helps us win robotics competitions.
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr makes it easy to create "Virtual" Zarr datacubes, allowing performant access to huge archival datasets as if it were in the Cloud-Optimized Zarr format, without duplicating any of the original data.
We will demonstrate using VirtualiZarr to generate references to archival files, combine them into one array datacube using xarray-like syntax, commit them to Icechunk, and read the data back with zarr-python v3.
Camera traps are an essential tool for wildlife research. Zamba is an open source Python package that leverages machine learning and computer vision to automate time-intensive processing tasks for wildlife camera trap data. This talk will dive into Zamba's capabilities and key factors that influenced its design and development. Topics will include the importance of code-free custom model training, Zamba’s origins in an open machine learning competition, and the technical challenges of processing video data. Attendees will walk away with a better understanding of how machine learning and Python tools can support conservation efforts.