talk-data.com talk-data.com

Event

SciPy 2025

2025-07-07 – 2025-07-13 PyData

Activities tracked

7

Filtering by: Cloud Computing ×

Sessions & talks

Showing 1–7 of 7 · Newest first

Search within this event →

Remote development for students and indie researchers with Spyder

2025-07-11
talk

PhD students, postdocs and independent researchers often struggle when trying to execute code developed locally in the cloud or HPC clusters for better performance. This is even more difficult if they can't count on IT staff to set up the necessary infrastructure for them on the remote machine, which is common in third-world countries. Spyder 6.1 will come with a whole set of improvements to address that limitation, from setting up a server automatically to easily run code remotely on behalf of users, to manage remote Conda environments and the remote file system from the comfort of a local Spyder installation.

From Legacy to Leading-Edge: Revamping NCEI Software for the Cloud Era

2025-07-11
talk

Extreme weather events threaten industries and economic stability. NOAA’s National Centers for Environmental Information (NCEI) addresses this through the Industry Proving Grounds (IPG), which modernizes data delivery by collaborating with sectors like re/insurance and retail to develop practical, data-driven solutions. This presentation explores IPG’s technical innovations, including implementing Polars for efficient data processing, AWS for scalability, and CI/CD pipelines for streamlined deployment. These tools enhance data accessibility, reduce latency, and support real-time decision-making. By integrating scientific computing, cloud technology, and DevOps, NCEI improves climate resilience and provides a model for leveraging open-source tools to address global challenges.

Processing Cloud-optimized data in Python (Dataplug)

2025-07-11
talk

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.

VirtualiZarr and Icechunk: How to build a cloud-optimised datacube of archival files in 3 lines of xarray

2025-07-10
talk

The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats. But often this data is stuck in archival pre-Cloud file formats such as netCDF.

VirtualiZarr makes it easy to create "Virtual" Zarr datacubes, allowing performant access to huge archival datasets as if it were in the Cloud-Optimized Zarr format, without duplicating any of the original data.

We will demonstrate using VirtualiZarr to generate references to archival files, combine them into one array datacube using xarray-like syntax, commit them to Icechunk, and read the data back with zarr-python v3.

Cubed: Scalable array processing with bounded-memory in Python

2025-07-09
talk

Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.

Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)

2025-07-08
talk

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.

In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB

2025-07-07
talk

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.