Topic

NumPy

scientific_computing numerical_analysis python

Activities

2

tagged

Activity Trend

16 peak/qtr

2020-Q1 2026-Q1

Top Events

O'Reilly Data Science Books 41 SciPy 2025 11 O'Reilly Data Visualization Books 8 O'Reilly Data Engineering Books 6 ADSP: Algorithms + Data Structures = Programs 3 PyData Paris 2025 3 PyConDE & PyData Berlin 2023 2 PyData Seattle 2025 2 Data Engineering Podcast 2 Introduction AI Mini Bootcamp - Dr. Yasin Ceran 1 [Online] Contributing to the NumPy Documentation 1 PyData Paris 2024 1

Top Speakers

Wes McKinney (Posit) 3 Conor Hoekstra 3 Bryce Adelstein Lelbach (NVIDIA) 3 Kyran Dale 2 Ralf Gommers (Quansight Labs) 2 Ivan Idris 2 Hendrik Makait 2 Fabio Nelli 2 Robert Johansson 2 Martin Czygan 2 Jake VanderPlas 2 Ashwin Pajankar 2

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Hendrik Makait ×

Geoscience at Massive Scale

2024-09-25 · PyData Paris 2024

talk

by Hendrik Makait

API Cloud Computing

When scaling geoscience workloads to large datasets, many scientists and developers reach for Dask, a library for distributed computing that plugs seamlessly into Xarray and offers an Array API that wraps NumPy. Featuring a distributed environment capable of running your workload on large clusters, Dask promises to make it easy to scale from prototyping on your laptop to analyzing petabyte-scale datasets.

Dask has been the de-facto standard for scaling geoscience, but it hasn’t entirely lived up to its promise of operating effortlessly at massive scale. This comes up in a few ways: - Correctly chunking your dataset has a significant impact on Dask’s ability to scale - Workers accidentally run out of memory due to: - Data being loaded too eagerly - Rechunking - Unmanaged memory

Over the last few months, Dask has addressed many of those pains and continues to do so through: - Improvements to its scheduling algorithms - A faster and more memory-stable method for rechunking - First-of-its-kind logical optimization layer for a distributed array framework (ongoing)

Join us as we dive into real-world geoscience workloads, exploring how Dask empowers scientists and developers to run their analyses at massive scale. Discover the impact of improvements made to Dask, ongoing challenges, and future plans for making it truly effortless to scale from your laptop to the cloud.

Observability for Distributed Computing with Dask

2023-04-18 · PyConDE & PyData Berlin 2023

talk

by Hendrik Makait

AI/ML Cloud Computing Data Engineering Data Science Pandas Python React

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.