talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

49

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: PyConDE & PyData Berlin 2023 ×

Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.

However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

FastKafka is a Python library that makes it easy to connect to Apache Kafka queues and send and receive messages. In this talk, we will introduce the library and its features for working with Kafka queues in Python. We will discuss the motivations for creating the library, how it compares to other Kafka client libraries, and how to use its decorators to define functions for consuming and producing messages. We will also demonstrate how to use these functions to build a simple application that sends and receives messages from the queue. This talk will be of interest to Python developers looking for an easy-to-use solution for working with Kafka.

The documentation of the library can be found here: https://fastkafka.airt.ai/

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly.

Throughout the talk we will be creating our own plugin friendly application using the pluggy library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager conda is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.

Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets.

Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts.

In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?

Keeping in mind the Pythonic principle that “simple is better than complex” we'll see how to create a web map with the Python based web framework Django using its GeoDjango module, storing geographic data in your local database on which to run geospatial queries.

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects.

However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication.

In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python.

This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.

Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs.

In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics:

  • An overview of Rust and its benefits for Python developers
  • Profiling and identifying performance bottlenecks in Python application
  • Implementing a solution in Rust and integrating it with the Python application using PyO3
  • Measuring the performance improvements and comparing them to other optimization techniques

Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.

The aspect-oriented programming paradigm can support the separation of cross-cutting concerns such as logging, caching, or checking of permissions. This can improve code modularity and maintainability. Python offers decorator to implement re-usable code for cross-cutting task.

This tutorial is an in-depth introduction to decorators. It covers the usage of decorators and how to implement simple and more advanced decorators. Use cases demonstrate how to work with decorators. In addition to showing how functions can use closures to create decorators, the tutorial introduces callable class instance as alternative. Class decorators can solve problems that use be to be tasks for metaclasses. The tutorial provides uses cases for class decorators.

While the focus is on best practices and practical applications, the tutorial also provides deeper insight into how Python works behind the scene. After the tutorial participants will feel comfortable with functions that take functions and return new functions.

In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing

As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.

This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems.

In this talk, we will explore the use of Python's typing.Protocol, Scala's Typeclasses, and Rust's Traits. They all offer a very powerful & elegant mechanism for abstracting over various concepts (such as Serialization) in a modular manner. We will compare and contrast the syntax and implementation of these constructs in each language and discuss their strengths and weaknesses. We will also look at real-world examples of how these features are used in each language to specify behavior, and consider differences in terms of type system expressiveness and effectiveness. By the end of the talk, attendees will have a better understanding of the differences and similarities between these three language features, and will be able to make informed decisions about which one is best suited for their needs.

In this talk, we will report on our experiences switching from Pandas to Polars in a real-world ML project. Polars is a new high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will compare the performance of polars with the popular pandas library, and show how polars can provide significant speed improvements for data manipulation and analysis tasks. We will also discuss the unique features of polars, such as its ability to handle large datasets that do not fit into memory, and how it feels in practice to make the switch from Pandas. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python.

We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library.

You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting

Learn how to build and analyze heterogeneous graphs using PyG, a machine graph learning library in Python. This workshop will provide a practical introduction to the concept of heterogeneous graphs and their applications, including their ability to capture the complexity and diversity of real-world systems. Participants will gain experience in creating a heterogeneous graph from multiple data tables, preparing a dataset, and implementing and training a model using PyG.