PyConDE & PyData Berlin 2023

Use Spark from anywhere: A Spark client in Python powered by Spark Connect

2023-04-18

talk

Martin Grund

API Python Spark SQL

Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.

However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

Enabling Machine Learning: How to Optimize Infrastructure, Tools and Teams for ML Workflows

2023-04-18

talk

Yann Lemonnier

AI/ML

In this talk, we will explore the role of a machine learning enabler engineer in facilitating the development and deployment of machine learning models. We will discuss best practices for optimizing infrastructure and tools to streamline the machine learning workflow, reduce time to deployment, and enable data scientists to extract insights and value from data more efficiently. We will also examine case studies and examples of successful machine learning enabler engineering projects and share practical tips and insights for anyone interested in this field.

Introducing FastKafka

2023-04-18

talk

Tvrtko Sternak

AI/ML Kafka Python

FastKafka is a Python library that makes it easy to connect to Apache Kafka queues and send and receive messages. In this talk, we will introduce the library and its features for working with Kafka queues in Python. We will discuss the motivations for creating the library, how it compares to other Kafka client libraries, and how to use its decorators to define functions for consuming and producing messages. We will also demonstrate how to use these functions to build a simple application that sends and receives messages from the queue. This talk will be of interest to Python developers looking for an easy-to-use solution for working with Kafka.

The documentation of the library can be found here: https://fastkafka.airt.ai/

MLOps in practice: our journey from batch to real-time inference

2023-04-18

talk

Theodore Meynard

AI/ML CI/CD MLOps

I will present the challenges we encountered while migrating an ML model from batch to real-time predictions and how we handled them. In particular, I will focus on the design decisions and open-source tools we built to test the code, data and models as part of the CI/CD pipeline and enable us to ship fast with confidence.

PyLadies Panel Session. Tech Illusions and the Unbalanced Society: Finding Solutions for a Better Future

2023-04-18

talk

During this panel, we’ll discuss the significant role PyLadies chapters around the world have played in advocating for gender representation and leadership and combating biases and the gender pay gap.

The bumps in the road: A retrospective on my data visualisation mistakes

2023-04-18

talk

Artem Kislovskiy

CI/CD DataViz Matplotlib

We will delve into the importance of effective data visualisation in today's world. We will explore how it can help convey insights from data using Matplotlib and best practices for creating informative visualisations. We will also discuss the limitations of static visualisations and examine the role of continuous integration in streamlining the process and avoiding common pitfalls. By the end of this talk, you will have gained valuable insights and techniques for creating informative and accurate data visualisations, no matter what tools you're using.

Data Kata: Ensemble programming with Pydantic #2

2023-04-18

talk

Lev Konstantinovskiy , Nitsan Avni , Gregor Riegler

Pydantic

Write code as an ensemble to solve a data validation problem using Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.

Let's contribute to pandas (3 hours) #2

2023-04-18

talk

Patrick Hoefler , Noa Tamir

HTML Pandas Python

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

Coffee Break

2023-04-18

talk

Coffee Break

2023-04-18

talk

Coffee Break

2023-04-18

talk

Coffee Break

2023-04-18

talk

Coffee Break

2023-04-18

talk

Accelerating Public Consultations with Large Language Models: A Case Study from the UK Planning Inspectorate

2023-04-18

talk

Andreas Leed , Michele Dallachiesa

Cloud Computing LLM React

Local Planning Authorities (LPAs) in the UK rely on written representations from the community to inform their Local Plans which outline development needs for their area. With an average of 2000 representations per consultation and 4 rounds of consultation per Local Plan, the volume of information can be overwhelming for both LPAs and the Planning Inspectorate tasked with examining the legality and soundness of plans. In this study, we investigate the potential for Large Language Models (LLMs) to streamline representation analysis.

We find that LLMs have the potential to significantly reduce the time and effort required to analyse representations, with simulations on historical Local Plans projecting a reduction in processing time by over 30%, and experiments showing classification accuracy of up to 90%.

In this presentation, we discuss our experimental process which used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of the BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss the design and prototyping of web applications to support the aided processing of representations using Voilà, FastAPI, and React. Finally, we highlight successes and challenges encountered and suggest areas for future improvement.

Delivering AI at Scale

2023-04-18

talk

Severin Schmitt , Anna Achenbach , Thorsten Kranz

Agile/Scrum AI/ML Cloud Computing Data Science Kubernetes MLOps

Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..

Visualizing your computer vision data is not a luxury, it's a necessity: without it, your models are blind and so do you.

2023-04-18

talk

Arnault Chazareix

DataViz

Are you ready to take your Computer Vision projects to the next level? Then don't miss this talk!

Data visualization is a crucial ingredient for the success of any computer vision project. It allows you to assess the quality of your data, grasp the intricacies of your project, and communicate effectively with stakeholders.

In this talk, we'll showcase the power of data visualization with compelling examples. You'll learn about the benefits of data visualization and discover practical methods and tools to elevate your projects.

Don't let this opportunity pass you by: join us and learn how to make data visualization a core feature of your Computer Vision projects.

When A/B testing isn’t an option: an introduction to quasi-experimental methods

2023-04-18

talk

Inga Janczuk

Identification of causal relationships through running experiments is not always possible. In this talk, an alternative approach towards it, quasi-experimental frameworks, is discussed. Additionally, I will present how to adjust well-known machine-learning algorithms so they can be used to quantify causal relationships.

Writing Plugin Friendly Python Applications

2023-04-18

talk

Travis Hathaway

GitHub Python

In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly.

Throughout the talk we will be creating our own plugin friendly application using the pluggy library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager conda is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.

Data-driven design for the Dask scheduler

2023-04-18

talk

Guido Imperiale

Historically, changes in the scheduling algorithm of Dask have often been based on theory, single use cases, or even gut feeling. Coiled has now moved to using hard, comprehensive performance metrics for all changes - and it's been a turning point!

Getting started with JAX

2023-04-18

talk

Simon Pressler

PyTorch TensorFlow

Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to TensorFlow and PyTorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization. Nevertheless, making your first steps in JAX can feel complicated given some of its idiosyncrasies. This talk helps new users getting started in this promising ecosystem by sharing practical tips and best practises.

Methods for Text Style Transfer: Text Detoxification Case

2023-04-18

talk

Daryna Dementieva

Global access to the Internet has enabled the spread of information throughout the world and has offered many new possibilities. On the other hand, alongside the advantages, the exponential and uncontrolled growth of user-generated content on the Internet has also facilitated the spread of toxicity and hate speech. Much work has been done in the direction of offensive speech detection. However, there is another more proactive way to fight toxic speech -- how a suggestion for a user as a detoxified version of the message. In this presentation, we will provide an overview how texts detoxification task can be solved. The proposed approaches can be reused for any text style transfer task for both monolingual and multilingual use-cases.

Pragmatic ways of using Rust in your data project

2023-04-18

talk

Christopher Prohm

NumPy Pandas Python Rust

Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets.

Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.

You are what you read: Building a personal internet front-page with spaCy and Prodigy

2023-04-18

talk

Victoria Slocum

NLP

Sometimes the internet can be a bit overwhelming, so I thought I would make a tool to create a personalized summary of it! In this talk, I'll demonstrate a personal front-page project that allows me to filter info on the internet on a certain topic, built using spaCy, an open-source library for NLP, and Prodigy, a scriptable annotation tool. With this project, I learned about the power of working with tools that provide extensive customizability without sacrificing ease of use. Throughout the talk, I'll also discuss how design concepts of developer tools can improve the development experience when building complex and adaptable software.

Data Kata: Ensemble programming with Pydantic #1

2023-04-18

talk

Lev Konstantinovskiy , Nitsan Avni , Gregor Riegler

Pydantic

Write code as an ensemble to solve a data validation problem with Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.

Let's contribute to pandas (3 hours) #1

2023-04-18

talk

Patrick Hoefler , Noa Tamir

HTML Pandas Python

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

talk-data.com

PyConDE & PyData Berlin 2023

Top Topics

Top Speakers

Use Spark from anywhere: A Spark client in Python powered by Spark Connect

Enabling Machine Learning: How to Optimize Infrastructure, Tools and Teams for ML Workflows

Introducing FastKafka

MLOps in practice: our journey from batch to real-time inference

PyLadies Panel Session. Tech Illusions and the Unbalanced Society: Finding Solutions for a Better Future

The bumps in the road: A retrospective on my data visualisation mistakes

Data Kata: Ensemble programming with Pydantic #2

Let's contribute to pandas (3 hours) #2

Coffee Break

Coffee Break

Coffee Break

Coffee Break

Coffee Break

Accelerating Public Consultations with Large Language Models: A Case Study from the UK Planning Inspectorate

Delivering AI at Scale

Visualizing your computer vision data is not a luxury, it's a necessity: without it, your models are blind and so do you.

When A/B testing isn’t an option: an introduction to quasi-experimental methods

Writing Plugin Friendly Python Applications

Data-driven design for the Dask scheduler

Getting started with JAX

Methods for Text Style Transfer: Text Detoxification Case

Pragmatic ways of using Rust in your data project

You are what you read: Building a personal internet front-page with spaCy and Prodigy

Data Kata: Ensemble programming with Pydantic #1

Let's contribute to pandas (3 hours) #1