Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train model
talk-data.com
Event
PyConDE & PyData Berlin 2023
Activities tracked
191
Top Topics
Sessions & talks
Showing 101–125 of 191 · Newest first
Announcements
Lunch
Lunch
Lunch
Lunch
Lunch
Lunch
Lunch
An exchange of views on fastAPI in practice.
FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation.
FastAPI does a great job of getting people started with APIs quickly.
This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions.
At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts.
In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?
Keeping in mind the Pythonic principle that “simple is better than complex” we'll see how to create a web map with the Python based web framework Django using its GeoDjango module, storing geographic data in your local database on which to run geospatial queries.
Observability for Distributed Computing with Dask
Debugging is hard. Distributed debugging is hell.
Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.
However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.
In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.
This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.
“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever
As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit.
In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers:
-
How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too.
-
I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that.
-
Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research.
-
So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.
Actionable Machine Learning in the Browser with PyScript
PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) zero-installation & server-less environment.
In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. built-in Javascript integration and local modules ), discussing new data & design patterns (e.g. loading heterogeneous data in the browser), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).
BLE and Python: How to build a simple BLE project on Linux with Python
Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects.
However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication.
In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python.
This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.
Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results.
With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely.
In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.
Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs.
In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics:
- An overview of Rust and its benefits for Python developers
- Profiling and identifying performance bottlenecks in Python application
- Implementing a solution in Rust and integrating it with the Python application using PyO3
- Measuring the performance improvements and comparing them to other optimization techniques
Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.
“Who is an NLP expert?” - Lessons Learned from building an in-house QA-system
Innovations such as sentence-transformers, neural search and vector databases fueled a very fast development of question-answering systems recently. At scieneers, we wanted to test those components to satisfy our own information needs using a slack-bot that will answer our questions by reading through our internal documents and slack-conversations. We therefore leveraged the HayStack QA-Framework in combination with a Weaviate vector database and many fine-tuned NLP-models. This talk will give you insights in both, the technical challenges we faced and the organizational learnings we took.
The aspect-oriented programming paradigm can support the separation of cross-cutting concerns such as logging, caching, or checking of permissions. This can improve code modularity and maintainability. Python offers decorator to implement re-usable code for cross-cutting task.
This tutorial is an in-depth introduction to decorators. It covers the usage of decorators and how to implement simple and more advanced decorators. Use cases demonstrate how to work with decorators. In addition to showing how functions can use closures to create decorators, the tutorial introduces callable class instance as alternative. Class decorators can solve problems that use be to be tasks for metaclasses. The tutorial provides uses cases for class decorators.
While the focus is on best practices and practical applications, the tutorial also provides deeper insight into how Python works behind the scene. After the tutorial participants will feel comfortable with functions that take functions and return new functions.
Bayesian Marketing Science: Solving Marketing's 3 Biggest Problems
In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value. In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined together to make optimal marketing budget decisions in complex scenarios.
Geospatial Data Processing with Python: A Comprehensive Tutorial
In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing
Improving Machine Learning from Human Feedback
Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning?
ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve their performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the initial training cost. In this talk, we will explore these techniques, known as Reinforcement Learning from Human Feedback (RLHF), and how open-source machine learning tools like PyTorch and Label Studio can be used to tune off-the-shelf models using direct human feedback.
Even if every data science work is special, a lot can be learned from similar problems solved in the past. In this talk, I will share some specific software design concepts that data scientists can use to build better data products.
As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.
This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems.