talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (3 results)

Activities & events

Title & Speakers Event

Welcome to the PyData Berlin November meetup!

We would like to welcome you all starting from 18:45. There will be food and drinks. The talks begin around 19.30 and the doors will close at 19:30. Make sure to arrive on time!

Please provide your first and last name for the registration because this is required for the venue's entry policy. If you cannot attend, please cancel your spot so others are able to join as the space is limited.

Host: Bonial is excited to welcome you to this month's version of PyData. ************************************************************************** The Lineup for the evening

Talk 1: Running Python data transformations at scale with dbt and Astronomer Cosmos Abstract: I will discuss how we built a tool using dbt and Astronomer Cosmos to orchestrate Python data transformations at scale. As part of a global team, we faced the challenge of developing and scaling data transformations for our entities across more than 50 countries. Our data scientists write these transformations to improve our machine learning models, and the need to manage such a large number of entities required an efficient and scalable solution. This tool streamlines the entire process, enabling data scientists to quickly develop data transformations, leverage built-in dbt tests for data validation, and seamlessly deploy these transformations to production environments. The integration of dbt and Astronomer Cosmos has significantly accelerated our workflows, ensuring robust and scalable data operations while also empowering our data scientists to deliver more value, faster.

Speaker: Galuh Sahid currently works as a Senior Machine Learning Engineer at Delivery Hero. She is also recognized as a Google Developer Expert in Machine Learning. She previously worked at Twitter and Gojek. She has developed and productionized various ML applications, including fraud detection, content moderation, and marketing using traditional ML, NLP, and computer vision. In her free time, Galuh enjoys painting and hiking.

Talk 2: Anomaly Detection in Track Scenes Abstract: In the “Digitale Schiene Deutschland” initiative, Deutsche Bahn is developing an automated train driving system. To support this, we collaborated with them to create a machine learning solution that detects anomalous objects on and around tracks using onboard RGB cameras. Rather than recognizing specific object classes (e.g., people, signals), this system identifies any object and ranks it by anomaly. This presentation covers challenges, approaches, and the final solution: a unique pipeline using multiple machine learning components, including monocular depth estimation, segmentation, image embedding, and anomaly detection. The OSDAR23 dataset, containing 45 scenes with RGB, infrared, radar, and lidar data, aids in model finetuning and evaluation. Additionally, unannotated data was used for self-supervised learning.

Speaker: Maximilian Trescher studied physics in Berlin and Paris, PhD in theoretical physics in 2018 (Freie Universität Berlin). Then he worked 4 years (18-22) as a software engineer (Java, databases etc). Since 2022 he is a machine learning Scientist at dida (www.dida.do).

Lightning talks There will be slots for 2-3 Lightning Talks (3-5 Minutes for each). Kindly let us know if you would like to present something at the start of the meetup :)

*** NumFOCUS Code of Conduct THE SHORT VERSION Be kind to others. Do not insult or put down others. Behave professionally. Remember that harassment and sexist, racist, or exclusionary jokes are not appropriate for NumFOCUS. All communication should be appropriate for a professional audience including people of many different backgrounds. Sexual language and imagery are not appropriate. NumFOCUS is dedicated to providing a harassment-free community for everyone, regardless of gender, sexual orientation, gender identity, and expression, disability, physical appearance, body size, race, or religion. We do not tolerate harassment of community members in any form. Thank you for helping make this a welcoming, friendly community for all. If you haven't yet, please read the detailed version here: https://numfocus.org/code-of-conduct ***

PyData Berlin 2024 November Meetup
Project Sprints 2024-09-27 · 07:30
PyData Paris 2024

On the day following the PyData Paris conference, open-source contributors and maintainers will gather at Carrefour Numérique to contribute to the open-source data science stack.

Note: Sprints are open to everyone, not just attendees of the conference. However, if you are interested in attending the PyData Paris conference, you can check out details and sign up at https://pydata.org/paris2024. PyData Paris will be the main gathering of the open-source data-science and AI/ML community.

Contributors to the following projects have already signed up the the sprint day!

  • pandas: Data analysis and manipulation tool
  • pyarrow: Columnar in-memory analytics
  • narwhals: Lightweight compatibility layer between dataframe libraries
  • scikit-learn: Machine Learning in Python
  • skrub: Prepping tables for machine learning
  • geopandas: Geospatial data analysis extending pandas
  • Project Jupyter: Interactive Scientific Computing
  • Make Open Data: Modern Data Stack Open Source for public data

You can also check out the PyData Paris page on the sprint day, which will receive frequent updates on sprint subjects: https://pydata.org/paris2024/sprints.

We will welcome participants starting at 9:30 AM at the Carrefour Numérique, located at 30 Av. Corentin Cariou, 75019 Paris.

Sprint Day at PyData Paris 2024
Event PyData Paris 2024 2024-09-26
Farewell 2024-09-26 · 15:15
Lightning Talks 2024-09-26 · 14:30

Machine Learning practitioners build predictive models from "noisy" data resulting in uncertain predictions. But what does "noise" mean in a machine learning context?

AI/ML
Coffee Break 2024-09-26 · 13:30
Coffee Break 2024-09-26 · 13:30
Coffee Break 2024-09-26 · 13:30

Rising concerns over IT's carbon footprint necessitate tools that gauge and mitigate these impacts. This session introduces CodeCarbon, an open-source tool that estimates computing's carbon emissions by measuring energy use across hardware components. Aimed at AI researchers and data scientists, CodeCarbon provides actionable insights into the environmental costs of computational projects, supporting efforts towards sustainability without requiring deep technical expertise.

This talk from the main contributors of Code Carbon will cover the environmental impact of IT, the possibilities to estimate it and a demo of CodeCarbon.

AI/ML
Stefanie Sabine Senger , Guillaume Lemaitre – scikit-learn maintainer @ scikit-learn

In this talk, we provide an update on the latest scikit-learn features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:

  • the metadata routing API allowing to pass metadata around estimators;
  • the TunedThresholdClassifierCV allowing to tuned operational decision through custom metric;
  • better support for categorical features and missing values;
  • interoperability of array and dataframe.
API Scikit-learn

Scaling machine learning at large organizations like Renault Group presents unique challenges, in terms of scales, legal requirements, and diversity of use cases. Data scientists require streamlined workflows and automated processes to efficiently deploy models into production. We present an MLOps pipeline based on python Kubeflow and GCP Vertex AI API designed specifically for this purpose. It enables data scientists to focus on code development for pre-processing, training, evaluation, and prediction. This MLOPS pipeline is a cornerstone of the AI@Scale program, which aims to roll out AI across the Group.

We choose a Python-first approach, allowing Data scientists to focus purely on writing preprocessing or ML oriented Python code, also allowing data retrieval through SQL queries. The pipeline addresses key questions such as prediction type (batch or API), model versioning, resource allocation, drift monitoring, and alert generation. It favors faster time to market with automated deployment and infrastructure management. Although we encountered pitfalls and design difficulties, that we will discuss during the presentation, this pipeline integrates with a CI/CD process, ensuring efficient and automated model deployment and serving.

Finally, this MLOps solution empowers Renault data scientists to seamlessly translate innovative models into production, and smoothen the development of scalable, and impactful AI-driven solutions.

AI/ML API CI/CD GCP MLOps Python SQL

The EU Commission is likely to vote on the Cyber Resilience Act (CRA) later this year. In this talk we will look at the timeline for the new legislation, any critical discussions happening around implementation and most importantly, the new responsibilities outlined by the CRA. We’ll also discuss what the PSF is doing for CPython and for PyPI and what each of us in the Python ecosystem might want to do to get ready for a new era of increased certainty – and liability – around security.

Python Cyber Security

Adaptive prediction intervals, which represent prediction uncertainty, are crucial for practitioners involved in decision-making. Having an adaptivity feature is challenging yet essential, as an uncertainty measure must reflect the model's confidence for each observation. Attendees will learn about state-of-the-art algorithms for constructing adaptive prediction intervals, which is an active area of research.

The MedTech industry is undergoing a revolutionary transformation with continuous innovations promising greater precision, efficiency, and accessibility. In particular oncology, a branch of medicine that focuses on cancer, will benefit immensely from these new technologies, which may enable clinicians to detect cancer earlier and increase chances of survival. Detecting cancerous cells in microscopic photography of cells (Whole Slide Images, aka WSIs) is usually done with segmentation algorithms, which neural networks (NNs) are very good at. While using ML and NNs for image segmentation is a fairly standard task with established solutions, doing it on WSIs is a different kettle of fish. Most training pipelines and systems have been designed for analytics, meaning huge columns of small individual datums. In the case of WSIs, a single image is so huge that its file can be up to dozens of gigabytes. To allow innovation in medical imaging with AI, we need efficient and affordable ways to store and process these WSIs at scale.

AI/ML Analytics Cloud Computing

Open Source Software, the backbone of today’s digital infrastructure, must be sustainable for the long-term. Qureshi and Fang (2011) find that motivating, engaging, and retaining new contributors is what makes open source projects sustainable.

Yet, as Steinmacher, et al. (2015) identifies, first-time open source contributors often lack timely answers to questions, newcomer orientation, mentors, and clear documentation. Moreover, since the term was first coined in 1998, open source lags far behind other technical domains in participant diversity. Trinkenreich, et al. (2022) reports that only about 5% of projects were reported to have women as core developers, and women authored less than 5% of pull requests, but had similar or even higher rates of pull request acceptances to men. So, how can we achieve more diversity in open source communities and projects?

Bloomberg’s Women in Technology (BWIT) community, Open Source Program Office (OSPO), and Corporate Philanthropy team collaborated with NumFOCUS to develop a volunteer incentive model that aligns business value, philanthropic impact, and individual technical growth. Through it, participating Bloomberg engineers were given the opportunity to convert their hours spent contributing to the pandas open source project into a charitable donation to a non-profit of their choice.

The presenters will discuss how we wove together differing viewpoints: non-profit foundation and for-profit corporation, corporate philanthropy and engineers, first-time contributors and core devs. They will showcase why and how we converted technical contributions into charitable dollars, the difference this community-building model had in terms of creating a diverse and sustained group of new open source contributors, and the viability of extending this to other open source projects and corporate partners to contribute to the long-term sustainability of open source—thereby demonstrating the true convergence of tech and social impact.

NOTE: [1] Qureshi, I, and Fang, Y. "Socialization in open source software projects: A growth mixture modeling approach." 2011. [2] Steinmacher, I., et al. "Social barriers faced by newcomers placing their first contribution in open source software projects." 2015. [3] Trinkenreich, B., et al. "Women’s participation in open source software: A survey of the literature." 2022.

Pandas

Retrieval is the process of searching for a given item (image, text, …) in a large database that are similar to one or more query items. A classical approach is to transform the database items and the query item into vectors (also called embeddings) with a trained model so that they can be compared via a distance metric. It has many applications in various fields, e.g. to build a visual recommendation system like Google Lens or a RAG (Retrieval Augmented Generation), a technique used to inject specific knowledge into LLMs depending on the query. Vector databases ease the management, serving and retrieval of the vectors in production and implement efficient indexes, to rapidly search through millions of vectors. They gained a lot of attention over the past year, due to the rise of LLMs and RAGs.

Although people working with LLMs are increasingly familiar with the basic principles of vector databases, the finer details and nuances often remain obscure. This lack of clarity hinders the ability to make optimal use of these systems.

In this talk, we will detail two examples of real-life projects (Deduplication of real estate adverts using the image embedding model DinoV2 and RAG for a medical company using the text embedding model Ada-2) and deep dive into retrieval and vector databases to demystify the key aspects and highlight the limitations: HSNW index, comparison of the providers, metadata filtering (the related plunge of performance when filtering too many nodes and how indexing partially helps it), partitioning, reciprocal rank fusion, the performance and limitations of the representations created by SOTA image and text embedding models, …

LLM RAG Vector DB

MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.

Link to Github: https://github.com/scikit-learn-contrib/MAPIE

AI/ML GitHub Python PyTorch Scikit-learn TensorFlow
Lunch Break 2024-09-26 · 10:10
Lunch Break 2024-09-26 · 10:10