talk-data.com talk-data.com

Alexandre Abraham

Speaker

Alexandre Abraham

2

talks

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →
Move beyond academia: Introducing an industry-first tabular benchmark

Discover a new benchmark designed for real-world impact. Built on authentic private-company data and carefully chosen public datasets that reflect real industry challenges, like product categorization, basket prediction, and personalized recommendations, it offers a realistic testing ground for both classic baselines (e.g., gradient boosting) and the latest models such as CARTE, TabICL, and TabPFN. By bridging the gap between academic research and industrial needs, this benchmark brings model evaluation closer to the decisions and constraints faced in practice.

This shift has tangible consequences: models are tested on problems that matter to businesses, using metrics that reflect real-world priorities (e.g., Precision@K, Recall@K, MAP@K). It enables more relevant model selection, highlights where academic approaches fall short, and fosters solutions that are not just novel but deployable. Models are judged on tasks and metrics that matter, enabling more informed choices, exposing the limits of lab-only approaches, and helping accelerate the journey from innovation to deployment.

In their seminal paper "Why propensity scores should not be used for matching," King and Nielsen (2019) highlighted the shortcomings of Propensity Score Matching (PSM). Despite these concerns, PSM remains prevalent in mitigating selection bias across numerous retrospective medical studies each year and continues to be endorsed by health authorities. Guidelines to mitigating these issues have been proposed, but many researchers encounter difficulties in both adhering to these guidelines and in thoroughly documenting the entire process.

In this presentation, I show the inherent variability in outcomes resulting from the commonly accepted validation condition of Standardized Mean Difference (SMD) below 10%. This variability can significantly impact treatment comparisons, potentially leading to misleading conclusions. To address this issue, I introduce A2A, a novel metric computed on a task specifically designed for the problem at hand. By integrating A2A with SMD, our approach substantially reduces the variability of predicted Average Treatment Effects (ATE) by up to 90% across validated matching techniques.

These findings collectively enhance the reliability of PSM outcomes and lay the groundwork for a comprehensive automated bias correction procedure. Additionally, to facilitate seamless adoption across programming languages, I have integrated these methods into "popmatch," a Python package that not only incorporates these techniques but also offers a convenient Python interface for R's MatchIt methods.