talk-data.com talk-data.com

Simeon Carstens

Speaker

Simeon Carstens

2

talks

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →
CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Markov chain Monte Carlo (MCMC) methods, a class of iterative algorithms that allow sampling almost arbitrary probability distributions, have become increasingly popular and accessible to statisticians and scientists. But they run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. Inaccurate sampling then results in incomplete and misleading parameter estimates. Markov chain Monte Carlo (MCMC) methods, a very popular class of iterative algorithms that allow sampling almost arbitrary probability distributions, run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. In this talk, intended for data scientists and statisticians with basic knowledge of MCMC and probabilistic programming, I present Chainsail, an open-source web service written entirely in Python. It implements Replica Exchange, an advanced MCMC method designed specifically to improve sampling of multimodal distributions. Chainsail makes this algorithm easily accessible to users of probabilistic programming libraries by automatically tuning important parameters and exploiting easy on-demand provisioning of the (increased) computing resources necessary for running Replica Exchange.