talk-data.com talk-data.com

Simeon Carstens

Speaker

Simeon Carstens

1

talks

Filtering by: PyData Paris 2025 ×

Filter by Event / Source

Talks & appearances

Showing 1 of 2 activities

Search activities →
CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.