talk-data.com talk-data.com

R

Speaker

Rania Talbi

1

talks

Filter by Event / Source

Talks & appearances

1 activities · Newest first

Search activities →
CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.