talk-data.com talk-data.com

Small Data SF talk 2025-11-05 at 16:30

Explore23: Web application for exploration of a large genomic research cohort

Topics

Description

We introduce a privacy-forward, secure, extensible, easy-to-use web application, Explore23, for browsing the multimodal data that has been collected as part of the 23andMe, Inc. Research cohort, built heavily on the DuckDB ecosystem. While the 23andMe Research program has collected a large number of data types from its >11M customers who have consented to participate in its Research program, there has not yet been a comprehensive tool enabling the exploration and visualization of the cohort, which is invaluable for genomics-driven target discovery and validation. Furthermore, any exploration of the 23andMe Research cohort needed to enable extensibility to future data types and applications, scalability for large participant and variant cohorts, comprehension by non-experts and external parties, and most importantly, protection of research participant privacy. The Explore23 tool utilizes DuckDB and the DuckDB extension ecosystem extensively through the lifecycle of data used in the showcase. A combination of pre-processing, backend result generation, and WASM-powered Mosaic integrations enable rapid search and visualization of the wide range of datasets collected. This includes integrating data from the various stages of the 23andMe research "pipeline": including raw survey questions, curated condition-based cohorts, genetic variants, and GWAS results. Of particular interest are the variant browser, which enables rapid, in-browser visualization of the over 170 million imputed and genotyped genetic variants in the 23andMe genetic panels; and the phenotypic pedigree summaries, which merges columnar datasets and graph queries (via DuckPGQ) to rapidly identify related participants in the 23andMe research cohort that share specific conditions. For each feature, there were challenges, both internal and external, in finding and contextualizing specific datasets for groups not already well acquainted with the data (e.g., even browsing surveys), and managing data scale. The front-end serves data that has been pre-processed through rigorous masking logic to protect participant privacy. In sum, Explore23 is an invaluable tool for research scientists exploring the immense complexity and diverse data of the whole 23andMe research cohort data. It highlights the incredible versatility of the DuckDB ecosystem to unify data access from raw result processing up through in-browser visualizations.