We introduce a privacy-forward, secure, extensible, easy-to-use web application, Explore23, for browsing the multimodal data that has been collected as part of the 23andMe, Inc. Research cohort, built heavily on the DuckDB ecosystem. While the 23andMe Research program has collected a large number of data types from its >11M customers who have consented to participate in its Research program, there has not yet been a comprehensive tool enabling the exploration and visualization of the cohort, which is invaluable for genomics-driven target discovery and validation. Furthermore, any exploration of the 23andMe Research cohort needed to enable extensibility to future data types and applications, scalability for large participant and variant cohorts, comprehension by non-experts and external parties, and most importantly, protection of research participant privacy. The Explore23 tool utilizes DuckDB and the DuckDB extension ecosystem extensively through the lifecycle of data used in the showcase. A combination of pre-processing, backend result generation, and WASM-powered Mosaic integrations enable rapid search and visualization of the wide range of datasets collected. This includes integrating data from the various stages of the 23andMe research "pipeline": including raw survey questions, curated condition-based cohorts, genetic variants, and GWAS results. Of particular interest are the variant browser, which enables rapid, in-browser visualization of the over 170 million imputed and genotyped genetic variants in the 23andMe genetic panels; and the phenotypic pedigree summaries, which merges columnar datasets and graph queries (via DuckPGQ) to rapidly identify related participants in the 23andMe research cohort that share specific conditions. For each feature, there were challenges, both internal and external, in finding and contextualizing specific datasets for groups not already well acquainted with the data (e.g., even browsing surveys), and managing data scale. The front-end serves data that has been pre-processed through rigorous masking logic to protect participant privacy. In sum, Explore23 is an invaluable tool for research scientists exploring the immense complexity and diverse data of the whole 23andMe research cohort data. It highlights the incredible versatility of the DuckDB ecosystem to unify data access from raw result processing up through in-browser visualizations.
talk-data.com
Speaker
Teague Sterling
1
talks
Teague is Director of Computational Biology Systems at 23andMe, making the breadth of the multimodal 23andMe research-consented cohort explorable and interpretable to researchers of different backgrounds. His team has used DuckDB and DuckDB-based tools extensively to process, summarize, and ultimately visualize the many billions of datapoints in the dataset.
Previously he worked in genetics and informatics at BioMarin Pharmaceutical and was the lead architect and developer of decidedly not-so-small ZINC chemical database at UCSF. He has also begun authoring multiple DuckDB community extensions including duckdb_mcp (for AI integrations), webbed (for XML and HTML parsing), and duckdb_yaml (for YAML file compatibility).
Despite a career in massive biological datasets, he maintains that the right small dataset beats a big one every time.
Bio from: Small Data SF 2025
Filter by Event / Source
Talks & appearances
Showing 1 of 1 activities