Event Date:
February 27th 9:30 AM - 10:30 AM
Biomedical & Health Informatics PhD trainee Xinyu Sun presenting.
Title:
Evaluation of Sequence-to-Function Models Across Diverse Ancestries in the MAGENTA Dataset
Deep learning Sequence-to-Function (S2F) models, such as Borzoi and AlphaGenome, have revolutionized our ability to predict regulatory effects directly from DNA sequences. However, because these models are trained predominantly on European-centric datasets like ENCODE and reference genomes derived primarily from European ancestry, their equitable performance across diverse populations remains an open question. This presentation evaluates the ancestry-specific performance of S2F models using whole blood eQTL data from the MAGENTA cohort, comprising African American (AA), Caribbean Hispanic (CH), and Non-Hispanic White (NHW) populations. We first assessed model performance on nominal significant eQTLs (per-gene FDR < 0.05 & P < 2.02*10^-5), finding minimal predictive power: correlations between predicted and observed effect sizes were negligible (Spearman r < 0.05), and direction concordance hovered near chance levels (~50–51%). In contrast, when focusing on SuSiE fine-mapped variants with high posterior inclusion probability—likely representing true causal variants—performance improved substantially. Using distance-matched AUROC analysis with 100× random subsampling to control for TSS distance confounding and class imbalance, we observed significant ancestry-specific differences at PIP ≥ 0.9: AA variants achieved the highest AUROC (0.83–0.84), followed by NHW (0.75) and CH (0.69–0.75). These differences were statistically significant (p < 0.001 for AA vs. CH and AA vs. NHW comparisons). Our findings demonstrate that S2F models perform markedly better on fine-mapped causal variants than on nominal eQTLs, and that model performance is influenced not only by reference genome composition but also by population-specific factors such as fine-mapping power and genetic architecture, highlighting the complexity of evaluating AI/ML genomics tools across diverse populations.
If unable to attend in person in Biomedical Research Building room 105, you may join via Zoom at
Meeting ID: 958 2937 2435
Passcode: 087450