| Literature DB >> 31511901 |
Shijie Zhang1, Yukun He1, Huanhuan Liu1, Haoyu Zhai2, Dandan Huang1,3, Xianfu Yi4, Xiaobao Dong5, Zhao Wang1, Ke Zhao1, Yao Zhou1, Jianhua Wang1, Hongcheng Yao6, Hang Xu6, Zhenglu Yang7, Pak Chung Sham8, Kexin Chen9, Mulin Jun Li1,9.
Abstract
Predicting the functional or pathogenic regulatory variants in the human non-coding genome facilitates the interpretation of disease causation. While numerous prediction methods are available, their performance is inconsistent or restricted to specific tasks, which raises the demand of developing comprehensive integration for those methods. Here, we compile whole genome base-wise aggregations, regBase, that incorporate largest prediction scores. Building on different assumptions of causality, we train three composite models to score functional, pathogenic and cancer driver non-coding regulatory variants respectively. We demonstrate the superior and stable performance of our models using independent benchmarks and show great success to fine-map causal regulatory variants on specific locus or at base-wise resolution. We believe that regBase database together with three composite models will be useful in different areas of human genetic studies, such as annotation-based casual variant fine-mapping, pathogenic variant discovery as well as cancer driver mutation identification. regBase is freely available at https://github.com/mulinlab/regBase.Entities:
Mesh:
Year: 2019 PMID: 31511901 PMCID: PMC6868349 DOI: 10.1093/nar/gkz774
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Study workflow and correlation analysis of prediction score among 23 regBase Common integrated tools. (A) A flowchart showing the workflow of our regBase study. (B) Pearson correlation of 23 regBase Common integrated functional scores on three known functional/pathogenic regulatory variant datasets. Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the square are proportional to the correlation coefficients. Non-significant P-value (>0.05) is marked with a cross. (C) Hierarchical clustering of regBase Common integrated tools on three known functional/pathogenic regulatory variant datasets. HGMD, the Human Gene Mutation Database functional regulatory variants dataset; ClinVar, the ClinVar pathogenic and benign regulatory variants dataset; MPRA, the expression-modulating variants dataset identified by massively parallel reporter assay.
Figure 2.Receiver operating characteristic (ROC) curve and area under the receiver operating characteristics curve (AUC) for different prediction models using 10-fold cross-validation. (A) ROC and AUC of 23 integrated tools and 10-fold cross-validation result for regBase_REG_Common model. (B) ROC and AUC of 13 integrated tools and 10-fold cross-validation result for regBase_REG model. (C) ROC and AUC of 13 integrated tools and 10-fold cross-validation result for regBase_PAT model. (D) ROC and AUC of 13 integrated tools and 10-fold cross-validation result for regBase_CAN model.
Figure 3.Area-under-curve scores distribution for eight independent benchmarks. (A) regBase_REG_Common model. (B) regBase_REG model. (C) regBase_PAT model. (D) regBase_CAN model. Brown_eQTL, 11 tissue/cell type-specific eQTLs fine-mapping data that was profiled by Brown and colleagues; GTEx_eQTL, 44 tissues-specific eQTLs within fine-mapped credible set from GTEx V6; MPRA_eQTL, significant expression modulating variants by MPRA in lymphoblastoid cell lines; GWAS_5E-8, GWAS disease-associated regulatory variants with P-value < 5E–8 from GWAS Catalog; GWAS_1E-5, GWAS disease-associated regulatory variants with P-value < 1E-5 from GWAS Catalog; Somatic_eQTL, recurrent somatic mutations within significant flanking intervals per somatic eGene; Rare_Patho_SNV, rare pathogenic regulatory variants for inherited diseases; ASD_denovo_SNV, de novo pathogenic regulatory mutations for autism spectrum disorder.
Figure 4.Evaluation result of individual prediction tools on six independent testing datasets. (A) Performance on Brown_eQTL dataset. (B) Performance on GTEx_eQTL dataset. (C) Performance on GWAS_5E-8 dataset. (D) Performance on Rare_Patho_SNV dataset. (E) Performance on Somatic_eQTL dataset. (F) Performance on ASD_denovo_SNV dataset. AUPR, area under the precision recal curve; AUROC, area under the receiver operating characteristics curve; bubble size is proportional to Pearson correlation coefficients between predicted and true labels for each evaluation.
Figure 5.Non-coding regulatory variants prioritization at 5p15.33 TERT region. (A) GWAS significant SNPs and regional PHRED-scaled score distribution of our four composite models across 5p15.33 TERT region. LocusZoom plot is generated using the most significant SNP rs10069690 as lead and the EUR LD structure. (B) Comparison of regional PHRED scores among our composite models and all integrated methods for 22 fine-mapping SNPs at 5p15.33 TERT gene. Tools that obtain more than 25% equal scores in the evaluation are excluded. (C) LocusZoom plots for regional PHRED-scaled score of 22 fine-mapping SNPs. The top prioritized SNP rs2853669 in regBase_REG_Common model and the top prioritized SNP rs13172201 in regBase_CAN models are selected as leads.
Figure 6.Causal regulatory alleles discrimination at base-wise resolution. (A) The uniqueness of prediction scores of 13 regBase-incorporated tools in the 259 bp ALDOB enhancer. (B) Prediction scores overlaid with expression fold changes (gray bars) for an ALDOB enhancer as determined with saturation mutagenesis assay. Pearson correlation values for this region are provided in parentheses for each method. (C) The proportion of discriminable scores among 13 regBase-incorporated tools for 55 453 simulated sites. (D) Degree of discrimination for pathogenic and non-pathogenic alleles of top prioritized variants among qualified prediction models.