| Literature DB >> 30659175 |
Li Liu1, Maxwell D Sanderford2, Ravi Patel2,3, Pramod Chandrashekar1, Greg Gibson4, Sudhir Kumar5,6.
Abstract
Computational prediction of the phenotypic propensities of noncoding single nucleotide variants typically combines annotation of genomic, functional and evolutionary attributes into a single score. Here, we evaluate if the claimed excellent accuracies of these predictions translate into high rates of success in addressing questions important in biological research, such as fine mapping causal variants, distinguishing pathogenic allele(s) at a given position, and prioritizing variants for genetic risk assessment. A significant disconnect is found to exist between the statistical modelling and biological performance of predictive approaches. We discuss fundamental reasons underlying these deficiencies and suggest that future improvements of computational predictions need to address confounding of allelic, positional and regional effects as well as imbalance of the proportion of true positive variants in candidate lists.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30659175 PMCID: PMC6338804 DOI: 10.1038/s41467-018-08270-y
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Properties of predictive models for six tools
| Method | Assumption of pathogenicity | Predictors | Modeling approaches | Performance (AUROC)a |
|---|---|---|---|---|
| CADD | Evolutionary fitness | Evolutionary parameters, ENCODE summaries, functional annotations, population frequencies | Support vector machines | 0.92b |
| CATO | Molecular functions | Cell type- and tissue-specific assays, evolutionary parameters, functional annotations | Logistic regression | NAc |
| DeepSEA | Molecular functions | Local sequences, evolutionary parameters | Deep learning, Logistic regression | 0.85 |
| EIGEN | Noned | Evolutionary parameters, ENCODE summaries, population frequencies | Unsupervised learning | 0.79 |
| GWAVA | DAVs vs. CPPs | Evolutionary parameters, ENCODE summaries, population frequencies | Random forests | 0.97 |
| LINSIGHT | Evolutionary fitness | Evolutionary parameters, ENCODE summaries, functional annotations | Generalized linear model | 0.96 |
AUROC = area under the receiver operator characteristic curve, DAV = disease-associated variant, CPP = common population polymorphism
aHighest AUROC values in classifying DAVs and CPPs reported in the original publications
bCADD reported AUROC values that mixed coding and noncoding variants
cCATO predicts transcription factor occupancy instead of pathogenicity
dEIGEN uses an unsupervised learning approach and thus makes no assumption of pathogenicity during training
Fig. 1Performance of tested methods on detecting pathogenic variant in position-matched ncSNVs. a ROC curves with AUROC values displayed for each method. b Cumulative distribution of CADD scores. c Cumulative distribution of DeepSEA scores. Pathogenic ncSNVs were limited to variants not observed in human or any of the 57 non-human placental mammals
Fig. 2Performance of tested methods on region-matched pathogenic and non-pathogenic ncSNVs. a Violin plots show distributions of impact score difference between nearby pathogenic and non-pathogenic ncSNVs. Variants were grouped into bins based on distances measured in base pairs. Positive values imply that the pathogenic variant has the higher score. b Correlation of impact scores for ncSNVs located within given genomic distances. Pearson correlation coefficient values are displayed. c Success rates of ranking a pathogenic ncSNV higher than a non-pathogenic ncSNV located within 1000 bps vicinity. Data are stratified by conservation of the position harboring pathogenic ncSNVs. Since pathogenic and non-pathogenic variants are evaluated in pairs, the random expectation of success rates is 50% as represented by the red line. d Success rates stratified by the genomic context of pathogenic ncSNVs, including promoters, 5’- and 3’- untranslated regions (UTRs), introns, near-gene (+/−100 kbps) regions and gene-desert regions. GWAVA represents GWAVA region-matched scores
Fig. 3Performance of tested methods on prioritizing GWAS hits. a Fraction of pathogenic ncSNVs in the top 10 percentile of impact scores declines exponentially as the mixing ratio increases. b AUPRC value decreases as the mixing ratio increases. The gray dotted line presents AUPRC = 0.5 corresponding to random predictions. The black dotted line presents AUPRC = 0.8 corresponding to a desired performance in practice. c Maximum mixing ratio for each tool to achieve AUPRC > 0.8 when pathogenic ncSNVs disrupt ultra-, well-, or least-conserved positions. d Maximum mixing ratio for each tool to achieve AUPRC > 0.8 when pathogenic ncSNVs are inside different types of genomic regions. Since the highest AUPRC of CATO in any category was lower than 0.8, we did not include it in panels c and d. In this task, we used GWAVA unmatched scores
Performance of six tools on three biological tasks
| Method | Task 1a | Task 2: Positional diagnosisb | Task 3: Diagnosis with noisy backgroundc | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Adjd | UC | WC | LC | Pro | 5'U | ups | int | UC | WC | LC | Pro | 5'U | ups | int | ||
| CADD | o | >10 bp | +++ | ++ | + | ++ | + | ++ | +++ | ++ | + | − | + | + | o | + |
| CATO | – | >1 kbp | + | + | + | + | + | + | + | − | − | − | − | − | − | − |
| DeepSEA | o | >10 bp | +++ | +++ | + | +++ | ++ | +++ | +++ | +++ | + | o | ++ | + | + | + |
| EIGEN | – | >100 bp | ++ | ++ | + | ++ | + | ++ | ++ | + | + | o | o | + | + | o |
| GWAVA | – | >50 bp | − | − | − | − | − | − | − | + | + | + | ++ | ++ | + | − |
| LINSIGHT | – | >10 bp | +++ | ++ | + | +++ | + | ++ | ++ | +++ | + | − | ++ | + | + | + |
Pathogenic variants are defined by location in ultra-conserved (UC), well-conserved (WC), or least-conserved (LC) intervals, or by location in the promoter (pro), 5’UTR (5’U), upstream gene region (ups) or intron (int)
a– indicates pathogenic scores are not specific to alleles at the same position, o indicates allele-specific scores but with low discriminative power
bsuccess rates are indicated by: − (<50%), o (50–60%), + (60–80%), ++ (80–90%), +++ (>90%)
cnon-pathogenic vs. pathogenic ratios are indicated by: − (<1:1), o (1–2:1), + (2–10:1), ++ (10–20:1), +++ (>20:1)
dAdj refers to adjacency corresponding to the minimum distance between pairs of pathogenic and non-pathogenic variants for which significantly different scores are produced in Task 2
Fig. 4Correlation of ncSNVs located within specific genomic distances. Given a pathogenic ncSNV, all variants located within 10 bps, 100 bps, 200 bps, 500 bps, 1000 bps, 10,000 bps, and 50,000 bps of its flanking region were retrieved. For each group, the average correlation coefficient was computed for a predictor value and colorized according to the legend bar. A total of 63 predictors were organized into five categories. Evolutionary predictors include 10 scores computed by GERP, PhyloP, and PhastCon. Motif predictors include six similarity scores to known transcription factor binding sites. DNA structure predictors include four scores predicting nucleotide secondary structure. ENCODE predictors include 34 scores from UCSC regulatory tracks. Genomic composition predictors include nine scores of sequence complexity
Fig. 5Performance of six methods for diagnosing eSNVs. In order to determine cutoff scores beyond which GTEx variants are classified as non-neutral, we selected the score cutoff that maximized balanced accuracy (TPR + TNR) on the DAV-vs-CPP dataset for each tool (Supplementary Table 2). a Proportion of eSNVs predicted as functionally non-neutral by each method and by at least one (1+), two (2+), or three (3+) methods. b Violin plots showing distributions of positional conservation, as measured by evolutionary rate. True positives are eSNVs predicted as functionally non-neutral by at least one method. False negatives are eSNVs predicted as functionally neutral by all six methods. Student t-tests were performed to compare evolutionary rates between true positive and false negative samples in each set. p-values are displayed