| Literature DB >> 31649718 |
Zishuo Zeng1,2, Yana Bromberg2,3.
Abstract
Recent advances in high-throughput experimentation have put the exploration of genome sequences at the forefront of precision medicine. In an effort to interpret the sequencing data, numerous computational methods have been developed for evaluating the effects of genome variants. Interestingly, despite the fact that every person has as many synonymous (sSNV) as non-synonymous single nucleotide variants, our ability to predict their effects is limited. The paucity of experimentally tested sSNV effects appears to be the limiting factor in development of such methods. Here, we summarize the details and evaluate the performance of nine existing computational methods capable of predicting sSNV effects. We used a set of observed and artificially generated variants to approximate large scale performance expectations of these tools. We note that the distribution of these variants across amino acid and codon types suggests purifying evolutionary selection retaining generated variants out of the observed set; i.e., we expect the generated set to be enriched for deleterious variants. Closer inspection of the relationship between the observed variant frequencies and the associated prediction scores identifies predictor-specific scoring thresholds of reliable effect predictions. Notably, across all predictors, the variants scoring above these thresholds were significantly more often generated than observed. which confirms our assumption that the generated set is enriched for deleterious variants. Finally, we find that while the methods differ in their ability to identify severe sSNV effects, no predictor appears capable of definitively recognizing subtle effects of such variants on a large scale.Entities:
Keywords: effect predictors; machine learning; synonymous variants; variant frequency; variant functional effect
Year: 2019 PMID: 31649718 PMCID: PMC6791167 DOI: 10.3389/fgene.2019.00914
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Possible mechanisms of sSNVs impact on biological function. Yellow triangles represent sSNV sites and the dashed lines indicate aberrant processes. sSNVs may affect (A) transcription factor binding, (B) splicing of pre-mRNA, (C) mRNA secondary structure and stability, (D) wobble-based tRNA binding, and (E) cotranslational folding (and thus the protein structure). Figure was created with BioRender.com.
Summary of sSNV-specific predictors.
| Ref/Tool name | Training data | Model | Features | Performance |
|---|---|---|---|---|
| ( | 33 deleterious from literature, 785 neutral from one 1000 Genomes Project individual | Random forest with 1,001 trees and default number of features | 26 in total |
|
| ( | 75 DM from literature and OMIM and 402 | Random forest with 1,000 trees, each with | 20 in total |
|
| ( | ∼655 DM from HGMD and ∼655 NM from 1000G | Random forest with 51 trees and 35 features at each node | 455 in total |
|
| ( | 592 DM from HGMD and 10,925 putatively benign from 1000G | Support Vector Machine with radial function kernel | 54 in total (including all of the 26 features used in SilVA) |
|
| ( | 300 DM from dbDSM and 300 NM from VariSNP | Random forest with 500 trees and 3 features at each split | 10 in total |
|
DM, disease/deleterious mutations; NM, neutral mutations; HGMD, human gene mutation database; 1000G, 1000 genome project; OMIM, online mendelian inheritance in man; AUC, area under the ROC curve (axes in Eqn. 1).
Summary of generalized SNV predictors.
| Ref/Tool name | Training data | Model | Features | Performance |
|---|---|---|---|---|
| ( | 13,141,299 SNVs, 627,071 insertions, and 926,968 deletions from simulated and observed variant sets | SVM with linear kernel | 63 in total |
|
| ( | 13,302,220 observed variants; 13,302,220 simulated variants selected from CADD data | Neural network with 3 1,000-node hidden layers | 63 features from CADD |
|
| ( | 1,073 coding DM from HGMD and 1,073 coding NM from 1000G for 10-feature-group model; 3,000 coding DM from HGMD and 3,000 coding NM from 1000G for 4-feature-group model | Multiple kernel learning | 1,281 in total |
|
| ( | 122,238 DM from ClinVar and HGMD; 6,807,269 NM from 1000G | Bayesian classifier | ∼ 7 (not explicitly stated) in total |
|
DM, disease/deleterious mutations; NM, neutral mutations; HGMD, human gene mutation database; 1000G, 1000 genome project; AUC, area under the receiver operating characteristic curve.
Figure 2Ratios of observed and generated sSNVs vary across codons and amino acids. Ratios of observed to generated sSNVs (barplot, left axis) affecting specific (A) amino acids and (B) codons in the human transcriptome differ. Lines (right axis) in plots indicate the fractions of (A) amino acids and (B) codons (“*” is a stop codons). Trivially, 2-codon amino acids are generally enriched for observed sSNVs, while higher degeneracy codons are depleted. However, there is a significant difference between the most and least frequent 2-codon amino acid sSNVs. Codons with an NCG pattern (N = any nucleotide) are most often affected by sSNVs. On the other hand, codons with a CGN pattern (also CpG) are relatively rarely affected. Note that amino acid degeneracy is correlated with % composition, although a single codon is often responsible for coding most of each of these amino acids (e.g. Leucine CTG and Valine CTG).
Figure 3Predictor scores correlate somewhat, but do not differentiate observed vs. generated sSNVs. Panel (A) shows the amount of agreement (i.e., FCBP) for any pair of predictors. High FCBP values indicate that two predictors agree in assigning binary (neutral/deleterious) predictions to variants. Panel (B) shows the Pearson correlations among the prediction scores. (C–I) Violin/box plots of prediction score distributions across predictors: CADD raw, CADD phred-scaled, DANN, FATHMM-MKL, SilVA, TraP, and DDIG-SN, respectively.
AUCs of the predictors on sSNVs and nsSNVs.
|
|
| ||
|---|---|---|---|
| AUC on | Average of AUCs ±SD * | ||
| CADD raw score | 0.518 | 0.517±0.0012 | 0.564 |
| CADD phred-scaled score | 0.518 | 0.518±0.0013 | 0.564 |
| DANN | 0.506 | 0.506±0.0023 | 0.491 |
| FATHMM-MKL | 0.540 | 0.540±0.0013 | 0.555 |
| SilVA | 0.527 | 0.527±0.0009 | |
| TraP | 0.495 | 0.496±0.0038 | |
| DDIG-SN | 0.535 | 0.535±0.0012 | |
*Test set was sampled 20 times (each with 100,000 observed and 100,000 generated variants) to produce averages and standard deviations (SD) of AUCs for sSNVs.
Figure 4Predictor scores correlate, but do not clearly differentiate observed vs. generated nsSNVs.Panel (A) shows the amount of agreement (i.e., FCBP) for any pair of predictors. High FCBP values indicate that two predictors agree in assigning binary (neutral/deleterious) predictions to variants. Panel (B) shows the Pearson correlations among the prediction scores. (C–F) Violin/box plots of prediction score distributions across predictors: CADD raw, CADD phred-scaled, DANN, and FATHMM-MKL, respectively.
Figure 5Some predictors assign higher scores to rare variants. In all panels, the scatterplots display the density of observed variant prediction scores vs. log10(allele frequency). A scoring threshold (red dashed line) for each predictor identifies scores above the threshold as reliable. The threshold is placed at the score that is higher than 99% of common (allele frequency > 0.01) variant scores. (A-G) represents the scatterplot for CADD raw, CADD phred-scaled, DANN, FATHMM-MKL, SilVA, TraP, and DDIG-SN, respectively.
Percentage of sSNVs scoring above threshold and the corresponding predictor resolutions.
| % Above-the-threshold sSNVs in | % above-the-threshold sSNVs in | Resolution | |
|---|---|---|---|
| CADD raw score | 0.871 | 1.981 | 2.274 |
| CADD phred-scaled score | 0.868 | 1.979 | 2.280 |
| DANN | 1.594 | 2.156 | 1.352 |
| FATHMM-MKL | 1.639 | 2.522 | 1.538 |
| SilVA | 4.902 | 6.015 | 1.227 |
| TraP | 2.376 | 2.912 | 1.226 |
| DDIG-SN | 1.764 | 2.414 | 1.368 |