| Literature DB >> 24916671 |
Shuai Zeng, Jing Yang, Brian Hon-Yin Chung, Yu Lung Lau, Wanling Yang1.
Abstract
BACKGROUND: Predicting the functional impact of amino acid substitutions (AAS) caused by nonsynonymous single nucleotide polymorphisms (nsSNPs) is becoming increasingly important as more and more novel variants are being discovered. Bioinformatics analysis is essential to predict potentially causal or contributing AAS to human diseases for further analysis, as for each genome, thousands of rare or private AAS exist and only a very small number of which are related to an underlying disease. Existing algorithms in this field still have high false prediction rate and novel development is needed to take full advantage of vast amount of genomic data.Entities:
Mesh:
Year: 2014 PMID: 24916671 PMCID: PMC4061446 DOI: 10.1186/1471-2164-15-455
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Flow chart of EFIN.
The features used by EFIN
| Name | Description | Value and range |
|---|---|---|
| Reference amino acid (AAref) | The reference amino acid of the query position | nominal (A,R,N…V)* |
| Mutant amino acid (AAmut) | The mutant amino acid of the query position | nominal (A,R,N…V)* |
| Frequency of reference amino acid (Fref) | Frequency of reference amino acid at the query position in each block | interval [0,1], with 1 means perfect conservation of reference amino acid |
| Frequency of mutant amino acid (Fmut) | Frequency of mutant amino acid at the query position in each block | interval [0,1], with 1 means that all sequences have the mutant amino acid at the position |
| Shannon Entropy (H) | Shannon entropy in each block at the query position | interval [0,4.322], 0 means no diversity and larger number means more diversity at the position |
| NAS of the first sequence in each block (NASfirst) | Normalized alignment score of the first sequence in each block. | interval [0,1], while 1 means identical sequence to the query human protein |
| Number of sequences in each block (No_all) | Number of total sequences in each block | Interval [0,5000], while 5000 is the cutoff for each MSA |
| Number of sequences which cover the query position in each block (No_qp) | Number of sequences that cover the query position in each block | Interval [0,5000], while 5000 is the cutoff for each MSA |
| No_qp/ No_all (RatioNN) | The ratio of No_qp and No_all | Interval [0,1] |
| Lowest conserved block | The lowest block for which all sequences, together with all the sequences in upper blocks, have the reference amino acid perfectly conserved. | Ordinal (primate block, Non-primate mammal block, non-mammal vertebrate block, invertebrate block, other species block) |
*The 20 amino acids in human proteins.
Figure 2Box plot of NAS differences between adjacent sequences belonging to either the same block or two adjacent blocks. NAS differences larger than 0.4 were all treated as 0.4.
Figure 3Distribution of NAS of the first sequence in each block. Shown are the general distribution of NAS (boxplot) and those from two proteins, IL10RA (dashed line) and GNAS (intermittent dashed line). The general distribution of NAS of the first sequences from each block was generated from randomly selected 12,000 human proteins in UniProt. IL10RA, which encodes a subunit of the interleukin-10 receptor, and GNAS (Guanine nucleotide-binding protein G(s) subunit alpha isoform short), a house-keeping signal transduction molecule, are presented here as examples of different evolving rates of proteins.
Figure 4The ratio of neutral vs. damaging mutations in relationship to the lowest conserved block. (Paralog block was not considered for this feature).
Figure 5Receiver operating characteristic (ROC) curves for predictions made by EFIN, SIFT, MutationTaster, phyloP, and GERP++ on the Swiss-Prot dataset. ROC Curve of Swiss-Prot-trained EFIN, represented as EFIN (Swiss-Prot) in the figure, is the average result of a 10 fold cross-validation.
Comparison of algorithms tested on Swiss-Prot dataset*
| AUC | Accuracy | Precision | Sensitivity | Specificity | |
|---|---|---|---|---|---|
| EFIN (Swiss-Prot)** | 90.7%(1.0%) | 83.7%(0.98%) | 86.7%(3.3%) | 86.4%(0.9%) | 79.5%(3.0%) |
| GERP++ | 76.10% | 52.80% | 45.27% | 96.78% | 24.36% |
| PhyloP | 76.30% | 54.47% | 46.18% | 96.47% | 27.33% |
| MutationTaster | 85.40% | 79.47% | 69.07% | 86.42% | 74.98% |
| SIFT | 83.60% | 76.58% | 67.29% | 78.52% | 75.32% |
*This test is based on a subset of Swiss-Prot, of which mutations can be processed by all these 5 tools, including totally 18660 damaging variants and 28863 neutral variants.
**Swiss-Prot trained EFIN is validated by 10 fold cross-validation, and all these statistical measures are average values. Standard deviation is described within brackets after each measure.
Comparison of EFIN with PolyPhen-2 on a subset of Swiss-Prot variants
| Tools (Training set) | AUC | Accuracy | Precision | Sensitivity | Specificity |
|---|---|---|---|---|---|
| Testing set: Swiss-Prot dataset with HumDiv mutations excluded (36998 neutral variants, 17643 damaging variants)* | |||||
| EFIN (HumDiv) | 86.96% | 80.71% | 85.96% | 84.96% | 72.16% |
| polyphen2 (HumDiv) | 85.26% | 78.35% | 63.27% | 78.59% | 78.24% |
| Testing set: Swiss-Prot dataset with HumVar mutations excluded (15819 neutral variants, 2284 damaging variants)* | |||||
| EFIN (Swiss-Prot ∩ HumVar)** | 84.91% | 71.37% | 96.72% | 69.60% | 83.64% |
| polyphen2 (HumVar) | 80.60% | 78.09% | 32.28% | 67.12% | 79.67% |
*This test is based on a subset of Swiss-Prot dataset of which mutations can be processed by both EFIN and PolyPhen-2.
**EFIN is trained by intersection of HumVar and Swiss-Prot datasets.