| Literature DB >> 24348198 |
Branislava Gemovic1, Vladimir Perovic1, Sanja Glisic1, Nevena Veljkovic1.
Abstract
There are more than 500 amino acid substitutions in each human genome, and bioinformatics tools irreplaceably contribute to determination of their functional effects. We have developed feature-based algorithm for the detection of mutations outside conserved functional domains (CFDs) and compared its classification efficacy with the most commonly used phylogeny-based tools, PolyPhen-2 and SIFT. The new algorithm is based on the informational spectrum method (ISM), a feature-based technique, and statistical analysis. Our dataset contained neutral polymorphisms and mutations associated with myeloid malignancies from epigenetic regulators ASXL1, DNMT3A, EZH2, and TET2. PolyPhen-2 and SIFT had significantly lower accuracies in predicting the effects of amino acid substitutions outside CFDs than expected, with especially low sensitivity. On the other hand, only ISM algorithm showed statistically significant classification of these sequences. It outperformed PolyPhen-2 and SIFT by 15% and 13%, respectively. These results suggest that feature-based methods, like ISM, are more suitable for the classification of amino acid substitutions outside CFDs than phylogeny-based tools.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24348198 PMCID: PMC3855963 DOI: 10.1155/2013/948617
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Sequences, their UniProt IDs, CFDs, and the relevant literature.
| Protein | UniProt ID | CFD | Position | Reference |
|---|---|---|---|---|
| ASXL1 | Q8IXJ9 | HARE | 11–83 | |
| ASXH | 241–369 | [ | ||
| PHD | 1506–1541 | |||
|
| ||||
| EZH2 | Q15910 | SANT1 | 159–250 | |
| SANT2 | 433–481 | [ | ||
| SET | 617–738 | |||
|
| ||||
| DNMT3A | Q9Y6K1 | PWWP | 290–348 | |
| PHD | 536–589 | [ | ||
| MTase | 638–908 | |||
|
| ||||
| TET2 | Q6N021 | BOX1 | 1104–1478 |
[ |
| BOX2 | 1845–2002 | |||
Figure 1Scheme for the ISM procedure.
Abbreviations and EIIP values for amino acids.
| Amino acid | Letter code | Numerical code EIIP (Ry) |
|---|---|---|
| Leucine | L | 0.0000 |
| Isoleucine | I | 0.0000 |
| Asparagine | N | 0.0036 |
| Glycine | G | 0.0050 |
| Valine | V | 0.0057 |
| Glutamic acid | E | 0.0058 |
| Proline | P | 0.0198 |
| Histidine | H | 0.0242 |
| Lysine | K | 0.0371 |
| Alanine | A | 0.0373 |
| Tyrosine | Y | 0.0516 |
| Tryptophan | W | 0.0548 |
| Glutamine | Q | 0.0761 |
| Methionine | M | 0.0823 |
| Serine | S | 0.0829 |
| Cysteine | C | 0.0829 |
| Threonine | T | 0.0941 |
| Phenylalanine | F | 0.0954 |
| Arginine | R | 0.0956 |
| Aspartic acid | D | 0.1263 |
Number of SNPs and mutations (MUTs) in the dataset.
| Gene | SNPs ( | MUTs ( | ||
|---|---|---|---|---|
| nCFDs | CFDs | nCFDs | CFDs | |
| ASXL1 ( | 59 | 4 | 12 | 1 |
| EZH2 ( | 4 | 2 | 6 | 13 |
| DNMT3A ( | 3 | 3 | 6 | 35 |
| TET2 ( | 42 | 3 | 27 | 94 |
|
| ||||
| Total | 108 | 12 | 51 | 143 |
Figure 2Performance of PolyPhen-2 and SIFT on the entire dataset (CFDs and nCFDs) and on the subset of variations outside CFDs (nCFDs).
Figure 3Process for the selection of significant frequencies from the spectra of ASXL1 (a), EZH2 (b), DNMT3A (c), and TET2 (d).
Figure 4Distribution of ISM scores.
Figure 5ROC curves on the ISM, PolyPhen-2 and SIFT scores for nCFD variations.
Performance statistics of PolyPhen-2, SIFT, and ISM binary classification of AASs outside CFDs.
| Accuracy | Precision | Sensitivity | Specificity | NPV | AUC | |
|---|---|---|---|---|---|---|
| PolyPhen-2 ( | 0.52 | 0.31 | 0.39 | 0.58 | 0.67 | 0.49 |
| SIFT ( | 0.57 | 0.37 | 0.51 | 0.59 | 0.72 | 0.55 |
| ISM ( | 0.69 | 0.51 | 0.65 | 0.70 | 0.81 | 0.68 |
Figure 6ROC curves for binary classification.