| Literature DB >> 23497682 |
Huiying Zhao, Yuedong Yang, Hai Lin, Xinjun Zhang, Matthew Mort, David N Cooper, Yunlong Liu, Yaoqi Zhou.
Abstract
Micro-indels (insertions or deletions shorter than 21 bps) constitute the second most frequent class of human gene mutation after single nucleotide variants. Despite the relative abundance of non-frameshifting indels, their damaging effect on protein structure and function has gone largely unstudied. We have developed a support vector machine-based method named DDIG-in (Detecting disease-causing genetic variations due to indels) to prioritize non-frameshifting indels by comparing disease-associated mutations with putatively neutral mutations from the 1,000 Genomes Project. The final model gives good discrimination for indels and is robust against annotation errors. A webserver implementing DDIG-in is available at http://sparks-lab.org/ddig.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23497682 PMCID: PMC4053752 DOI: 10.1186/gb-2013-14-3-r23
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
List of all features considered.
| Features | Description |
|---|---|
| Microdeletion/microinsertion positions (2) | Distances to nearest 5' and 3' splicing positions |
| DNA conservation scores (3) | Maximum, minimum, average |
| Evolution feature (30) | Maximum, minimum, average values (7 transition probabilities between match (M), microdeletion (D), and microinsertion (I) (MM, MI, MD, IM, II, DM, DD), 3 effective numbers of match/microinsertion/microdeletion) |
| Length (4) | Protein length, Microdeletion/microinsertion length, Distances to terminals |
| ΔS (1) | The indel-induced change to the HMM match score |
| Disorder score (3) | Maximum, minimum, average |
| Secondary structure (12) | Maximum, minimum, average probability (C, H, E), Predicted secondary structure (C, H, E) |
| Accessible surface area (3) | Maximum, minimum, average |
Top five performing features for microdeletion and microinsertion discrimination.
| Features | MCCa | AUCb | Precision | Recall |
|---|---|---|---|---|
| Disorder (Min, Ave, Max) | 0.558, 0.557, 0.551 | 0.824, 0.825, 0.818 | 74%, 74%, 73% | 85%, 85%, 84% |
| ASAc (Min, Ave, Max) | 0.542, 0.47, 0.302 | 0.81, 0.781, 0.659 | 73%, 71%, 68% | 88%, 81%, 57% |
| DNA conservation (Max, Ave, Min) | 0.468, 0.367, 0.144 | 0.781, 0.742, 0,561 | 68%, 72%, 66% | 79%, 71%, 23% |
| Neffd (Min, Ave, Max) | 0.449, 0.439, 0.43 | 0.735, 0.749, 0.729 | 68%, 66%, 67% | 85%, 87%, 85% |
| Probability of sheet (Max, Min Ave) | 0.32, 0.305, 0.284 | 0.678, 0.658, 0.632 | 69%, 69%, 64% | 60%, 53%, 51% |
| Disorder (Min, Max, Ave) | 0.556, 0.546, 0.545 | 0.813, 0.816, 0.80 | 78%, 80%, 79% | 75%, 74%, 75% |
| ASAc (Min, Ave, Max) | 0.501, 0.454, 0.317 | 0.80, 0.78, 0.670 | 71%, 78%, 71% | 85%, 65%, 52% |
| Neff d (Min, Ave, Max) | 0.467, 0.455, 0.438 | 0.751, 0.747, 0.742 | 68%, 68%, 67% | 86%, 85%, 84% |
| DNA conservation (Max, Ave, Min) | 0.453, 0.422, 0.234 | 0.758, 0.752, 0.597 | 72%, 74%, 76% | 75%, 65%, 27% |
| Transition probability of microinsertion to match (Min) | 0.372 | 0.708 | 72% | 62% |
Note: Max, min, and ave are arranged in the order of MCC values.
ASA, Solvent accessible surface area; AUC, Area under the curve; MCC, Matthews correlation coefficient; Neff, the number of effective homologous sequences aligned to residues, irrespective of residue type.
Figure 1Distributions of the average DNA conservation score from phyloP (phylogenetic .
List of selected features for different training sets
| Deletions | Insertions | indels | Non-redundant indels |
|---|---|---|---|
| Disorder(min) | Disorder(min) | Disorder(min) | Disorder(min) |
| DNA conservation(max) | DNA conservation(max) | DNA conservation(max) | DNA conservation(max) |
| Deletion length | P(m-i)e (min) | ΔSd | ΔSd |
| ASAa (min) | ΔSd | Neffc(ave) | Neffc(min) |
| P(m-d)b(ave) | P(m-i)e (ave) | indel length | ASAa (ave) |
| Neffc(min) | Disorder(ave) | Distance to the nearest splicing site (upstream) | indel length |
| Distance to the nearest splicing site (downstream) | Helical probability(max) | ASAa (max) | ASAa (max) |
| ASAa(max) | P(m-m)f(ave) | Neffc(min) | P(m-m)f(max) |
| ΔSd | DNA conservation(ave) | ||
| ASA(ave) |
aASA, solvent accessible surface area. bP(m-d), match-to-deletion transition probability. cNeff: the number of effective homologous sequences aligned to residues. dΔS, indel-induced change to alignment score. eP(m-i), match-to-insertion transition probability. fP(m-m), match-to-match transition probability.
Figure 2The ROC curves for the microdeletion (top) and microinsertion (bottom) sets, respectively, by 10-fold cross-validation on the set (black), 10-fold cross-validation on both insertions and deletions (red), independent test by training on the microinsertions (top) or microdeletions (bottom) (blue), by disorder feature only (orange), and by DNA conservation score only (purple) as labeled.
Figure 3The average predicted disease-causing probabilities as a function of the average allele frequency in the neutral indel dataset derived from the 1,000 Genomes Project data. This was done by dividing allele frequencies into 20 bins. The dashed line is from a linear regression fit. The correlation coefficient is -0.84.
Figure 4Ten-fold cross-validated Matthews correlation coefficient for the NFS-indel set as a function of SVM gamma and cost parameters and half window size when trained on all features. Note that a logarithmic scale is used for gamma and cost parameters and log2(gamma) and log2(cost) are shifted to facilitate comparison.