| Literature DB >> 16630345 |
Richard J Dobson1, Patricia B Munroe, Mark J Caulfield, Mansoor As Saqi.
Abstract
BACKGROUND: There has been an explosion in the number of single nucleotide polymorphisms (SNPs) within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs), some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16630345 PMCID: PMC1489951 DOI: 10.1186/1471-2105-7-217
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The number of disease and polymorphism nsSNPs within SWISSPROT feature table sites containing > 90% disease nsSNPs.
| Site | Disease | Polymorphisms | Percentage (odds ratio) of nsSNPs within these sites that are disease |
| ACT_SITE | 25 | 1 | 96.15 (10.12) |
| BINDING | 13 | 0 | 100 (-) |
| DNA_BIND | 352 | 20 | 94.62 (7.12) |
| METAL | 38 | 0 | 100 (-) |
| MOD_RES | 34 | 3 | 91.89 (4.59) |
| MUTAGEN | 111 | 10 | 91.74 (4.49) |
| NP_BIND | 108 | 8 | 93.1 (5.46) |
Distribution of positions containing disease and neutral nsSNPs within BIND and MMDBBIND. Some sites may contain multiple nsSNPs
| Interacting sites (num) [odds ratio] | Non-interacting sites (num) [odds ratio] | |
| Disease (BIND) | 71.7%(736) [1.29] | 58.6%(431) [0.72] |
| Polymorphism (BIND) | 28.3%(290) | 41.4%(304) |
| Disease (MMDBBIND) | 86.0%(294) [1.29] | 82.0%(1818) [0.97] |
| Polymorphism (MMDBBIND) | 14.0%(48) | 18.0%(398) |
Top 10 rank of attributes using 1R with 10 fold cross validation and bucket size 14
| 1R Rank | Attribute |
| 72.82 | conservation score (PSIC) |
| 67.49 | norm relative accessibility |
| 63.46 | MMDBBIND |
| 62.64 | mass change |
| 62.56 | relative accessibility |
| 62.23 | exposure |
| 61.41 | PAM score |
| 60.67 | mutation residue |
| 60.34 | volume change |
| 59.19 | wild type residue |
Top 10 information gain attributes
| Info gain (bits) | Attribute |
| 0.2 | conservation score (PSIC) |
| 0.1 | norm relative accessibility |
| 0.09 | wild type residue |
| 0.07 | relative accessibility |
| 0.06 | PAM score |
| 0.06 | mass change |
| 0.05 | mutation residue |
| 0.05 | exposure |
| 0.04 | volume change |
| 0.04 | hydrophobicity difference |
Figure 1Matthews Correlation Coeffecient (MCC) measure of predictive quality for five attribute subsets. Set 1 – All variables (3821 nsSNPs). Set 2 – Structurally dependant variables (3821 nsSNPs). Set 3 – All non structurally dependant attributes (14.636 nsSNPs). Set 4 – Non structurally dependant variables excluding the conservation score (14.636 nsSNPs). Set 5 – The conservation score alone (14.636 nsSNPs).
Summary of Training Dataset
| Disease | Polymorphism | Total | |
| Number of nsSNPS | 10,419 | 4217 | 14,636 |
| Number of nsSNPS within proteins with structural homologs | 3212 | 609 | 3821 |
| Number of Proteins with nsSNPs | 893 | 1256 | 2149 |
| Number of Proteins with nsSNPs having structural homologs | 299 | 295 | 594 |