| Literature DB >> 15784611 |
Suganthi Balasubramanian1, Yu Xia, Elizaveta Freinkman, Mark Gerstein.
Abstract
We assessed the disease-causing potential of single nucleotide polymorphisms (SNPs) based on a simple set of sequence-based features. We focused on SNPs from the dbSNP database in G-protein-coupled receptors (GPCRs), a large class of important transmembrane (TM) proteins. Apart from the location of the SNP in the protein, we evaluated the predictive power of three major classes of features to differentiate between disease-causing mutations and neutral changes: (i) properties derived from amino-acid scales, such as volume and hydrophobicity; (ii) position-specific phylogenetic features reflecting evolutionary conservation, such as normalized site entropy, residue frequency and SIFT score; and (iii) substitution-matrix scores, such as those derived from the BLOSUM62, GRANTHAM and PHAT matrices. We validated our approach using a control dataset consisting of known disease-causing mutations and neutral variations. Logistic regression analyses indicated that position-specific phylogenetic features that describe the conservation of an amino acid at a specific site are the best discriminators of disease mutations versus neutral variations, and integration of all our features improves discrimination power. Overall, we identify 115 SNPs in GPCRs from dbSNP that are likely to be associated with disease and thus are good candidates for genotyping in association studies.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15784611 PMCID: PMC1069129 DOI: 10.1093/nar/gki311
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
This table summarizes the different sequence-based features that have been used for identifying amino acid substitutions that could be deleterious to the protein and the results obtained from these studies
| Sequence-based features | Comment | Reference |
|---|---|---|
| Properties based on amino acid scale | ||
| Mass, volume, surface area, side-chain properties (charge, polarity), partial specific volume, hydrophobicity, alpha helix propensity, relative occurrence, percent buried, pKa. | The physicochemical properties were used as features in a Bayesian framework to predict the pathogenecity of an amino acid variation. Change in hydrophobicity coupled with low positional entropy was shown to be a good predictor. | ( |
| Position-specific phylogenetic features | ||
| Positional entropy, modified Shannon entropy and normalized site entropy | Substitutions at evolutionarily conserved sites have been shown to be strongly correlated with disease-causing mutations. Conservation at a position in a protein sequence has been assessed using slightly modified versions of sequence entropy from MSAs. | ( |
| Change in residue frequency | Residue frequency at a given amino acid position was calculated for both variants from multiple-sequence alignments. Change in residue frequency in conjunction with hydrophobicity correlated with the observed phenotype. | ( |
| Conservation related to allele frequency | Absolutely conserved residues between at least three mammalian orthologs were identified and variations at these positions were shown to be underrepresented at high allele frequencies compared to variations at unconserved sites. | ( |
| Degree of conservation using tree method | The number of substitutions at a given position in a sequence was estimated based on known phylogenetic relationships between species. Disease-associated mutations were more prevalent at conserved sites. | ( |
| SIFT | Calculates a conservation index based on MSA. Normalized probabilities for all possible substitutions at a given amino acid position are obtained from the MSA and substitutions with probabilities below a certain cutoff are deemed intolerant to the protein. | ( |
| Substitution matrices | ||
| BLOSUM, PAM and GRANTHAM | It was shown that ∼40% of disease-causing changes had highly unfavorable BLOSUM62 scores. Similar general trends were seen for PAM matrix scores ( | ( |
| A clear correlation between BLOSUM62 and allele frequency of nonsynonymous SNPs was not seen in a study of SNPs in membrane-transporter genes ( | ||
| BLOSUM62 scores were able to distinguish tolerant from intolerant substitutions in a variety of proteins with total prediction accuracies ranging from 47 to 70% ( | ||
| About 40% balanced classification error was reported by Saunders | ||
| Miller | ||
Distribution of the various amino acid changes among the TM, extracellular and intracellular regions for the disease-causing, neutral variations and SNPs from dbSNP
| Domain | Disease | Neutral | dbSNP |
|---|---|---|---|
| Transmembrane | 164 (93) | 90 (86) | 112 (158) |
| Extracellular | 80 (111) | 96 (82) | 200 (159) |
| Intracellular | 40 (80) | 61 (79) | 152 (126) |
The numbers in the parentheses is the expected number based on a Poisson distribution and the numbers left of the parentheses indicate the observed number of variations in the corresponding domain.
Figure 1(a) Histogram of BLOSUM62 scores. (b) Histogram of GRANTHAM scores. Here, the black bars represent disease variations, white bars indicate neutral variations and the shaded bars are dbSNP variations.
Total error rate of misclassification of disease-causing and neutral variation when each feature was assessed by itself in the logistic regression analysis
| Feature | Error rate (%) |
|---|---|
| SIFT conservation score | 18.41 |
| Normalized site entropy | 18.60 |
| Change in residue frequency | 19.92 |
| BLOSUM62 score | 27.70 |
| GRANTHAM score | 31.31 |
| Change in volume | 34.91 |
| Change in hydrophobicity | 37.95 |
| Location of variation (i.e. TM or non-TM) | 39.47 |
| BLOSUM62 score (TM only) | 22.53 |
| PHAT (TM only) | 24.90 |
| Change in free energy of hydropathy (TM only) | 27.27 |
Here, phylogenetic features refer to SIFT score, normalized site entropy and change in residue frequency.
Figure 2Histogram of change in residue frequency for the disease-causing, neutral and dbSNP variation datasets. The absolute value of change in residue frequency is shown. The black bars represent disease variations, white bars indicate neutral variations and the shaded bars are dbSNP variations.
Figure 3Frequency distribution of normalized site entropy values for the disease-causing, neutral and dbSNP variation datasets. The black bars represent disease variations, white bars indicate neutral variations and the shaded bars are dbSNP variations.
The results of logistic regression analysis of all variations using the features common to both TM and non-TM regions
| All features (excluding position-specific phylogenetic features) | SIFT only | Position-specific phylogenetic features only | All features | |||||
|---|---|---|---|---|---|---|---|---|
| Disease | Neutral | Disease | Neutral | Disease | Neutral | Disease | Neutral | |
| Correct classification | 221 | 167 | 257 | 173 | 247 | 217 | 249 | 219 |
| Wrong classification | 62 | 77 | 26 | 71 | 36 | 27 | 34 | 25 |
| Total number of errors | 139 (26.38%) | 97 (18.41%) | 63 (11.95%) | 59 (11.20%) | ||||
Here, phylogenetic features refer to SIFT score, normalized site entropy and change in residue frequency.
aThe classification obtained by logistic regression analysis using only the SIFT score as the determining feature.
The results of logistic regression analysis of variations in TM regions
| All features excluding position-specific phylogenetic features | SIFT only | Position-specific phylogenetic features only | All features | |||||
|---|---|---|---|---|---|---|---|---|
| Disease | Neutral | Disease | Neutral | Disease | Neutral | Disease | Neutral | |
| Correct classification | 143 | 58 | 157 | 68 | 155 | 71 | 155 | 80 |
| Wrong classification | 21 | 31 | 7 | 21 | 9 | 18 | 9 | 9 |
| Total number of errors | 52 (20.55%) | 28 (11.07%) | 27 (10.67%) | 18 (7.11%) | ||||
Here, phylogenetic features refer to SIFT score, normalized site entropy and change in residue frequency.
The results of logistic regression analysis of variations in non-TM regions
| All features excluding position-specific phylogenetic features | SIFT only | Position-specific phylogenetic features only | All features | |||||
|---|---|---|---|---|---|---|---|---|
| Disease | Neutral | Disease | Neutral | Disease | Neutral | Disease | Neutral | |
| Correct classification | 77 | 117 | 100 | 114 | 93 | 142 | 94 | 143 |
| Wrong classification | 42 | 38 | 19 | 41 | 26 | 13 | 25 | 12 |
| Total number of errors | 80 (29.20%) | 60 (21.90%) | 39 (14.23%) | 37 (13.50%) | ||||
Here, phylogenetic features refer to SIFT score, normalized site entropy and change in residue frequency.
The error rate of misclassification of disease-causing and neutral variations using the SIFT score, normalized site entropy and change in residue frequency individually as predictors in the logistic regression analysis
| Dataset | SIFT score (%) | Normalized site entropy (%) | Change in residue frequency (%) | Combining all three features (%) |
|---|---|---|---|---|
| All variations | 18.41 | 18.60 | 19.92 | 11.95 |
| TM only | 11.07 | 19.37 | 19.37 | 10.67 |
| Non-TM only | 21.90 | 19.71 | 20.07 | 14.23 |