| Literature DB >> 19455225 |
Tai-Sung Lee1, Steven J Potts, Matthew J McGinniss, Charles M Strom.
Abstract
Computational prediction of the impact of a mutation on protein function is still not accurate enough for clinical diagnostics without additional human expert analysis. Sequence alignment-based methods have been extensively used but their results highly depend on the quality of the input alignments and the choice of sequences. Incorporating the structural information with alignments improves prediction accuracy. Here, we present a conservation of amino acid properties method for mutation prediction, Multiple Properties Tolerance Analysis (MuTA), and a new strategy, MuTA/S, to incorporate the solvent accessible surface (SAS) property into MuTA. Instead of combining multiple features by machine learning or mathematical methods, an intuitive strategy is used to divide the residues of a protein into different groups, and in each group the properties used is adjusted.The results for LacI, lysozyme, and HIV protease show that MuTA performs as well as the widely used SIFT algorithm while MuTA/S outperforms SIFT and MuTA by 2%-25% in terms of prediction accuracy. By incorporating the SAS term alone, the alignment dependency of overall prediction accuracy is significantly reduced. MuTA/S also defines a new way to incorporate any structural features and knowledge and may lead to more accurate predictions.Entities:
Year: 2007 PMID: 19455225 PMCID: PMC2674666
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
The physicochemical properties used for the 20 natural amino acid residues (single letter codes are shown). The units are not shown since they are irrelevant in the calculations.
| A | C | D | E | F | G | H | I | K | L | |
|---|---|---|---|---|---|---|---|---|---|---|
| Kyte-Doolittle Hydrophobicity | 1.8 | 2.5 | −3.5 | −3.5 | 2.8 | −0.4 | −3.2 | 4.5 | −3.9 | 3.8 |
| Hopp–Woods Hydrophobicity | −0.5 | −1 | 3 | 3 | −2.5 | 0 | −0.5 | −1.8 | 3 | −1.8 |
| pKa value for free amino acid carboxylate | 2.3 | 1.8 | 2 | 2.2 | 1.8 | 2.4 | 1.8 | 2.4 | 2.2 | 2.4 |
| Number of sulfur atoms in amino acid | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Average accessible area in proteins | 31.5 | 13.9 | 60.9 | 72.3 | 28.7 | 25.2 | 46.7 | 23 | 110.3 | 29 |
| Volume | 88.6 | 108.5 | 111.1 | 138.4 | 189.9 | 60.1 | 153.2 | 166.7 | 168.6 | 166.7 |
| Side Chain Charge | 0 | 0 | −1 | −1 | 0 | 0 | 0 | 0 | 1 | 0 |
| Polarity | 0 | 1.48 | 49.7 | 49.9 | 0.35 | 0 | 51.6 | 0.13 | 49.5 | 0.13 |
| Aromatic residue | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Solvation Energy | 1.93 | −1.24 | −6.7 | −6.47 | −0.89 | 1.00 | −10.25 | 2.15 | −4.29 | 2.28 |
| Kyte-Doolittle Hydrophobicity | 1.9 | −3.5 | −1.6 | −3.5 | −4.5 | −0.8 | −0.7 | 4.2 | −0.9 | −1.3 |
| Hopp-Woods Hydrophobicity | −1.3 | 0.2 | 0 | 0.2 | 3 | 0.3 | −0.4 | −1.5 | −3.4 | −2.3 |
| pKa value for free amino acid carboxylate | 2.3 | 2 | 2 | 2.2 | 1.8 | 2.1 | 2.6 | 2.3 | 2.4 | 2.2 |
| Number of sulfur atoms in amino acid | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Average accessible area in proteins | 30.5 | 62.2 | 53.7 | 74 | 93.8 | 44.2 | 46 | 23.5 | 41.7 | 59.1 |
| Volume | 162.9 | 114.1 | 112.7 | 143.8 | 173.4 | 89 | 116.1 | 140 | 227.8 | 193.6 |
| Side Chain Charge | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Polarity | 1.43 | 3.38 | 1.58 | 3.53 | 52 | 1.67 | 1.66 | 0.13 | 2.1 | 1.61 |
| Aromatic residue | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| Solvation Energy | −1.49 | −9.71 | 2.00 | −9.41 | −10.91 | −5.11 | −5.01 | 1.98 | −5.91 | −6.13 |
The principle components (PCn) used in the calculations. The entries (in percentage) are the contributions from different properties.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Kyte-Doolittle Hydrophobicity | 20.85 | 0.18 | 0.68 | 1.32 | 0.02 | 15.36 | 3.50 | 1.23 | 43.16 | 13.68 |
| Hopp-Woods Hydrophobicity | 17.57 | 9.89 | 0.09 | 0.44 | 0.82 | 4.34 | 11.42 | 0.16 | 27.43 | 27.85 |
| pKa value for free amino acid carboxylate | 4.45 | 0.47 | 41.16 | 3.93 | 37.74 | 4.31 | 0.02 | 7.60 | 0.31 | 0.00 |
| Number of sulfur atoms in amino acid | 2.30 | 2.03 | 51.54 | 0.23 | 30.00 | 5.80 | 7.04 | 0.57 | 0.08 | 0.41 |
| Average accessible area in proteins | 19.87 | 1.86 | 2.07 | 4.91 | 1.56 | 0.16 | 13.63 | 20.32 | 0.01 | 35.60 |
| Volume | 0.00 | 39.56 | 0.39 | 0.39 | 16.61 | 11.19 | 3.96 | 13.47 | 0.97 | 13.45 |
| Side Chain Charge | 18.00 | 0.03 | 0.83 | 0.66 | 4.51 | 30.26 | 3.37 | 32.37 | 2.22 | 7.73 |
| Polarity | 15.04 | 4.33 | 0.82 | 4.78 | 0.83 | 27.04 | 26.75 | 0.33 | 19.64 | 0.44 |
| Aromatic residue | 1.44 | 36.68 | 0.09 | 8.09 | 1.97 | 0.21 | 29.86 | 15.58 | 5.98 | 0.10 |
| Solvation Energy | 0.48 | 4.95 | 2.31 | 75.25 | 5.94 | 1.33 | 0.45 | 8.36 | 0.19 | 0.74 |
The correlation coefficients between the relative SAS and the standard deviation of ten physicochemical properties.
| Kyte-Doolittle Hydrophobicity | 0.29 |
| Hopp-Woods Hydrophobicity | 0.45 |
| pKa value for free amino acid carboxylate | 0.39 |
| Number of sulfur atoms in amino acid | −0.21 |
| Average accessible area in proteins | 0.54 |
| Volume | 0.28 |
| Side Chain Charge | 0.54 |
| Polarity | 0.42 |
| Aromatic residue | −0.12 |
| Solvation Energy | 0.52 |
The standard deviations of properties are calculated from the human curated alignment of LacI.
Comparison of prediction accuracy for SIFT, MAPP, MuTA, and MuTA/S with different regions.
| Lysozyme | LacI | |||||
|---|---|---|---|---|---|---|
| Benign | Deleterious | Prediction Accuracy | Benign | Deleterious | Prediction Accuracy | |
| SIFT | 864/1377 | 169/175 | 79.66 | 1764/2267 | 767/1166 | 71.80 |
| MAPP | 826/1377 | 170/175 | 78.57 | 1795/2267 | 834/1166 | 75.35 |
| MuTA | 790/1377 | 169/175 | 76.97 | 1819/2267 | 762/1166 | 72.79 |
| R1 | 957/1377 | 166/175 | 82.18 | 1901/2267 | 753/1166 | 74.22 |
| R1+R2 | 1079/1377 | 166/175 | 86.61 | 1964/2267 | 725/1166 | 74.41 |
| R1+R2+R3 | 1112/1377 | 166/175 | 87.81 | 2000/2267 | 707/1166 | 74.43 |
| R1+R2+R3+R4 | 1075/1377 | 174/175 | 88.75 | 1925/2267 | 788/1166 | 76.25 |
For the “Benign” entries, the first number is the number of benign predictions and the second number is the total number of experimentally confirmed benign mutations. “Deleterious” entries have a similar meaning but are for deleterious mutations.
The entries of “Prediction Accuracy” are the prediction percentages calculated by averaging the benign and deleterious percentages.
Rn means MuTA/S with Region n, defined in the System and Method section.
Prediction results from SIFT, MuTA, and MuTA/S using different alignments for LacI. Percentage entries are in the format of “overall percentage (benign percentage, deleterious percentage)”. The overall percentage is the average of the benign percentage and the deleterious percentage.
| LacI | n | I | SIFT | MuTA | MuTA/S |
|---|---|---|---|---|---|
| Manual | 9 | 1.01 | 71.80 (77.81,65.78) | 72.79 (80.24,65.35) | 76.25 (84.91,67.58) |
| SwissProt | 6 | 0.84 | 71.58 (79.00,64.15) | 72.22 (68.02,76.42) | 77.52 (79.31,75.73) |
| NR.7 | 7 | 0.01 | 49.47 (0.75,98.20) | 49.65 (1.28,98.03) | 70.68 (51.83,89.54) |
| NR.14 | 14 | 0.01 | 50.26 (0.53,100) | 49.61 (1.01,98.20) | 70.70 (51.52,89.88) |
| NR.29 | 29 | 0.54 | 72.93 (64.12,81.73) | 72.83 (69.08,76.59) | 76.94 (77.81,76.07) |
| Swiss+NR.29 | 34 | 0.71 | 72.2 (68.58,75.81) | 71.96 (68.37,75.56) | 77.86 (77.33,78.39) |
Manual is human-curated alignment.
SwisProt is the alignment from PsiBLAST search results from the SwissProt database.
NR is the alignment from PsiBLAST search results from NCBI’s non-redundant database. NR.7 stands for the first 7 sequences.
Swiss+NR is the alignment combined with the SwissProt and NR sequences. Redundant sequences are not removed.
n is the number of sequences in the alignment, including the query sequence.
I is the average entropy.
Prediction results from SIFT, MuTA, and MuTA/S using different alignments for Lysozyme. Percentage entries are in the format of “overall percentage (benign percentage, deleterious percentage)”. The overall percentage is the average of the benign percentage and the deleterious percentage.
| Lysozyme | n | I | SIFT | MuTA | MuTA/S |
|---|---|---|---|---|---|
| Manual | 8 | 0.71 | 79.66 (62.75,96.57) | 76.97 (57.37,96.57) | 88.75 (78.07,99.43) |
| SwissProt | 3 | 0.69 | 75.61 (68.36,82.86) | 69.87 (44.88,94.86) | 85.19 (75.53,94.86) |
| NR.80 | 80 | 0.04 | 50.44 (0.87,100) | 51.65 (9.01,94.29) | 77.04 (57.52,96.57) |
| NR.165 | 165 | 0.05 | 50.29 (0.58,100) | 54.82 (13.07,96.57) | 78.31 (59.48,97.14) |
| NR.329 | 329 | 0.13 | 56.21 (12.42,100) | 61.92 (29.56,94.29) | 81.70 (67.97,95.43) |
| Swiss+NR.329 | 331 | 0.15 | 58.50 (16.99,100) | 63.87 (36.31,91.43) | 82.48 (71.24,93.71) |
Manual is human-curated alignment.
SwisProt is the alignment from PsiBLAST search results from the SwissProt database.
NR is the alignment from PsiBLAST search results from NCBI’s non-redundant database. NR.80 stands for the first 80 sequences.
Swiss+NR is the alignment combined with the SwissProt and NR sequences. Redundant sequences are not removed.
n is the number of sequences in the alignment, including the query sequence.
I is the average entropy.
Prediction results from SIFT, MuTA, and MuTA/S using different alignments for HIV-1 protease. Percentage entries are in the format of “overall percentage (benign percentage, deleterious percentage)”. The overall percentage is the average of the benign percentage and the deleterious percentage.
| HIV-PR | n | I | SIFT | MuTA | MuTA/S |
|---|---|---|---|---|---|
| Manual | 48 | 1.55 | 62.71 (84.55,40.88) | 68.24 (100,36.48) | 69.81 (100,39.62) |
| HV1 | 20 | 0.25 | 57.03 (15.32,98.74) | 59.06 (22.52,95.60) | 76.12 (79.28,72.96) |
| HV1HV2 | 30 | 0.54 | 69.88 (50.45,89.31) | 67.28 (54.05,80.50) | 76.89 (86.49,67.30) |
| SwissProt | 43 | 0.78 | 79.61 (71.17,88.05) | 75.45 (81.08,69.81) | 77.62 (95.5,59.75) |
| NR.50 | 50 | 0.04 | 54.05 (8.11,100) | 56.71 (15.32,98.11) | 77.69 (79.28,76.10) |
| NR.400 | 400 | 0.05 | 51.80 (3.60,100) | 59.92 (26.13,93.71) | 76.79 (77.48,76.10) |
| NR.100 | 100 | 0.04 | 51.80 (3.60,100) | 57.62 (17.12,98.11) | 78.32 (79.28,77.36) |
| Swiss+NR.400 | 442 | 0.19 | 70.32 (45.05,95.6) | 73.58 (60.36,86.79) | 80.27 (90.09,70.44) |
Manual is human-curated alignment.
HV1 is the alignment consisting of only HIV-type 1 sequences from the NR database.
HV1HV2 is the alignment consisting of HIV-type 1 and type 2 sequences from the NR database.
SwisProt is the alignment from PsiBLAST search results from the SwissProt database.
NR is the alignment from PsiBLAST search results from NCBI’s non-redundant database. NR.50 stands for the first 50 sequences.
Swiss+NR is the alignment combined with the SwissProt and NR sequences. Redundant sequences are not removed.
n is the number of sequences in the alignment, including the query sequence.
I is the average entropy.
Prediction results from SIFT, MuTA, and MuTA/S for CFTR and G6PD. Entries are in the format of “overall percentage (benign percentage, deleterious percentage)”. The overall percentage is the average of the benign percentage and the deleterious percentage.
| n | SIFT | MuTA | MuTA/S | |
|---|---|---|---|---|
| CFTR | 61 | 65.47 (35.29,95.65) | 62.02 (58.82,65.22) | 78.01 (64.71,91.3) |
| G6PD | 34 | 53.67 (20.00,87.34) | 60.81 (52.00,69.62) | 67.45 (64.00,70.89) |
Alignments are human-curated.
The structure for CFTR is taken from PDB ID:2F9Q; the structure for G6PD is taken from PDB ID:1QKI.