| Literature DB >> 25521329 |
Maad Shatnawi, Nazar Zaki, Paul D Yoo.
Abstract
BACKGROUND: Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25521329 PMCID: PMC4290662 DOI: 10.1186/1471-2105-15-S16-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Representation of proten sequence by AA features and sliding window. Each sequence in the dataset is replaced by its corresponding properties. These property values are then averaged over a window that slides along the length of each protein sequence.
Amino acid composition and linker index.
| Amino acid | Linker (%) | Domain (%) | Linker index |
|---|---|---|---|
| P | 7.95 | 4.93 | -0.478 |
| S | 8.32 | 6.97 | -0.177 |
| T | 6.68 | 5.67 | -0.163 |
| E | 7.53 | 6.62 | -0.128 |
| K | 6.30 | 5.64 | -0.112 |
| Q | 4.35 | 4.04 | -0.073 |
| A | 7.03 | 6.64 | -0.058 |
| V | 7.33 | 6.96 | -0.052 |
| R | 5.39 | 5.39 | 0.000 |
| D | 5.39 | 5.47 | 0.016 |
| N | 4.29 | 4.41 | 0.027 |
| I | 4.86 | 5.16 | 0.060 |
| L | 7.62 | 8.75 | 0.138 |
| H | 2.13 | 2.59 | 0.195 |
| F | 2.92 | 3.71 | 0.240 |
| M | 1.47 | 1.94 | 0.275 |
| Y | 2.49 | 3.44 | 0.322 |
| G | 5.46 | 7.60 | 0.331 |
| C | 1.62 | 2.53 | 0.447 |
| W | 0.89 | 1.56 | 0.564 |
Hydrophobicity index (kcal/mol) of amino acids in a distribution from non-polar to polar at pH = 7.
| Amino acid | Hydrophobicity index | Amino acid | Hydrophobicity index |
|---|---|---|---|
| I | 4.92 | Y | -0.14 |
| L | 4.92 | T | -2.57 |
| V | 4.04 | S | -3.40 |
| P | 4.04 | H | -4.66 |
| F | 2.98 | Q | -5.54 |
| M | 2.35 | K | 5.55 |
| W | 2.33 | N | -6.64 |
| A | 1.81 | E | -6.81 |
| C | 1.28 | D | -8.72 |
| G | 0.94 | R | -14.92 |
Rose hydrophobicity scale.
| Amino acid | Hydrophobicity index | Amino acid | Hydrophobicity index |
|---|---|---|---|
| A | 0.74 | L | 0.85 |
| R | 0.64 | K | 0.52 |
| N | 0.63 | M | 0.85 |
| D | 0.62 | F | 0.88 |
| C | 0.91 | P | 0.64 |
| Q | 0.62 | S | 0.66 |
| E | 0.62 | T | 0.70 |
| G | 0.72 | W | 0.85 |
| H | 0.78 | Y | 0.76 |
| I | 0.88 | V | 0.86 |
The scale is correlated to the average area of buried AAs in globular proteins.
SARAH1 hydrophobicity scale.
| Amino acid | Hydrophobicity index | Amino acid | Hydrophobicity index |
|---|---|---|---|
| C | 1,1,0,0,0 | G | 0,0,0,-1,-1 |
| F | 1,0,1,0,0 | T | 0,0,-1,0,-1 |
| I | 1,0,0,1,0 | S | 0,0,-1,-1,0 |
| V | 1,0,0,0,1 | R | 0,-1,0,0,-1 |
| L | 0,1,1,0,0 | P | 0,-1,0,-1,0 |
| W | 0,1,0,1,0 | N | 0,-1,-1,0,0 |
| M | 0,1,0,0,1 | D | -1,0,0,0,-1 |
| H | 0,0,1,1,0 | Q | -1,0,0,-1,0 |
| Y | 0,0,1,0,1 | E | -1,0,-1,0,0 |
| A | 0,0,0,1,1 | K | -1,-1,0,0,0 |
Each AA is assigned a five-bit code in descending order of the binary value of the corresponding code where the right-half is the negative mirror image of the left-half. The 10 most hydrophobic residues are positive, and the 10 least hydrophobic residues are negative.
Amino acid classification according to their physiochemical properties.
| Peoperty | Value | Amino acids |
|---|---|---|
| Charge | Positive | H, K, R |
| Negative | D, E | |
| Neutral | A, C, F, G, I, L, M, N, P, Q, S, T, V, W, Y | |
| Polatity | Polar | C, D, E, H, K, N, Q, R, S, T, Y |
| Non-polar | A, F, G, I, L, M, P, V, W | |
| Aromaticity | Aliphatic | I, L, V |
| Aromatic | F, H, W, Y | |
| Neutral | A, C, D, E, G, K, M, N, P, Q, R, S, T | |
| Size | Small | A, G, P, S |
| Medium | D, N, T | |
| Large | C, E, F, H, I, K, L, M, Q, R, V, W, Y, | |
| Electronic | Strong donor | A, D, E, P |
| Weak donor | I, L, V | |
| Neutral | C, G, H, S, W | |
| Weak acceptor | F, M, Q, T, Y | |
| Strong acceptor | K, N, R | |
Figure 2Random Forest algorithm.
Figure 3Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DS-All dataset.
Figure 4Averaging window optimization. Recall, precision, and F1-score at different window sizes with fifty protein sequences from DomCut dataset.
Figure 5Number of generated trees optimization. Recall, precision, and F-measure at different number of generated trees performed on DS-All dataset.
Figure 6Performance comparison. Recall, precision, and F-measure of six currently available domain boundary/linker predictors compared to our approach performed on DS-All dataset.
Recall, precision, and F-measure using Swiss-Prot/DomCut dataset.
| Approach | Recall | Precision | F1 |
|---|---|---|---|
| Our Approach | 0.71 | 0.98 | 0.82 |
| Shatnawi and Zaki [ | 0.56 | 0.84 | 0.67 |
| DomCut [ | 0.54 | 0.50 | 0.52 |
Figure 7FADD Human-protein. The protein chain contains 208 residues and has two domains and a linker in the interval (83-96).
Prediction measures after removing features that have less information gain using DS-All dataset.
| Features Removed | Recall | Precision | F1 |
|---|---|---|---|
| None | 0.675 | 0.987 | 0.802 |
| Polarity | 0.673 | 0.984 | 0.799 |
| Charge and Polarity | 0.645 | 0.983 | 0.779 |
| Size and all the above | 0.602 | 0.980 | 0.746 |
| Electronic and all the above | 0.455 | 0.968 | 0.619 |
| Aromaticity and all the above | 0.325 | 0.916 | 0.480 |
| Hydrophobicity and all the above | 0.169 | 0.204 | 0.185 |