| Literature DB >> 28439455 |
Samuel D Chapman1, Christoph Adami2, Claus O Wilke3, Dukka B Kc1.
Abstract
Predicting protein structure from sequence remains a major open problem in protein biochemistry. One component of predicting complete structures is the prediction of inter-residue contact patterns (contact maps). Here, we discuss protein contact map prediction by machine learning. We describe a novel method for contact map prediction that uses the evolution of logic circuits. These logic circuits operate on feature data and output whether or not two amino acids in a protein are in contact or not. We show that such a method is feasible, and in addition that evolution allows the logic circuits to be trained on the dataset in an unbiased manner so that it can be used in both contact map prediction and the selection of relevant features in a dataset.Entities:
Keywords: Evolutionary computation; Feature selection; Machine learning; Markov networks; Protein contact map prediction
Year: 2017 PMID: 28439455 PMCID: PMC5398280 DOI: 10.7717/peerj.3139
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Gate logic table.
The characteristic logic table for a deterministic gate with two inputs and two outputs.
| Inputs | Outputs |
|---|---|
| 1 1 | 0 1 |
| 0 1 | 0 0 |
| 1 0 | 1 1 |
| 0 0 | 0 1 |
Dataset features.
A description of the features of the dataset used in this study.
| Feature (s) | Number | Binary | Description |
|---|---|---|---|
| Cosine similarity | 1 | No | Cosine similarity of amino acid profiles in positions |
| Correlation measure | 1 | No | Correlation measure of amino acid profiles in positions |
| Mutual information | 1 | No | Mutual information of amino acid profiles in positions |
| Amino acid types | 10 | Yes | Gives all types of amino acid in pair among nonpolar, polar, acidic, and basic. |
| Levitt’s contact potential | 1 | No | Amino acid pair energy measure. |
| Jernigan’s pairwise potential | 1 | No | Amino acid pair energy measure. |
| Braun’s pairwise potential | 1 | No | Amino acid pair energy measure. |
| MSA amino acid profiles | 483 | No | Profile of each of the 20 amino acids, plus gap, in the 18 sliding window positions and five central segment positions. |
| MSA entropy | 23 | No | Profile entropy of each of the 18 sliding window positions and five central segment positions. |
| Solvent accessibility | 46 | Yes | Solvent accessibility of the amino acid (buried or exposed) of each of the 18 sliding window positions and five central segment positions. |
| Secondary structure | 69 | Yes | Secondary structure of the amino acid (helix, sheet, or coil) of each of the 18 sliding window positions and five central segment positions. |
| Central segment amino acid compositions | 21 | No | Overall proportions of each of the 20 amino acids, plus gap, across all central segments. |
| Central segment secondary structure compositions | 3 | No | Overall proportion of the three secondary structures across the central segments. |
| Central segment solvent accessibility compositions | 2 | No | Overall proportion of the two solvent accessibilities across the central segments. |
| Amino acid sequence separation | 16 | Yes | Amino acid sequence separation using bins <6, 6, 7, 8, 9, 10, 11, 12, 13, 14, <19, <24, ≤29, ≤39, ≤49, and ≥50. |
| Protein secondary structure composition | 3 | No | Overall secondary structure composition of the protein of the contact pair. |
| Protein length | 4 | Yes | Length of the protein of the contact pair using bins ≤50, ≤100, ≤150, >150. |
| Protein solvent accessibility composition | 2 | No | Overall solvent accessibility composition of the protein of the contact pair. |
Parameters.
Parameters for the evolutionary algorithm.
| Parameter | Value |
|---|---|
| Updates | 100,000 |
| Population size | 500 |
| Starting gates | 100 |
| Inputs per gate | 4 |
| Outputs per gate | 4 |
| Gene duplication rate per update | 0.05 |
| Gene deletion rate per update | 0.05 |
| Site mutation rate per update | 0.001 |
Figure 1Treatment results.
Results for the split of 10 with 10 bits per feature (6,880 total bits). The highest Fmax at 60 committee members is at 75 k, with an Fmax of 0.098.
Figure 5Treatment results.
Results for the split of four with two bits per feature (1,376 total bits). The highest Fmax at 60 committee members is at 75 k, with an Fmax of 0.102.
Figure 6All treatment results.
Results at 75 k updates for all five split treatments. The highest Fmax is achieved by the split of 16, four bits per feature encoding, with an Fmax of 0.103.
Figure 7Treatment results detail.
Specificity and sensitivity results at 75 k updates for the split of four with two bits per feature treatment. Specificity at 60 committee members was 0.14, and sensitivity was 0.35.
Markov network (MN) and SVMcon comparison.
Mean specificity, sensitivity, and Fmax of Markov networks (all guesses and top-L guesses) and SVMcon across different sequence separations.
| Method | Sep. ≥6 | Sep. ≥12 | Sep. ≥24 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Spec. | Sen. | Fmax | Spec. | Sen. | Fmax | Spec. | Sen. | Fmax | |
| MN, all guesses | 0.144 | 0.349 | 0.102 | 0.132 | 0.281 | 0.090 | 0.108 | 0.209 | 0.071 |
| MN, top- | 0.268 | 0.136 | 0.090 | 0.218 | 0.123 | 0.079 | 0.155 | 0.119 | 0.067 |
| SVMcon | 0.37 | 0.21 | 0.13 | 0.30 | 0.20 | 0.12 | 0.21 | 0.19 | 0.01 |
Markov network (MN) and SVMcon comparison.
Fmax comparisons between Markov network (top-L guesses) and SVMcon of different SCOP protein classes and sequence separations.
| SCOP class | Number | Fmax, sep. ≥6 | Fmax, sep. ≥12 | Fmax, sep. ≥24 | |||
|---|---|---|---|---|---|---|---|
| MN, top- | SVMcon | MN, top- | SVMcon | MN, top- | SVMcon | ||
| Alpha | 11 | 0.075 | 0.120 | 0.012 | 0.087 | 0.009 | 0.050 |
| Beta | 10 | 0.077 | 0.117 | 0.073 | 0.111 | 0.057 | 0.096 |
| Alpha+beta | 15 | 0.102 | 0.161 | 0.089 | 0.146 | 0.067 | 0.110 |
| Alpha/beta | 7 | 0.099 | 0.126 | 0.096 | 0.121 | 0.097 | 0.117 |
| Small | 4 | 0.085 | 0.120 | 0.060 | 0.113 | 0.023 | 0.063 |
| Coil-coil | 1 | 0.091 | 0.142 | 0.030 | 0.025 | N/A | N/A |
CASP 10 results comparison.
Specificity results on the 123 CASP 10 targets for short-range, medium-range, and long-range contacts with guess numbers of L/10, L/5, and L/2 where L is the protein length.
| Method | Seq. sep. of [6,12] | Seq. sep. of [12,24) | Seq. sep. of >24 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Markov networks | 0.292 | 0.250 | 0.194 | 0.297 | 0.261 | 0.223 | 0.153 | 0.135 | 0.111 |
| CoinDCA | 0.517 | 0.435 | 0.311 | 0.500 | 0.440 | 0.340 | 0.412 | 0.351 | 0.279 |
| PSICOV | 0.234 | 0.191 | 0.140 | 0.310 | 0.259 | 0.192 | 0.276 | 0.225 | 0.168 |
| plmDCA | 0.264 | 0.218 | 0.152 | 0.344 | 0.289 | 0.214 | 0.326 | 0.280 | 0.213 |
| NNcon | 0.499 | 0.399 | 0.275 | 0.393 | 0.334 | 0.226 | 0.239 | 0.188 | 0.001 |
| GREMLIN | 0.256 | 0.212 | 0.161 | 0.343 | 0.280 | 0.229 | 0.320 | 0.278 | 0.159 |
| CMAPpro | 0.437 | 0.368 | 0.253 | 0.414 | 0.363 | 0.276 | 0.336 | 0.297 | 0.227 |
| EVfold | 0.193 | 0.165 | 0.130 | 0.294 | 0.249 | 0.188 | 0.257 | 0.225 | 0.171 |
CASP 11 results comparison.
Specificity results on the 105 CASP 11 targets for short-range, medium-range, and long-range contacts with guess numbers of L/10, L/5, and L/2 where L is the protein length.
| Method | Seq. sep. of [6,12] | Seq. sep. of (12,24] | Seq. sep. of >24 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| L/2 | |||||||||
| Markov networks | 0.287 | 0.254 | 0.197 | 0.270 | 0.235 | 0.195 | 0.142 | 0.127 | 0.108 |
| CoinDCA | 0.452 | 0.391 | 0.286 | 0.430 | 0.365 | 0.254 | 0.279 | 0.240 | 0.186 |
| PSICOV | 0.190 | 0.144 | 0.112 | 0.196 | 0.163 | 0.115 | 0.198 | 0.172 | 0.127 |
| plmDCA | 0.185 | 0.144 | 0.107 | 0.208 | 0.165 | 0.122 | 0.226 | 0.214 | 0.161 |
| GREMLIN | 0.183 | 0.145 | 0.106 | 0.193 | 0.162 | 0.121 | 0.215 | 0.206 | 0.160 |
| EVfold | 0.159 | 0.137 | 0.100 | 0.197 | 0.163 | 0.113 | 0.193 | 0.163 | 0.132 |
Figure 8Sample Markov network.
A sample network diagram taken from the treatment of a split of four with two bits/feature at 75 k updates. Out of a possible 1,376 bits, the network has evolved to recognize only 118 of these. Inputs bits are green, gates are red, and outputs are blue. The inputs and gates are unordered. Note that a pair of outputs has evolved to represent a positive contact answer (the maximum is two), but that the negative contact answer evolved only one.
Figure 9Network recognition.
A histogram showing how many features are recognized by a certain number of networks (encoding split of four with two bits/feature at 75 k updates). A network only has to have input from one bit of a feature to recognize it.
Networks per feature.
The mean and median number of networks that recognize each feature for each treatment.
| Treatment | Networks recognized per feature | |
|---|---|---|
| Mean networks | Median networks | |
| Split 10, 10 bits/feature | 7.5 | 5 |
| Split 16, 16 bits/feature | 7.1 | 4 |
| Split 4, 4 bits/feature | 7.2 | 5 |
| Split 16, 4 bits/feature | 6.5 | 4 |
| Split 4, 2 bits/feature | 7.8 | 5 |
Feature recognition.
Feature recognition values of the networks from each treatment at 75 k updates. For a network to recognize a feature, it only has to connect to one bit for that feature.
| Treatment | Network statistics | |||
|---|---|---|---|---|
| Mean bits recognized | Median bits recognized | Mean features recognized | Median features recognized | |
| Split 10, 10 bits/feature | 98.9 | 98.0 | 85.6 | 84.5 |
| Split 16, 16 bits/feature | 93.5 | 89.5 | 80.9 | 75.5 |
| Split 4, 4 bits/feature | 91.9 | 95.0 | 82.0 | 84.5 |
| Split 16, 4 bits/feature | 84.1 | 85.5 | 74.9 | 76.5 |
| Split 4, 2 bits/feature | 97.6 | 98.0 | 90.0 | 90.5 |
Most-recognized features.
The 12 features most recognized by Markov networks for the split of 4 with 2 bits/feature at 75 k updates.
| Feature | Networks |
|---|---|
| Contact pair sequence separation ≥ 50 | 59 |
| C-terminus amino acid window position 5, sheet secondary structure | 58 |
| N-terminus amino acid window position 5, sheet secondary structure | 54 |
| Amino acid central segment window position 4, coil secondary structure | 50 |
| Amino acid central segment window position 4, sheet secondary structure | 49 |
| Contact pair sequence separation ≤ 49 | 48 |
| Amino acid central segment window position 5, coil secondary structure | 45 |
| N-terminus amino acid window position 5, exposed solvent accessibility | 45 |
| Amino acid central segment window position 5, sheet secondary structure | 43 |
| Contact pair sequence separation of 6 | 42 |
| N-terminus amino acid window position 5, buried solvent accessibility | 42 |
| C-terminus amino acid window position 5, buried solvent accessibility | 40 |
Treatment top features.
Number of top-12 features from the split 4, 2 bit/feature treatment that are also in the top-12 features of the other 4 treatments.
| Number of treatments | Top-12 features |
|---|---|
| Five treatments | 5 |
| Four treatments | 7 |
| Three treatments | 8 |
| Two treatments | 10 |
Figure 10Secondary structure recognition.
Number of networks out of the 60 that evolved to recognize each kind of secondary structure along the two size-9 sliding windows. Encoding was split of four, two bits/feature.
Figure 11Amino acid separation recognition.
Number of networks out of the 60 that evolved to recognize the amino acid pair separation features. Encoding was split of four, two bits/feature. Each tick shown is a different contact separation feature.
Figure 12Comparison of full vs. reduced feature set performance.
Fmax of the original split-4, two bits per feature encoding with all features, and the same kind of run with the reduced feature set that only used features recognized by at least six of the networks from the first run.