| Literature DB >> 28623886 |
Kolja Stahl1, Michael Schneider1, Oliver Brock2.
Abstract
BACKGROUND: Accurately predicted contacts allow to compute the 3D structure of a protein. Since the solution space of native residue-residue contact pairs is very large, it is necessary to leverage information to identify relevant regions of the solution space, i.e. correct contacts. Every additional source of information can contribute to narrowing down candidate regions. Therefore, recent methods combined evolutionary and sequence-based information as well as evolutionary and physicochemical information. We develop a new contact predictor (EPSILON-CP) that goes beyond current methods by combining evolutionary, physicochemical, and sequence-based information. The problems resulting from the increased dimensionality and complexity of the learning problem are combated with a careful feature analysis, which results in a drastically reduced feature set. The different information sources are combined using deep neural networks.Entities:
Keywords: Contact prediction; Deep learning; Meta algorithms
Mesh:
Substances:
Year: 2017 PMID: 28623886 PMCID: PMC5474060 DOI: 10.1186/s12859-017-1713-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Overview: Leveraged information per algorithm
| Algorithm | Physicochemistry | Evolutionary | Sequence | Features |
|---|---|---|---|---|
| EPC-map | ✓ | ✓(1) | 228 | |
| MemBrain | ✓(1) | ✓ | 400/200 | |
| BCL::Contact | ✓ | ✓ | 90 | |
| PhyCMAP | ✓(2) | ✓ | ≈300 | |
| PConsC | ✓(2) | ✓ | 252 | |
| MetaPSICOV | ✓(3) | ✓ | 672/731 | |
| EPSILON-CP | ✓ | ✓(5) | ✓ | 171 |
The number of features specified for MetaPSICOV refer to Stage1/ Stage2. The number of features specified for MemBrain refer to the serial/parallel combination. The first part of the table contains methods that linearly combine the features/high-level predictions, the second part non-linearly. The numbers in parentheses for evolutionary information denote the number of different methods utilized
Fig. 1Results for long-range contacts. EPSILON-CP outperforms the other predictors on all three benchmark sets
Mean precision for long-range contacts on 21 CASP11 FM hard targets
| L/10 | L/5 | L/2 | L | 1.5L | |
|---|---|---|---|---|---|
| EPC-map | 0.206 | 0.18 | 0.129 | 0.103 | 0.086 |
| MetaPSCIOV (stage 1) | 0.317 | 0.268 | 0.2 | 0.151 | 0.129 |
| MetaPSICOV (stage 2) | 0.322 | 0.284 | 0.216 | 0.159 | 0.132 |
| EPSILON-CP | 0.357 | 0.305 | 0.235 | 0.182 | 0.155 |
| CCMpred | 0.221 | 0.182 | 0.145 | 0.111 | 0.092 |
| GaussDCA | 0.209 | 0.186 | 0.135 | 0.104 | 0.087 |
| GREMLIN | 0.207 | 0.165 | 0.12 | 0.086 | 0.078 |
| PSICOV | 0.189 | 0.147 | 0.112 | 0.087 | 0.074 |
Precision is computed on the 26 domains of these targets for the top predictions relative to the sequence length L
Mean precision for long-range contacts on the NOUMENON data set
| L/10 | L/5 | L/2 | L | 1.5L | |
|---|---|---|---|---|---|
| EPC-map | 0.487 | 0.445 | 0.355 | 0.265 | 0.21 |
| MetaPSCIOV (stage 1) | 0.473 | 0.403 | 0.287 | 0.217 | 0.184 |
| MetaPSCIOV (stage 2) | 0.419 | 0.341 | 0.243 | 0.183 | 0.149 |
| EPSILON-CP | 0.576 | 0.531 | 0.435 | 0.328 | 0.264 |
| CCMpred | 0.095 | 0.095 | 0.083 | 0.074 | 0.066 |
| GaussDCA | 0.084 | 0.083 | 0.077 | 0.07 | 0.065 |
| GREMLIN | 0.071 | 0.073 | 0.065 | 0.059 | 0.056 |
| PSICOV | 0.083 | 0.069 | 0.063 | 0.055 | 0.053 |
Precision of the top predictions relative to the sequence length L
Mean precision for long-range contacts for proteins from D329, SVMcon Test, PSICOV and EPC-map_test
| L/10 | L/5 | L/2 | L | 1.5L | |
|---|---|---|---|---|---|
| EPC-map | 0.656 | 0.591 | 0.459 | 0.335 | 0.263 |
| MetaPSCIOV (stage 1) | 0.639 | 0.57 | 0.444 | 0.333 | 0.268 |
| MetaPSCIOV (stage 2) | 0.658 | 0.599 | 0.483 | 0.368 | 0.3 |
| EPSILON-CP | 0.723 | 0.665 | 0.542 | 0.409 | 0.325 |
| CCMpred | 0.511 | 0.456 | 0.345 | 0.249 | 0.197 |
| GaussDCA | 0.481 | 0.423 | 0.322 | 0.238 | 0.192 |
| GREMLIN | 0.5 | 0.448 | 0.338 | 0.243 | 0.192 |
| PSICOV | 0.452 | 0.39 | 0.285 | 0.203 | 0.163 |
Precision of the top predictions relative to the sequence length L
Fig. 2Simplified and aggregated depiction of the feature importance as emitted by XGBoost. The amino acid composition is attributed with the least importance, although it makes up roughly 75% of the features. The different co-evolutionary information entries correspond to different co-evolutionary methods. The feature importance depicted here is the average over a 5-fold cross-validation
Fig. 3Comparison of three neural networks with identical architecture on EPC-map_test (long-range contacts). The baseline network (square marker) uses the full feature set and is trained on 657 proteins. The training proteins are a mix of EPC-map_train and MetaPSICOV proteins. The square marker denotes the neural network that is trained without the amino acid composition but on the same data set. The second network (circle marker) shows the performance of the neural network after increasing the training set size from 657 to 1479 proteins, which was possible because dropping the amino acid composition reduced the dimensionality of the learning problem. Note here that most of the new proteins are much more complex and may not be helpful for predicting proteins in EPC-map_test
Fig. 4Performance comparison of different information types on long-range contacts on the EPC-map_test data set. S(equence), E(volutionary), P(hysicochemical) and the respective combinations. S uses the feature set described in Features minus EPC-map and the co-evolutionary methods. E (GaussDCA) is the best evolutionary method in our experiments. For P the representative is EPC-map