| Literature DB >> 32143444 |
Fardina Fathmiul Alam1, Taseef Rahman1, Amarda Shehu1,2,3,4.
Abstract
Rapid growth in molecular structure data is renewing interest in featurizing structure. Featurizations that retain information on biological activity are particularly sought for protein molecules, where decades of research have shown that indeed structure encodes function. Research on featurization of protein structure is active, but here we assess the promise of autoencoders. Motivated by rapid progress in neural network research, we investigate and evaluate autoencoders on yielding linear and nonlinear featurizations of protein tertiary structures. An additional reason we focus on autoencoders as the engine to obtain featurizations is the versatility of their architectures and the ease with which changes to architecture yield linear versus nonlinear features. While open-source neural network libraries, such as Keras, which we employ here, greatly facilitate constructing, training, and evaluating autoencoder architectures and conducting model search, autoencoders have not yet gained popularity in the structure biology community. Here we demonstrate their utility in a practical context. Employing autoencoder-based featurizations, we address the classic problem of decoy selection in protein structure prediction. Utilizing off-the-shelf supervised learning methods, we demonstrate that the featurizations are indeed meaningful and allow detecting active tertiary structures, thus opening the way for further avenues of research.Entities:
Keywords: autoencoder; decoy selection; featurization; protein modeling; tertiary structure
Mesh:
Substances:
Year: 2020 PMID: 32143444 PMCID: PMC7179114 DOI: 10.3390/molecules25051146
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1(Top panel): In a vanilla AE, the encoder maps x directly to y, and the decoder maps y directly to z. (Bottom panel): In a deep, non-stacked architecture, the encoder and decoder contain several hidden layers.
Testing dataset (* denotes proteins with a predominant fold and a short helix). The chain extracted from a multi-chain PDB entry (shown in Column 2) to be used as the native structure is shown in parentheses. The CATH fold and architecture [41] of the known native structure is shown in Column 3. The length of the protein sequence (#aas) is shown in Column 4. The size of the Rosetta-generated decoy dataset is shown in Column 5. Column 6 shows the minimum lRMSD over decoys from the known native structure. Column 7 shows the percentage of near-native decoy structures (within some threshold lRMSD of the known native structure).
| Difficulty | PDB id | CATH Fold and Architecture | # AAs | # Decoys | Min lRMSD | % Native |
|---|---|---|---|---|---|---|
| (Å) | ||||||
| Easy | 1ail | Mainly | 70 | 58,491 |
|
|
| 1dtd(B) | 61 | 58,745 |
|
| ||
| 1wap(A) | Mainly | 68 | 68,000 |
|
| |
| 1tig | 88 | 60,000 |
|
| ||
| 1dtj(A) | 74 | 60,500 |
|
| ||
| Medium | 1hz6(A) | 64 | 60,000 |
|
| |
| 1c8c(A) | Mainly | 64 | 65,000 |
|
| |
| 2ci2 | 65 | 60,000 |
|
| ||
| 1bq9 | Mainly | 53 | 61,000 |
|
| |
| 1hhp | Mainly | 99 | 60,000 |
|
| |
| 1fwp | 69 | 51,724 |
|
| ||
| 1sap | Mainly | 66 | 66,000 |
|
| |
| Hard | 2h5n(D) | Mainly | 123 | 54,795 |
|
|
| 2ezk | Mainly | 93 | 54,626 |
|
| |
| 1aoy | Mainly | 78 | 57,000 |
|
| |
| 1cc5 | Mainly | 83 | 55,000 |
|
| |
| 1isu(A) | 62 | 60,000 |
|
| ||
| 1aly | Mainly | 146 | 53,000 |
|
|
Figure 2Tertiary structures obtained with Rosetta and prepared for input as described in Section 3 are projected onto the two features learned via (a) PCA, (b) Isomap, (c) best-performing vAE architecture, (d) best-performing dAE architecture, (e) best-performing oAE architecture, (f) PCA + t-SNE, or (g) best-performing dAE model + t-SNE; in (c–e), two feature spaces are shown to demonstrate the variability of model parameters converged onto from the learning process. (a–g) Featurized structures are drawn as disks, color-coded by their lRMSD (of the corresponding CA-represented structures) from the native structure in a blue-to-red color scheme indicating lower-to-higher lRMSDs.
Comparison of MSEs of PCA, vAE, , , and over testing datasets of target proteins; each protein is identified with the PDB id of its known native structure, with the chain shown in parentheses. Each AE architecture is trained three times, starting with initial random weights and biases, resulting in different models, whose MSE variances are shown. The mean MSE values are rounded to the second digit after the decimal sign. Higher precision is shown for the variance results, rounding to 0 all values less than .
| PDB ID | PCA | vAE |
|
|
|
|---|---|---|---|---|---|
| Mean | Mean (Var) | Mean (Var) | Mean (Var) | Mean (Var) | |
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 2 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 2 |
| ||||
| 2 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
| ||||
| 1 |
|
Figure 3(a) The 6 drawn structures correspond to a horizontal, left-to-right walk in the latent space of PDB id 1dtj(A).(b) The 6 structures correspond to a vertical, bottom-to-top walk in the same latent space. They are shown from a different viewpoint to highlight structural changes. (c,d) 50 structures (generated via motions along each of the top two PCs) are superimposed to visualize the latent PC space. (e) The native structure is drawn for reference. (a–e) All structures are rendered with the VMD software [46].
Performance of the Easy model. MSE and variance values are rounded to the third digit after the decimal sign.
| dAE | PCA | Isomap | ||||
|---|---|---|---|---|---|---|
| Perceptron | Linear Reg. | Perceptron | Linear Reg. | Perceptron | Linear Reg. | |
| MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | |
| 2D | ||||||
| 5D | ||||||
| 10D | ||||||
| 20D | ||||||
Performance of the Medium model. MSE and Variance values are rounded to the third digit after the decimal sign.
| dAE | PCA | Isomap | ||||
|---|---|---|---|---|---|---|
| Perceptron | Linear Reg. | Perceptron | Linear Reg. | Perceptron | Linear Reg. | |
| MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | |
| 2D | ||||||
| 5D | ||||||
| 10D | ||||||
| 20D | ||||||
Performance of the Hard model.MSE and Variance values are rounded to the third digit after the decimal sign.
| dAE | PCA | Isomap | ||||
|---|---|---|---|---|---|---|
| Perceptron | Linear Reg. | Perceptron | Linear Reg. | Perceptron | Linear Reg. | |
| MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | |
| 2D | ||||||
| 5D | ||||||
| 10D | ||||||
| 20D | ||||||
Performance of the Combined model. MSE and Variance values are rounded to the third digit after the decimal sign.
| dAE | PCA | Isomap | ||||
|---|---|---|---|---|---|---|
| Perceptron | Linear Reg. | Perceptron | Linear Reg. | Perceptron | Linear Reg. | |
| MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | MSE (Var) | |
| 2D | ||||||
| 5D | ||||||
| 10D | ||||||
| 20D | ||||||