| Literature DB >> 22461885 |
Yanjun Qi1, Merja Oja, Jason Weston, William Stafford Noble.
Abstract
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22461885 PMCID: PMC3312883 DOI: 10.1371/journal.pone.0032235
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Deep neural network architecture.
Given an input amino acid sequence, the neural network outputs a posterior distribution over the class labels for that amino acid. This general deep network architecture is suitable for all of our prediction tasks. The network is characterized by three parts: (1) an amino acid feature extraction layer, (2) a sequential feature extraction layer, and (3) a series of classical neural network layers. The first layer consists a PSI-BLAST feature module and an amino acid embedding module. With a sliding window input (here ), the amino acid embedding module outputs a series of real valued vectors . Similarly, the PSI-BLAST module derives 20-dimensional PSI-BLAST feature vectors corresponding to the amino acids. These vectors are then concatenated in the sequential extraction layer of the network. Finally, the derived vector is fed into the classical neural network layers. The final softmax layer allows us to interpret the outputs as probabilities for each class.
Figure 2Multitask learning with weight sharing between multiple deep neural networks.
In this figure, two related tasks are trained simultaneously using the network the architecture from Figure 1. Here only the very last layers of the network are task specific.
Summary of data sets.
| Name | Task | Prot Num | AA Num | CV | Composition (%) |
| ss | Secondary structure | 11 765 | 2 518 596 | 5 | 41.7 = C, 21.6 = E, 36.7 = H |
| cb513ss | Secondary structure | 497 | 83 707 | 7 | 42.8 = C, 22.7 = E, 34.5 = H |
| dssp | Secondary structure, DSSP | 11 765 | 2 518 596 | 5 | 33.3 = H,20.4 = E, 20.1 = L, |
| 11.2 = T, 9.5 = S, 3.5 = G, 1.1 = B, 0.02 = I | |||||
| sar | Relative solvent accessibility | 11 765 | 2 518 596 | 5 | 51.1 = B, 48.9 = A |
| saa | Absolute solvent accessibility | 11 795 | 2 518 596 | 5 | 64.9 = B, 35.1 = A |
| dna | DNA binding | 693 | 127 064 | 3 | 81.2 = N, 18.8 = B |
| sp | Signal peptide | 2816 | 1 058 598 | 10 | 30.8 = O, 4.0 = S, 65.2 = N |
| tm | Transmembrane topology | 1457 | 460 780 | 10 | 82.1 = O, 9.6 = I, 7.5 = M, |
| 0.3 = S, 0.1 = R, 0.4 = N | |||||
| cc | Coiled coil | 765 | 444 138 | 10 | 69.8 = N, 4.3 = each of a/b/c/d/e/f/g |
| ppi | Protein protein interaction | 1129 | 188 676 | 3 | 73.4 = P 26.6 = N |
For each data set, we list the number of protein sequences, the number of amino acids, the number of cross validation folds, and the proportion of amino acids assigned to each label.
Figure 3Network architecture for training the “natural protein” auxiliary task.
The “natural protein” auxiliary task aiming to model the local patterns of amino acids that naturally occur in protein sequences. Using local windows in the unlabeled protein sequences as positive examples and randomly modified windows as negative examples, the network learns the feature representations for each amino acid. In contrast to the network illustrated in Figure 1, the network contains only the amino acid embedding module in the first layer of the network. The learned embedding is encoded into the real valued parameter matrix of the amino acid feature extraction layer.
Comparison of learning strategies based on percent accuracy.
| Embedding? | ✓ | ✓ | * | * | * | ||||||
| Multitask? | ✓ | ✓ | ✓ | ||||||||
| Natural protein? | ✓ | ✓ | ✓ | ||||||||
| Task (%) | Single | Embed | Multi | Multi-Emb | NP | NP only | All3 | All3+Vit |
| CV | Previous |
| ss | 79.1 | 79.6 | 80.5 | 81.3 | 79.7 | 67.7 |
| 81.4 | 1e-4 | 100 | – |
| cb513ss | 76.1 | 74.5 | 79.8 | 80.2 | 74.8 | 65.8 | 80.2 |
| 1e-3 | 100 | 80.0 |
| dssp | 65.5 | 66.3 | 67.1 | 68.1 | 66.3 | 54.3 | 68.2 |
| 1e-4 | 100 | – |
| sar | 78.4 | 79.8 | 79.2 | 81.0 | 79.8 | 73.1 | 81.0 |
| 1e-4 | 100 | – |
| saa | 80.7 | 81.3 | 81.7 | 82.6 | 81.3 | 74.2 |
|
| 1e-4 | 100 | – |
| dna | 82.4 | 82.2 | 85.3 | 87.0 | 82.3 | 81.1 | 88.6 |
| 1e-4 | 66.7 | 89.0 |
| sp | 80.9 | 80.7 | 83.6 | 83.9 | 80.7 | 69.4 | 84.1 |
| 1e-4 | 100 | – |
| sp (prot) | 99.5 | 99.5 | 99.8 | 99.8 | 99.8 | 99.8 | 99.7 | 99.8 | 5e-2 | – | 97.0 |
| tm | 87.1 | 87.5 | 89.0 | 89.3 | 87.7 | 85.8 | 89.4 |
| 1e-4 | 100 | – |
| tm (seg) | 91.0 | 96.9 | 97.4 | 98.3 | 96.7 | 92.7 |
| 96.5 | 1e-4 | – | 94.0 |
| cc | 88.6 | 89.9 | 93.1 | 94.2 | 90.7 | 87.3 | 94.4 |
| 1e-4 | 100 | – |
| cc (seg) | 90.7 | 91.9 | 94.5 | 95.6 | 92.0 | 89.7 | 95.7 |
| 1e-4 | – | 94.0 |
| ppi | 73.6 | 73.6 |
| 73.1 | 73.6 | 71.0 | 74.3 | 75.6 | 1e-4 | 66.7 | – |
The table lists, for each prediction task, the per-residue percent accuracy achieved via single-task training of the neural network with just the PSI-BLAST features (“Single”), single-task training that includes the amino acid embedding (“Embed”), multitask training just using the PSI-BLAST features (“Multi”), multitask training including the amino acid embedding (“Multi-Emb”), multitask training of one task along with the natural protein task (“NP”), multitask training without the PSI-BLAST embedding module but initializing the amino acid embedding by using the natural protein task (“NP only”), multitask training including the natural protein task (“All3”), “All3” with Viterbi post-processing (“All3+Vit”) and a previously reported method (“Previous”). Each row corresponds to a single task. The -value column indicates whether the difference between “Single” and “All3+Vit” is significant, according to a Z-test. The “CV” column is computed based on the accuracies separately for each cross-validation fold. It counts the percentage of CV folds in which the “All3+Vit” method outperforms the “Single” method. Rows labeled “(prot)” or “(seg)” report the protein- or segment-level accuracy, rather than residue-level accuracy. For the “NP” setting, the “*” in the “Embedding?” row indicates that this network uses the pre-trained embedding layer from the natural protein task.
Figure 4A learned amino acid embedding.
The figure shows an approximation of a 15-dimensional embedding of amino acids, learned by a neural network trained on the natural protein task. The projection to 2D is accomplished via principal component analysis.