| Literature DB >> 29028931 |
Maxat Kulmanov1, Mohammed Asif Khan1, Robert Hoehndorf1, Jonathan Wren.
Abstract
Motivation: A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29028931 PMCID: PMC5860606 DOI: 10.1093/bioinformatics/btx624
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Convolutional Neural Network Architecture. (1) The input of the model is a list of integer indexes of trig.rams generated from protein sequence and vector of size 256 for protein PPI network representation. The trigram indexes are passed to an embedding layer which provides vector representations of size 128 for each trigram. The output of an embedding layer is a matrix of size 1000 × 128 on which we apply convolution and max-pooling. We merge the flattened output of the max-pooling layer and concatenate the resulting vector with the PPI network embeddings. This feature vector is then passed to hierarchically structured classification layers. (2) The hierarchically structured classification layers form a directed acyclic graph following the taxonomic structure of GO for is-a relations. For each GO class we generate one fully connected layer with a sigmoid activation function that predicts whether the input should be classified with this GO class. To ensure consistency, all non-leaf nodes in the graph use a maximum merge layers (rounded purple square) which outputs the maximum value of the classification results for all child nodes and the internal node’s classification results. The output vector of the model is the concatenation of maximum merge layers of the internal nodes and the classification layers of the leaf nodes
Overview of our model’s performance and comparison to BLAST baseline
| Method | BP | MF | CC | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F max | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | |
| BLAST | 0.314 | 0.302 | 0.327 | 0.372 | 0.367 | 0.377 | 0.362 | 0.321 | 0.417 | ||||||
| DeepGOSeq | 0.293 | 0.304 | 0.282 | 0.814 | 0.266 | 0.364 | 0.453 | 0.304 | 0.875 | 0.328 | 0.568 | 0.602 | 0.538 | 0.924 | 0.520 |
| DeepGOFlat | 0.387 | 0.393 | 0.382 | 0.899 | 0.395 | 0.451 | 0.529 | 0.393 | 0.925 | 0.428 | 0.632 | 0.635 | 0.629 | 0.966 | 0.595 |
| DeepGO | 0.412 | 0.379 | 0.896 | 0.397 | 0.577 | 0.397 | 0.928 | 0.438 | 0.643 | 0.624 | 0.967 | 0.592 | |||
| BLAST (selected) | 0.344 | 0.376 | 0.317 | 0.615 | 0.483 | 0.497 | 0.506 | 0.489 | |||||||
| DeepGOSeq (selected) | 0.322 | 0.319 | 0.324 | 0.814 | 0.266 | 0.392 | 0.453 | 0.346 | 0.875 | 0.328 | 0.574 | 0.602 | 0.548 | 0.924 | 0.520 |
| DeepGOFlat (selected) | 0.425 | 0.415 | 0.436 | 0.899 | 0.396 | 0.483 | 0.579 | 0.414 | 0.925 | 0.432 | 0.638 | 0.635 | 0.641 | 0.966 | 0.595 |
| DeepGO (selected) | 0.444 | 0.426 | 0.896 | 0.399 | 0.503 | 0.577 | 0.447 | 0.928 | 0.438 | 0.643 | 0.635 | 0.967 | 0.592 | ||
Note: The DeepGOSeq model uses only sequence information. DeepGOFlat uses both the protein sequence and network interactions as input, but instead of hierarchically structured classification layers DeepGOFlat has one fully connected layer with sigmoid activation function to generate output predictions. Our final DeepGO model uses sequence and interaction networks with hierarchical classification layers. The first part of the evaluation shows performance results when considering all GO annotations (even those that our model cannot predict), while the second part focuses on the selected terms for which our model can generate predictions. Best performing models are highlighted in bold.
Performance of our method distinguished by organisms
| Organism | BP | MF | CC | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fmax | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | |
| 0.40 | 0.41 | 0.39 | 0.89 | 0.40 | 0.48 | 0.59 | 0.41 | 0.93 | 0.45 | 0.63 | 0.64 | 0.62 | 0.96 | 0.59 | |
| Human | 0.42 | 0.46 | 0.39 | 0.89 | 0.42 | 0.51 | 0.64 | 0.42 | 0.94 | 0.46 | 0.60 | 0.58 | 0.61 | 0.96 | 0.56 |
| Mouse | 0.39 | 0.42 | 0.36 | 0.88 | 0.40 | 0.51 | 0.60 | 0.45 | 0.95 | 0.48 | 0.59 | 0.69 | 0.51 | 0.95 | 0.55 |
| Rat | 0.38 | 0.39 | 0.37 | 0.88 | 0.37 | 0.52 | 0.61 | 0.45 | 0.94 | 0.49 | 0.53 | 0.50 | 0.58 | 0.94 | 0.48 |
| Fruit Fly | 0.38 | 0.41 | 0.35 | 0.89 | 0.40 | 0.51 | 0.63 | 0.42 | 0.94 | 0.48 | 0.57 | 0.54 | 0.59 | 0.96 | 0.56 |
| Yeast | 0.45 | 0.46 | 0.43 | 0.93 | 0.46 | 0.42 | 0.49 | 0.37 | 0.91 | 0.38 | 0.57 | 0.55 | 0.59 | 0.96 | 0.56 |
| Fission Yeast | 0.42 | 0.43 | 0.41 | 0.91 | 0.41 | 0.40 | 0.40 | 0.39 | 0.91 | 0.35 | 0.77 | 0.77 | 0.78 | 0.98 | 0.74 |
| Zebrafish | 0.40 | 0.44 | 0.37 | 0.90 | 0.38 | 0.60 | 0.74 | 0.51 | 0.95 | 0.55 | 0.65 | 0.74 | 0.59 | 0.97 | 0.66 |
| 0.37 | 0.40 | 0.34 | 0.90 | 0.38 | 0.39 | 0.45 | 0.34 | 0.90 | 0.36 | 0.69 | 0.71 | 0.67 | 0.98 | 0.62 | |
| 0.40 | 0.42 | 0.38 | 0.93 | 0.42 | 0.40 | 0.47 | 0.35 | 0.93 | 0.38 | 0.73 | 0.76 | 0.70 | 0.99 | 0.66 | |
| Mycobacterium tuber-s | 0.29 | 0.28 | 0.31 | 0.88 | 0.24 | 0.38 | 0.45 | 0.33 | 0.91 | 0.35 | 0.68 | 0.65 | 0.71 | 0.99 | 0.63 |
| 0.52 | 0.57 | 0.47 | 0.93 | 0.55 | 0.42 | 0.65 | 0.31 | 0.91 | 0.41 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
| 0.36 | 0.50 | 0.29 | 0.87 | 0.34 | 0.39 | 0.43 | 0.36 | 0.91 | 0.33 | 0.50 | 0.64 | 0.42 | 0.97 | 0.53 | |
Note: We use the DeepGO model that combines both sequence and network information for this prediction. Best performance values are highlighted in bold.
Evaluation of DeepGO, FFPred3 and GoFDR methods on a CAFA3 preliminary evaluation set
| Method | BP | MF | CC | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F max | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | Fmax | AvgPr | AvgRc | AUC | MCC | |
| FFPred3 | 0.26 | 0.30 | 0.23 | 0.83 | 0.23 | 0.38 | 0.35 | 0.86 | 0.29 | 0.44 | 0.46 | 0.43 | 0.89 | 0.39 | |
| GoFDR | 0.20 | 0.27 | 0.15 | 0.61 | 0.00 | 0.36 | 0.84 | 0.40 | 0.40 | 0.41 | 0.72 | 0.31 | |||
| DeepGO | 0.47 | 0.61 | 0.39 | 0.37 | |||||||||||
Evaluation of DeepGO on a dataset split by sequence identity
| Model | AvgPr | AvgRc | AUC | MCC | |
|---|---|---|---|---|---|
| BP | 0.397 | 0.437 | 0.364 | 0.900 | 0.395 |
| MF | 0.403 | 0.495 | 0.339 | 0.908 | 0.359 |
| CC | 0.625 | 0.654 | 0.598 | 0.963 | 0.598 |
Fig. 2.Term centric performance. These plots show the performance of our model for each term in our subset of GO as a function of the number of supporting proteins in test set which are annotated by the term