Literature DB >> 33323113

Predicting functions of maize proteins using graph convolutional network.

Guangjie Zhou^1,2, Jun Wang², Xiangliang Zhang³, Maozu Guo⁴, Guoxian Yu^5,6,7.

Abstract

BACKGROUND: Maize (Zea mays ssp. mays L.) is the most widely grown and yield crop in the world, as well as an important model organism for fundamental research of the function of genes. The functions of Maize proteins are annotated using the Gene Ontology (GO), which has more than 40000 terms and organizes GO terms in a direct acyclic graph (DAG). It is a huge challenge to accurately annotate relevant GO terms to a Maize protein from such a large number of candidate GO terms. Some deep learning models have been proposed to predict the protein function, but the effectiveness of these approaches is unsatisfactory. One major reason is that they inadequately utilize the GO hierarchy.
RESULTS: To use the knowledge encoded in the GO hierarchy, we propose a deep Graph Convolutional Network (GCN) based model (DeepGOA) to predict GO annotations of proteins. DeepGOA firstly quantifies the correlations (or edges) between GO terms and updates the edge weights of the DAG by leveraging GO annotations and hierarchy, then learns the semantic representation and latent inter-relations of GO terms in the way by applying GCN on the updated DAG. Meanwhile, Convolutional Neural Network (CNN) is used to learn the feature representation of amino acid sequences with respect to the semantic representations. After that, DeepGOA computes the dot product of the two representations, which enable to train the whole network end-to-end coherently. Extensive experiments show that DeepGOA can effectively integrate GO structural information and amino acid information, and then annotates proteins accurately.
CONCLUSIONS: Experiments on Maize PH207 inbred line and Human protein sequence dataset show that DeepGOA outperforms the state-of-the-art deep learning based methods. The ablation study proves that GCN can employ the knowledge of GO and boost the performance. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=DeepGOA .

Entities: Chemical Disease Species

Keywords: Convolutional neural network; GO terms; Gene ontology; Graph convolutional network; Maize; Protein function prediction

Year: 2020 PMID： 33323113 PMCID： PMC7739465 DOI： 10.1186/s12859-020-03745-6

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Maize (Zea mays ssp. mays L.) has been subjected to cultivation and selection ever since over the past 10,000 years [1, 2]. Advances in sequencing technology have led to a large and rapidly increasing amount of Maize proteomic data (i.e., amino acid sequences and interaction networks). Knowledge of protein sequences is useful for many applications, such as yield and quality improvement, disease resistance and so on. Moreover, understanding the behavior of biological systems also requires determining the function of the protein [3, 4]. The functional annotations of proteins does not increase with the explosion of sequence data. Therefore, accurately annotating the functions of Maize proteins is crucial for all forms of basic and applied research [4-6]. However, due to the bias of botanists’ research interests, and identifying protein function always requires in vitro or in vivo experiments, only a very tiny part of newly obtained sequences have experimentally validated GO annotations [7-9]. Annotating proteins by wet-lab techniques (i.e., gene knockout and iRNA) is low-throughput and can not keep pace with the rapid influx of proteomic data. Therefore, the automatic methods have become increasingly important [4, 10]. Gene Ontology (GO) is a controlled vocabulary of terms for describing the biological roles of genes and their products [11], it has been extensively used as a golden standard [12]. GO annotations of proteins are originally collected from published (or unpublished) experimental data by GO curators. GO includes plenty of GO terms and each GO term describes a distinct biological concept [13]. If a protein is annotated with a GO term, it means that the protein has the function represented by the GO term. Furthermore, many proteins do not only have a single function but may have multiple different functions, making the automated function prediction (AFP) become a multi-label problem. Additionally, the GO contains strong, formally defined relations between GO terms that need to be accounted during predicting the function of proteins. Till now, GO contains over 40000 terms, covering three different sub-ontologies, namely Biological Process(BP), Molecular Function (MF) and Cellular Component (CC). GO structurally organizes each sub-ontologies’ GO terms in a direct acyclic graph (DAG). In the DAG, each node corresponds to a GO term and each edge describes the relationship between terms. If a protein is annotated with a term, then the protein is also annotated with its ancestor (if any) terms. On the other hand, if a protein is not annotated with a GO term, the protein will not be annotated with any of its descendant terms. This rule is known as the True Path Rule [11, 14]: a child term is a further refinement of the function of its parental term. Figure 1 gives an example of GO annotations of Maize protein ‘Zm00008a000131-p01’.

Fig. 1

An example of hierarchical GO annotations of proteins. ‘Zm00008a000131-p01’ is a Maize protein, it is annotated with ‘GO:0005886’. According to the True Path Rule, the protein ‘Zm00008a000131-p01’ is also annotated with their ancestor terms (‘GO:0071944’, ‘GO:0044464’, ‘GO:0005623’, ‘GO:0016020’ and ‘GO:0005575’) A protein is typically annotated with multiple GO terms at the same time, since it usually participates in different life processes and executes multiple biological functions. The function of protein is not isolated. Multiple proteins form a biological pathway to implement biological functions, such as apoptosis and nerve impulses. Therefore, protein function prediction can be regarded as a multi-label learning problem [15-18]. However, due to a large amount of un-validated GO annotations of proteins, existing multi-label learning based function predicting methods face the issue of insufficient annotations and massive candidate GO terms. Furthermore, deep terms in the GO DAG describe more refined biological functions, and the shallow terms describe the broad functions. The missing GO annotations of proteins usually correspond to deep terms, which makes accurately predicting the GO annotations of proteins more difficult than traditional multi-label learning. Some efforts have been made toward utilizing the knowledge of GO. To name a few, Valentini [14] adjusts the predictions made by binary classifier for each GO term by using the GO hierarchy. Pandey et al. [19] firstly defined a taxonomic similarity through the knowledge of GO hierarchy, and used it to measure the correlations between GO terms, and then improved the prediction of deep GO terms via the correlation of GO terms. Yu et al. [18] views the GO structure as a graph and applied downward random walks (dRW) on the GO hierarchy. This method used the terms already annotated to the protein as the initial walkers to predict new GO annotation of this protein and identified the negative GO annotations of this protein [20]. Yu et al. [21] introduced a hybrid graph based on dRW, composed of two types of nodes (proteins and GO terms), to encode interactions between proteins, GO hierarchy and available annotations of proteins, and then predicted GO annotations of proteins through the bi-random walk algorithm proposed on the hybrid graph. Recently, Zhao et al. [22, 23] uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms and optimizes a series of hashing functions to encode massive GO terms via compact binary codes and then makes protein function prediction in the compressed hashing space and obtained a promising protein function prediction accuracy. All the above methods can be regarded as shallow solutions, which are difficult to mine the deep (non-linear) relationship between proteins and GO terms. In recent years, deep learning has significantly sparked the development of image recognition and speech recognition [24]. The huge and complex output space is a big challenge faced by deep learning model in protein function prediction. Wehrmann et al. [25] established a series of fully connected neural networks for the GO terms of different levels in the GO hierarchy. They used each fully connected neural network as a classifier to predict a certain number of GO items separately. Since the frequency of GO terms annotated proteins on the same level also varies, which will impact the performance of the deep model, Zilke et al. [26] grouped GO terms based on the level of GO and the number of annotations. For each group, they established a fully connected neural network for the function prediction. Based on the fully connected neural network, Rifaioglu et al. [27] used conjoint triad [28], pseudo amino acid composition [29] and subsequence profile map [30] to obtain protein sequence features, which further improves the accuracy of protein prediction. These two deep learning based approaches separate GO terms, thus they can not well respect the connection between GO terms, which are not in the same group. Kulmanov et al. [31] first utilizes Convolutional Neural Networks to encode amino acids and incorporates the GO structure into the output layer. They generates a fully connected layer with a Sigmoid activation function for each GO term, which predicts whether the protein should be annotated with this GO term. Furthermore, they uses a maximum merge layer which outputs the maximum value of the classification results for all child nodes and the internal nodes to predict the non-leaf terms in the GO DAG. Kulmanov et al. [32] further removed the maximum merge layers and increased the number of convolution kernels to obtain a better prediction accuracy. These aforementioned deep models optimistically assume that their models are suitable for multiple GO terms. But in fact, they do not well utilize the hierarchical relationship between GO terms and still suffer from the gap between amino acids and GO annotations, which is often similarly termed as the semantic gap in image classification [33]. In this paper, We used the deep neural network to learn the knowledge of Gene Ontology and to reduce the semantic gaps between amino acids and Gene Ontology and the annotations. Particularly, the proposed DeepGOA extracts the feature vectors of amino acids using the Convolutional Neural Network (CNN), and learns the semantic representation of GO terms by the Graph Convolution Network (GCN) [34] referring to GO hierarchy and known annotations related with these GO terms. Then, DeepGOA learns a mapping from sequence features to the semantic space of GO terms. The mapping is learned by a multi-layer neural network, which is reversely guided by the known GO annotations of proteins. We observe that DeepGOA outperforms existing state-of-the-art methods [27, 31, 32, 35] on Maize PH207 inbred line and Human protein sequence dataset. In addition, DeepGOA retains more GO structure information. It is important to highlight that the deep learning model incorporating Gene Ontology structure, to the best of our knowledge, is still less studied in computational model-based protein function prediction. The conference and short version of DeepGOA [36], as a showcase of CNN and GCN for mining amino acids and Gene Ontology for protein function prediction, was published as part of IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2019). In the extended version, we updated the background, problem definition, method description, results and their analysis.

Results and discussion

In this section, we briefly introduce several widely-used protein function prediction evaluation criteria for performance comparison, and the recommended configuration of experiments. Then, we analyze and discuss the experimental results and compare our results with related and competitive approaches.

Evaluation metrics

For a comprehensive evaluation, we use five widely-used evaluation metrics: AUC, AUPRC, PR50, F and S [37]. AUPRC (area under the precision-recall curve) and AUC (area under the receiver operator characteristics curve) are widely adopted for binary classification. Here we compute the AUPRC and AUC for each term and then take the average AUPRC and AUC of all terms. AUPRC is more sensitive to class-imbalance than AUC. PR50 is the average precision of all GO terms when the recall rate equals to 50%. F is the overall maximum harmonic mean of precision and recall across all possible thresholds on the predicted protein-term association matrix . S uses the information theoretic analogs of precision and recall based on the GO hierarchy to measure the minimum semantic distance between the predictions and ground-truths across all possible thresholds. The first three evaluation metrics are term-centric ones and the last two are protein-centric ones. These metrics quantify the performance of protein function prediction from different perspectives and it is difficult for an approach to outperform another one consistently across all the metrics. It is worthwhile to point out that unlike other evaluation metric, the smaller the value of S, the better the performance is. F is a protein-centric F-measure computed over all prediction thresholds. First, we compute the average precision and recall using the following formulas: We define and P represent the protein’s true and predicted functions of proteins. and P represent true and predicted values of the i-th protein functions. Where is the number of true positive, that is, the total number of occurrences of . FP is the number of false positive, that is, the total number of occurrences of but P=1. FN is the number of false negative, that is the total number of occurrences of but P=0. N is the total number of proteins. Precision and are so-called precision and recall, respectively. Then, we compute the F for all possible thresholds: where , m(θ) is the number of proteins whose predicted probability of at least one function tag is greater than or equal to the threshold θ, indicating the average precision of m(θ) proteins at the threshold θ. is the average recall of all proteins at the threshold θ. S computes semantic distance between real and predicted annotations based on information content of the classes. The information content IC(t) is computed by Eq. (14). S is computed utilizing the following formulas: where denotes a set of function labels whose prediction probability is greater than or equal to θ. is a set of true annotations.

Experimental setup

Our approach is implemented on Pytorch platform https://pytorch.org/. We conduct experiment on GO annotations and amino acids of Maize and Human. We firstly sort GO terms in descending order based on the number of proteins annotated to the GO term. Then we selected the most frequent terms for our experiments. Particularly, we select 117, 251 and 112 GO terms in BP, CC and MF for experiments on Maize; and 1190, 661 and 540 GO terms in BP, MF and CC for experiments on Human. After that, we use the information content of each GO term and the frequency of terms annotations to convert these selected GO terms into terms matrices and adjacency matrices. Meanwhile, we firstly convert each amino acid into a one-hot encoding and use a combination of one-hot vectors to represent the protein sequence. After that, we train CNN and GCN with graphisc processing unit (GPU). Finally, we fuse these two networks to predict association probabilities, and train these networks through the annotation information of the training protein sequences. In the following experiment, we randomly partition the proteins into a training set (80%) and a validation set (20%). All the experiments are performed on a server with following configurations:CentOS 7.3, 256GB RAM, Intel Exon E5-2678 v3 and NVIDIA Corporation GK110BGL [Tesla K40s].

Results of protein function prediction

For experimental validation, we compare DeepGOA against Naive [4, 10], BLAST [35], Deepred [27], DeepGO [31] and DeepGOPlus [32]. Naive assigns the same GO terms to all proteins based on annotation frequencies. The idea of BLAST is to find similar sequeneces from the training data and transfer GO terms from the most similar. All input parameters are the same as those reported by authors or optimized within the recommended ranges. Since DeepGOPlus has too many parameters to run in our experimental environment, we reduce the number of convolution kernels from 512 to 128. Table 1 reveals the prediction results of DeepGOA and those of comparing methods in 10 rounds of independent partitions.

Table 1

Experimental results of predicting GO annotations of Maize and Human genome

		Maize					Human
		PR50	AUC	AUPRC	S_min↓	F_max	PR50	AUC	AUPRC	S_min↓	F_max
BP	DeepGOA	82.44	89.56	70.18	0.9965	67.53	55.20	69.79	62.20	19.7772	38.52
	DeepGOPlus	76.47	89.52	69.64	1.2193	59.61	53.61	68.74	60.75	20.0152	36.23
	DeepGO	64.41	85.39	62.91	1.2586	59.49	50.25	63.85	57.13	20.6061	32.71
	Deepred	67.39	84.95	63.02	1.3509	58.21	55.60	68.33	56.96	19.9538	38.07
	BLAST	32.61	71.77	28.96	1.1745	61.10	46.50	57.72	48.94	20.2695	33.92
	Navie	27.08	49.93	27.67	1.8957	29.32	51.94	49.98	56.61	20.4729	34.45
CC	DeepGOA	96.33	87.73	75.78	0.6603	75.74	50.88	75.69	49.97	4.9029	62.92
	DeepGOPlus	91.21	82.51	77.84	0.8105	70.82	50.18	65.15	48.70	4.9488	62.75
	DeepGO	86.57	82.91	72.07	0.7759	71.08	43.60	69.51	44.81	5.1828	58.86
	Deepred	84.77	86.74	73.86	0.6952	69.85	44.58	75.94	44.58	5.8166	61.77
	BLAST	39.48	70.82	39.18	0.7904	62.02	21.25	56.27	26.91	5.0593	44.18
	Navie	48.14	49.98	43.74	1.2458	49.84	36.27	48.69	37.70	5.4474	55.15
MF	DeepGOA	83.63	92.51	68.63	1.7024	58.10	68.64	82.03	70.98	4.7571	47.71
	DeepGOPlus	72.70	83.67	64.42	1.6777	51.25	67.84	81.86	69.38	4.8426	46.82
	DeepGO	68.78	88.22	59.91	1.8551	52.82	54.56	75.98	62.47	5.2581	40.43
	Deepred	62.89	89.73	57.65	2.287	45.49	62.68	81.30	62.01	5.1711	45.14
	BLAST	27.40	67.76	32.92	1.8274	51.40	42.33	62.34	46.11	4.9195	41.07
	Navie	28.44	51.04	28.84	2.7430	26.13	46.86	49.87	52.77	5.7466	32.59

The best results for each metric are in boldface

Experimental results of predicting GO annotations of Maize and Human genome The best results for each metric are in boldface Among the five evaluation metrics, DeepGOA consistently achieves better performance than these methods. The improvement of DeepGOA to other comparing methods with respect to AUPRC and PR50 is more prominent, which shows that DeepGOA can achieve effectiveness in dealing with the imbalances of GO terms by introducing GO structure. Besides, the performance of DeepGOA on the Maize protein dataset is better than the human protein dataset, because the annotations of Maize protein is more sparse than the annotations of human protein. Through the introduction of GO structure, DeepGOA can achieve better performance on relatively sparse data compared to other methods. The semantic representation of the GO term helps to improve this effectiveness. DeepGO uses the structure between parent and child terms in the final output layer, but still falls behind DeepGOA, which shows that the GCN we choose for GO hierarchy representation learning is more effective. DeepGOPlus does not use any GO structural information, but it gains better performance than DeepGO. This fact suggests that the structural regularization in the final layer of DeepGO does not make full use of the GO hierarchy. The performance margin between DeepGOA and DeepGOPlus again indicates the effectiveness of our coherent learning on the semantic representation of GO terms and the feature representations of amino acids. Deepred does not use the convolutional structure to learn the local features of the sequence but uses the fully connected layer to learn the protein sequence. Due to the sparseness of protein annotations, there are many false-negative predictions in this method, resulting in a higher AUC, but it does not perform well in AUPRC. The AUC value of Naive is always lower than 0.5, since it predicts the GO annotation of a protein based on the frequency of GO terms, and tends to assign the most frequent GO terms to a protein. Mostly, BLAST is inferior to other comparing methods (except Naive). This fact proves the effectiveness of learning the representation of amino acids by CNN for protein function prediction. We choose one protein (Name:Zm00008a011322-p01) from our Maize protein dataset to illustrate the effectiveness of DeepGOA in the CC sub-ontology. Table 2 lists the GO annotations predicted by DeepGOA and other deep learning competing methods. The real annotation have been supplemented by True Path Rule. DeepGO annotates a GO term to a protein and automatically annotates all ancestor terms of that term to the protein simultaneously, due to the maximum merge layers. But the maximum merge layers of DeepGO will increase the false positive rate of the model. Compared with DeepGO, DeepGOplus uses a more reasonable convolutional structure and can mine deep terms. However, this method can not achieve the expected performance on the strong correlated GO terms because it ignores GO structural information. Deepred attempts to learn the overall features of the sequence based on a fully connected network, which leads to a situation that many annotations cannot be predicted. These results again confirms that DeepGOA performs better than other compared methods.

Table 2

The prediction of the Maize protein (Zm00008a011322-p01) with different methods

	Real annotation	DeepGOA	DeepGOplus	DeepGO	Deepred
CC	GO:0005622	GO:0005622	GO:0005622	GO:0005622	GO:0005622
	GO:0044464	GO:0044464	GO:0044464	GO:0044464	GO:0044464
	GO:0005623	GO:0005623	GO:0005623	GO:0005623	GO:0005623
	GO:0044424	GO:0044424	GO:0044424	GO:0044424
	GO:0043229	GO:0043229	GO:0005737	GO:0043229
	GO:0005737	GO:0005737		GO:0005737
	GO:0043231	GO:0043231		GO:0043231
	GO:0043227	GO:0043227
	GO:0005634

The prediction of the Maize protein (Zm00008a011322-p01) with different methods

Component and hyper-parameters analysis

In order to investigate which component of DeepGOA contribute to the improved performance of DeepGOA, we introduc three variants: DeepGOA-GO only uses the GO hierarchy; DeepGOA-LABEL only uses the co-annotation patterns without GO hierarchy; DeepGOA-CNN directly uses the representation of amino acids and the dot product to make function prediction, without using the semantic representation of GO terms. Table 3 lists the results of DeepGOA and its three variants on Human genome. The experimental configuration is the same as in the previous section.

Table 3

Prediction results of DeepGOA and its variants

		AUC	AUPRC	S_min↓	F_max
BP	DeepGOA	69.79	62.20	19.7772	38.52
	DeepGOA-GO	69.72	60.69	20.1579	36.79
	DeepGOA-Label	70.12	61.72	20.2206	38.14
	DeepGOA-CNN	69.19	61.06	20.2332	36.12
CC	DeepGOA	75.69	49.97	4.9029	62.92
	DeepGOA-GO	75.94	48.64	4.9127	62.43
	DeepGOA-Label	76.83	55.87	4.9707	62.67
	DeepGOA-CNN	74.85	49.19	5.0134	61.43
MF	DeepGOA	82.03	70.98	4.7571	47.71
	DeepGOA-GO	81.75	70.28	4.8201	46.98
	DeepGOA-Label	81.46	70.81	4.9661	46.88
	DeepGOA-CNN	77.65	63.12	5.2867	41.54

The best results for each metric are in boldface

Prediction results of DeepGOA and its variants The best results for each metric are in boldface DeepGOA generally has a better performance than its three variants due to the contribution of more valid information. Under the same experimental setting, DeepGOA-GO and DeepGOA-Label have better performance than DeepGOA-CNN. This observation proves that it is important and beneficial to learn the semantic representation of GO terms and optimize the mapping of feature representation of amino acids to the semantic representation. DeepGOA-GO achieves better results than DeepGOA-Label with respect to S, since it utilizes the GO hierarchy while DeepGOA-Label mainly uses the co-annotation pattern of GO terms to the same proteins, and S is defined with respect to the GO hierarchy. On the other hand, DeepGOA-Label has better results on AUPRC and AUC by modeling GO term co-annotation. DeepGOA leverages the GO hierarchy and GO terms’ co-annotation pattern, and thus it obtains better results than three variants. This ablation study further confirms the necessity of incorporating GCN for exploring and exploiting the latent hierarchical relationship between GO terms, and thus to improve the prediction accuracy. DeepGOA gives the predicted association probabilities by the dot product of the low-dimensional representation of the amino acid sequences and the low-dimensional representation of GO terms. If the dimensionality of low-dimensional representation is too low, it will lead to the loss of effective information. On the other hand, if it is too high, it will generate many parameters to degrade the training efficiency. Figure 2 reveals that when the low-dimensional vector dimension increases from 16 to 256, the AUPRC and AUC of DeepGOA prediction results will accordingly increase until stabilizing in the CC sub-ontology of Maize data. In our experiment, in order to make the experiment adapt to more GO terms and avoid the waste of computing resources, we chose 128 as the low-dimensional vector dimension.

Fig. 2

The AUC and AUPRC under different values of low-dimensional vector dimension

Conclusions and future work

Protein function prediction is one of the fundamental challenges in the post-genomic era. The firmly and formally defined relationship between the functions contained in the GO structure can improve the prediction performance. To this end, we develop DeepGOA based on GCN and CNN. DeepGOA utilizes the GCN to learn the semantic representation of GO terms through GO hierarchy and annotations related to GO terms, and the CNN to learn the representation of amino acids by combining the long and short range features of amino acid sequences. Then DeepGOA jointly seeks the mapping from the amino acids feature representation to GO terms semantic representation, and complete protein function prediction in an end-to-end and coherent manner. Experimental results on archived GO annotations dataset of Maize and Human show that DeepGOA outperforms existing deep learning-based protein function predicting models. Our ablation study further confirms that it is beneficial to learn the semantic representations of GO terms for function prediction. We will extend our work to predict the functional roles of diverse protein isoforms and noncoding RNAs.

Methods

In the protein function prediction, effectively mining GO hierarchy and known annotation is important [12, 13, 22, 23]. The semantic and structural information of GO can largely assist computational models to determine the function of proteins. Recently, Deep learning has been widely used in the field of protein function prediction [25, 26, 31]. However, how to properly use the knowledge of GO in the deep model has been a huge challenge. Most deep models simply try to learn the mapping of protein sequences to GO terms directly, without respecting to the GO hierarchy when optimizing the mapping. Different from these methods, DeepGOA firstly learns the semantic representation of Gene Ontology via GCN and simultaneously optimizes the representation of protein sequence through CNN. After that, DeepGOA computes the dot product of the aforementioned two sub-nets to learn the mapping from feature representation to semantic representation in an end-to-end style. At the same time, it utilizes the collected annotations of proteins and back propagation to refine the mapping coefficients and to obtain coherent representations. Figure 3 illustrates the basic architecture of our model.

Fig. 3

The network architecture of DeepGOA. The upper yellow subnetwork is the convolutional network part. The amino acids are extracted by convolution kernels of different sizes, and the fully connected layer is used to learn the mapping from sequence features to semantic representations of GO terms. The lower blue subnetwork is the graph convolution part, it uses the GO hierarchy and empirical correlations between GO terms stored in to learn the semantic representation of each GO term. The dot product is finally used to guide the mapping between proteins and GO terms and to reversely adjust the representations of proteins and GO terms. In this way, the associations between GO terms and proteins are also predicted

Datasets

For our experiments, we downloaded the Gene Ontology data (June 2019) from GO official site1. GO data, which has three branches and 44,786 terms, includes 4169 terms in CC, 29,462 terms in BP, 11,155 terms in MF. We use Maize PH207 inbred line [38] sequence dataset to evaluate our approach. To prove the universality of our model, we also used the Human sequence protein dataset. We collect the protein sequence and GO annotation data of Maize PH207 inbred line from Phytozome2. The Maize PH207 inbred line protein data contains 18,533 protein sequences that annotated with one or more GO terms. We collected the reviewed and manually annotated protein sequences with GO annotations of Human from SwissProt,3 which contains 20,431 protein sequences. For each subontology in GO, we all train a model to learn the knowledge of GO structure. Particularly, we rank GO terms by their number of annotations and select terms with the minimum number of annotations 25, 150 and 25 for CC, BP and MF, respectively. The adopted cutoff values are only half of those used by DeepGO [31], and thus our datasets include much more deep GO terms which describe more refined biological functions. Then, we propagate annotations by applying the True Path Rule. For instance, if a protein is annotated with a GO term, it will be annotated with all of its ancestor terms. We convert the annotations of protein into a binary label vector. If a protein sequence is annotated with a GO term from our list of selected terms, we will assign 1 to the term position in the binary vector and use it as a positive sample for this GO term. Otherwise, we will assign 0 and use it as a negative sample. In our model training process, we exclude proteins not annotated by any of the selected GO terms. In this paper, n represents the number of proteins in the training set, represents the set of studied GO terms, counts the number of selected GO terms.

Extracting features from amino acids via CNN

Computers cannot directly identify amino acid sequences. Moreover, different proteins have different peptide chain structures and amino acid numbers. We need to numerically encode each amino acid sequence while retaining their characteristics. Kulmanov et al. [32] confirms that utilize one-hot encoding in deep networks can achieve a good predictive effect. Therefore, the input of our model is the one-hot encoding of amino acids. Each amino acid can be represented via a one-hot encoding vector of length of 21. There are twenty types of amino acids. Some amino acid sequences have undetermined amino acids at certain positions. We specifically use an additional one-hot bit to represent them. We transform each amino acid into a one-hot encoding and utilize a combination of one-hot vectors to represent the first-order structure of a protein. To ensure that the model input vectors are equal in length, we take the first 2000 amino acids for proteins vectors longer than 2000 amino acids and zero-padded for proteins vectors less than 2000 amino acids. We finally got the amino acid sequences feature vector with size 2000×21. Each amino acid sequence can be presented by a matrix: where represents the i-th protein in the data set, x is the one-hot encoding of the j-th amino acid of the i-th protein. For each protein sequence feature vector, we utilize CNN to learn its low-dimensional representation. Convolutional Neural Networks (CNN) is a kind of feedforward neural network with convolutional computation and deep structure. It is one of the representative algorithms of deep learning and has a strong ability to extract features when processing fixed-size data. Therefore, we use a convolutional network to extract features from amino acid sequences and mine the deep information contained in the sequences. In addition, the amino acid sequence has not only a primary structure but also a secondary structure (α-helix and β-sheets) and a tertiary structure. This causes adjacent amino acids not necessarily participating in certain biological functions together. In order to dig out the impact of protein secondary and tertiary structure on function, we choose four different sizes of convolution kernels, respectively 8, 16, 24, 32, and set different sliding steps. The convolution portion takes X as input and extracts protein sequence features by a series of differently sized 1D convolution kernels. The convolution kernel is and h is the sliding window length. The convolution operation is defined as follows: where ∗ is a convolution operation, is a convolution kernel, f(·) is a non-linear operation, is our model input vector, k is the input feature vector length. The new feature vector of is defined as: where p=k−h+1. To this end, we get the feature representation of each protein. Since our deep network has a lot of parameters and the loss function is used to optimize the training data, the neural network is very easy to get higher precision on the training data, but the poor results on the test data. Due to the unequal length of the protein sequence and the huge output space, it is easy to cause over-fitting. To solve this problem, We added two dropout layers in the fully connected layer of the convolution module. The role of the dropout layer is to stop the activation of a certain neuron with a certain probability p in forward propagation, which makes the model more generalized against relying too much on some local features. Protein function prediction is a multi-label learning problem and it is easy for the activation function to fall into the saturation region, causing the gradient disappearance. To solve this problem, a batch normalization layer is added after the convolution layer. The batch normalization layer aims to normalize the feature map generated by the convolution layer and leads parameters obeying the normal distribution.

Graph convolutional network

Many existing protein function prediction methods utilize different techniques to employ the GO structure (or correlation) between terms and show improved per [21, 22, 31]. However, incorporating the GO structure into the deep model is a very challenging problem. For the learning of graph structure, traditional deep learning models can’t get a good performance, because they are designed for grids or simple sequences, such as images and texts. Graph Convolutional Network (GCN) [34] can learn the node representation of a graph (or network) using the graph structure. The core idea of GCN is to generate the representations of GO terms by propagating information between GO terms using the neighborhoods of GO terms. Unlike standard convolution for fixed-size input operations, GCN takes the feature descriptions with one-hot coding and the corresponding correlation matrix of GO terms as input, and updates the representation of GO terms. The operation of GCN layer is defined as follows: where is the normalized version of the correlation matrix , which will be given later. f(·) is a non-linear operation, and is a transformation matrix to be learned. We can learn the deep information of GO terms on the GO DAG by stacking the GCN layers. The frequency of two terms annotated to the same protein is often used to estimate the correlation between GO terms, which has been widely adopted in multi-label learning based protein function prediction [15-17]. However, this simple estimation can not well reflect the underlying correlation between GO terms because the available annotations of proteins are imbalance and incomplete. Furthermore, the GO hierarchy between GO terms is independent from the known species. However, it has important guidance for accurate protein function, which is overlooked in this simple estimation process. In the Gene Ontology, the deep terms describe more refined biological functions. Therefore, the different information contents between GO terms are also the key information to estimate the correlation between GO terms. Given that, we combine the GO hierarchy and collected annotations of proteins to estimate the correlations between the parental term t and its child term s as follows where ch(t) is an aggregation of all direct child terms of t, n and n represent the number of proteins annotated with term s and t, respectively. IC(t) is the information content of t and it is measured as: where desc(t) includes all the descendants of t and itself. The semantic similarity between GO terms is widely measured utilizing this type of information content [20, 39, 40]. Obviously, since t has a lot of descendant GO terms, which convey more specific biological functions than t, the bigger the desc(t) is, the smaller the information content t has. This GO structure-based measurement is independent of the known GO annotations of proteins. Therefore, it is less affected by the incomplete and sparse GO annotations of proteins. In this way, we can differentiate the edges between parental terms and child terms.

DeepGOA classifier learning

Till now, we can obtain the representation for GO terms via the GCN, and the representation of n protein sequences (after dense layer of C in Fig. 3) in the d-dimensional semantic space encoded by . Finally, we get the dot product of and as the predicted association probabilities as follows: Since it is a binary problem to predict the association between a GO term and a protein, and the semantic representation already encodes the latent relationships between GO terms, our multi-label loss function can be defined by cross-entropy as follows: where stores the truth annotations of a protein, y∈{0,1} denotes whether GO term s is annotated to the protein or not, σ(·) is the Sigmoid activation function. By minimizing the above loss and back propagating the loss to the subnetwork of learning and to the subnetwork of learning Z, we can achieve the optimization of and , and protein function prediction in the semantic space in a coherent end-to-end fashion.

18 in total

1. Detecting protein function and protein-protein interactions from genome sequences.

Authors: E M Marcotte; M Pellegrini; H L Ng; D W Rice; T O Yeates; D Eisenberg
Journal: Science Date: 1999-07-30 Impact factor: 47.728

2. The B73 maize genome: complexity, diversity, and dynamics.

Authors: Patrick S Schnable; Doreen Ware; Robert S Fulton; Joshua C Stein; Fusheng Wei; Shiran Pasternak; Chengzhi Liang; Jianwei Zhang; Lucinda Fulton; Tina A Graves; Patrick Minx; Amy Denise Reily; Laura Courtney; Scott S Kruchowski; Chad Tomlinson; Cindy Strong; Kim Delehaunty; Catrina Fronick; Bill Courtney; Susan M Rock; Eddie Belter; Feiyu Du; Kyung Kim; Rachel M Abbott; Marc Cotton; Andy Levy; Pamela Marchetto; Kerri Ochoa; Stephanie M Jackson; Barbara Gillam; Weizu Chen; Le Yan; Jamey Higginbotham; Marco Cardenas; Jason Waligorski; Elizabeth Applebaum; Lindsey Phelps; Jason Falcone; Krishna Kanchi; Thynn Thane; Adam Scimone; Nay Thane; Jessica Henke; Tom Wang; Jessica Ruppert; Neha Shah; Kelsi Rotter; Jennifer Hodges; Elizabeth Ingenthron; Matt Cordes; Sara Kohlberg; Jennifer Sgro; Brandon Delgado; Kelly Mead; Asif Chinwalla; Shawn Leonard; Kevin Crouse; Kristi Collura; Dave Kudrna; Jennifer Currie; Ruifeng He; Angelina Angelova; Shanmugam Rajasekar; Teri Mueller; Rene Lomeli; Gabriel Scara; Ara Ko; Krista Delaney; Marina Wissotski; Georgina Lopez; David Campos; Michele Braidotti; Elizabeth Ashley; Wolfgang Golser; HyeRan Kim; Seunghee Lee; Jinke Lin; Zeljko Dujmic; Woojin Kim; Jayson Talag; Andrea Zuccolo; Chuanzhu Fan; Aswathy Sebastian; Melissa Kramer; Lori Spiegel; Lidia Nascimento; Theresa Zutavern; Beth Miller; Claude Ambroise; Stephanie Muller; Will Spooner; Apurva Narechania; Liya Ren; Sharon Wei; Sunita Kumari; Ben Faga; Michael J Levy; Linda McMahan; Peter Van Buren; Matthew W Vaughn; Kai Ying; Cheng-Ting Yeh; Scott J Emrich; Yi Jia; Ananth Kalyanaraman; An-Ping Hsia; W Brad Barbazuk; Regina S Baucom; Thomas P Brutnell; Nicholas C Carpita; Cristian Chaparro; Jer-Ming Chia; Jean-Marc Deragon; James C Estill; Yan Fu; Jeffrey A Jeddeloh; Yujun Han; Hyeran Lee; Pinghua Li; Damon R Lisch; Sanzhen Liu; Zhijie Liu; Dawn Holligan Nagel; Maureen C McCann; Phillip SanMiguel; Alan M Myers; Dan Nettleton; John Nguyen; Bryan W Penning; Lalit Ponnala; Kevin L Schneider; David C Schwartz; Anupma Sharma; Carol Soderlund; Nathan M Springer; Qi Sun; Hao Wang; Michael Waterman; Richard Westerman; Thomas K Wolfgruber; Lixing Yang; Yeisoo Yu; Lifang Zhang; Shiguo Zhou; Qihui Zhu; Jeffrey L Bennetzen; R Kelly Dawe; Jiming Jiang; Ning Jiang; Gernot G Presting; Susan R Wessler; Srinivas Aluru; Robert A Martienssen; Sandra W Clifton; W Richard McCombie; Rod A Wing; Richard K Wilson
Journal: Science Date: 2009-11-20 Impact factor: 47.728

3. NegGOA: negative GO annotations selection using ontology structure.

Authors: Guangyuan Fu; Jun Wang; Bo Yang; Guoxian Yu
Journal: Bioinformatics Date: 2016-06-17 Impact factor: 6.937

4. The effects of artificial selection on the maize genome.

Authors: Stephen I Wright; Irie Vroh Bi; Steve G Schroeder; Masanori Yamasaki; John F Doebley; Michael D McMullen; Brandon S Gaut
Journal: Science Date: 2005-05-27 Impact factor: 47.728

5. Predicting protein functions using incomplete hierarchical labels.

Authors: Guoxian Yu; Hailong Zhu; Carlotta Domeniconi
Journal: BMC Bioinformatics Date: 2015-01-16 Impact factor: 3.169

6. An expanded evaluation of protein function prediction methods shows an improvement in accuracy.

Authors: Yuxiang Jiang; Tal Ronnen Oron; Wyatt T Clark; Asma R Bankapur; Daniel D'Andrea; Rosalba Lepore; Christopher S Funk; Indika Kahanda; Karin M Verspoor; Asa Ben-Hur; Da Chen Emily Koo; Duncan Penfold-Brown; Dennis Shasha; Noah Youngs; Richard Bonneau; Alexandra Lin; Sayed M E Sahraeian; Pier Luigi Martelli; Giuseppe Profiti; Rita Casadio; Renzhi Cao; Zhaolong Zhong; Jianlin Cheng; Adrian Altenhoff; Nives Skunca; Christophe Dessimoz; Tunca Dogan; Kai Hakala; Suwisa Kaewphan; Farrokh Mehryary; Tapio Salakoski; Filip Ginter; Hai Fang; Ben Smithers; Matt Oates; Julian Gough; Petri Törönen; Patrik Koskinen; Liisa Holm; Ching-Tai Chen; Wen-Lian Hsu; Kevin Bryson; Domenico Cozzetto; Federico Minneci; David T Jones; Samuel Chapman; Dukka Bkc; Ishita K Khan; Daisuke Kihara; Dan Ofer; Nadav Rappoport; Amos Stern; Elena Cibrian-Uhalte; Paul Denny; Rebecca E Foulger; Reija Hieta; Duncan Legge; Ruth C Lovering; Michele Magrane; Anna N Melidoni; Prudence Mutowo-Meullenet; Klemens Pichler; Aleksandra Shypitsyna; Biao Li; Pooya Zakeri; Sarah ElShal; Léon-Charles Tranchevent; Sayoni Das; Natalie L Dawson; David Lee; Jonathan G Lees; Ian Sillitoe; Prajwal Bhat; Tamás Nepusz; Alfonso E Romero; Rajkumar Sasidharan; Haixuan Yang; Alberto Paccanaro; Jesse Gillis; Adriana E Sedeño-Cortés; Paul Pavlidis; Shou Feng; Juan M Cejuela; Tatyana Goldberg; Tobias Hamp; Lothar Richter; Asaf Salamov; Toni Gabaldon; Marina Marcet-Houben; Fran Supek; Qingtian Gong; Wei Ning; Yuanpeng Zhou; Weidong Tian; Marco Falda; Paolo Fontana; Enrico Lavezzo; Stefano Toppo; Carlo Ferrari; Manuel Giollo; Damiano Piovesan; Silvio C E Tosatto; Angela Del Pozo; José M Fernández; Paolo Maietta; Alfonso Valencia; Michael L Tress; Alfredo Benso; Stefano Di Carlo; Gianfranco Politano; Alessandro Savino; Hafeez Ur Rehman; Matteo Re; Marco Mesiti; Giorgio Valentini; Joachim W Bargsten; Aalt D J van Dijk; Branislava Gemovic; Sanja Glisic; Vladmir Perovic; Veljko Veljkovic; Nevena Veljkovic; Danillo C Almeida-E-Silva; Ricardo Z N Vencio; Malvika Sharan; Jörg Vogel; Lakesh Kansakar; Shanshan Zhang; Slobodan Vucetic; Zheng Wang; Michael J E Sternberg; Mark N Wass; Rachael P Huntley; Maria J Martin; Claire O'Donovan; Peter N Robinson; Yves Moreau; Anna Tramontano; Patricia C Babbitt; Steven E Brenner; Michal Linial; Christine A Orengo; Burkhard Rost; Casey S Greene; Sean D Mooney; Iddo Friedberg; Predrag Radivojac
Journal: Genome Biol Date: 2016-09-07 Impact factor: 13.583

7. Improved maize reference genome with single-molecule technologies.

Authors: Yinping Jiao; Paul Peluso; Jinghua Shi; Tiffany Liang; Michelle C Stitzer; Bo Wang; Michael S Campbell; Joshua C Stein; Xuehong Wei; Chen-Shan Chin; Katherine Guill; Michael Regulski; Sunita Kumari; Andrew Olson; Jonathan Gent; Kevin L Schneider; Thomas K Wolfgruber; Michael R May; Nathan M Springer; Eric Antoniou; W Richard McCombie; Gernot G Presting; Michael McMullen; Jeffrey Ross-Ibarra; R Kelly Dawe; Alex Hastie; David R Rank; Doreen Ware
Journal: Nature Date: 2017-06-12 Impact factor: 49.962

8. A large-scale evaluation of computational protein function prediction.

Authors: Predrag Radivojac; Wyatt T Clark; Tal Ronnen Oron; Alexandra M Schnoes; Tobias Wittkop; Artem Sokolov; Kiley Graim; Christopher Funk; Karin Verspoor; Asa Ben-Hur; Gaurav Pandey; Jeffrey M Yunes; Ameet S Talwalkar; Susanna Repo; Michael L Souza; Damiano Piovesan; Rita Casadio; Zheng Wang; Jianlin Cheng; Hai Fang; Julian Gough; Patrik Koskinen; Petri Törönen; Jussi Nokso-Koivisto; Liisa Holm; Domenico Cozzetto; Daniel W A Buchan; Kevin Bryson; David T Jones; Bhakti Limaye; Harshal Inamdar; Avik Datta; Sunitha K Manjari; Rajendra Joshi; Meghana Chitale; Daisuke Kihara; Andreas M Lisewski; Serkan Erdin; Eric Venner; Olivier Lichtarge; Robert Rentzsch; Haixuan Yang; Alfonso E Romero; Prajwal Bhat; Alberto Paccanaro; Tobias Hamp; Rebecca Kaßner; Stefan Seemayer; Esmeralda Vicedo; Christian Schaefer; Dominik Achten; Florian Auer; Ariane Boehm; Tatjana Braun; Maximilian Hecht; Mark Heron; Peter Hönigschmid; Thomas A Hopf; Stefanie Kaufmann; Michael Kiening; Denis Krompass; Cedric Landerer; Yannick Mahlich; Manfred Roos; Jari Björne; Tapio Salakoski; Andrew Wong; Hagit Shatkay; Fanny Gatzmann; Ingolf Sommer; Mark N Wass; Michael J E Sternberg; Nives Škunca; Fran Supek; Matko Bošnjak; Panče Panov; Sašo Džeroski; Tomislav Šmuc; Yiannis A I Kourmpetis; Aalt D J van Dijk; Cajo J F ter Braak; Yuanpeng Zhou; Qingtian Gong; Xinran Dong; Weidong Tian; Marco Falda; Paolo Fontana; Enrico Lavezzo; Barbara Di Camillo; Stefano Toppo; Liang Lan; Nemanja Djuric; Yuhong Guo; Slobodan Vucetic; Amos Bairoch; Michal Linial; Patricia C Babbitt; Steven E Brenner; Christine Orengo; Burkhard Rost; Sean D Mooney; Iddo Friedberg
Journal: Nat Methods Date: 2013-01-27 Impact factor: 28.547