Literature DB >> 35542493

Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning.

Zejun Li^1,2, Bo Liao¹, Yun Li¹, Wenhua Liu², Min Chen^1,2, Lijun Cai¹.

Abstract

Gene function annotation is the main challenge in the post genome era, which is an important part of the genome annotation. The sequencing of the human genome project produces a whole genome data, providing abundant biological information for the study of gene function annotation. However, to obtain useful knowledge from a large amount of data, a potential strategy is to apply machine learning methods to mine these data and predict gene function. In this study, we improved multi-instance hierarchical clustering by using gene ontology hierarchy to annotate gene function, which combines gene ontology hierarchy with multi-instance multi-label learning frame structure. Then, we used multi-label support vector machine (MLSVM) and multi-label k-nearest neighbor (MLKNN) algorithm to predict the function of gene. Finally, we verified our method in four yeast expression datasets. The performance of the simulated experiments proved that our method is efficient. This journal is © The Royal Society of Chemistry.

Entities: Chemical

Year: 2018 PMID： 35542493 PMCID： PMC9083914 DOI： 10.1039/c8ra05122d

Source DB: PubMed Journal: RSC Adv ISSN： 2046-2069 Impact factor: 4.036

Introduction

In post-genomic era, predicting the functions of genes is one of the biggest challenges of genome function annotation. With the rapid advancements in high-throughput bio-based technologies, such as microarray expression profiles, a large number of biological data have been produced.[1,2] These data provide valuable information for predicting gene functions. Recently, time-series gene expression profile datasets have been widely used to predict gene function, in which genes with similar expression patterns may have similar functions.[3] Many efforts have been made to settle this task based on this assumption. Zhao et al.[4] presented a new technique, namely, Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. Barutcuoglu et al.[5] developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Experiments show that over 105 nodes sub-hierarchy of the gene ontology (GO) the Bayesian framework improves predictions for 93 nodes. Vinayagam et al.[6] developed a large-scale annotation system and annotations were provided through GO terms by applying multiple SVMs for the classification of correct and false predictions. Pei et al.[7] proposed a novel method for the function annotation of new biological sequences by using the variable-precision rough set theory. Doniger et al.[8] proposed a tool called MAPPFinder, which created a global gene-expression profile across all areas of biology by integrating the annotations of the GO project. Huang et al.[9] discussed various sorts and varieties of gene annotation enrichment analysis tools. Approximately 68 gene annotation enrichment tools that are currently available in the community were collected in this survey. These tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. Zhang et al.[10] have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). Although this tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis, it is adaptable for use with other gene sets from other high-throughput analyses. Draghici et al.[11] developed Onto-Express as a novel tool capable of automatically translating differentially regulated genes into functional profiles that characterize the impact of the condition studied. Despite the good performance of the machine learning techniques, there are still two characteristics of the function-prediction task that are different from common machine learning tasks: (1) a single gene may have multiple functions; and (2) the functions are organized in a hierarchy, i.e., a gene that is related to some functions is automatically related to all its ancestral functions (this is called the hierarchy constraint).[12] Therefore, we combined multi-label learning frame with gene ontology hierarchy[13] to settle this task. In this study, we improved multi-instance hierarchical clustering (MIHC)[14] with gene ontology hierarchy. Then, MLSVM and MLKNN classifiers were used to predict the function of genes in time-course gene expression profile.[1] There are numerous classification methods that have been used in bioinformatics.[15-18] In the section of Materials and methods, we will introduce the predicting task, MLSVM and MLKNN algorithms. And, the MIHC method will be introduced. In the section of Experiment and results, we describe the application of MIHC method to the real data in GO database to examine its effectiveness. Numerical results show that the proposed method has better precision, recall-rate and harmonic mean value.

Materials and methods

Gene function prediction task

The goal of gene function annotation task is to find the function of un-annotation genes. The general calculation approach is to calculate the relationship between the genes and the various functions by a variety of biological models to predict the un-annotation genes.[19-23] From the correspondence between genes and its functions, a gene can be transcribed and translated into various proteins and can execute many different functions.[24-27] Similarly, the in vivo biological process is not borne by a single gene, but by multiple genes working together.[15] Therefore, the relationship formed between the genes and its corresponding function is N to N mapping. Among the learning frameworks, the multi-instance multi-label learning framework is perfectly suited for N to N mapping. There is a certain degree of correlation between genes and genes and between gene functions and gene functions. However, the degradation processing in multi-instance multi-label learning framework destroys these correlations. Therefore, it is necessary to maintain the relevance when multi-instance multi-label learning framework is implemented.

Gene ontology hierarchy

GO database is a standard model with a hierarchical tree structure, designed to standardize biological knowledge of genes and their products. Overall, GO database is a directed acyclic graph (DAG), covering three branches: Biological Process (BP), Molecular Function (MF) and Cellular Components (CC). Also, there is no intersection between any two of the three branches. Moreover, GO database contains gene annotations of most of the microorganisms, plants and animal species, and GO terms can be used in multiple databases, maintaining the consistency of each gene described in different databases. GO database was constructed by DAG, which treats GO terms as nodes of DAG and the relationship of GO terms as edges of DAG. The GO DAG also describes its terms by referring terms of the tree structure, such as tree root node, parent nodes, child nodes, leaf nodes and levels. This makes GO DAG easier to be understood. In GO DAG, the parent node closer to the root is described as rougher, while child nodes further from the root are described as finer. Therefore, genes annotated with the GO terms have the highest possible level of details, which corresponds to the lowest level of abstraction.[28] There are three semantic relations between GO terms, namely, is_a, part_of and regulates. Among the three relationships, is_a and part_of relationships are transitive, while regulates relationships can be classified as the regulation of relations and under the control of the relationship. We simply describe these three relationships as follows. (1) The is_a relationship shows a relationship comprising a single included relationship. It also has the transitive. In other words, A is_a B represents A is subset of B. Moreover, the relationship can be inferred from A is_a C and C is_a B. We formulated these derivation relationships as is_a × is_a → is_a. (2) The part_of relationships are similar to is_a relationships. A part_of B indicates that if A is present, then A is a subset of B, but A does not necessarily occur. Similar to is_a relationship, part_of relationship also has the transitive. Therefore, we can also formulate the derivation relationships of part_of as part_of × part_of → part_of. (3) In comparison with the previous two relationships, the regulates relationship is slightly different. In GO semantics, if A can directly affect B, this affection, called A to B, has a regulatory role, i.e., A regulates B. The expressions of the three relationships in the GO database are shown in the Fig. 1. In Fig. 1, the alphabet “I” represents is_a relationship, “P” represents part_of relationship and “R” represents regulates relationship.

Fig. 1

Part of the gene ontology hierarchy relationships.

MIML learning framework

The multi-instance multi-label learning (MIML) framework was proposed by Zhou et al.[29] Formally, MIML can be defined as follows: Let x = {x1,x2,…,x} represent set of instances and y = {y1,y2,…,y}denote set of labels. Given the dataset {(X1,Y1),(X2,Y2),…,(X,Y)}, the goal of the learning task is the mapping of f: 2 → 2, where X ⊂ x is a bag-of-instances while Y ⊂ y is a subset of labels. In this study, we solved the MIML task by degeneration approach. First, MIML task was degenerated into multi-instance learning (MLL) or multi-label learning (MIL). Then, MLL or MIL was continually degenerated into single instance single label learning (SISL). The relationships of these learning frameworks are shown in Fig. 2. As shown in the subgraph (b), a gene is annotated by multiple GO terms. Fig. 2(c) shows that a GO term is a set of genes. The relationship shown in (b) is called multi-label,[30] and that shown in (c) is called multi-instance.[31] We used (b) or (c) as a bridge to degenerate (d) into (a). In this study, we used MLL algorithm to predict the annotation of novel genes. The pseudo code of MLL algorithms, which are MLSVM[32,33] and MLKNN,[34,35] are shown in Tables 1 and 2.

Fig. 2

Four types of machine learning frames.

The MLSVM algorithm

Input: S-the training set, T-the test samples
Output: Y_T-the set of predicted labels of T
1	For training set S = {(x_i,Y_i)\|i = 1,2,…,N}, calculate the kernel matrix
2	For each label y ∈ Y, Y = {Y_i\|i = 1,2,…,N}
3	Produce a sub-training set S_y = {(x_i,ψ(x_i,y))\|i = 1,2,…,N}
4	Train a SVM model M_y = svmtrain(S_y)
5	For a test sample t_i ∈ T
6	Its labels are obtained by Y_ti = {y\|M_y(t_i) ≥ 0}
7	End
8	End
9	Y _T = {Y_ti\|i = 1,2,…,N}

MLKNN algorithm

Input: S-the training set, T-the test samples
Output: Y_T-the set of predicted labels of T
1	For a test sample t_i ∈ T
2	Calculate S_ti ∈ KNN(t_i) which are the k-nearest neighbors of t_i among the S
3	The candidate classes of t_i are obtained by Y_tc = {y\|y ∈ Y and ψ(S_ti,Y) = 1}
4	For each label y ∈ Y_tc
5	Calculate simScore(t_i,s_i) which is the similarity score of s_i to t_i
6	Calculate the likelihood score of t_i to y by
7	t _i is labeled by Y_ti = {y\|Score(t_i,y) ≥ 0}
8	End
9	End
10	Y _T = {Y_ti\|i = 1,2,…,N}

MIHC+ algorithm

Despite the MIHC algorithm making many efforts on gene annotation task, there are some limitations. MIHC does not consider the GO DAG when it clusters GO terms. In this study, we improved on this issue with GO hierarchy when GO terms were clustered. Hierarchical Clustering[36] is a widely used machine learning technology. General hierarchical clustering algorithm can be described as follows: Step 1: determine all objects' dissimilarities by calculating the distance between each pair of objects, like Euclidean distance. Step 2: collect two closest objects or clusters and merge them into one class. Step 3: recalculate all dissimilarities between new clusters or objects. Step 4: return to Step 2 until certain conditions are satisfied or certain number of clusters generated. However, it is absurd that all objects are clustered into the same class. Hence, when certain conditions are satisfied, the algorithm is stopped. The end condition of MIHC algorithm is that no new cluster is generated. We still used formulae in MIHC to calculate the distance between bag-of-instances. The flowchart of MIHC+ algorithm is shown in the Fig. 3. In the algorithm, the merger condition is the most important. We also defined the merger calculate condition as D(GO,GO) ≤ max(D(GO),D(GO)), but MIHC+ algorithm needs another condition to merge two GO terms. Following the GO hierarchy, we up-propagated the two GO terms; if one of them owns a common ancestor in the GO database, we merge them. If there are no more new GO terms needed to be merged, the algorithm comes to an end. In other words, MIHC+ algorithm completely obeys the knowledge of GO hierarchy.

Fig. 3

MIHC+ algorithm flowchart.

Experiment and results

Experiment

Time-series expression datasets in the experiment were obtained from ref. 37, and can be downloaded from ref. 38. These four datasets are yeast cell cycle expression data with different time points and circumstances. Gene annotation data can be obtained from GO database, which can be downloaded from ref. 39. We used the method in ref. 40 to preprocess the raw data and always make the first value 0. Then, the average transformation t = (t + t)/2 was used to smooth out spikes. After the data process, we used the method in ref. 14 to select genes that are significantly correlated with each other in the same function. Then, the non-noise system of expression data and annotation are represented as S = {(G,GO)|i = 1,…,M}. Subsequently, the MIHC+ algorithm was used to the construct learning system. Finally, MLSVM and MLKNN classifiers were used to verify the performance of the learning system. The flow chart of the gene function prediction is shown in Fig. 4.

Fig. 4

The flow chart of gene function prediction.

Leave-one-out and leave-a-percent-out cross validation[41] approaches were used for evaluating the performance of the function prediction algorithm. We selected the latter method to evaluate the MIHC+ method. To accurately measure the performance, the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) were introduced to quantify the results. The classifications were often based on continuous random variables. The probability of belonging in a class varies with different threshold parameters. In other words, the values of true and false positive rates (TPR and FPR, respectively) vary with different threshold parameters. The ROC curve parametrically plots TPR versus FPR with varying parameters. TPR and FPR were calculated by eqn (3) and (4).where TP, FP, TN and FN represent the number of true positive, false positive, true negative and false negative predictions, respectively. Therefore, TPR and FPR can reflect the sensitivity and specificity of prediction. AUC was calculated to quantify the content of the ROC curves. A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive sample rather than to a randomly chosen negative sample.

Results

The four yeast time-course expression datasets are alpha, cdc15, cdc28 and elution, which record mRNA level of 18, 24, 17 and 14 time points in whole cell cycle under different circumstances, respectively. For each expression dataset, MIHC+ and other three methods (GNC, GOLC and MIHC) in ref. 14 were used to construct learning system. Then, all learning systems were tested by MLSVM and MLKNN classifiers. In the classification task, the multi-label learning task is decomposed into a series of binary classification tasks. The experimental settings are the same as that in ref. 14. For each expression dataset, the average results obtained from each learning system by MLSVM classifier are shown in Fig. 5–8. The data in these figures indicate that the MIHC+ learning system has a similar performance with MIHC. The results from cdc28 dataset are shown in Table 3. However, MIHC+ method can give more biological information. From MIHC+ learning system, we found that the GO term named ‘GO: 0009987’ appears in all of these datasets and only 7 genes, which own ‘GO: 0009987’, appear in cdc28 and cdc15 dataset. From the GO, we find that ‘GO: 0009987’ named “cellular process” is defined as “any process that is carried out at the cellular level, but not necessarily restricted to a single cell”. For example, cell communication occurs among more than one cell, but at the cellular level.

The results of cdc28 dataset by MLSVM

Method		n%
Method		10	20	30	40	50	60	70	80	90
GNC	λ = 10	0.559	0.606	0.631	0.644	0.656	0.671	0.670	0.679	0.703
	λ = 20	0.569	0.603	0.607	0.625	0.638	0.646	0.650	0.650	0.661
	λ = 30	0.571	0.583	0.599	0.612	0.618	0.629	0.628	0.632	0.646
	λ = 40	0.529	0.543	0.552	0.567	0.577	0.574	0.578	0.589	0.602
	λ = 50	0.532	0.535	0.558	0.569	0.572	0.579	0.588	0.592	0.596
GOLC	ι = 1	0.594	0.617	0.634	0.635	0.648	0.643	0.641	0.651	0.653
	ι = 2	0.609	0.623	0.624	0.631	0.624	0.632	0.635	0.640	0.643
	ι = 3	0.601	0.644	0.657	0.658	0.654	0.661	0.656	0.643	0.668
	ι = 4	0.601	0.638	0.647	0.651	0.654	0.663	0.658	0.656	0.662
MIHC		0.621	0.666	0.727	0.767	0.800	0.794	0.817	0.828	0.838
MIHC⁺		0.644	0.665	0.735	0.740	0.796	0.811	0.822	0.831	0.840

The results of the experiments in four datasets proved that genes involved in the same biological processes may vary with external environment. Moreover, the ref. 37 also points towards this view because yeast cells automatically turn on or turn off certain genes' expression in order to adapt to the external environment when cells are in different growth environments. We present some GO terms and its genes for four different datasets in Table 4–7. As summarized in these tables, all of genes in the unit of “Genes” have the corresponding GO terms in the unit of “GO terms”.

Conclusion

In this study, we improve the MIHC method with gene ontology hierarchy (MIHC+ method) to construct a learning system. Our method was verified on four yeast gene expression datasets. The MIHC+ method treats gene ontology hierarchy as the relationship between gene annotations and then, Hierarchical Clustering follows the GO hierarchy to cluster them. Compared with other learning systems employed in this study, the MIHC+ method obtained more biological knowledge from the time-series expression dataset. It also has a similar performance with MIHC method. In future research, we will combine gene annotation information with other biological information (e.g., single nucleotide polymorphism,[42,43] and miRNA[44-52]) to diagnose complex diseases more accurately.

Conflicts of interest

There are no conflicts to declare.

Some GO terms and its genes in alph dataset

Environment	Alph
Genes	YBR189W	YGL189C	YGR214W	YJR123W	YOL121C
	YER025W	YGR094W	YGR285C	YNL178W
	YGL123W	YGR118W	YHR064C	YNL209W
GO terms	GO:0008152	GO:0009987	GO:0044237	GO:0044238	GO:0071704

Some GO terms and its genes in cdc15 datasets

Environment	cdc15
Genes	YBR048W	YGL030W	YKL006W	YLR333C	YOL120C
	YDL061C	YGL103W	YKR057W	YLR367W	YOL127W
	YDL083C	YGR034W	YKR094C	YLR388W	YOR063W
	YDR064W	YGR214W	YLR075W	YML073C	YOR167C
	YDR418W	YHR203C	YLR167W	YML091C	YPL131W
	YER102W	YIL069C	YLR185W	YNL209W
	YFR031C-A	YIL133C	YLR264W	YOL040C
GO terms	GO:0000462	GO:0006396	GO:0016072	GO:0042274	GO:0071704
	GO:0000469	GO:0006725	GO:0022613	GO:0043170	GO:0071840
	GO:0000478	GO:0006807	GO:0030490	GO:0044085	GO:0090304
	GO:0000479	GO:0008152	GO:0034470	GO:0044237	GO:0090305
	GO:0000480	GO:0009987	GO:0034641	GO:0044238	GO:0090501
	GO:0006139	GO:0010467	GO:0034660	GO:0044260	GO:0090502
	GO:0006364	GO:0016070	GO:0042254	GO:0046483	GO:1901360

Some GO terms and its genes in cdc28 datasets

Environment	cdc28
Genes	YBL027W	YDL191W	YJR145C	YLR367W	YOR312C
	YBL087C	YDR025W	YKL180W	YNL162W	YPL143W
	YBR048W	YDR064W	YKR057W	YNL302C	YPL198W
	YBR084C-A	YDR447C	YKR094C	YOL120C	YPR132W
	YBR181C	YHL001W	YLR185W	YOL121C
	YDL075W	YJL189W	YLR287C-A	YOR234C
GO terms	GO:0009987	GO:0044699	GO:0044763

Some GO terms and its genes in cdc28 datasets

Environment	Elution
Genes	YBL047C	YEL048C	YJL154C	YLR361C	YNL192W
	YBL099W	YER096W	YJR017C	YLR371W	YOR273C
	YBR038W	YFL038C	YJR032W	YLR417W	YOR332W
	YBR127C	YFR026C	YJR121W	YML034W	YPR156C
	YCR069W	YGR106C	YKL002W	YML078W	YPR165W
	YDL089W	YGR138C	YKL080W	YMR054W
	YDR304C	YHL006C	YKL203C	YMR089C
	YDR519W	YHR079C	YLR106C	YNL026W
GO terms	GO:0009987

37 in total

1. Global functional profiling of gene expression.

Authors: Sorin Draghici; Purvesh Khatri; Rui P Martins; G Charles Ostermeier; Stephen A Krawetz
Journal: Genomics Date: 2003-02 Impact factor: 5.736

2. Clustering short time series gene expression data.

Authors: Jason Ernst; Gerard J Nau; Ziv Bar-Joseph
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

3. Ontological analysis of gene expression data: current tools, limitations, and open problems.

Authors: Purvesh Khatri; Sorin Drăghici
Journal: Bioinformatics Date: 2005-06-30 Impact factor: 6.937

4. Hierarchical multi-label prediction of gene function.

Authors: Zafer Barutcuoglu; Robert E Schapire; Olga G Troyanskaya
Journal: Bioinformatics Date: 2006-01-12 Impact factor: 6.937

5. Novel human lncRNA-disease association inference based on lncRNA expression profiles.

Authors: Xing Chen; Gui-Ying Yan
Journal: Bioinformatics Date: 2013-09-02 Impact factor: 6.937

6. Informative SNPs selection based on two-locus and multilocus linkage disequilibrium: criteria of max-correlation and min-redundancy.

Authors: Xiong Li; Bo Liao; Lijun Cai; Zhi Cao; Wen Zhu
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2013 May-Jun Impact factor: 3.710

1. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms.

Authors: Kaustav Sengupta; Sovan Saha; Anup Kumar Halder; Piyali Chatterjee; Mita Nasipuri; Subhadip Basu; Dariusz Plewczynski
Journal: Front Genet Date: 2022-09-29 Impact factor: 4.772

1 in total