Literature DB >> 35542493

Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning.

Zejun Li1,2, Bo Liao1, Yun Li1, Wenhua Liu2, Min Chen1,2, Lijun Cai1.   

Abstract

Gene function annotation is the main challenge in the post genome era, which is an important part of the genome annotation. The sequencing of the human genome project produces a whole genome data, providing abundant biological information for the study of gene function annotation. However, to obtain useful knowledge from a large amount of data, a potential strategy is to apply machine learning methods to mine these data and predict gene function. In this study, we improved multi-instance hierarchical clustering by using gene ontology hierarchy to annotate gene function, which combines gene ontology hierarchy with multi-instance multi-label learning frame structure. Then, we used multi-label support vector machine (MLSVM) and multi-label k-nearest neighbor (MLKNN) algorithm to predict the function of gene. Finally, we verified our method in four yeast expression datasets. The performance of the simulated experiments proved that our method is efficient. This journal is © The Royal Society of Chemistry.

Entities:  

Year:  2018        PMID: 35542493      PMCID: PMC9083914          DOI: 10.1039/c8ra05122d

Source DB:  PubMed          Journal:  RSC Adv        ISSN: 2046-2069            Impact factor:   4.036


Introduction

In post-genomic era, predicting the functions of genes is one of the biggest challenges of genome function annotation. With the rapid advancements in high-throughput bio-based technologies, such as microarray expression profiles, a large number of biological data have been produced.[1,2] These data provide valuable information for predicting gene functions. Recently, time-series gene expression profile datasets have been widely used to predict gene function, in which genes with similar expression patterns may have similar functions.[3] Many efforts have been made to settle this task based on this assumption. Zhao et al.[4] presented a new technique, namely, Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. Barutcuoglu et al.[5] developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Experiments show that over 105 nodes sub-hierarchy of the gene ontology (GO) the Bayesian framework improves predictions for 93 nodes. Vinayagam et al.[6] developed a large-scale annotation system and annotations were provided through GO terms by applying multiple SVMs for the classification of correct and false predictions. Pei et al.[7] proposed a novel method for the function annotation of new biological sequences by using the variable-precision rough set theory. Doniger et al.[8] proposed a tool called MAPPFinder, which created a global gene-expression profile across all areas of biology by integrating the annotations of the GO project. Huang et al.[9] discussed various sorts and varieties of gene annotation enrichment analysis tools. Approximately 68 gene annotation enrichment tools that are currently available in the community were collected in this survey. These tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. Zhang et al.[10] have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). Although this tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis, it is adaptable for use with other gene sets from other high-throughput analyses. Draghici et al.[11] developed Onto-Express as a novel tool capable of automatically translating differentially regulated genes into functional profiles that characterize the impact of the condition studied. Despite the good performance of the machine learning techniques, there are still two characteristics of the function-prediction task that are different from common machine learning tasks: (1) a single gene may have multiple functions; and (2) the functions are organized in a hierarchy, i.e., a gene that is related to some functions is automatically related to all its ancestral functions (this is called the hierarchy constraint).[12] Therefore, we combined multi-label learning frame with gene ontology hierarchy[13] to settle this task. In this study, we improved multi-instance hierarchical clustering (MIHC)[14] with gene ontology hierarchy. Then, MLSVM and MLKNN classifiers were used to predict the function of genes in time-course gene expression profile.[1] There are numerous classification methods that have been used in bioinformatics.[15-18] In the section of Materials and methods, we will introduce the predicting task, MLSVM and MLKNN algorithms. And, the MIHC method will be introduced. In the section of Experiment and results, we describe the application of MIHC method to the real data in GO database to examine its effectiveness. Numerical results show that the proposed method has better precision, recall-rate and harmonic mean value.

Materials and methods

Gene function prediction task

The goal of gene function annotation task is to find the function of un-annotation genes. The general calculation approach is to calculate the relationship between the genes and the various functions by a variety of biological models to predict the un-annotation genes.[19-23] From the correspondence between genes and its functions, a gene can be transcribed and translated into various proteins and can execute many different functions.[24-27] Similarly, the in vivo biological process is not borne by a single gene, but by multiple genes working together.[15] Therefore, the relationship formed between the genes and its corresponding function is N to N mapping. Among the learning frameworks, the multi-instance multi-label learning framework is perfectly suited for N to N mapping. There is a certain degree of correlation between genes and genes and between gene functions and gene functions. However, the degradation processing in multi-instance multi-label learning framework destroys these correlations. Therefore, it is necessary to maintain the relevance when multi-instance multi-label learning framework is implemented.

Gene ontology hierarchy

GO database is a standard model with a hierarchical tree structure, designed to standardize biological knowledge of genes and their products. Overall, GO database is a directed acyclic graph (DAG), covering three branches: Biological Process (BP), Molecular Function (MF) and Cellular Components (CC). Also, there is no intersection between any two of the three branches. Moreover, GO database contains gene annotations of most of the microorganisms, plants and animal species, and GO terms can be used in multiple databases, maintaining the consistency of each gene described in different databases. GO database was constructed by DAG, which treats GO terms as nodes of DAG and the relationship of GO terms as edges of DAG. The GO DAG also describes its terms by referring terms of the tree structure, such as tree root node, parent nodes, child nodes, leaf nodes and levels. This makes GO DAG easier to be understood. In GO DAG, the parent node closer to the root is described as rougher, while child nodes further from the root are described as finer. Therefore, genes annotated with the GO terms have the highest possible level of details, which corresponds to the lowest level of abstraction.[28] There are three semantic relations between GO terms, namely, is_a, part_of and regulates. Among the three relationships, is_a and part_of relationships are transitive, while regulates relationships can be classified as the regulation of relations and under the control of the relationship. We simply describe these three relationships as follows. (1) The is_a relationship shows a relationship comprising a single included relationship. It also has the transitive. In other words, A is_a B represents A is subset of B. Moreover, the relationship can be inferred from A is_a C and C is_a B. We formulated these derivation relationships as is_a × is_a → is_a. (2) The part_of relationships are similar to is_a relationships. A part_of B indicates that if A is present, then A is a subset of B, but A does not necessarily occur. Similar to is_a relationship, part_of relationship also has the transitive. Therefore, we can also formulate the derivation relationships of part_of as part_of × part_of → part_of. (3) In comparison with the previous two relationships, the regulates relationship is slightly different. In GO semantics, if A can directly affect B, this affection, called A to B, has a regulatory role, i.e., A regulates B. The expressions of the three relationships in the GO database are shown in the Fig. 1. In Fig. 1, the alphabet “I” represents is_a relationship, “P” represents part_of relationship and “R” represents regulates relationship.
Fig. 1

Part of the gene ontology hierarchy relationships.

MIML learning framework

The multi-instance multi-label learning (MIML) framework was proposed by Zhou et al.[29] Formally, MIML can be defined as follows: Let x = {x1,x2,…,x} represent set of instances and y = {y1,y2,…,y}denote set of labels. Given the dataset {(X1,Y1),(X2,Y2),…,(X,Y)}, the goal of the learning task is the mapping of f: 2 → 2, where X ⊂ x is a bag-of-instances while Y ⊂ y is a subset of labels. In this study, we solved the MIML task by degeneration approach. First, MIML task was degenerated into multi-instance learning (MLL) or multi-label learning (MIL). Then, MLL or MIL was continually degenerated into single instance single label learning (SISL). The relationships of these learning frameworks are shown in Fig. 2. As shown in the subgraph (b), a gene is annotated by multiple GO terms. Fig. 2(c) shows that a GO term is a set of genes. The relationship shown in (b) is called multi-label,[30] and that shown in (c) is called multi-instance.[31] We used (b) or (c) as a bridge to degenerate (d) into (a). In this study, we used MLL algorithm to predict the annotation of novel genes. The pseudo code of MLL algorithms, which are MLSVM[32,33] and MLKNN,[34,35] are shown in Tables 1 and 2.
Fig. 2

Four types of machine learning frames.

The MLSVM algorithm

Input: S-the training set, T-the test samples
Output: YT-the set of predicted labels of T
1For training set S = {(xi,Yi)|i = 1,2,…,N}, calculate the kernel matrix
2For each label yY, Y = {Yi|i = 1,2,…,N}
3Produce a sub-training set Sy = {(xi,ψ(xi,y))|i = 1,2,…,N}
4Train a SVM model My = svmtrain(Sy)
5For a test sample ti ∈ T
6Its labels are obtained by Yti = {y|My(ti) ≥ 0}
7End
8End
9 Y T = {Yti|i = 1,2,…,N}

MLKNN algorithm

Input: S-the training set, T-the test samples
Output: YT-the set of predicted labels of T
1For a test sample ti ∈ T
2Calculate Sti ∈ KNN(ti) which are the k-nearest neighbors of ti among the S
3The candidate classes of ti are obtained by Ytc = {y|yY and ψ(Sti,Y) = 1}
4For each label yYtc
5Calculate simScore(ti,si) which is the similarity score of si to ti
6Calculate the likelihood score of ti to y by
7 t i is labeled by Yti = {y|Score(ti,y) ≥ 0}
8End
9End
10 Y T = {Yti|i = 1,2,…,N}

MIHC+ algorithm

Despite the MIHC algorithm making many efforts on gene annotation task, there are some limitations. MIHC does not consider the GO DAG when it clusters GO terms. In this study, we improved on this issue with GO hierarchy when GO terms were clustered. Hierarchical Clustering[36] is a widely used machine learning technology. General hierarchical clustering algorithm can be described as follows: Step 1: determine all objects' dissimilarities by calculating the distance between each pair of objects, like Euclidean distance. Step 2: collect two closest objects or clusters and merge them into one class. Step 3: recalculate all dissimilarities between new clusters or objects. Step 4: return to Step 2 until certain conditions are satisfied or certain number of clusters generated. However, it is absurd that all objects are clustered into the same class. Hence, when certain conditions are satisfied, the algorithm is stopped. The end condition of MIHC algorithm is that no new cluster is generated. We still used formulae in MIHC to calculate the distance between bag-of-instances. The flowchart of MIHC+ algorithm is shown in the Fig. 3. In the algorithm, the merger condition is the most important. We also defined the merger calculate condition as D(GO,GO) ≤ max(D(GO),D(GO)), but MIHC+ algorithm needs another condition to merge two GO terms. Following the GO hierarchy, we up-propagated the two GO terms; if one of them owns a common ancestor in the GO database, we merge them. If there are no more new GO terms needed to be merged, the algorithm comes to an end. In other words, MIHC+ algorithm completely obeys the knowledge of GO hierarchy.
Fig. 3

MIHC+ algorithm flowchart.

Experiment and results

Experiment

Time-series expression datasets in the experiment were obtained from ref. 37, and can be downloaded from ref. 38. These four datasets are yeast cell cycle expression data with different time points and circumstances. Gene annotation data can be obtained from GO database, which can be downloaded from ref. 39. We used the method in ref. 40 to preprocess the raw data and always make the first value 0. Then, the average transformation t = (t + t)/2 was used to smooth out spikes. After the data process, we used the method in ref. 14 to select genes that are significantly correlated with each other in the same function. Then, the non-noise system of expression data and annotation are represented as S = {(G,GO)|i = 1,…,M}. Subsequently, the MIHC+ algorithm was used to the construct learning system. Finally, MLSVM and MLKNN classifiers were used to verify the performance of the learning system. The flow chart of the gene function prediction is shown in Fig. 4.
Fig. 4

The flow chart of gene function prediction.

Leave-one-out and leave-a-percent-out cross validation[41] approaches were used for evaluating the performance of the function prediction algorithm. We selected the latter method to evaluate the MIHC+ method. To accurately measure the performance, the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) were introduced to quantify the results. The classifications were often based on continuous random variables. The probability of belonging in a class varies with different threshold parameters. In other words, the values of true and false positive rates (TPR and FPR, respectively) vary with different threshold parameters. The ROC curve parametrically plots TPR versus FPR with varying parameters. TPR and FPR were calculated by eqn (3) and (4).where TP, FP, TN and FN represent the number of true positive, false positive, true negative and false negative predictions, respectively. Therefore, TPR and FPR can reflect the sensitivity and specificity of prediction. AUC was calculated to quantify the content of the ROC curves. A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive sample rather than to a randomly chosen negative sample.

Results

The four yeast time-course expression datasets are alpha, cdc15, cdc28 and elution, which record mRNA level of 18, 24, 17 and 14 time points in whole cell cycle under different circumstances, respectively. For each expression dataset, MIHC+ and other three methods (GNC, GOLC and MIHC) in ref. 14 were used to construct learning system. Then, all learning systems were tested by MLSVM and MLKNN classifiers. In the classification task, the multi-label learning task is decomposed into a series of binary classification tasks. The experimental settings are the same as that in ref. 14. For each expression dataset, the average results obtained from each learning system by MLSVM classifier are shown in Fig. 5–8. The data in these figures indicate that the MIHC+ learning system has a similar performance with MIHC. The results from cdc28 dataset are shown in Table 3. However, MIHC+ method can give more biological information. From MIHC+ learning system, we found that the GO term named ‘GO: 0009987’ appears in all of these datasets and only 7 genes, which own ‘GO: 0009987’, appear in cdc28 and cdc15 dataset. From the GO, we find that ‘GO: 0009987’ named “cellular process” is defined as “any process that is carried out at the cellular level, but not necessarily restricted to a single cell”. For example, cell communication occurs among more than one cell, but at the cellular level.

The results of cdc28 dataset by MLSVM

Method n%
102030405060708090
GNC λ = 100.5590.6060.6310.6440.6560.6710.6700.6790.703
λ = 200.5690.6030.6070.6250.6380.6460.6500.6500.661
λ = 300.5710.5830.5990.6120.6180.6290.6280.6320.646
λ = 400.5290.5430.5520.5670.5770.5740.5780.5890.602
λ = 500.5320.5350.5580.5690.5720.5790.5880.5920.596
GOLC ι = 10.5940.6170.6340.6350.6480.6430.6410.6510.653
ι = 20.6090.6230.6240.6310.6240.6320.6350.6400.643
ι = 30.6010.6440.6570.6580.6540.6610.6560.6430.668
ι = 40.6010.6380.6470.6510.6540.6630.6580.6560.662
MIHC0.6210.6660.7270.7670.8000.7940.8170.8280.838
MIHC+0.6440.6650.7350.7400.7960.8110.8220.8310.840
The results of the experiments in four datasets proved that genes involved in the same biological processes may vary with external environment. Moreover, the ref. 37 also points towards this view because yeast cells automatically turn on or turn off certain genes' expression in order to adapt to the external environment when cells are in different growth environments. We present some GO terms and its genes for four different datasets in Table 4–7. As summarized in these tables, all of genes in the unit of “Genes” have the corresponding GO terms in the unit of “GO terms”.

Conclusion

In this study, we improve the MIHC method with gene ontology hierarchy (MIHC+ method) to construct a learning system. Our method was verified on four yeast gene expression datasets. The MIHC+ method treats gene ontology hierarchy as the relationship between gene annotations and then, Hierarchical Clustering follows the GO hierarchy to cluster them. Compared with other learning systems employed in this study, the MIHC+ method obtained more biological knowledge from the time-series expression dataset. It also has a similar performance with MIHC method. In future research, we will combine gene annotation information with other biological information (e.g., single nucleotide polymorphism,[42,43] and miRNA[44-52]) to diagnose complex diseases more accurately.

Conflicts of interest

There are no conflicts to declare.

Some GO terms and its genes in alph dataset

EnvironmentAlph
GenesYBR189WYGL189CYGR214WYJR123WYOL121C
YER025WYGR094WYGR285CYNL178W
YGL123WYGR118WYHR064CYNL209W
GO termsGO:0008152GO:0009987GO:0044237GO:0044238GO:0071704

Some GO terms and its genes in cdc15 datasets

Environmentcdc15
GenesYBR048WYGL030WYKL006WYLR333CYOL120C
YDL061CYGL103WYKR057WYLR367WYOL127W
YDL083CYGR034WYKR094CYLR388WYOR063W
YDR064WYGR214WYLR075WYML073CYOR167C
YDR418WYHR203CYLR167WYML091CYPL131W
YER102WYIL069CYLR185WYNL209W
YFR031C-AYIL133CYLR264WYOL040C
GO termsGO:0000462GO:0006396GO:0016072GO:0042274GO:0071704
GO:0000469GO:0006725GO:0022613GO:0043170GO:0071840
GO:0000478GO:0006807GO:0030490GO:0044085GO:0090304
GO:0000479GO:0008152GO:0034470GO:0044237GO:0090305
GO:0000480GO:0009987GO:0034641GO:0044238GO:0090501
GO:0006139GO:0010467GO:0034660GO:0044260GO:0090502
GO:0006364GO:0016070GO:0042254GO:0046483GO:1901360

Some GO terms and its genes in cdc28 datasets

Environmentcdc28
GenesYBL027WYDL191WYJR145CYLR367WYOR312C
YBL087CYDR025WYKL180WYNL162WYPL143W
YBR048WYDR064WYKR057WYNL302CYPL198W
YBR084C-AYDR447CYKR094CYOL120CYPR132W
YBR181CYHL001WYLR185WYOL121C
YDL075WYJL189WYLR287C-AYOR234C
GO termsGO:0009987GO:0044699GO:0044763

Some GO terms and its genes in cdc28 datasets

EnvironmentElution
GenesYBL047CYEL048CYJL154CYLR361CYNL192W
YBL099WYER096WYJR017CYLR371WYOR273C
YBR038WYFL038CYJR032WYLR417WYOR332W
YBR127CYFR026CYJR121WYML034WYPR156C
YCR069WYGR106CYKL002WYML078WYPR165W
YDL089WYGR138CYKL080WYMR054W
YDR304CYHL006CYKL203CYMR089C
YDR519WYHR079CYLR106CYNL026W
GO termsGO:0009987
  37 in total

1.  Global functional profiling of gene expression.

Authors:  Sorin Draghici; Purvesh Khatri; Rui P Martins; G Charles Ostermeier; Stephen A Krawetz
Journal:  Genomics       Date:  2003-02       Impact factor: 5.736

2.  Clustering short time series gene expression data.

Authors:  Jason Ernst; Gerard J Nau; Ziv Bar-Joseph
Journal:  Bioinformatics       Date:  2005-06       Impact factor: 6.937

3.  Ontological analysis of gene expression data: current tools, limitations, and open problems.

Authors:  Purvesh Khatri; Sorin Drăghici
Journal:  Bioinformatics       Date:  2005-06-30       Impact factor: 6.937

4.  Hierarchical multi-label prediction of gene function.

Authors:  Zafer Barutcuoglu; Robert E Schapire; Olga G Troyanskaya
Journal:  Bioinformatics       Date:  2006-01-12       Impact factor: 6.937

5.  Novel human lncRNA-disease association inference based on lncRNA expression profiles.

Authors:  Xing Chen; Gui-Ying Yan
Journal:  Bioinformatics       Date:  2013-09-02       Impact factor: 6.937

6.  Informative SNPs selection based on two-locus and multilocus linkage disequilibrium: criteria of max-correlation and min-redundancy.

Authors:  Xiong Li; Bo Liao; Lijun Cai; Zhi Cao; Wen Zhu
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2013 May-Jun       Impact factor: 3.710

7.  Learning a weighted meta-sample based parameter free sparse representation classification for microarray data.

Authors:  Bo Liao; Yan Jiang; Guanqun Yuan; Wen Zhu; Lijun Cai; Zhi Cao
Journal:  PLoS One       Date:  2014-08-12       Impact factor: 3.240

8.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction.

Authors:  Xing Chen; Li Huang; Di Xie; Qi Zhao
Journal:  Cell Death Dis       Date:  2018-01-05       Impact factor: 8.469

9.  Predicting gene function using hierarchical multi-label decision tree ensembles.

Authors:  Leander Schietgat; Celine Vens; Jan Struyf; Hendrik Blockeel; Dragi Kocev; Saso Dzeroski
Journal:  BMC Bioinformatics       Date:  2010-01-02       Impact factor: 3.169

10.  Applying Support Vector Machines for Gene Ontology based gene function prediction.

Authors:  Arunachalam Vinayagam; Rainer König; Jutta Moormann; Falk Schubert; Roland Eils; Karl-Heinz Glatting; Sándor Suhai
Journal:  BMC Bioinformatics       Date:  2004-08-26       Impact factor: 3.169

View more
  1 in total

1.  PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms.

Authors:  Kaustav Sengupta; Sovan Saha; Anup Kumar Halder; Piyali Chatterjee; Mita Nasipuri; Subhadip Basu; Dariusz Plewczynski
Journal:  Front Genet       Date:  2022-09-29       Impact factor: 4.772

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.