| Literature DB >> 18221567 |
Xing-Ming Zhao1, Yong Wang, Luonan Chen, Kazuyuki Aihara.
Abstract
BACKGROUND: In general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18221567 PMCID: PMC2275242 DOI: 10.1186/1471-2105-9-57
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The functional categories and genes used in this paper
| Functional categories | Number of genes |
| 01 metabolism | 967 |
| 02 energy | 241 |
| 10 cell cycle and DNA processing | 727 |
| 11 transcription | 829 |
| 12 protein synthesis | 364 |
| 14 protein fate | 680 |
| 20 cellular transport | 726 |
| 30 cellular communication | 86 |
| 32 cell rescue, defense and virulence | 307 |
| 34 interaction with the environment | 332 |
| 40 cell fate | 201 |
| 42 biogenesis of cellular components | 471 |
| 43 cell type differentiation | 354 |
The number of features used for each class in the paper
| Functional categories | Number of features | ||||
| AGPS | PSoL | one-class SVMs | two-class SVMs | two-class SVMs_balanced | |
| 01 metabolism | 295 | 110 | 10 | 10 | 10 |
| 02 energy | 115 | 210 | 10 | 60 | 145 |
| 10 cell cycle and DNA processing | 175 | 160 | 10 | 10 | 10 |
| 11 transcription | 280 | 210 | 10 | 260 | 10 |
| 12 protein synthesis | 25 | 160 | 260 | 260 | 10 |
| 14 protein fate | 160 | 160 | 10 | 10 | 295 |
| 20 cellular transport | 190 | 260 | 10 | 60 | 10 |
| 30 cellular communication | 70 | 160 | 10 | 110 | 70 |
| 32 cell rescue, defense and virulence | 295 | 160 | 10 | 110 | 250 |
| 34 interaction with the environment | 85 | 210 | 10 | 10 | 295 |
| 40 cell fate | 55 | 260 | 10 | 10 | 100 |
| 42 biogenesis of cellular components | 25 | 260 | 10 | 10 | 190 |
| 43 cell type differentiation | 40 | 210 | 10 | 10 | 40 |
Annotating Genes with Positive Samples (AGPS)
| - positive training data |
| - validation set |
| - unlabeled data |
| - unknown gene |
| - Prediction results |
| |
| Stage 1.1: Initial negative set generation |
| - Construct classifier |
| - Classify |
| - |
| Stage 1.2: Negative set expansion |
| - Classifier set |
| - repeat |
| - |
| - Construct classifier |
| - |
| - Classify |
| - |
| - |
| - until | |
| Stage 1.3: Classifier and negative set selection |
| - Classify |
| - Return negative set |
| Classify |
The results of 10-fold cross-validation by the five methods averaged over 13 classes
| Methods | |||
| AGPS | 68 | 61 | 61 |
| PSoL | 68 | 37 | 47 |
| two-class SVMs | 45 | 24 | 33 |
| two-class SVMs_balanced | 61 | 70 | 69 |
| one-class SVMs | 50 | 21 | 31 |
| kernel integration | 58 | 28 | 37 |
| kernel integration_balanced | 64 | 47 | 52 |
The prediction results by the five methods averaged over 13 classes
| Methods | coverage | ||||
| AGPS | 15 | 66 | 22 | 0.61 | 13 (13) |
| PSoL | 20 | 18 | 19 | 0.55 | 12 (13) |
| two-class SVMs | 28 | 10 | 16 | 0.53 | 11 (13) |
| two-class SVMs_balanced | 18 | 36 | 29 | 0.57 | 10(13) |
| one-class SVMs | 10 | 42 | 15 | 0.53 | 13 (13) |
| kernel integration | 39 | 16 | 23 | 0.56 | 11(13) |
| kernel integration_balanced | 11 | 32 | 24 | 0.59 | 6(13) |
The digit in the parenthesis is the true number of functional classes whereas the number outside is the number of classes that can be predicted by the corresponding method.
Figure 1The number of genes predicted correctly for the 13 functional classes. The prediction results obtained by the five methods: AGPS, PSoL, two-class SVMs, one-class SVMs and kernel integration methods, where two-class SVMs_balanced means the results by two-class SVMs trained on balanced data and the same for kernel integration. The height of the bar in the figure means the number of genes that the five methods can recover correctly from unlabeled genes for each functional class, respectively.
Figure 2Comparison of the five methods class by class. Comparison of the performance among the five methods, where two-class SVMs_balanced means the results by two-class SVMs trained on balanced data and the same for kernel integration. The number of classes versus one ROC score threshold is countered, and a higher curve means a better result.
Figure 3Comparison of AGPS and PSoL class by class. Comparison of the performance between the two single-class methods, i.e. AGPS and PSoL, class by class. The ROC scores obtained by the two methods for each functional class are compared.
Predicted annotations by AGPS algorithm versus annotations from GO
| MIPS functional categories | Gene ontology | genes annotated by GO | genes predicted by AGPS that match GO annotation |
| 01 metabolism | GO:0008152 | YEL044W YHL029C YGL185C YMR010W | YHL029C YGL185C YMR010W |
| 10 cell cycle and dna processing | GO:0007067 GO:0006260 GO:0006281 | YDR106W YGL168W YER038C | YDR168W YER038C |
| 11 transcription | GO:0006364 GO:0006396 | YLR196W YLR204W | YLR196W |
| 12 protein synthesis | GO:0043037 | YFR032C YLR287C | YFR032C YLR287C |
| 14 protein fate | GO:0006457 | YNL310C | YNL310C |
| 20 cellular transport | GO:0006888 | YDL099W | YDL099W |
| 32 cell rescue, defense and virulence | GO:0006974 GO:0006950 GO:0006979 | YOL063C YMR251W YDR346C | YOL063C YMR251W YDR346C |
| 42 biogenesis of cellular components | GO:0019898 GO:0007005 GO:0007047 | YPL005W YDR339C YNL149C YKR100C YNL310C YOR060C | YKR100C YOR060C |
The predicted terms versus GO terms, where only the predicted annotations that match GO terms with the corresponding MIPS annotations are shown.
Figure 4Schematic flow chart of the proposed method. Schematic flow chart of the proposed method. First, the protein interaction data, gene expression profiles and protein complex data for yeast genes are integrated into one functional linkage graph; Then, the SVD technique is utilized to project the gene vectors into low-dimensional feature space by uncovering the dominant structure of the functional linkage graph; Finally, the AGPS algorithm is utilized to predict the functions of genes.