| Literature DB >> 26356316 |
Abstract
BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers.Entities:
Mesh:
Year: 2015 PMID: 26356316 PMCID: PMC4565116 DOI: 10.1186/1752-0509-9-S5-S1
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Figure 1Acceptor Splice Site: Each instance is a 141-nt window around the splice site, with the "AG" dimer starting at position 61. The sequence is used to generate two views for co-training: one based on nucleotides and another one based on 3-mers.
Table of Results.
| LBE | CTEO | STEO | CTEP | STEP | CTEOD | STEOD | CTEPD | STEPD | |
|---|---|---|---|---|---|---|---|---|---|
| 1-to-5 | 0.452 | ||||||||
| 1-to-10 | 0.434 | 0.343† | |||||||
| 1-to-20 | 0.437 | 0.434 | 0.292◇ | ||||||
| 1-to-25 | 0.437 | 0.384◇ | 0.423◇ | 0.245* | |||||
| 1-to-30 | 0.430 | 0.336* | 0.408◇ | 0.239* | |||||
| 1-to-40 | 0.443 | 0.404† | 0.409 | 0.222† | |||||
| 1-to-50 | 0.450 | 0.372† | 0.409◇ | 0.236* | |||||
| 1-to-60 | 0.471 | 0.388† | 0.398 | 0.195† | 0.423 | ||||
| 1-to-70 | 0.450 | 0.392† | 0.411 | 0.207† | 0.444 | ||||
| 1-to-75 | 0.454 | 0.388 | 0.399◇ | 0.249† | 0.435 | ||||
| 1-to-80 | 0.449 | 0.353† | 0.386† | 0.436 | 0.204* | 0.421◇ | |||
| 1-to-90 | 0.453 | 0.359† | 0.410 | 0.449 | 0.242 | 0.423 | |||
| 1-to-99 | 0.446 | 0.376 | 0.389◇ | 0.440† | 0.226† | 0.414 |
The values represent averages of auPRC values for the positive class over the five organisms when the class imbalance degree varies from 1-to-5 to 1-to-99 and the amount of labeled instances represents less than 1% of the training data. LBE is the ensemble-based supervised lower bound. CTEO and STEO are the co-training-based and self-training-based ensembles inspired by the original approach in [11]. CTEP and STEP are the co-training and self-training based ensembles that use the "dynamic balancing" approach introduced in [15], in which only positive instances are used in semi-supervised iterations to augment the originally labeled training data. CTEOD and STEOD add positive and negative instances but distribute them among all subclassifiers, such that the balance and diversity of each subclassifier's labeled subset is maintained. CTEPD and STEPD use "dynamic balancing" but also distribute instances among all subclassifiers. The bold font denotes the semi-supervised experiments that outperform the lower bound. The starred (*) values denote experiments whose variation in comparison to the lower bound was found to be statistically significant by the paired t-test in all five organisms. The values marked with a plus (†) indicate experiments that the paired t-test found to be statistically significant in four out of five organisms. The values marked with a diamond (◇) indicate experiments that the paired t-test found to be statistically significant in three out of five organisms.