Literature DB >> 18344247

Biological sequence classification utilizing positive and unlabeled data.

Yuanyuan Xiao1, Mark R Segal.   

Abstract

MOTIVATION: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.
RESULTS: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies--prediction of HLA binding, and alternative splicing conservation between human and mouse--we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18344247     DOI: 10.1093/bioinformatics/btn089

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  1 in total

1.  Towards site-based protein functional annotations.

Authors:  Seak Fei Lei; Jun Huan
Journal:  Int J Data Min Bioinform       Date:  2010       Impact factor: 0.667

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.