Literature DB >> 22401589

Classifier assessment and feature selection for recognizing short coding sequences of human genes.

Kai Song1, Ze Zhang, Tuo-Peng Tong, Fang Wu.   

Abstract

With the ever-increasing pace of genome sequencing, there is a great need for fast and accurate computational tools to automatically identify genes in these genomes. Although great progress has been made in the development of gene-finding algorithms during the past decades, there is still room for further improvement. In particular, the issue of recognizing short exons in eukaryotes is still not solved satisfactorily. This article is devoted to assessing various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue. Eight state-of-the-art linear and kernel-based supervised pattern recognition techniques were used to identify the short (21-192 bp) coding sequences of human genes. By measuring the prediction accuracy, the tradeoff between sensitivity and specificity and the time consumption, partial least squares (PLS) and kernel partial least squares (KPLS) algorithms were verified to be the most optimal linear and kernel-based classifiers, respectively. A surprising result was that, by making good use of the interpretability of the PLS and the Z-curve methods, 93 Z-curve features were proved to be the best selective combination. Using them, the average recognition accuracy was improved as high as 7.7% by means of KPLS when compared with what was obtained by the Fisher discriminant analysis using 189 Z-curve variables (Gao and Zhang, 2004 ). The used codes are freely available from the following approaches (implemented in MATLAB and supported on Linux and MS Windows): (1) SVM: http://www.support-vector-machines.org/SVM_soft.html. (2) GP: http://www.gaussianprocess.org. (3) KPLS and KFDA: Taylor, J.S., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK. (4) PLS: Wise, B.M., and Gallagher, N.B. 2011. PLS-Toolbox for use with MATLAB: ver 1.5.2. Eigenvector Technologies, Manson, WA. Supplementary Material for this article is available at www.liebertonline.com/cmb.

Entities:  

Mesh:

Year:  2012        PMID: 22401589      PMCID: PMC3298678          DOI: 10.1089/cmb.2011.0078

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  18 in total

1.  A decision tree system for finding genes in DNA.

Authors:  S Salzberg; A L Delcher; K H Fasman; J Henderson
Journal:  J Comput Biol       Date:  1998       Impact factor: 1.479

2.  Prediction of gene structure.

Authors:  R Guigó; S Knudsen; N Drake; T Smith
Journal:  J Mol Biol       Date:  1992-07-05       Impact factor: 5.469

3.  Novel unsupervised feature filtering of biological data.

Authors:  Roy Varshavsky; Assaf Gottlieb; Michal Linial; David Horn
Journal:  Bioinformatics       Date:  2006-07-15       Impact factor: 6.937

4.  Direct mapping and alignment of protein sequences onto genomic sequence.

Authors:  Osamu Gotoh
Journal:  Bioinformatics       Date:  2008-08-26       Impact factor: 6.937

5.  Analysis of distribution of bases in the coding sequences by a diagrammatic technique.

Authors:  C T Zhang; R Zhang
Journal:  Nucleic Acids Res       Date:  1991-11-25       Impact factor: 16.971

6.  A symmetrical theory of DNA sequences and its applications.

Authors:  C T Zhang
Journal:  J Theor Biol       Date:  1997-08-07       Impact factor: 2.691

7.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

8.  In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists.

Authors:  Yvan Saeys; Pierre Rouzé; Yves Van de Peer
Journal:  Bioinformatics       Date:  2007-01-04       Impact factor: 6.937

9.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses.

Authors:  John Besemer; Mark Borodovsky
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

10.  Global discriminative learning for higher-accuracy computational gene prediction.

Authors:  Axel Bernal; Koby Crammer; Artemis Hatzigeorgiou; Fernando Pereira
Journal:  PLoS Comput Biol       Date:  2007-02-02       Impact factor: 4.475

View more
  3 in total

1.  Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm.

Authors:  Sun Chen; Chun-ying Zhang; Kai Song
Journal:  Biol Direct       Date:  2013-09-25       Impact factor: 4.540

2.  Recognition of Protein-coding Genes Based on Z-curve Algorithms.

Authors:  Feng -Biao Guo; Yan Lin; Ling -Ling Chen
Journal:  Curr Genomics       Date:  2014-04       Impact factor: 2.236

3.  A Brief Review: The Z-curve Theory and its Application in Genome Analysis.

Authors:  Ren Zhang; Chun-Ting Zhang
Journal:  Curr Genomics       Date:  2014-04       Impact factor: 2.236

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.