Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 The impact of sequence length and number of sequences on promoter prediction performance.

Literature DB >> 26695879

The impact of sequence length and number of sequences on promoter prediction performance.

Sávio G Carvalho, Renata Guerra-Sá, Luiz H de C Merschmann.

Abstract

BACKGROUND: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers.
RESULTS: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more.
CONCLUSION: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Oligonucleotides

Year: 2015 PMID： 26695879 PMCID： PMC4686783 DOI： 10.1186/1471-2105-16-S19-S5

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

13 in total

1. Generic eukaryotic core promoter prediction using structural features of DNA.

Authors: Thomas Abeel; Yvan Saeys; Eric Bonnet; Pierre Rouzé; Yves Van de Peer
Journal: Genome Res Date: 2007-12-20 Impact factor: 9.043

Review 2. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods.

Authors: Jia Zeng; Shanfeng Zhu; Hong Yan
Journal: Brief Bioinform Date: 2009-06-16 Impact factor: 11.622

3. A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles.

Authors: Yanglan Gan; Jihong Guan; Shuigeng Zhou
Journal: Bioinformatics Date: 2009-06-10 Impact factor: 6.937

4. Core promoter T-blocks correlate with gene expression levels in C. elegans.

Authors: Vladislav Grishkevich; Tamar Hashimshony; Itai Yanai
Journal: Genome Res Date: 2011-03-02 Impact factor: 9.043

5. Ensemble approach combining multiple methods improves human transcription start site prediction.

Authors: David G Dineen; Markus Schröder; Desmond G Higgins; Pádraig Cunningham
Journal: BMC Genomics Date: 2010-11-30 Impact factor: 3.969

6. DBTSS: DataBase of Transcriptional Start Sites progress report in 2012.

Authors: Riu Yamashita; Sumio Sugano; Yutaka Suzuki; Kenta Nakai
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

7. A comparison study on feature selection of DNA structural properties for promoter prediction.

Authors: Yanglan Gan; Jihong Guan; Shuigeng Zhou
Journal: BMC Bioinformatics Date: 2012-01-07 Impact factor: 3.169

8. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles.

Authors: Thomas Abeel; Yvan Saeys; Pierre Rouzé; Yves Van de Peer
Journal: Bioinformatics Date: 2008-07-01 Impact factor: 6.937

9. DNA structural properties in the classification of genomic transcription regulation elements.

Authors: Pieter Meysman; Kathleen Marchal; Kristof Engelen
Journal: Bioinform Biol Insights Date: 2012-07-02

10. Toward a gold standard for promoter prediction evaluation.

Authors: Thomas Abeel; Yves Van de Peer; Yvan Saeys
Journal: Bioinformatics Date: 2009-06-15 Impact factor: 6.937

3 in total

1. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers.

Authors: Rob Eisinga; Tom Heskes; Ben Pelzer; Manfred Te Grotenhuis
Journal: BMC Bioinformatics Date: 2017-01-25 Impact factor: 3.169

2. Genome-Wide Analysis of PEBP Genes in Dendrobium huoshanense: Unveiling the Antagonistic Functions of FT/TFL1 in Flowering Time.

Authors: Cheng Song; Guohui Li; Jun Dai; Hui Deng
Journal: Front Genet Date: 2021-07-09 Impact factor: 4.599

3. The U2AF1S34F mutation induces lineage-specific splicing alterations in myelodysplastic syndromes.

Authors: Bon Ham Yip; Violetta Steeples; Emmanouela Repapi; Richard N Armstrong; Miriam Llorian; Swagata Roy; Jacqueline Shaw; Hamid Dolatshad; Stephen Taylor; Amit Verma; Matthias Bartenstein; Paresh Vyas; Nicholas Cp Cross; Luca Malcovati; Mario Cazzola; Eva Hellström-Lindberg; Seishi Ogawa; Christopher Wj Smith; Andrea Pellagatti; Jacqueline Boultwood
Journal: J Clin Invest Date: 2017-04-24 Impact factor: 14.808

3 in total