Literature DB >> 8670613

CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases.

G Grillo1, M Attimonelli, S Liuni, G Pesole.   

Abstract

A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-significant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancy will be used, based on the measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user. In this paper we present a new algorithm based on an "approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies.

Mesh:

Substances:

Year:  1996        PMID: 8670613     DOI: 10.1093/bioinformatics/12.1.1

Source DB:  PubMed          Journal:  Comput Appl Biosci        ISSN: 0266-7061


  33 in total

1.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs.

Authors:  G Pesole; S Liuni; G Grillo; F Licciulli; A Larizza; W Makalowski; C Saccone
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Transterm: a database of messenger RNA components and signals.

Authors:  G H Jacobs; P A Stockwell; M J Schrieber; W P Tate; C M Brown
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

3.  GBuilder--an application for the visualization and integration of EST cluster data.

Authors:  J Muilu; P Rodriguez-Tomé; A Robinson
Journal:  Genome Res       Date:  2001-01       Impact factor: 9.043

4.  ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins.

Authors:  T Bakheet; M Frevel; B R Williams; W Greer; K S Khabar
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

5.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002.

Authors:  Graziano Pesole; Sabino Liuni; Giorgio Grillo; Flavio Licciulli; Flavio Mignone; Carmela Gissi; Cecilia Saccone
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

6.  Compositional gene landscapes in vertebrates.

Authors:  Stéphane Cruveiller; Kamel Jabbari; Oliver Clay; Giorgio Bernardi
Journal:  Genome Res       Date:  2004-05       Impact factor: 9.043

7.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs.

Authors:  Bastien Chevreux; Thomas Pfisterer; Bernd Drescher; Albert J Driesel; Werner E G Müller; Thomas Wetter; Sándor Suhai
Journal:  Genome Res       Date:  2004-05-12       Impact factor: 9.043

Review 8.  Searching for IRES.

Authors:  Stephen D Baird; Marcel Turcotte; Robert G Korneluk; Martin Holcik
Journal:  RNA       Date:  2006-09-06       Impact factor: 4.942

9.  UTRdb: a specialized database of 5' and 3' untranslated regions of eukaryotic mRNAs.

Authors:  G Pesole; S Liuni; G Grillo; M Ippedico; A Larizza; W Makalowski; C Saccone
Journal:  Nucleic Acids Res       Date:  1999-01-01       Impact factor: 16.971

10.  Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants.

Authors:  A Barakat; G Matassi; G Bernardi
Journal:  Proc Natl Acad Sci U S A       Date:  1998-08-18       Impact factor: 11.205

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.