Olivier Croce1, Michaël Lamarre, Richard Christen. 1. Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France. croce@unice.fr
Abstract
BACKGROUND: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use. CONCLUSION: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
BACKGROUND: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords. RESULTS: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use. CONCLUSION: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
Authors: P D'Addabbo; L Lenzi; F Facchin; R Casadei; S Canaider; L Vitale; F Frabetti; P Carinci; M Zannotti; P Strippoli Journal: Bioinformatics Date: 2004-05-14 Impact factor: 6.937
Authors: Steffen Hirschhäuser; Jürgen Fröhlich; Armin Gneipel; Inge Schönig; Helmut König Journal: FEMS Microbiol Lett Date: 2005-03-01 Impact factor: 2.742
Authors: Ulrich Nübel; Peter M Schmidt; Edda Reiss; Frank Bier; Wolfgang Beyer; Dieter Naumann Journal: FEMS Microbiol Lett Date: 2004-11-15 Impact factor: 2.742
Authors: Alexander L Chernorudskiy; Alejandro Garcia; Eugene V Eremin; Anastasia S Shorina; Ekaterina V Kondratieva; Murat R Gainullin Journal: BMC Bioinformatics Date: 2007-04-18 Impact factor: 3.169