Literature DB >> 10089196

A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

C Miller1, J Gurd, A Brass.   

Abstract

MOTIVATION: Word-matching algorithms such as BLAST are routinely used for sequence comparison. These algorithms typically use areas of matching words to seed alignments which are then used to assess the degree of sequence similarity. In this paper, we show that by formally separating the word-matching and sequence-alignment process, and using information about word frequencies to generate alignments and similarity scores, we can create a new sequence-comparison algorithm which is both fast and sensitive. The formal split between word searching and alignment allows users to select an appropriate alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence.
RESULTS: We present three algorithms, RAPID, PHAT and SPLAT, which together allow vector contaminations to be found and assessed extremely rapidly. RAPID is a word search algorithm which uses probabilities to modify the significance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of magnitude faster than BLAST. The formal split between word searching and alignment not only offers considerable gains in performance, but also allows alignment generation to be viewed as a user interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an artificial test set allows the optimal score threshold for identifying vector contamination to be determined. ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of the entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST subset of EMBL) finds an error rate of 0.86%, principally due to two large-scale projects. AVAILABILITY: A Web page for the software exists at http://bioinf.man.ac.uk/rapid, or it can be downloaded from ftp://ftp.bioinf.man.ac.uk/RAPID CONTACT: crispin@cs.man.ac.uk

Entities:  

Mesh:

Substances:

Year:  1999        PMID: 10089196     DOI: 10.1093/bioinformatics/15.2.111

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  7 in total

1.  Gene2EST: a BLAST2 server for searching expressed sequence tag (EST) databases with eukaryotic gene-sized queries.

Authors:  C Gemünd; C Ramu; B Altenberg-Greulich; T J Gibson
Journal:  Nucleic Acids Res       Date:  2001-03-15       Impact factor: 16.971

2.  SSAHA: a fast search method for large DNA databases.

Authors:  Z Ning; A J Cox; J C Mullikin
Journal:  Genome Res       Date:  2001-10       Impact factor: 9.043

3.  VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.

Authors:  Alejandro A Schäffer; Eric P Nawrocki; Yoon Choi; Paul A Kitts; Ilene Karsch-Mizrachi; Richard McVeigh
Journal:  Bioinformatics       Date:  2018-03-01       Impact factor: 6.937

4.  Cgaln: fast and space-efficient whole-genome alignment.

Authors:  Ryuichiro Nakato; Osamu Gotoh
Journal:  BMC Bioinformatics       Date:  2010-04-30       Impact factor: 3.169

5.  An annotation infrastructure for the analysis and interpretation of Affymetrix exon array data.

Authors:  Michał J Okoniewski; Tim Yates; Siân Dibben; Crispin J Miller
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

6.  Towards computational improvement of DNA database indexing and short DNA query searching.

Authors:  Done Stojanov; Sašo Koceski; Aleksandra Mileva; Nataša Koceska; Cveta Martinovska Bande
Journal:  Biotechnol Biotechnol Equip       Date:  2014-10-31       Impact factor: 1.632

7.  An optimized procedure greatly improves EST vector contamination removal.

Authors:  Yi-An Chen; Chang-Chun Lin; Chin-Di Wang; Huan-Bin Wu; Pei-Ing Hwang
Journal:  BMC Genomics       Date:  2007-11-13       Impact factor: 3.969

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.