Literature DB >> 22149632

Separating significant matches from spurious matches in DNA sequences.

Hugo Devillers1, Sophie Schbath.   

Abstract

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.

Entities:  

Mesh:

Year:  2011        PMID: 22149632      PMCID: PMC3244807          DOI: 10.1089/cmb.2011.0070

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  24 in total

Review 1.  Comparison of genomic DNA sequences: solved and unsolved problems.

Authors:  W Miller
Journal:  Bioinformatics       Date:  2001-05       Impact factor: 6.937

Review 2.  Comparative genomics: genome-wide analysis in metazoan eukaryotes.

Authors:  Abel Ureta-Vidal; Laurence Ettwiller; Ewan Birney
Journal:  Nat Rev Genet       Date:  2003-04       Impact factor: 53.242

Review 3.  Alignment-free sequence comparison-a review.

Authors:  Susana Vinga; Jonas Almeida
Journal:  Bioinformatics       Date:  2003-03-01       Impact factor: 6.937

4.  Efficient multiple genome alignment.

Authors:  Michael Höhl; Stefan Kurtz; Enno Ohlebusch
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

5.  Good spaced seeds for homology search.

Authors:  Kwok Pui Choi; Fanfan Zeng; Louxin Zhang
Journal:  Bioinformatics       Date:  2004-02-05       Impact factor: 6.937

Review 6.  The many faces of sequence alignment.

Authors:  Serafim Batzoglou
Journal:  Brief Bioinform       Date:  2005-03       Impact factor: 11.622

7.  Robustness assessment of whole bacterial genome segmentations.

Authors:  Hugo Devillers; Hélène Chiapello; Sophie Schbath; Meriem El Karoui
Journal:  J Comput Biol       Date:  2011-09       Impact factor: 1.479

8.  A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors:  S B Needleman; C D Wunsch
Journal:  J Mol Biol       Date:  1970-03       Impact factor: 5.469

9.  Alignment of whole genomes.

Authors:  A L Delcher; S Kasif; R D Fleischmann; J Peterson; O White; S L Salzberg
Journal:  Nucleic Acids Res       Date:  1999-06-01       Impact factor: 16.971

10.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

View more
  2 in total

1.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches.

Authors:  Chris-André Leimeister; Salma Sohrabi-Jahromi; Burkhard Morgenstern
Journal:  Bioinformatics       Date:  2017-04-01       Impact factor: 6.937

2.  Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes.

Authors:  Mireille Régnier; Philippe Chassignet
Journal:  Front Bioeng Biotechnol       Date:  2016-06-08
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.