Literature DB >> 18073381

Uncertainty in homology inferences: assessing and improving genomic sequence alignment.

Gerton Lunter1, Andrea Rocco, Naila Mimouni, Andreas Heger, Alexandre Caldeira, Jotun Hein.   

Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

Entities:  

Mesh:

Year:  2007        PMID: 18073381      PMCID: PMC2203628          DOI: 10.1101/gr.6725608

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.043


  55 in total

1.  Deletion errors generated during replication of CAG repeats.

Authors:  L C Kroutil; T A Kunkel
Journal:  Nucleic Acids Res       Date:  1999-09-01       Impact factor: 16.971

2.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.

Authors:  Adam Siepel; David Haussler
Journal:  Mol Biol Evol       Date:  2003-12-05       Impact factor: 16.240

3.  LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors:  Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal:  Genome Res       Date:  2003-03-12       Impact factor: 9.043

Review 4.  The many faces of sequence alignment.

Authors:  Serafim Batzoglou
Journal:  Brief Bioinform       Date:  2005-03       Impact factor: 11.622

5.  RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.

Authors:  Ye Ding; Chi Yu Chan; Charles E Lawrence
Journal:  RNA       Date:  2005-08       Impact factor: 4.942

6.  Pseudo-likelihood for non-reversible nucleotide substitution models with neighbour dependent rates.

Authors:  Ole F Christensen
Journal:  Stat Appl Genet Mol Biol       Date:  2006-07-31

7.  Quantifying the local reliability of a sequence alignment.

Authors:  H T Mevissen; M Vingron
Journal:  Protein Eng       Date:  1996-02

8.  Recombination drives the evolution of GC-content in the human genome.

Authors:  Julien Meunier; Laurent Duret
Journal:  Mol Biol Evol       Date:  2004-02-12       Impact factor: 16.240

9.  Human-mouse alignments with BLASTZ.

Authors:  Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal:  Genome Res       Date:  2003-01       Impact factor: 9.043

10.  Choosing the best heuristic for seeded alignment of DNA sequences.

Authors:  Yanni Sun; Jeremy Buhler
Journal:  BMC Bioinformatics       Date:  2006-03-13       Impact factor: 3.169

View more
  54 in total

Review 1.  A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA).

Authors:  Michiaki Hamada; Kiyoshi Asai
Journal:  J Comput Biol       Date:  2012-02-07       Impact factor: 1.479

2.  A stochastic evolutionary model for protein structure alignment and phylogeny.

Authors:  Christopher J Challis; Scott C Schmidler
Journal:  Mol Biol Evol       Date:  2012-06-21       Impact factor: 16.240

3.  Mutation biases and mutation rate variation around very short human microsatellites revealed by human-chimpanzee-orangutan genomic sequence alignments.

Authors:  William Amos
Journal:  J Mol Evol       Date:  2010-08-11       Impact factor: 2.395

4.  Genome assembly quality: assessment and improvement using the neutral indel model.

Authors:  Stephen Meader; LaDeana W Hillier; Devin Locke; Chris P Ponting; Gerton Lunter
Journal:  Genome Res       Date:  2010-03-19       Impact factor: 9.043

5.  Comparative assessment of methods for aligning multiple genome sequences.

Authors:  Xiaoyu Chen; Martin Tompa
Journal:  Nat Biotechnol       Date:  2010-05-23       Impact factor: 54.908

6.  Problems and solutions for estimating indel rates and length distributions.

Authors:  Reed A Cartwright
Journal:  Mol Biol Evol       Date:  2008-11-28       Impact factor: 16.240

7.  Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner.

Authors:  David V Lu; Randall H Brown; Manimozhiyan Arumugam; Michael R Brent
Journal:  Bioinformatics       Date:  2009-05-04       Impact factor: 6.937

8.  Multiple whole-genome alignments without a reference organism.

Authors:  Inna Dubchak; Alexander Poliakov; Andrey Kislyuk; Michael Brudno
Journal:  Genome Res       Date:  2009-01-28       Impact factor: 9.043

9.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs.

Authors:  Benedict Paten; Javier Herrero; Kathryn Beal; Stephen Fitzgerald; Ewan Birney
Journal:  Genome Res       Date:  2008-10-10       Impact factor: 9.043

10.  Detection of nonneutral substitution rates on mammalian phylogenies.

Authors:  Katherine S Pollard; Melissa J Hubisz; Kate R Rosenbloom; Adam Siepel
Journal:  Genome Res       Date:  2009-10-26       Impact factor: 9.043

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.