Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 A machine-learning approach to combined evidence validation of genome assemblies.

Literature DB >> 18204064

A machine-learning approach to combined evidence validation of genome assemblies.

Jeong-Hyeon Choi¹, Sun Kim, Haixu Tang, Justen Andrews, Don G Gilbert, John K Colbourne.

Abstract

MOTIVATION: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data.
RESULTS: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

Entities: Chemical Species

Mesh：

Year: 2008 PMID： 18204064 DOI： 10.1093/bioinformatics/btm608

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

6 in total

1. Genome assembly quality: assessment and improvement using the neutral indel model.

Authors: Stephen Meader; LaDeana W Hillier; Devin Locke; Chris P Ponting; Gerton Lunter
Journal: Genome Res Date: 2010-03-19 Impact factor: 9.043

2. Genome assembly reborn: recent computational challenges.

Authors: Mihai Pop
Journal: Brief Bioinform Date: 2009-05-29 Impact factor: 11.622

Review 3. Sequence assembly demystified.

Authors: Niranjan Nagarajan; Mihai Pop
Journal: Nat Rev Genet Date: 2013-01-29 Impact factor: 53.242

4. Detection and correction of false segmental duplications caused by genome mis-assembly.

Authors: David R Kelley; Steven L Salzberg
Journal: Genome Biol Date: 2010-03-10 Impact factor: 13.583

5. Positional information resolves structural variations and uncovers an evolutionarily divergent genetic locus in accessions of Arabidopsis thaliana.

Authors: Alvina G Lai; Matthew Denton-Giles; Bernd Mueller-Roeber; Jos H M Schippers; Paul P Dijkwel
Journal: Genome Biol Evol Date: 2011-05-27 Impact factor: 3.416

6. Extensive error in the number of genes inferred from draft genome assemblies.

Authors: James F Denton; Jose Lugo-Martinez; Abraham E Tucker; Daniel R Schrider; Wesley C Warren; Matthew W Hahn
Journal: PLoS Comput Biol Date: 2014-12-04 Impact factor: 4.475

6 in total