| Literature DB >> 25077414 |
M Fujimoto, Paul M Bodily, Nozomu Okuda, Mark J Clement, Quinn Snell.
Abstract
BACKGROUND: Error correction is an important step in increasing the quality of next-generation sequencing data for downstream analysis and use. Polymorphic datasets are a challenge for many bioinformatic software packages that are designed for or assume homozygosity of an input dataset. This assumption ignores the true genomic composition of many organisms that are diploid or polyploid. In this survey, two different error correction packages, Quake and ECHO, are examined to see how they perform on next-generation sequence data from heterozygous genomes.Entities:
Mesh:
Year: 2014 PMID: 25077414 PMCID: PMC4110727 DOI: 10.1186/1471-2105-15-S7-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Errors at heterozygous and homozygous positions (all errors) as treated by Quake and ECHO. Rate of occurrence is defined as how often an error is treated in a specified way out of all heterozygous and homozygous errors. The haploid genome size was used when running Quake.
Figure 2Errors at only heterozygous positions as treated by Quake and ECHO. The two error correction programs' performance on errors at heterozygous positions when given the heterozygous dataset at ≈ 3.7% error rate. Rate corrected is defined as the number of corrections made at erroneous heterozygous bases out of all erroneous heterozygous bases in the dataset. The haploid genome size was used when running Quake.
The SOAPdenovo2 E. coli assembly.
| Correction Algorithm | How Corrected | Contigs | N50 | Largest Contig |
|---|---|---|---|---|
| Raw reads | 486825 | 100 | 7110 | |
| Quake | Corrected together | 25642 | 661 | 28841 |
| Quake | Corrected separately | 18153 | 1143 | 36891 |
| ECHO | Corrected together | 392668 | 100 | 10094 |
| ECHO | Corrected separately | 348885 | 100 | 9563 |
There were five different assemblies. The assembly of raw reads involved no correction. For both Quake and ECHO, the reads were corrected separately by strain and corrected together with reads from both strains present then assembled. The number of contigs, the N50 size and the length of the largest contig of each assembly is shown.
Figure 3Introduced errors at non-error heterozygous positions. Showing heterozygous dataset with ≈ 3.7% error rate where errors were introduced at non-error positions. Introduced errors consisted of non-error bases at heterozygous positions that were corrected to the wrong or neither haplotype. Rate of introduced errors is defined as the number of mis-corrections at non-error heterozygous positions out of all the non-error heterozygous positions in the dataset.
Figure 4Chimeric after correction. Rate chimeric reads is defined as the number of chimeric reads out of all reads that have > 1 heterozygous marker given a heterozygous dataset at ≈ 3.7% error rate.
Figure 5Reads with > 1 heterozygous marker after correction. These are reads from the heterozygous dataset at ≈ 3.7% error rate. Rate reads with > 1 heterozygous marker is defined as the number of reads that have > 1 heterozygous marker out of all the reads in the dataset.