| Literature DB >> 20562415 |
Thomas D Otto1, Mandy Sanders, Matthew Berriman, Chris Newbold.
Abstract
MOTIVATION: The accuracy of reference genomes is important for downstream analysis but a low error rate requires expensive manual interrogation of the sequence. Here, we describe a novel algorithm (Iterative Correction of Reference Nucleotides) that iteratively aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20562415 PMCID: PMC2894513 DOI: 10.1093/bioinformatics/btq269
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Flow chart of iCORN.
Fig. 2.Example of correction of a region of chromosome one of P.falciparum 3D7. The upper plot shows the coverage per iteration of the SSAHA mapping. The lower plot represent the coverage of the perfect mapping reads SNP-o-matic (http://snpomatic.sourceforge.net/). The vertical bars show the positions of the corrections. The actual corrections made at each iteration are shown in the multiple sequence alignment below.
Application of iCORN to prokaryotic and eukaryotic genome projects in various stages of completion
| Organism | Sequence quality | Sequencing method | Genome size (Mb) | SNPs | Indels | Number rejected | Genome covered | New mappable reads | Iterations | |
|---|---|---|---|---|---|---|---|---|---|---|
| Before (%) | After (%) | |||||||||
| A | Capillary | 23 | 1906 | 368 | 30 | 97.20 | 97.56 | 24 698 | 6 | |
| B | Capillary | 110 | 5508 | 2520 | 2140 | 48.89 | 49.11 | 1 023 315 | 5 | |
| B | Capillary | 33 | 594 | 1061 | 122 | 98.52 | 98.62 | 313 | 6 | |
| B | Capillary | 32 | 2770 | 1878 | 320 | 89.26 | 89.72 | 5629 | 8 | |
| B | Capillary | 21 | 1431 | 238 | 1081 | 91.27 | 91.42 | 6368 | 4 | |
| B | 454 | 18 | 25 976 | 33 860 | 5639 | 88.65 | 95.38 | 140 788 | 7 | |
| B | Capillary | 22 | 1901 | 3818 | 538 | 97.18 | 97.48 | 23 805 | 7 | |
| B | Capillary | 1.0 | 487 | 16 | 18 | 99.86 | 99.997 | 9734 | 4 | |
| B | 454 | 4.1 | 61 | 1652 | 32 | 99.30 | 99.43 | 1708 | 6 | |
| B | RNAseq | 2.0 | 13 | 5 | 1 | 64.23 | 64.23 | 6 | 3 | |
| A | Capillary | 2.1 | 2 | 1 | 0 | 98.84 | 98.85 | 15 | 2 | |
| A | Capillary | 2.0 | 0 | 0 | 0 | 99.7626 | 0 | 1 | ||
| Salmonella Dublin Strain | B | 454 | 5.0 | 13 | 45 | 18 | 96.84 | 96.85 | 207 | 7 |
| B | Capillary | 5.0 | 25 | 235 | 6 | 99.96 | 99.97 | 131 796 | 3 | |
Sequence quality: ‘A’ indicates manually finished and published genomes and ‘B’ indicates a draft assembly. SNPs and Indels shows the total number called between the first and last iteration. Rejected indicates the total number of changes that were rejected because they decreased the total of perfectly mapping reads at that location. Percent genome covered indicates how many bases are covered at least five times by perfectly mapping reads, before and after the correction. New mapable reads indicates the additional number of reads that could be mapped by SSAHA between the first and last iteration. Further information can be found in Supplementary Table S3.
Fig. 3.Examples of corrections of homopolymer length errors in assemblies from 454 sequencing. Details of the reads used can be found in Table 1. Figures are Artemis screen shots that show the three different reading frames in the direction of the gene. Black vertical lines are stop codons. Filled coloured boxes denote open reading frames. (A) Correction of a region of an assembly of P.berghei 454 reads. (B) Correction of a region of a 454 assembly of C.dificile.