| Literature DB >> 19747388 |
Scott Diguistini1, Nancy Y Liao, Darren Platt, Gordon Robertson, Michael Seidel, Simon K Chan, T Roderick Docking, Inanc Birol, Robert A Holt, Martin Hirst, Elaine Mardis, Marco A Marra, Richard C Hamelin, Jörg Bohlmann, Colette Breuil, Steven Jm Jones.
Abstract
Sequencing-by-synthesis technologies can reduce the cost of generating de novo genome assemblies. We report a method for assembling draft genome sequences of eukaryotic organisms that integrates sequence information from different sources, and demonstrate its effectiveness by assembling an approximately 32.5 Mb draft genome sequence for the forest pathogen Grosmannia clavigera, an ascomycete fungus. We also developed a method for assessing draft assemblies using Illumina paired end read data and demonstrate how we are using it to guide future sequence finishing. Our results demonstrate that eukaryotic genome sequences can be accurately assembled by combining Illumina, 454 and Sanger sequence data.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19747388 PMCID: PMC2768983 DOI: 10.1186/gb-2009-10-9-r94
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Assembly process overview. Overview of the process for producing de novo assemblies.
Velvet assemblies
| Total contigs | 6,945 | 8,637 | 19,118 | 39,488 |
| N50 contig | 24,566 (N/A) | 10,706 | 2,902 | 1,299 |
| Total DNA (bp) | 26,721,397 | 26,466,756 | 25,854,719 | 24,812,690 |
| EST analysis* | 6,585/29 | 6,204/24 | 4,657/11 | 2,923/9 |
*EST alignments are given as: Complete alignments/Misassemblies (see Materials and methods). Velvet assemblies were generated from Illumina GAii read data. Assembly T42 was generated from the untrimmed, no-call and shadow filtered Illumina PE reads. Assemblies T38 and T36 were generated by trimming the last 4 and 6 bp, respectively, from the T42 read set. Assembly T36, QRL(Q10) = 28 was generated with the T36 read set from which reads were removed if they failed the QRL(Q10) = 28 quality region length filtering (see Materials and methods).
Forge assemblies
| Total scaffolds* | 7,860 | 4,805 | 2,307 | 1,443 |
| N50 contig (scaffold) | 5,773 (N/A) | 7,440 (289,760) | 31,821 (557,565) | 164,278 (187,326) |
| Total DNA (bp)† | 29,484,877 | 34,841,371 | 39,238,044 | 29,522,629 |
| Number of scaffolds with gaps‡ | 0 | 656 | 163 | 17 |
| Augustus predictions | 10,555 | 10,230 | 8,912 | 8,476 |
| EST analysis§ | 5,544/25 | 5,747/60 | 6,314/40 | 6,685/33 |
*Scaffolds included in this calculation contained two or more reads and were longer than 500 bp. †Total DNA was calculated excluding gaps and was performed on scaffolds that contained two or more reads and were longer than 500 bp. ‡Gaps included in this calculation were longer than 50 bp. §EST alignments are given as: Complete alignments/Misassemblies (see Materials and methods). Forge assemblies were generated using Illumina, 454 and Sanger read data. The '454' assembly was generated using only 454 SE read data. The 'Sanger-454' assembly was generated by combining the Sanger PE and 454 SE read collections. The 'Sanger-454-IlluminaPA' assembly was generated by combining the Sanger PE and 454 SE read collections with preassembled (PA) contigs generated from Illumina PE reads with Velvet. The 'Sanger-454-IlluminaDA' assembly was generated by combining the Sanger PE and 454 SE read collections with Illumina PE reads (DA = direct assembly).
Figure 2Consensus sequence quality. The proportion of 454 read data within the total read collection affected the number of small insertions and deletions (indels) based on analysis of 7,169 unique EST-to-genome alignments. The relative proportions of insertions (blue) and deletions (orange) in the assembly sequence are shown in the inset pie chart. Assemblies are described in Tables 1 and 2; those including 454 read data were assembled with Forge; the Illumina-only assembly was generated with Velvet.
Figure 3Comparison of Forge Sanger/454/Illumina assemblies against GCgb1. Alignments of scaffolds greater that 100 kb - (a) 'Sanger/454/IlluminaDA' (approximately 24 Mb on 80 scaffolds) and (b) 'Sanger/454/IlluminaPA' (approximately 28.7 Mb on 46 scaffolds) - on the y-axis against the manually finished genome sequence (GCgb1) on the x-axis.
Figure 4Assessing the discovery of unique read information between the Illumina and 454 platforms. (a) Raw reads were processed into overlapping 28-bp k-mers, and any k-mer that varied from all other k-mers by at least 1 bp was accepted as new sequence information. The analysis was done separately for unique k-mers and those that occurred at least twice (2× k-mers). (b) MAQ was then used to map these k-mers to the reference genome sequence and the rate at which new coverage was generated was plotted against the number of k-mers examined.