| Literature DB >> 24083400 |
Marco Ferrarini1, Marco Moretto, Judson A Ward, Nada Šurbanovski, Vladimir Stevanović, Lara Giongo, Roberto Viola, Duccio Cavalieri, Riccardo Velasco, Alessandro Cestaro, Daniel J Sargent.
Abstract
BACKGROUND: Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.Entities:
Mesh:
Year: 2013 PMID: 24083400 PMCID: PMC3853357 DOI: 10.1186/1471-2164-14-670
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Sequence data coverage of the chloroplast genome. Schematic diagram showing the coverage of the P. micrantha chloroplast genome by the seven Illumina contigs (black) and a single PacBio contig (green) following assembly using ABySS and Celera assembler respectively. The red line across the top of the schematic represents the P. micrantha chloroplast genome sequence, blue bold sections indicate the inverted repeat regions of the genome. Sections of contig 1 from both the Illumina and PacBio assemblies corresponding to the non-unique section of the IR are shown in red. Illumina contig 1 spans the start/end point of the linear representation of the circular chloroplast genome.
chloroplast sequencing data statistics
| Number of raw reads reads1 | 56,770 | 7,164,496 (paired reads) |
| Total nucelotides (raw data)1 | 223,483,907 | 1,421,726,349 |
| Mean read length (raw data)1 | 3,937 | 99 |
| Total nucleotides (error-corrected data) | 54,492,250 | n.a. |
| Mean read length (error-corrected data) | 1,902 | n.a. |
| Pre-assembly error-rate2 | 1.3% | 0.117% |
| Ambiguous bases post-assembly3 | 0% | 0.12% |
| Assembled genome coverage | 100% | 90.59% |
| Average depth of coverage | 320× | 9,111× |
| Number of contigs | 1 | 7 |
| Total genome coverage (bp) | 154,959 | 148,776 |
Summary statistics for the assembly of the P. Micrantha chloroplast genome using PacBio RS and Illumina HiSeq2000 sequencing data.
1Trimmed Illumina reads.
2Error-corrected PacBio reads and raw Illumina reads.
3In comparison to the chloroplast consensus sequence.
Figure 2Base-per-base coverage of the chloroplast genome. Graph showing the base per base depth of sequencing coverage across the P. micrantha chloroplast genome with (a) Illumina (black) and PacBio (green) data and (b) PacBio data only, revealing a more uniform coverage of PacBio data across the genome despite the substantially lower depth of coverage, and regions of the genome with poor or zero coverage in the Illumina dataset. The two regions of significantly greater coverage in both datasets represent the two inverted repeat regions.
Figure 3Determination of percentage GC bias in the Illumina and PacBio datasets. Percentage of mean depth of coverage across 987 windows of 157 nucleotides plotted as a function of percentage GC content for (a) Illumina (black) and (b) PacBio (green) data showing a much stronger positive dependency within the Illumina data (Pearsons correlation coefficient = 0.61 p-value = 2.2e-16) than in the PacBio data (Pearsons correlation coefficient = 0.23 p-value = 5.675e-09). For the purposes of the calculation, high coverage data from the two inverted repeat regions were excluded.
Figure 4The chloroplast genome sequence. Structural organisation of gene content of the P. micrantha chloroplast genome detailing genes transcribed clockwise inside the circle and genes transcribed counter-clockwise outside the circle. Genes coloured according to functional categorisation, inner circle indicates mean percentage GC content across the genome. IRa and IRb denote inverted repeat regions, LSC and SSC denote long and short single copy regions respectively. Genome map plotted using OGDRAW [15].