| Literature DB >> 25977818 |
Sagar M Utturkar1, Dawn M Klingeman2, José M Bruno-Barcena3, Mari S Chinn4, Amy M Grunden3, Michael Köpke5, Steven D Brown6.
Abstract
During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequence datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.Entities:
Mesh:
Year: 2015 PMID: 25977818 PMCID: PMC4409012 DOI: 10.1038/sdata.2015.14
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Summary of datasets
|
|
|
|
|
|---|---|---|---|
| Datasets described in this manuscript, which can be accessed using the accession numbers provided | |||
| Accession linking all SRA data for this project | SRP030033 | — | |
| Roche 454 shotgun | Raw data in SFF format | SRR1748017 | 1.5 Gb |
| Roche 454 3 kb | Raw data in SFF format | SRR989497 | 1.4 Gb |
| Illumina | Raw data in fastq format | SRR989790 | (669x2) Mb |
| Ion Torrent | Raw data in SFF format | SRR1748018 | 858 Mb |
| PacBio RS II | Filtered subreads in fastq format | SRR1740585 | 1.2 Gb |
| Dryad doi linking all depositions for this project |
| — | |
| PacBio RS II | Raw PacBio data in tar.gz format |
| 8.5 Gb |
| PacBio RS II | DNA methylation motifs in gff format |
| 1.99 Mb |
| Sanger Sequencing | Chromatogram files in ABI format |
| 4.39 Mb |
*There are two files for Illumina data corresponding read_1 and read_2. Detailed instructions for downloading data from SRA are provided in the Supplementary Information.
Summary of DNA methylation motif patterns discovered across the C. autoethanogenum genome
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Modified base within each motif is shown in bold. | |||||||
| CAAAA | 6 | m6A | 95.44 | 4,190 | 4,390 | 68.4 | 56.8 |
| GWTA | 5 | m6A | 93.87 | 7,975 | 8,496 | 78.5 | 58.1 |
| SNNGCA | 7 | m6A | 85.27 | 3,242 | 3,802 | 75.9 | 57.8 |
Summary of quality trimming statistics for Illumina, 454 and Ion Torrent data
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Roche 454 | Singletons | 128,856764,756462,052 | 275151289 | 128,806764,744458,340 | 261144249 | 33,631,416110,124,864114,126,660 | 46x 26x |
| Ion Torrent | Single end reads | 453,686 | 215 | 419,010 | 188 | 78,773,880 | 18x |
| Illumina | Paired end reads | 3,689,644 | 150 | 3,682,655 | 149 | 549,756,956 | 126x |
*The singleton sequences are generated from 454 3 kb sequencing run.
Post-filter quality statistics for PacBio data.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| PacBio RSII | Single end reads | 94,408 | 9,196 | 26,777 | 631,598,400 | 145x |
Figure 1PHRED quality score distribution.
The distribution of average PHRED quality score is plotted on X-axes and percentage of sequences on Y-axes for (a) 454 single end shotgun data (b) 454 3 kb paired end data (c) Ion Torrent single end data and (d) Illumina paired end data. Quality distribution shows that more than 95% reads from each dataset have average PHRED scores above 20.
Figure 2Mapped subread concordance and coverage.
The distribution of mapped subread concordances and mapped subread coverages are plotted with (a) C. autoethanogenum DSM 10061 finished genome and (b) C. ljungdahlii DSM 13528 as reference. These graphs suggest good agreement between reads and reference genomes.