| Literature DB >> 25977796 |
Kristi E Kim1, Paul Peluso1, Primo Babayan1, P Jane Yeadon2, Charles Yu3, William W Fisher3, Chen-Shan Chin1, Nicole A Rapicavoli1, David R Rank1, Joachim Li4, David E A Catcheside2, Susan E Celniker3, Adam M Phillippy5, Casey M Bergman6, Jane M Landolin1.
Abstract
Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.Entities:
Mesh:
Year: 2014 PMID: 25977796 PMCID: PMC4365909 DOI: 10.1038/sdata.2014.45
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Summary of DNA samples.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| The NCBI sample ID associated with each dataset is provided. DNA was extracted in a species-specific manner, yielding genomic DNA of various sizes. All DNA was size selected using the Blue Pippin system (Sage Sciences), and select samples were sheared with g-TUBEs (Covaris). | |||||
|
| SAMN02951645 | ammonium acetate or SDS, proteinase K, phenol-chloroform | 17 | none | Blue Pippin (7 kb) |
|
| SAMN02743420 | ammonium acetate or SDS, proteinase K, phenol-chloroform | 17 | none | Blue Pippin (7 kb) |
|
| SAMN02731377 | Qiagen genomic DNA buffer set | >40 | g-TUBE | Blue Pippin (17 kb) |
|
| SAMN02724975 | BashingBeads, Zymo Research kit | 6 | none | Blue Pippin (4 kb) |
|
| SAMN02724976 | SDS, proteinase K, phenol-chloroform, RNAase, isopropanol | 15 | none | Blue Pippin (7 kb) |
|
| SAMN02731378 | CTAB, chloroform:isoamyl, isopropanol precip. | >40 | g-TUBE | Blue Pippin (7 kb) |
|
| SAMN02724977 | CTAB, chloroform:isoamyl, isopropanol precip. | >40 | g-TUBE | Blue Pippin (15 kb) |
|
| SAMN02614627 | SDS, phenol-chloroform, CsCl banding, ethanol precip. | >40 | g-TUBE | Blue Pippin (17 kb) |
Summary of datasets.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Eight datasets from five organisms are described in this paper. Data can be accessed from SRA using the accession numbers provided. | |||||
|
| MG1655 | Lofstrand Labs | P4C2 | SRX669475 | 6.0 |
|
| MG1655 | Lofstrand Labs | P5C3 | SRX533603 | 3.8 |
|
| 9464 | J. Li | P4C2 | SRX533604 | 38 |
|
| OR74A | FGSC | P4C2 | SRX533605 | 29 |
|
| T1 | D. Catcheside | P4C2 | SRX533606 | 143 |
|
| Ler-0 | Lehle Seeds | P4C2 | SRX533608 | 263 |
|
| Ler-0 | Lehle Seeds | P5C3 | SRX533607 | 252 |
|
| ISO1 | S. Celniker | P5C3 | SRX499318 | 187 |
Summary statistics of filtered data.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Results shown for each dataset are based on output of SMRT Portal analysis using the default filtering parameters (see text for details). Fold coverage is calculated relative to the estimated genome size. | ||||||
|
| 61,019 | 7,586 | 22,609 | 331,516,965 | 5 | 66X |
|
| 43,063 | 12,041 | 28,647 | 373,874,428 | 5 | 75X |
|
| 269,145 | 8,821 | 30,164 | 1,597,871,118 | 12 | 133X |
|
| 175,926 | 7,617 | 30,845 | 981,884,113 | 40 | 25X |
|
| 210,480 | 10,462 | 36,227 | 11,497,185,440 | 40 | 287X |
|
| 1,338,320 | 8,769 | 41,753 | 8,129,670,483 | 120 | 68X |
|
| 2,067,212 | 12,188 | 47,445 | 17,714,447,516 | 120 | 148X |
|
| 1,561,929 | 14,214 | 44,766 | 15,194,174,294 | 160 | 95X |
Figure 1Mapped Subread Concordance and Coverage.
The distribution of mapped subread concordances and mapped subread coverages are plotted for E. coli MG1655 P4C2 (a), S. cerevisiae 9464 P4C2 (b), and D. melanogaster ISO1 P5C3 (c). The coverage distribution is similar among all chromosomes in S. cerevisiae, whereas the coverage distribution is half in chrX (50X) compared to the autosomes (100X) in D. melanogaster. ChrU and chrUextra are assembled contigs that could not be placed to physical chromosomes, and have very low coverages in general.