| Literature DB >> 33203859 |
Ting Hon1, Kristin Mars1, Greg Young1, Yu-Chih Tsai1, Joseph W Karalius1, Jane M Landolin2, Nicholas Maurer3, David Kudrna4, Michael A Hardigan5, Cynthia C Steiner6, Steven J Knapp5, Doreen Ware7,8, Beth Shapiro3,9, Paul Peluso1, David R Rank10.
Abstract
The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10-25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.Entities:
Mesh:
Year: 2020 PMID: 33203859 PMCID: PMC7673114 DOI: 10.1038/s41597-020-00743-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Flowchart of HiFi sequence read generation and downstream applications.
Sample description: strain names, origins, available reference sequences, and SRA BioSample IDs are detailed for each HiFi dataset.
| Sample | Strain (Cultivar/ Cell line) | Sample Origin | Sequence Reference | SRA BioSample ID |
|---|---|---|---|---|
| C57BL/6 J | Jackson Labs | GRCm38.p6[ | SAMN14691541[ | |
| B73 | M. Hufford | Zm-B73-REFERENCE-NAM-5.0[ | ERS3371164[ | |
| Royal Royce | S. Knapp | N/A | SAMN14691544[ | |
| KB 21384; ISIS # 916035 | San Diego Zoo Global | N/A | SAMN14691543[ | |
| Metagenome Std | MSA-1003 | ATCC | See Supplementary Table | SAMN14691545[ |
Background genomic information for each sample: strain or sample ID, expected ploidy level, inbred status, and haploid genome size for each HiFi read dataset.
| Organism | Strain | Ploidy | Inbred | Haploid Genome Size (Mb) |
|---|---|---|---|---|
| C57BL/6 J | 2n | Yes | 2,700 | |
| B73 | 2n | Yes | 2,200 | |
| Royal Royce | 8n | No | 800* | |
| KB 21384; ISIS # 916035 | 2n | No | 9,000ǂ | |
| Metagenome Std | MSA-1003 | N/A | N/A | 67 |
*The estimate haploid genome size of F. × ananassa ‘Royal Royce’ is based on the size of the sequenced F. × ananassa ‘Camarosa’[26].
ǂThe haploid genome size of R. muscosa is estimated at 9 Gb based on the estimated genome sizes of 8,600 to 9,100 Mb for two closely related species (R. aurora and R. cascadae)[27] as well as the size estimate provided by our k-mer analysis.
Library molecule sizes, sequencing metrics, and SRA accession numbers for each HiFi read dataset.
| Organism | HiFi library size (kb) | Sequel II Runs (number) | Bases > RQ20 (Gb) | Average RL (kb) | Reads (Millions) | Quality Value* (avg) | Data Record |
|---|---|---|---|---|---|---|---|
| 15.9 | 2 | 66.5 | 16.4 | 4.1 | 31 | SRR11606870[ | |
| 15.0 | 2 | 48.1 | 15.6 | 3.1 | 30 | SRR11606869[ | |
| 23.0 | 1 | 29.7 | 21.7 | 1.4 | 28 | SRR11606867[ | |
| 15.8 | 8 | 189.1 | 15.7 | 12.1 | 31 | SRR11606868[ | |
| ATCC MSA-1003 | 14.1 | 2 | 59.1 | 10.5 | 5.6 | 35 | SRR11606871[ |
*Predicted RQ values from the PacBio software are in Phred quality scale = −10 log10 (P) where P is the probability of error.
Technical validation summary: k-mer based genome size estimates, average mapped HiFi read coverage for samples with references[59,61] genomes, and average mapped HiFi read accuracy for each dataset.
| Sample | K-mer based Genome Coverage (fold) | Reference Mapped Genome Coverage (fold) | Median Read Accuracy (percent) | Mean Read Accuracy (percent) |
|---|---|---|---|---|
| 25 | 27 | 99.869 | 99.176 | |
| 21 | 23 | 99.844 | 99.686 | |
| 17/37/74/109 | N/A# | N/A# | N/A# | |
| 20 | N/A# | N/A# | N/A# | |
| ATCC MSA-1003 | 2–4000 | 1–8,000§ | 99.995 | 99.733 |
#No published reference.
§See Supplementary Table 1 for reference genome file names and locations.
Fig. 2Read length and quality distributions for the three sequenced samples with high quality finished sequence references. M. musculus read length (a) and accuracy (b), Z. mays read length (c) and accuracy (d), and Mock metagenome community ATTC MSA-1003 read length (e) and accuracy (f). All data is mapped to the genomic references (Table 1 and Supplementary Table 1) using minmap2. Accuracies are reported in Phred read quality space (Q value) = −10 × log10(P) where P is the measured error rate.
Fig. 3K-mer (length 21) distribution for all HiFi reads for each sequencing dataset. (a) M. musculus (b) Z. mays (c) F. × ananassa (d) R. muscosa (e) Mock metagenome community ATTC MSA-1003.
| Measurement(s) | DNA • genome • Metagenome |
| Technology Type(s) | DNA sequencing • PacBio Sequel System |
| Factor Type(s) | organism that had its genome sequenced |
| Sample Characteristic - Organism | Mus musculus • Rana muscosa • Fragaria x ananassa • Zea mays |