| Literature DB >> 30624602 |
Pirita Paajanen1,2, George Kettleborough1, Elena López-Girona3,4, Michael Giolai1, Darren Heavens1, David Baker1, Ashleigh Lister1, Fiorella Cugliandolo1, Gail Wilde3, Ingo Hein3, Iain Macaulay1, Glenn J Bryan3, Matthew D Clark1,5.
Abstract
BACKGROUND: A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates.Entities:
Keywords: 10x Genomics; PacBio; Pacific Biosciences; assembly; long reads; optical mapping; short reads
Mesh:
Substances:
Year: 2019 PMID: 30624602 PMCID: PMC6423373 DOI: 10.1093/gigascience/giy163
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Comparison of contig/scaffold lengths and total assembly sizes of the various S. verrucosum assemblies.
Assembly statistics of Illumina and PacBio assemblies, with a minimum contig/scaffold size of 1 kbp
| Assembly | Number of contigs | N50 (kbp) | Max length (kbp) | Total length (Mbp) |
|---|---|---|---|---|
|
| 33,146 | 75 | 642 | 702 |
|
| 21,376 | 331 | 2,288 | 712 |
|
| 25,216 | 77 | 498 | 646 |
|
| 8,074 | 858 | 4,266 | 665 |
|
| 5,446 | 585 | 4,876 | 716 |
|
| 8,138 | 290 | 4,701 | 722 |
|
| 2,442 | 712 | 5,738 | 659 |
abyss uses the TALL library; discovar uses the Discovar library; and hgap, canu, and falcon use the PacBio library. For a more comprehensive summary, see Supplementary Table S3.1.
Figure 2:k-mer spectra plots from the k-mer Analysis Toolkit comparing three S. verrucosum contig assemblies. The heights of the bars indicate how many k-mers of each multiplicity appear in the raw Discovar reads. The colors indicate how many times those k-mers appear in the respective assemblies with black being zero times and red being one time. A colored bar at zero multiplicity indicates k-mers appearing in the assembly that do not appear in the reads. The Falcon assembly has been polished with the Illumina reads using Pilon to reduce the effect of using a different sequencing platform.
Figure 3:Busco analysis of supernova-bn, discovar-mp-dt-bn, and falcon-dt-bn using the plant gene dataset.
Figure 4:Box and whisker plot showing completeness of the S. tuberosum transcripts in supernova-bn, discovar-mp-dt-bn, and falcon-dt-bn with various levels of minimum percentage identity.
Figure 5:A difficult region of the genome that is contiguously assembled with a PacBio BAC but in none of our whole-genome assemblies. The region was correctly scaffolded by Dovetail. The figure shows various alignments and information with respect to the BAC assembly. The top track shows the contigs that appear in the discovar, falcon, and supernova assemblies. The paired-end track shows read coverage of the Discovar paired-end library. The mate-pair and Dovetail tracks show physical/fragment coverage of the mate-pair and Dovetail libraries, respectively. The bottom track shows GC content of the sequence as well as homopolymers sequences of at least 5 bp where A, C, G, and T are colored red, blue, yellow, and green, respectively.
Figure 6:Mummer plots showing alignment to chromosome 11 of the S. tuberosum reference version 4.03. The S. tuberosum reference is shown on the x-axis and assembly scaffolds on the y-axis. Alignments shown are at least 10 kbp long and 90% identical.
Material requirements for each library
| Library | Tissue type | Material/DNA amount | HMW | Fragment length (bp) |
|---|---|---|---|---|
| TALL | Frozen | 3 μg | No | 700 |
| Discovar | Frozen | 0.6 μg | No | 500 |
| Mate-pair | Frozen | 4 μg | No | 10,000 |
| PacBio | Young frozen | 5 g | No | 20,000 |
| BioNano | Young fresh | 2.5 μg | Yes | >100,000 |
| Dovetail | Fresh | 20 g | Yes | >100,000 |
| Chromium | Flash frozen | 0.5 g | Yes | >100,000 |
Amounts in grams are for fresh/frozen material and amounts in micrograms for DNA. In each case where frozen or flash frozen is stated, fresh material is also acceptable.
The overall cost of each assembly project
| Assembly | Paired-end | Mate-pair | PacBio | Chromium | Dovetail | BioNano | HiSeq 2500 | MiSeq | PacBio RSII | Total (USD) |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ✗ | ✗ | 3,273 | |||||||
|
| ✗ | ✗ | ✗ | ✗ | 7,854 | |||||
|
| ✗ | ✗ | ✗ | ✗ | ✗ | 8,803 | ||||
|
| ✗ | ✗ | ✗ | ✗✗ | ✗ | 32,793 | ||||
|
| ✗ | ✗ | ✗ | ✗ | ✗✗ | ✗ | 33,742 | |||
|
| ✗ | ✗ | 25,499 | |||||||
|
| ✗ | ✗ | ✗ | 26,448 | ||||||
|
| ✗ | ✗ | ✗ | ✗ | 50,438 | |||||
|
| ✗ | ✗ | ✗ | ✗ | ✗ | 51,387 | ||||
|
| ✗ | ✗ | 4,299 | |||||||
|
| ✗ | ✗ | ✗ | 5,248 | ||||||
|
| 209 | 595 | 474 | 1,235* | 21,875 | 949* | 3,064 | 3,986 | 25,025 |
We show which library preparations and sequencing runs are required for each assembly with a checkmark (✗). Individual costs are given at the bottom, and total costs of each assembly are on the right. All costs are according to Duke University as of April 2017 and in US dollars (USD), except those marked with an asterisk (*), which were according to the Earlham Institute and converted from Great British pounds (GBP) to US dollars at an exchange rate of 0.804 GBP/USD. Paired-end, mate-pair, PacBio, and Chromium are library preparations including DNA extraction. Dovetail includes Chicago library preparation and HiRise scaffolding. BioNano is the cost of building the optical map. HiSeq2500 is for a rapid run half flowcell (one lane) with 250 bp reads. MiSeq is for two runs with 300 bp reads. PacBio RSII is for 65 SMRT cells.
Computational requirements
| Name of assembly | Approximate runtime | Peak memory (Gb) | Average memory (Gb) | System |
|---|---|---|---|---|
| Supernova | 3 days | 1 300 | Large memory | |
| Canu (Uncorr) | 12 days | 47 | 20 | HPC cluster |
| Canu (Corr) | 4 days | 34 | 14 | HPC cluster |
| Falcon | 5 days | 120 | 60 | Large memory |
| HGAP | 2 minutes | 280 | -- | Large memory |
| Discovar | 22 hours | 260 | 134 | Large memory |
| ABySS | 1 week | 64 | -- | HPC cluster |
| BioNano (Asm) | 8 hours | 64 | 64 | HPC cluster |
| BioNano (Scaf) | 1 day | 64 | 64 | HPC cluster |