| Literature DB >> 26286689 |
Luigi Faino1, Michael F Seidl1, Erwin Datema2, Grardy C M van den Berg1, Antoine Janssen2, Alexander H J Wittenberg2, Bart P H J Thomma3.
Abstract
UNLABELLED: Next-generation sequencing (NGS) technologies have increased the scalability, speed, and resolution of genomic sequencing and, thus, have revolutionized genomic studies. However, eukaryotic genome sequencing initiatives typically yield considerably fragmented genome assemblies. Here, we assessed various state-of-the-art sequencing and assembly strategies in order to produce a contiguous and complete eukaryotic genome assembly, focusing on the filamentous fungus Verticillium dahliae. Compared with Illumina-based assemblies of the V. dahliae genome, hybrid assemblies that also include PacBio-generated long reads establish superior contiguity. Intriguingly, provided that sufficient sequence depth is reached, assemblies solely based on PacBio reads outperform hybrid assemblies and even result in fully assembled chromosomes. Furthermore, the addition of optical map data allowed us to produce a gapless and complete V. dahliae genome assembly of the expected eight chromosomes from telomere to telomere. Consequently, we can now study genomic regions that were previously not assembled or poorly assembled, including regions that are populated by repetitive sequences, such as transposons, allowing us to fully appreciate an organism's biological complexity. Our data show that a combination of PacBio-generated long reads and optical mapping can be used to generate complete and gapless assemblies of fungal genomes. IMPORTANCE: Studying whole-genome sequences has become an important aspect of biological research. The advent of next-generation sequencing (NGS) technologies has nowadays brought genomic science within reach of most research laboratories, including those that study nonmodel organisms. However, most genome sequencing initiatives typically yield (highly) fragmented genome assemblies. Nevertheless, considerable relevant information related to genome structure and evolution is likely hidden in those nonassembled regions. Here, we investigated a diverse set of strategies to obtain gapless genome assemblies, using the genome of a typical ascomycete fungus as the template. Eventually, we were able to show that a combination of PacBio-generated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly, allowing in-depth genome analyses to facilitate functional studies into an organism's biology.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26286689 PMCID: PMC4542186 DOI: 10.1128/mBio.00936-15
Source DB: PubMed Journal: mBio Impact factor: 7.867
Statistics for the various genome assemblies of Verticillium dahliae strain JR2
| Source of data, metric | Value for indicated assembly and data sources | |||||||
|---|---|---|---|---|---|---|---|---|
| VerdaJR2v1.5 | VerdaJR2v1.5 | VerdaJR2v1.5 | SPAdes 3.0 | SPAdes 3.0 | A5 | A5 | VDAG_JR2v4.0 | |
| PE library | X | X | X | X | X | X | X | |
| MP library | X | X | X | X | X | X | X | |
| PacBio P4-C2 | X | X | X | X | ||||
| PacBio P5-C3 | X | X | X | X | ||||
| Optical map | X | X | X | X | ||||
| Contig metrics | ||||||||
| No. of contigs (≥0 bp) | 4,514 | 515 | 533 | 2,335 | 2,463 | 1,013 | 1,195 | 8 |
| No. of contigs (≥1,000 bp) | 3,262 | 324 | 338 | 1,579 | 1,570 | 415 | 419 | 8 |
| Longest contig (bp) | 99,830 | 2,178,335 | 2,251,806 | 227,026 | 543,223 | 2,308,962 | 2,304,878 | 9,275,483 |
| Total length (bp) | 33,523,879 | 35,178,480 | 35,520,228 | 34,886,730 | 35,110,786 | 36,248,419 | 36,213,197 | 36,150,287 |
| | 17,466 | 662,062 | 649,303 | 46,943 | 50,038 | 598,861 | 512,741 | 4,168,633 |
| No. of Ns/100 kb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Scaffold metrics | ||||||||
| No. of scaffolds (≥0 bp) | 9 | 9 | 9 | 1,334 | 1,510 | 606 | 599 | 8 |
| No. of scaffolds (≥1,000 bp) | 9 | 9 | 9 | 659 | 702 | 298 | 285 | 8 |
| Longest scaffold | 9,141,183 | 9,180,926 | 9,215,033 | 1,263,620 | 1,066,798 | 2,912,494 | 2,937,429 | 9,275,483 |
| Total length | 37,537,096 | 38,353,192 | 38,703,526 | 34,780,691 | 34,969,668 | 36,425,691 | 36,548,884 | 36,150,287 |
| | 4,064,734 | 4,091,407 | 4,087,047 | 350,075 | 306,662 | 781,486 | 808,031 | 4,168,633 |
| No. of Ns/100 kb | 10,691.34 | 8,277.57 | 8,224.83 | 109.82 | 100.64 | 652.57 | 1,082.47 | 0 |
Ns, unknown nucleotides.
Verticillium dahliae strain JR2 assemblies based on different amounts of PacBio long reads
| Metric | Value for data set: | ||||||
|---|---|---|---|---|---|---|---|
| SMRT.4 | SMRT.6 | SMRT.8 | SMRT.10 | SMRT.12 | SMRT.14 | SMRT.18 | |
| No. of SMRT cells used | 4 | 6 | 8 | 10 | 12 | 14 | 18 |
| Coverage ( | 46.4× | 72.1× | 96.1× | 120× | 143.7× | 167.1× | 248× |
| Contig metrics with HGAP | |||||||
| No. of contigs ≥0 bp | 246 | 45 | 49 | 41 | 41 | 34 | 35 |
| No. of contigs ≥1,000 bp | 246 | 45 | 49 | 41 | 41 | 34 | 35 |
| Largest contig (bp) | 957,075 | 8,522,516 | 9,231,296 | 5,501,910 | 5,496,487 | 5,496,279 | 9,231,737 |
| Total length (bp) | 36,019,813 | 36,390,158 | 36,496,508 | 36,298,366 | 36,439,073 | 36,407,468 | 36,472,797 |
| | 271,892 | 3,085,282 | 4,168,662 | 2,910,158 | 3,361,205 | 3,361,230 | 3,399,208 |
| No. of Ns/100 kb | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Contig metrics with MHAP | |||||||
| No. of contigs ≥0 bp | 95 | 132 | 77 | 55 | 50 | 47 | 48 |
| No. of contigs ≥1,000 bp | 95 | 132 | 77 | 55 | 50 | 47 | 48 |
| Largest contig (bp) | 3,355,274 | 1,305,931 | 3,215,544 | 3,814,805 | 5,484,470 | 4,267,138 | 5,486,069 |
| Total length (bp) | 36,785,530 | 35,897,226 | 36,545,821 | 36,523,003 | 36,589,360 | 36,382,335 | 36,635,502 |
| | 1,816,396 | 567,445 | 1,167,265 | 2,569,351 | 3,068,688 | 2,330,944 | 3,358,862 |
| No. of Ns/100 kb | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Software used for genome assemblies.
Ns, unknown nucleotides.
FIG 1 A gapless genome assembly of Verticillium dahliae strain JR2. (A) Alignment of the gapless genome assembly of V. dahliae strain JR2 with the optical map displays nearly perfect agreement. Represented in blue with blue lines is the genome assembly, while the optical map is represented in red with blue lines. Each blue line represents an NheI restriction site. Black lines represent alignments between the assembly and the optical map. Indicated in black and green boxes are length discrepancies between the assembly and the optical map due to the collapse of repetitive elements in the assembly. (B) Data for rRNA gene cluster located on the distal end of chromosome 1 (see green box in panel A). Local high read coverage (>1,000×) compared with the genomewide average of 15× coverage indicates the collapse of this region during the genome assembly. A single repeat unit of the V. dahliae rRNA gene is displayed, and its location in the assembly is marked.
Summary of transposable elements and other types of repetitive elements identified in V. dahliae strains JR2 and VdLs17
| Type of element | Value for | |||||
|---|---|---|---|---|---|---|
| JR2 | VdLs17 | |||||
| No. in genome | Coverage (bp) | Coverage (%) | No. in genome | Coverage (bp) | Coverage (%) | |
| TEs | ||||||
| SINEs | 15 | 665 | 0 | 16 | 811 | 0 |
| LINEs | 324 | 124,209 | 0.34 | 311 | 167,003 | 0.46 |
| LTR elements | 1,071 | 2,428,443 | 6.72 | 1,006 | 2,430,766 | 6.76 |
| DNA elements | 269 | 114,336 | 0.32 | 272 | 150,768 | 0.42 |
| Unclassified | 1,557 | 1,286,043 | 3.56 | 1,351 | 1,098,298 | 3.05 |
| Summary of TEs | 3,953,696 | 10.94 | 3,847,646 | 10.7 | ||
| Other repeats | ||||||
| Small RNA | 125 | 22,942 | 0.06 | 114 | 18,050 | 0.05 |
| Satellites | 74 | 7,336 | 0.02 | 71 | 7,003 | 0.02 |
| Simple repeats | 10,210 | 423,998 | 1.17 | 10,208 | 424,918 | 1.18 |
| Low complexity | 832 | 40,802 | 0.11 | 835 | 40,340 | 0.11 |
| Total amt of repeats | 4,446,122 | 12.3 | 4,336,001 | 12.05 | ||
TEs, transposable elements; SINEs, short interspersed elements; LINEs, long interspersed elements; LTR, long terminal repeat.
Total bases matching the element.
% of genome covered by the element.
Statistics for the various genome assemblies of Verticillium dahliae strain JR2 generated using Quast software and using VDAG_JR2v4.0 as the reference genome
| Assembly, software used | Data source | Avg PacBio coverage | Scaffold metric (no. of instances) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Library | PacBio chemistry | Optical map | |||||||||
| PE | MP | P4-C2 | P5-C3 | SNPs | Misassemblies | Genes that are: | |||||
| Complete | Partial | Missing | |||||||||
| VerdaJR2v1.5 | X | X | X | 1,146 | 493 | 10,855 | 570 | 5 | |||
| VerdaJR2v1.5 | X | X | X | 46 | 988 | 544 | 11,271 | 158 | 1 | ||
| VerdaJR2v1.5 | X | X | X | 19 | 1,018 | 511 | 11,266 | 160 | 4 | ||
| SPAdes 3.0 | X | X | 46 | 636 | 251 | 10,982 | 447 | 1 | |||
| SPAdes 3.0 | X | X | 19 | 775 | 223 | 10,959 | 463 | 8 | |||
| A5 | X | X | 46 | 661 | 369 | 11,349 | 78 | 3 | |||
| A5 | X | X | 19 | 768 | 365 | 11,344 | 81 | 5 | |||
| HGAP | |||||||||||
| SMRT.4 | X | 46 | 1,089 | 21 | 11,149 | 167 | 114 | ||||
| SMRT.6 | X | 72 | 283 | 16 | 11,429 | 1 | 0 | ||||
| SMRT.8 | X | 96 | 175 | 12 | 11,429 | 1 | 0 | ||||
| SMRT.10 | X | 120 | 160 | 18 | 11,424 | 1 | 5 | ||||
| SMRT.12 | X | 143 | 146 | 21 | 11,429 | 1 | 0 | ||||
| SMRT.14 | X | 167 | 75 | 5 | 11,430 | 0 | 0 | ||||
| SMRT.18 | X | X | 248 | 41 | 13 | 11,430 | 0 | 0 | |||
| VDAG_JR2v4.0 | X | X | X | 248 | 113 | 0 | 11,430 | 0 | 0 | ||
| MHAP | |||||||||||
| SMRT.4 | X | 46 | 10,270 | 15 | 11,410 | 16 | 4 | ||||
| SMRT.6 | X | 72 | 15,683 | 12 | 11,256 | 81 | 93 | ||||
| SMRT.8 | X | 96 | 11,256 | 15 | 11,397 | 25 | 8 | ||||
| SMRT.10 | X | 120 | 8,521 | 12 | 11,411 | 13 | 6 | ||||
| SMRT.12 | X | 143 | 7,579 | 13 | 11,425 | 4 | 1 | ||||
| SMRT.14 | X | 167 | 6,535 | 14 | 11,405 | 12 | 13 | ||||
HGAP 2.0 (12) and MHAP 1.5b1 (28) were used to generate assemblies.
Average genome coverage of the PacBio data set used for the assembly.
SNPs, single-nucleotide polymorphisms.
Statistics of Verticillium dahliae strain VdLs17 genome assemblies
| Metric | Value for assembly using: | ||
|---|---|---|---|
| Sanger sequencing + optical mapping | PacBio only | PacBio + optical mapping | |
| Contig metrics | |||
| No. of contigs: | |||
| ≥0 bp | 1,562 | 119 | 8 |
| ≥1,000 bp | 1,525 | 118 | 8 |
| Largest contig (bp) | 216,594 | 2,545,020 | 6,210,300 |
| Total length (bp) | 32,902,348 | 36,288,516 | 35,973,870 |
| | 43,309 | 711,766 | 5,894,008 |
| No. of Ns/100 kb | 0 | 0 | 0 |
| Scaffold metrics | |||
| No. of contigs: | |||
| ≥0 bp | 9 | 119 | 8 |
| ≥1,000 bp | 9 | 118 | 8 |
| Largest contig (bp) | 6,048,892 | 2,545,020 | 6,210,300 |
| Total length (bp) | 36,874,636 | 36,288,516 | 35,973,870 |
| | 4,180,501 | 711,766 | 5,894,008 |
| No. of Ns/100 kb | 10,770.33 | 0 | 0 |
Genome assembly described in reference 23.
Ns, unknown nucleotides.