| Literature DB >> 34030633 |
Gina V Filloramo1,2, Bruce A Curtis3,4, Emma Blanche3,4, John M Archibald5,6.
Abstract
BACKGROUND: The marine diatoms Thalassiosira pseudonana and Phaeodactylum tricornutum are valuable model organisms for exploring the evolution, diversity and ecology of this important algal group. Their reference genomes, published in 2004 and 2008, respectively, were the product of traditional Sanger sequencing. In the case of T. pseudonana, optical restriction site mapping was employed to further clarify and contextualize chromosome-level scaffolds. While both genomes are considered highly accurate and reasonably contiguous, they still contain many unresolved regions and unordered/unlinked scaffolds.Entities:
Keywords: Bionano optical mapping; Diatom genomics; Long-terminal repeat retrotransposons; Oxford Nanopore long-read sequencing; Phaeodactylum tricornutum; Thalassiosira pseudonana
Mesh:
Substances:
Year: 2021 PMID: 34030633 PMCID: PMC8147415 DOI: 10.1186/s12864-021-07666-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Raw read data summary for unfiltered, Albacore “passed” and filtered Oxford Nanopore long-read sequencing datasets for Thalassiosira pseudonana and Phaeodactylum tricornutum. The unfiltered data include all sequence reads, including passed (q-score>7) and failed (q-score<7) reads as determined by Albacore. The Albacore “pass” data include all reads with a quality-score >7. The filtered datasets for T. pseudonana and P. tricornutum included reads with read length ≥30 kb and ≥20 kb, respectively
| Unfiltered Data | Albacore “pass” Data | Filtered Data | ||
|---|---|---|---|---|
| Total bases (Gbp) | 8.2 | 7.5 | 2.7 | |
| No. of reads | 986,604 | 820,187 | 84,445 | |
| Mean read length (bp) | 8,311.1 | 9,144.7 | 31,973.9 | |
| Mean read quality | 8.5 | 9.2 | 9.6 | |
| Read length N50 (bp) | 18,756 | 19,261 | 32,648 | |
| Estimated genome coverage | ~300x | ~273x | ~100x | |
| Percentage of reads mapped to JGI reference | 76.8% | 87.8% | 92.6% | |
| Average percent identity of reads to JGI reference | 73.7% | |||
| Total bases (Gbp) | 7.5 | 7.0 | 1.8 | |
| No. of reads | 701,596 | 580,845 | 46,708 | |
| Mean read length (bp) | 10,611.8 | 12,029.7 | 37,942.3 | |
| Mean read quality | 9.9 | 10.8 | 10.9 | |
| Read length N50 (bp) | 20,088 | 20,514 | 37,303 | |
| Estimated genome coverage | ~230x | ~215x | ~50x | |
| Percentage of reads mapped to JGI reference | 76.6% | 89.1% | 93.5% | |
| Average percent identity of reads to JGI reference | 71.5% |
Assembly statistics for the original reference genomes and de novo long-read derived genomes for Thalassiosira pseudonana and Phaeodactylum tricornutum. The mitochondrial and organellar genomes for both diatoms were assembled by Canu and Flye and were excluded from assembly statistical analyses. All Canu and Flye assemblies were corrected first by long-reads using Racon and Nanopolish followed by Illumina short-reads using Pilon (See Methods and Materials for more details). The BUSCO odb9 eukaryotic database (303 genes) was used to assess the different assemblies. The BUSCO scores are reported for the total gene completeness (C), complete single-copy (S), complete duplicated (D) and fragmented (F) orthologs
| Assembly | Total length (Mbp) | Read depth coverage | No. contigs | Largest contig (Mbp) | Contig N50 (Mbp) | Contig L50 | No. scaffolds | Largest scaffold (Mbp) | Scaffold N50 (Mbp) | Scaffold L50 | G+C content (%) | % identity to reference | BUSCO | ALE score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reference (Bowler et al. 2008) | 27.4 | 9.6x | 179 | n/a | 0.42 | 20 | 88a | 2.53 | 0.95 | 11 | 48.8 | n/a | C:82.5% F:5.9% | n/a | |
| Canu | 57.0 | 40x | 291 | 2.51 | 0.25 | 43 | n/a | n/a | n/a | n/a | 48.7 | 99.3 | C 85.4% F:3.0% | -734,959,595 | |
| Flye | 33.5 | 72x | 196 | 1.66 | 0.36 | 24 | n/a | n/a | n/a | n/a | 48.7 | 99.1 | C:80.9% F:4.6% | -781,367,384 | |
| Canu-Bionano hybrid | 66.8 | n/a | n/a | n/a | n/a | n/a | 219b | 2.78 | 1.06 | n/a | n/a | n/a | n/a | n/a | |
| Reference (Armbrust et al. 2004, Bowler et al. 2008) | 32.4 | n/a | 115 | n/a | 1.27 | 8 | 64c | 3.04 | 1.99 | 7 | 46.9 | n/a | C:81.2% F:5.3% | n/a | |
| Canu | 47.3 | 40x | 222 | 2.77 | 0.98 | 14 | n/a | n/a | n/a | n/a | 46.9 | 99.4 | C:79.2% F:6.6 | -1,238,092,187 | |
| Flye | 33.8 | 48x | 52 | 2.76 | 1.38 | 8 | n/a | n/a | n/a | n/a | 47.0 | 99.4 | C:80.6% F:5.6% | -1,047,071,217 |
aThe number of scaffolds reflects the 33 chromosome-level scaffolds and 55 unplaced, smaller contigs.
bThe number of scaffolds for the Canu-Bionano hybrid includes both the 49 scaffolds that were assembled from the 138 long-read contigs that met minimum length requirement (≥150 kb) for Bionano optical map anchoring and the 155 unanchored contigs <150 kb.
cThe number of scaffolds reflects the 27 chromosome-level scaffolds and 37 unplaced, smaller contigs.
Fig. 1Genome completeness using single-copy orthologs (BUSCO eukaryota_odb9 database) was assessed for the Thalassiosira pseudonana (a) and Phaeodactylum tricornutum (b) reference genomes as well as the unpolished and polished versions of the Canu and Flye de novo assemblies for both diatom species. Note that the BUSCO analysis was performed after each step of the polishing pipeline which included two rounds of Racon, followed by Nanopolish and finally, multiple iterations of Pilon
Fig. 2Assemblytics output plots showing six classes of structural variants between the Thalassiosira pseudonana reference genome and the final polished de novo long-read Flye assembly (a) and the Phaeodactylum tricornutum reference genome and the final polished de novo long-read Canu assembly (b). Dot plots comparing the Thalassiosira pseudonana reference chromosome-level scaffolds and unanchored contigs to the Flye assembly contigs (c) and the Phaeodactylum tricornutum chromosome-level scaffolds to the Canu assembly contigs (d) were also generated by Assemblytics
Summary of complete ribosomal operon (rRNA) statistics for Phaeodactylum tricornutum and Thalassiosira pseudonana
| Number of complete tandem rRNA copies per contig | 2 (contig 2792-chr7) | 5 (contig3-chr17) |
| 5 (contig74-chr13) | ||
| Average complete rRNA length | 5,935.6 bp | 5,826.8 bp |
| Average length of sequence between rRNA copies on same contig | 15,611 bp (contig 2792-chr7) | 4,521.8 |
| 8,058 bp (contig74-chr13) | ||
| Percent identity between copies on same contig | 99.9 (contig 2792-chr7) | 99.6-99.7 |
| 99.9 [99.8-100] (contig74-chr13) | ||
| Percent identity between copies on different contigs | 99.5 [96.5-100] | n/a |
| Average illumina read depth at rRNA loci (avg read depth across genome) | 82.5x (66x) | 638x (148x) |
Orthologous group (OG) statistics for eight diatom genomes
| Diatom species | Total protein coding genes | Proteins classified into OGs | Proteins not classified into OGs | OGs shared among all diatom species | Proteins in diatom-shared OGs |
|---|---|---|---|---|---|
| 18,111 | 14,312 | 3,799 | 3,731 | 6,741 | |
| 20,429 | 17,899 | 2,530 | 10,693 | ||
| 19,703 | 14,123 | 5,580 | 6,266 | ||
| 12,039 | 10,675 | 1,364 | 5,726 | ||
| 12,178 | 10,278 | 1,900 | 5,886 | ||
| 27,337 | 17,403 | 9,934 | 9,326 | ||
| 34,642 | 16,486 | 18,156 | 8,781 | ||
| 16,491 | 13,799 | 2,692 | 7,512 |
Fig. 3Full-length CoDi long-terminal repeat retrotransposon content in the Phaeodactylum tricornutum genome [11]. LTR-RT characterization is presented for the Bowler et al. reference genome (a) and Canu assembly (b). The Relative abundance of full-length LTR-RTs assigned to each CoDi group is presented for the Bowler et al. reference genome (c) and our de novo Canu assembly (d). Characterization of the LTR-RT loci resolved per CoDi group for the reference genome (e) and Canu assembly (f). LTR-RTs are characterized as either “previously reported loci” (i.e., loci homologous to those previously reported by Maumus et al. [14] in the reference genome), “overlooked loci” (those homologous to those present in the reference genome but not reported) or “novel loci” (i.e., loci detected in our long-read assembly but without a homologous insertion in the reference genome)