| Literature DB >> 29267857 |
Marcela Uliano-Silva1,2,3, Francesco Dondero4, Thomas Dan Otto5,6, Igor Costa7, Nicholas Costa Barroso Lima7,8, Juliana Alves Americo1, Camila Junqueira Mazzoni2,3, Francisco Prosdocimi7, Mauro de Freitas Rebelo1.
Abstract
Background: For more than 25 years, the golden mussel, Limnoperna fortunei, has aggressively invaded South American freshwaters, having travelled more than 5000 km upstream across 5 countries. Along the way, the golden mussel has outcompeted native species and economically harmed aquaculture, hydroelectric powers, and ship transit. We have sequenced the complete genome of the golden mussel to understand the molecular basis of its invasiveness and search for ways to control it. Findings: We assembled the 1.6-Gb genome into 20 548 scaffolds with an N50 length of 312 Kb using a hybrid and hierarchical assembly strategy from short and long DNA reads and transcriptomes. A total of 60 717 coding genes were inferred from a customized transcriptome-trained AUGUSTUS run. We also compared predicted protein sets with those of complete molluscan genomes, revealing an exacerbation of protein-binding domains in L. fortunei. Conclusions: We built one of the best bivalve genome assemblies available using a cost-effective approach using Illumina paired-end, mate-paired, and PacBio long reads. We expect that the continuous and careful annotation of L. fortunei's genome will contribute to the investigation of bivalve genetics, evolution, and invasiveness, as well as to the development of biotechnological tools for aquatic pest control.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29267857 PMCID: PMC5836269 DOI: 10.1093/gigascience/gix128
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
DNA reads produced for L. fortunei genome assembly
| Library technology | Raw data | Trimmed data* | ||||
|---|---|---|---|---|---|---|
| Reads insert size | Pairs | Number of reads | Number of bases | Number of reads | Number of bases | |
| Illumina Nextera | Paired-end – 180 bp | R1 | 209 542 721 | 21 060 365 702 | 209 036 571 | 21 001 101 404 |
| R2 | 209 542 721 | 21 049 308 698 | 209 036 571 | 20 991 650 008 | ||
| Paired-end – 500 bp | R1 | 153 948 902 | 15 472 966 961 | 153 482 290 | 15 423 123 500 | |
| R2 | 153 948 902 | 15 462 883 157 | 153 482 290 | 15 414 813 589 | ||
| Mate-paired 3 – 12 Kb | R1 | 178 392 944 | 18 017 687 344 | 58 157 933 | 5 822 572 152 | |
| R2 | 178 392 944 | 18 017 687 344 | 58 157 933 | 5 811 310 412 | ||
| Pacific Biosciences | P4C – 10/SMTRC | Subreads | 1 663 730 | 11 171 487 485 |
*Trimmomatic parameters for Illumina reads—ILLUMINACLIP: NexteraPE-PE.fa:2:30:10 SLIDINGWINDOW:4:2 LEADING:10 TRAILING:10 CROP:101 HEADCROP:0 MINLEN:80.
Trinity assembled transcripts used in the assembly and annotation of L. fortunei genome
| Number of reads | Number of | Number of | Average | GC | ||
|---|---|---|---|---|---|---|
| Sample | Pooled tissues | prior assembly | trinity transcripts | trinity genes | contig length | % |
| Mussel 1 | Gills, mantle, digestive gland, foot | 406 589 144 | 433 197 | 303 172 | 854 | 34 |
| Mussel 2 | Gills, mantle, digestive gland, foot | 376 577 660 | 435 054 | 298 117 | 824 | 34 |
| Mussel 3 | Gills, mantle, digestive gland, foot | 334 316 116 | 499 392 | 351 649 | 844 | 34 |
Figure 1:Kmer distribution of Limnoperna fortunei Illumina DNA reads (Table 1).
Figure 2:Hierarchical assembly strategy employed for the golden mussel genome assembly. Trimmed Illumina reads were assembled to the level of contigs with the Sparse Assembler algorithm (Step 1). Then, Illumina contigs and PacBio reads were used to build scaffolds with the DBG2OLC assembler, which anchors Illumina contigs to erroneous PacBio subreads, correcting them and building longer scaffolds (Step 2), followed by transcriptome joining scaffolds using L_RNA_scaffolder (Step 3). Final scaffolds were corrected by re-aligning all Illumina DNA and RNA-seq reads back to them and calling consensus with Pilon software (Step 4). In bold is the bioinformatics software used in each step. Red blocks indicate PacBio errors, which are represented by insertions and/or deletions, found in approximately 12% of PacBio subreads.
Assembly statistics for Limnoperna fortunei’s genome
| Parameter | Value |
|---|---|
| Estimated genome size by kmer analysis, Gb | 1.6 |
| Total size of assembled genome, Gb | 1.673 |
| Number of scaffolds | 20 548 |
| Number of contigs | 61 093 |
| Scaffold N50, Kb | 312 |
| Maximum scaffold length, Mb | 2.72 |
| Percentage of genome in scaffolds >50 Kb | 82.55 |
| Masked percentage of total genome | 33 |
| Mapping percentage of Illumina reads back to scaffolds | 91 |
Summary statistics of BUSCO analysis for L. fortunei genome run for Metazoans
| Categories | Number of genes | Percentage |
|---|---|---|
| Total BUSCO groups searched | 978 | – |
| Complete BUSCOs | 801 | 81.9 |
| Complete and single-copy BUSCOs | 769 | 78.62 |
| Complete and duplicated BUSCOs | 32 | 3.27 |
| Fragmented BUSCOs | 72 | 7.36 |
| Missing BUSCOs | 105 | 10.73 |
Comparison of genome assembly statistics for molluscan genomes
|
|
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| |
| Estimated genome size | 1.65 Gb | 359.5 Mb | 1.8 Gb | 1.37 Gb | 1.43 Gb | 545 Mb | 1.15 Gb | 1.6 Gb | 1.64 Gb | 2.38 Gb | 1.6 Gb |
| Number of scaffolds | 80 032 | 4475 | 8766 | 223 851 | 82 731 | 11 969 | 7997 | 1746 447 | 65 664 | 74 575 | 20 548 |
| Total size of scaffolds | 1 865 475 499 | 359 512 207 | 715 791 924 | 2 561 070 351 | 987 685 017 | 558 601 156 | 915 721 316 | 1 599 211 957 | 1 659 280 971 | 2 629 649 654 | 1 673 125 894 |
| Longest scaffold | 2 207 537 | 9 386 848 | 1 784 514 | 572 939 | 7 498 238 | 1 964 558 | 5 897 787 | 67 529 | 2 790 175 | 715 382 | 2 720 304 |
| Shortest scaffold | 854 | 1000 | 5001 | 500 | 200 | 100 | 1807 | 100 | 292 | 205 | 558 |
| Number of scaffolds >1 K nt (%) | 79 923 (99.9) | 4471 (99.9) | 8766 (100) | 138 771 (61.9) | 16 004 (19.3) | 5788 (48.4) | 7997 (100) | 393 685 (22.5) | 38 704 (58.9) | 44 921 (60.2) | 20 547 (100) |
| Number of scaffolds >1 M nt (%) | 67 (0.1) | 98 (2.2) | 27 (0.3) | 0 (0.0) | 248 (0.3) | 60 (0.5) | 27 (0.3) | 0 (0.0) | 164 (0.2) | 0 (0) | 95 (0.5) |
| Mean scaffold size | 23 309 | 80 338 | 81 655 | 11 441 | 11 939 | 46 671 | 114 508 | 916 | 25 269 | 35 262 | 81 425 |
| Median scaffold size | 1697 | 3622 | 13 763 | 1327 | 362 | 824 | 14 683 | 258 | 1284 | 13 722 | 22 134 |
| N50 scaffold length | 200 099 | 1 870 055 | 264 327 | 48 447 | 803 631 | 401 319 | 345 846 | 2651 | 343 373 | 100 161 | 312 020 |
| Sequencing coverage | ×322 | ×8.87 | ×11 | ×39.7 | ×297 | ×155 | ×234 | ×32 | ×319 | ×209.5 | ×60 |
| Sequencing Technology | Illumina + PacBio | Sanger | Sanger | Illumina | Illumina | Illumina | Illumina + BACs | Illumina | Illumina | Illumina | Illumina + PacBio |
Summary of gene annotation against various databases for L. fortunei whole-genome-predicted genes
| Total number of genes | 60 717 |
| Total number of exons | 220 058 |
| Total number of proteins | 60 717 |
| Average protein size, aa | 304 |
| Number of protein BLAST hits* with Uniprot | 26 198 |
| Number of protein BLAST hits* with NR NCBI (no hits with Uniprot) | 14 810 |
| Number of protein HMMER hits* with Pfam.A | 24 513 |
| Number with proteins with KO assigned by KEGG | 8387 |
| Number of proteins with BLAST hits* with EggNOG | 36 868 |
*All considered hits had a minimum e-value of 1e-05.
Figure 3:(A) Gene family assigned with OrthoMCL for the total set of proteins predicted from 5 mussel genome projects. Outside the Venn diagram, the species name is represented, and below it is the number of proteins/number of clustered proteins/number of clusters. (B) Phylogeny of the concatenated dataset using 44 single-copy orthologs extracted from 10 molluscan genomes. The VT model was estimated to be the best-fitting substitution model with ProtTest 3.4.2. We reconstructed the phylogeny using PhyML and 100 bootstrap repetition.
Figure 4:Gene family representation analysis in the L. fortunei genome. (A) Pfam hierarchical clustering, heatmap. Features were selected according to a model based on the Poisson cumulative distribution of each Pfam count in the golden mussel genome vs the normalized average values found in the other 9 molluscan genomes (Bonferroni correction, P ≤ 0.05). Transposable elements were included in the analysis. Colors depict the log2 ratio between Pfam counts found in each single genome and the corresponding mean values. The hierarchical clustering used the average dot product for the data matrix and complete linkage for branching. Abbreviations: Bp: Bathymodioulus platifrons; Cg: Crassostrea gigas; Hd: Haliotus discus hannai; Lf: L. fortunei; Lg: Lottia gigantean; Mg: Mytilus galloprovincialis; Mp: Modioulus philippinarum; Pf: Pinctada fucata; Py: Patinopecten yessoensis; Rp: Ruditapes philippinarum. (B) Gene Ontology analysis of expanded gene families, semantic scatter plot. Shown are cluster representatives after redundancy reduction in a 2-dimensional space applying multidimensional scaling to a matrix of semantic similarities of GO terms. Color indicates the GO enrichment level (legend in upper left-hand corner); size indicates the relative frequency of each term in the UNIPROT database (larger bubbles represent less specific processes).