| Literature DB >> 30778257 |
Huimin Cai1, Qiye Li2,3,4, Xiaodong Fang5, Ji Li2,3,4, Nicholas E Curtis6, Andreas Altenburger7, Tomoko Shibata8, Mingji Feng5, Taro Maeda8, Julie A Schwartz9, Shuji Shigenobu8,10, Nina Lundholm7, Tomoaki Nishiyama11, Huanming Yang2,12, Mitsuyasu Hasebe8,10, Shuaicheng Li1, Sidney K Pierce9,13, Jian Wang2,12.
Abstract
Elysia chlorotica, a sacoglossan sea slug found off the East Coast of the United States, is well-known for its ability to sequester chloroplasts from its algal prey and survive by photosynthesis for up to 12 months in the absence of food supply. Here we present a draft genome assembly of E. chlorotica that was generated using a hybrid assembly strategy with Illumina short reads and PacBio long reads. The genome assembly comprised 9,989 scaffolds, with a total length of 557 Mb and a scaffold N50 of 442 kb. BUSCO assessment indicated that 93.3% of the expected metazoan genes were completely present in the genome assembly. Annotation of the E. chlorotica genome assembly identified 176 Mb (32.6%) of repetitive sequences and a total of 24,980 protein-coding genes. We anticipate that the annotated draft genome assembly of the E. chlorotica sea slug will promote the investigation of sacoglossan genetics, evolution, and particularly, the genetic signatures accounting for the long-term functioning of algal chloroplasts in an animal.Entities:
Mesh:
Year: 2019 PMID: 30778257 PMCID: PMC6380222 DOI: 10.1038/sdata.2019.22
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1A photograph of an adult Elysia chlorotica (image courtesy of Patrick Krug).
Statistics of DNA reads produced for the E. chlorotica genome in this study.
| Platform | Insert size (bp) | No. of Libraries | Read length (bp) | Raw data | Clean data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total bases (Gb) | Sequencing coverage (X) | Physical coverage (X) | Total bases (Gb) | Sequencing coverage (X) | Physical coverage (X) | |||||||||||
| Note: Each of the five Illumina mate-pair libraries (2.5 kb × 2, 5 kb × 2 and 10 kb × 1) was run on two lanes with read length of 49 bp and 90 bp, respectively. Coverage calculation was based on the estimated genome size of 575 Mb according to | ||||||||||||||||
| Illumina | 170 | 2 | 100 | 35.55 | 61.83 | 52.55 | 28.67 | 49.86 | 41.56 | |||||||
| 500 | 2 | 100 | 32.44 | 56.42 | 141.06 | 18.12 | 31.51 | 84.04 | ||||||||
| 800 | 2 | 100 | 34.51 | 60.02 | 240.04 | 21.01 | 36.54 | 158.32 | ||||||||
| 2,500 | 2 | 49;90 | 76.60 | 133.22 | 1,752.62 | 43.77 | 76.12 | 1,301.05 | ||||||||
| 5,000 | 2 | 49;90 | 83.67 | 145.51 | 4,771.57 | 46.30 | 80.52 | 2,767.88 | ||||||||
| 10,000 | 1 | 49;90 | 33.97 | 59.08 | 3,841.34 | 18.45 | 32.09 | 2,174.84 | ||||||||
| Total | 11 | — | 296.73 | 516.05 | 10,799.18 | 176.32 | 306.64 | 6,527.68 | ||||||||
| PacBio | 6,000 | 3 | 1,224 | 9.45 | 16.43 | — | 5.36 | 9.32 | — | |||||||
Estimation of genome size and heterozygosity of E. chlorotica by k-mer analysis.
| Total number of | Minimum coverage (X) | Number of erroneous | Homozygous peak | Estimated genome size (Mb) | Estimated heterozygosity (%) | |
|---|---|---|---|---|---|---|
| Note: | ||||||
| 17 | 51,187,863,592 | 13 | 1,410,585,877 | 86 | 579 | 3.59 |
| 19 | 49,735,232,800 | 11 | 1,804,614,643 | 84 | 571 | 3.93 |
| 21 | 48,282,601,880 | 11 | 2,010,206,114 | 80 | 578 | 3.90 |
| 23 | 46,829,970,960 | 11 | 2,152,586,758 | 78 | 573 | 3.79 |
| 25 | 45,377,340,040 | 10 | 2,235,304,391 | 75 | 575 | 3.69 |
| 27 | 43,924,709,120 | 10 | 2,327,591,479 | 72 | 578 | 3.57 |
| 29 | 42,472,078,200 | 9 | 2,370,847,012 | 70 | 573 | 3.47 |
| 31 | 41,019,447,280 | 9 | 2,433,307,151 | 67 | 576 | 3.36 |
Figure 2A 17-mer frequency distribution of E. chlorotica based on 62.8 Gb Illumina data.
The first peak at coverage 43X corresponds to the heterozygous peak. The second peak at coverage 86X corresponds to the homozygous peak.
Improvement in continuity and completeness of genome assembly generated by each of the eight assembly steps as stated in main text.
| Step | Assembly statistics | Read mapping assessment | BUSCO assessment | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Assembly size (Mb) | Contig N50 (kb) | Scaffold N50 (kb) | Gap ratio (%) | Mapping rate (%) | Mapping rate in proper pairs (%) | Complete BUSCOs (%) | Fragmented BUSCOs (%) | Missing BUSCOs (%) | |||||||
| Note: For read mapping assessment, 500,000 pairs of clean reads were randomly selected from each of the six short-insert libraries, summed up to 3 M pairs of clean reads, which were aligned to each assembly by BWA-MEM (v0.7.16), followed by mapping rates counting by samtools flagstat (SAMtools v1.7). For BUSCO assessment, the percentages of complete, fragmented and missing BUSCOs were calculated by BUSCO (v3.0.2) for all the assemblies using 978 genes that are expected to be present in all metazoans. | |||||||||||||||
| i | 776 | 1.7 | NA | 0 | 95.27 | 65.07 | 30.2 | 37.5 | 32.3 | ||||||
| ii | 575 | 1.9 | NA | 0 | 93.26 | 63.91 | 29.2 | 38.9 | 31.9 | ||||||
| iii | 469 | 4.4 | NA | 0 | 81.42 | 79.45 | 54.5 | 26.6 | 18.9 | ||||||
| iv | 535 | 5.1 | NA | 0 | 95.20 | 79.79 | 64.9 | 25.3 | 9.8 | ||||||
| v | 583 | 5.6 | 457.2 | 8.27 | 95.35 | 82.77 | 92.0 | 2.0 | 6.0 | ||||||
| vi | 584 | 27.6 | 455.6 | 3.03 | 96.39 | 84.12 | 92.8 | 1.6 | 5.6 | ||||||
| vii | 560 | 28.5 | 457.0 | 3.03 | 96.06 | 83.89 | 93.2 | 1.5 | 5.3 | ||||||
| viii | 557 | 28.5 | 442.0 | 3.04 | 95.93 | 83.87 | 93.3 | 1.4 | 5.3 | ||||||
Comparison of assembly continuity and completeness for available mollusc genomes.
| Species | Sequencing technology | Genome coverage (X) | Assembly size (Mb) | Contig N50 (kb) | Scaffold N50 (kb) | Gap ratio (%) | Complete BUSCOs (%) | Fragmented BUSCOs (%) | Assembly Data Citation |
|---|---|---|---|---|---|---|---|---|---|
| Note: Sequencing technology and genome coverage were retrieved from the indicated reference or data citation for each species. Assembly size, Contig N50, Scaffold N50 and Gap ratio were calculated with an in-house script according to assemblies downloaded from NCBI or GigaDB with indicated Data Citations. The percentages of complete and fragmented BUSCOs were calculated by BUSCO (v3.0.2) for the all the assemblies using 978 genes that are expected to be present in all metazoans. | |||||||||
| Illumina | 66 | 927.31 | 9.59 | 917.54 | 20.44 | 92.5 | 2.0 | 8 | |
| Illumina | 319 | 1,658.19 | 10.74 | 343.34 | 11.77 | 93.6 | 2.5 | 9 | |
| 454 | 28 | 916.39 | 7.30 | 48.06 | 1.91 | 88.9 | 4.9 | 10 | |
| Illumina + Fosmid | 100 | 557.74 | 31.24 | 401.69 | 11.81 | 95.2 | 1.1 | 11 | |
| Illumina + PacBio | 322 | 1,865.48 | 14.19 | 200.10 | 6.25 | 91.6 | 4.9 | 12 | |
| Illumina + PacBio | 60 | 1,673.22 | 32.17 | 309.12 | 0.23 | 81.9 | 7.3 | 13 | |
| Sanger | 9 | 359.51 | 93.95 | 1,870.06 | 16.86 | 95.9 | 0.9 | 14 | |
| Illumina | 209 | 2,629.56 | 13.66 | 100.16 | 4.84 | 89.8 | 5.0 | 15 | |
| Illumina | 92 | 2,338.19 | 5.53 | 475.18 | 15.13 | 90.4 | 3.6 | 16 | |
| Illumina | 297 | 987.59 | 37.58 | 803.63 | 8.10 | 94.3 | 1.3 | 17 | |
| Illumina + BACs + RAD-seq | 150 | 990.98 | 21.52 | 59,032.46 | 11.18 | 87.8 | 3.5 | 18 | |
| Illumina + PacBio + Hi-C | 60 | 440.16 | 1072.86 | 31,531.29 | 0.02 | 95.8 | 0.7 | 19 | |
| Illumina | 72 | 909.76 | 16.26 | 578.73 | 6.42 | 93.2 | 1.5 | 20 | |
| Illumina | 300 | 788.10 | 39.54 | 804.23 | 5.27 | 91.9 | 3.8 | 21 | |
| Illumina + PacBio | 316 | 557.48 | 28.55 | 441.95 | 3.04 | 93.3 | 1.4 | 6,7 | |
Statistics for repetitive sequences identified in the E. chlorotica genome assembly according to detection method and biological category.
| According to method | According to category | ||||
|---|---|---|---|---|---|
| Tool | Total repeat length (bp) | % of assembly | Category | Total repeat length (bp) | % of assembly |
| RepeatMasker | 51,434,719 | 9.52 | DNA | 33,515,133 | 6.20 |
| RepeatProteinMask | 12,318,674 | 2.28 | LINE | 30,286,412 | 5.60 |
| RepeatModeler | 127,879,238 | 23.66 | SINE | 19,423,541 | 3.59 |
| Tandem Repeats Finder | 55,758,776 | 10.32 | LTR | 14,375,566 | 2.66 |
| Combined | 176,039,101 | 32.57 | Tandem repeats | 55,758,776 | 10.32 |
Summary of protein-coding gene annotation for the E. chlorotica genome assembly.
| Total number of protein-coding genes | 24,980 |
| Gene space (exon + intron; Mb) | 233.5 (41.9% of assembly) |
| Mean gene size (bp) | 9,634 |
| Mean CDS length (bp) | 1,344 |
| Exon space (Mb) | 33.2 (6.0% of assembly) |
| Mean exon number per gene | 6.8 |
| Mean exon length (bp) | 198 |
| Mean intron length (bp) | 1,433 |
| % of proteins with hits in UniProtKB/Swiss-Prot | 61.3 |
| % of proteins with hits in NCBI nr database | 84.1 |
| % of proteins with signatures assigned by InterProScan | 68.8 |
| % of proteins with KO assigned by KEGG | 64.7 |
| % of proteins with functional annotation | 85.9 |
Figure 3Mapping quality distribution of the E. chlorotica genome assembly.
The distribution was generated by Qualimap 2 (v2.2.1) with the BWA-MEM (v0.7.16) alignment of 62.8 Gb short-insert Illumina clean data as input.
Figure 4Per-position coverage distributions of the initial ALLPATHS-LG assembly and the final genome assembly.
Per-position coverage was counted based on the BWA-MEM (v0.7.16) alignment of 62.8 Gb short-insert Illumina clean data with PCR duplicates removed by Picard (v2.10.10).
Figure 5Fragment size distributions for all the Illumina libraries.
The distributions were generated by Picard CollectInsertSizeMetrics (v2.10.10; setting MINIMUM_PCT = 0.5) with BWA-MEM (v0.7.16) alignment of read pairs from each library as input.
Figure 6Fragment coverage distribution of the 10 kb mate-pair library data generated by REAPR (v1.0.18).