| Literature DB >> 26872384 |
Bharat Bhusan Patnaik1,2, Tae Hun Wang1, Se Won Kang1, Hee-Ju Hwang1, So Young Park1, Eun Bi Park1, Jong Min Chung1, Dae Kwon Song1, Changmu Kim3, Soonok Kim3, Jun Sang Lee4, Yeon Soo Han5, Hong Seog Park6, Yong Seok Lee1.
Abstract
BACKGROUND: The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction.Entities:
Mesh:
Year: 2016 PMID: 26872384 PMCID: PMC4752248 DOI: 10.1371/journal.pone.0148622
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Schematic representation of transcriptome assembly and annotation.
A C. plicata visceral mass transcriptome was obtained using an Illumina HiSeq2500 NGS platform. The raw reads obtained were preprocessed using the Sickle software tool (quality: 20, length: 40) and Fastq_filter software to obtain clean reads. Trinity assembly (K-mer, 25; minimum contig length; 200) and TGICL clustering (Identity; 94%; overlap; 30 bp) generated 374,794 unigenes. The unigenes were used for functional annotation using the PANM, Unigene, COG, GO, and KEGG databases and structural annotation for SSR detection.
Transcriptome assembly statistics of C. plicata visceral mass using the Trinity analysis.
| Description | Statistics |
|---|---|
| Number of bases | 36,055,225,584 |
| Mean length (bp) | 126 |
| Number of bases | 34,909,374,303 |
| Mean length (bp) | 124.1 |
| N50 length | 126 |
| High quality reads (%) | 98.31 (sequences), 96.82 (bases) |
| Number of bases | 331,930,879 |
| Mean length of contig (bp) | 731.2 |
| N50 length of contig (bp) | 1,254 |
| GC% of contig | 36.62 |
| Largest contig (bp) | 36,440 |
| No. of large contigs (≥500 bp) | 151,695 |
| Number of bases | 276,264,683 |
| Mean length of unigene (bp) | 737.1 |
| N50 length of unigene (bp) | 1,262 |
| GC% of unigene | 36.47 |
| Length ranges (bp) | 212–68,788 |
Fig 2Summary of C. plicata visceral mass unigene (≥ 200 bp) sequences after Trinity assembly.
Summary of molluscan transcriptomics in the last three years using Next Generation Sequencing (NGS) platforms.
R- raw reads, C- clean reads;
| Species (Tissue) | NGS platform | Reads (n) | Assembler | Contigs (n) | Contigs (mean length in bp) | Contigs (N50 length in bp) | Others | SRA Accession | Sequencing objectives | Reference |
|---|---|---|---|---|---|---|---|---|---|---|
| Illumina HiSeq2500 | R- 286,152,584 C- 281,322,837 | Trinity | 453,931 | 731.2 | 1,254 | 374,794 | SRP062467 PRJNA293023 | Endangered species | This study | |
| Roche 454GS FLX | R- 1,595,855 C- 1,405,240 | GS Denovo Assembler v2.6 | 41,472 | 958 | 1,571 | — | SRR949615 | Genetic selection | [ | |
| Illumina HiSeq2000 | R- 133,156,930 | Trinity | 185,546 | 74 | 363 | — | PRJNA252890 | Molecular correlates of behavior | [ | |
| Illumina GAIIx | R- 150,302,926 C- 127,019,711 | Trinity | 254,506 | 669 | 1,632 | 87,408 | SRP043705 | Molecular basis of immune defense | [ | |
| Illumina | R- 57,059,700 C- 52,770,704 | Trinity | 21,193 | 771 | 1,010 | — | SRP011280.2 | Molecular markers for toxin accumulation | [ | |
| Illumina Genome Analyzer IIx | C- 544,272,542 | Trinity | 233,257 | 1,264 | 2,868 | — | SRP043984 | Molecular aspects of toxicity responses | [ | |
| Illumina HiSeq2000 | C- 27,000,000 | Trinity | 75,024 | 505 | 597 | 13,507 | PRJNA210944 | Physiological response to environment stress | [ | |
| Illumina HiSeq2000 | R- 49,500,748 | Trinity | 108,704 | 407 | — | — | — | Development of EST-SSR markers | [ | |
| Illumina HiSeq2000 | R- 61,000,000 | Trinity | 115,211 | 453 | 492 | — | SRP041635 | Thermal adaptations | [ | |
| Illumina HiSeq2000 | R- 1,335,123,074 | SOAPdenovo | 26,064 | 1,011 | — | — | SRP040427 | Shell production | [ | |
| Illumina HiSeq2000 | R- 216,444,674 | CLC Genomic workbench | 73,752 | 502.6 | — | — | SRR1009240, SRR1009241, SRR1009242 | Immunity | [ | |
| Illumina GAIIx | R- 67,087,130 C- 62,250,336 | Velvet & Oasis | 134,684 | 791.06 | 1,264 | — | SRA062349 | Exploration as environmental test organism | [ | |
| Illumina HiSeq2000 | R- 59,918,916 C- 55,122,820 | Trinity | 188,629 | 249 | 306 | — | — | Shell production | [ | |
| Roche 454GS FLX | C- 859,313 | Newbler2.7 | 16,323 | 1,376 | — | — | GALB01000000 | Molecular markers for growth | [ | |
| Illumina GAIIx | R- 112,265,296 | Velvet & Oases | 217,190 | 436 | — | — | SRR653778 | Molecular response to heavy metals | [ |
a number of contigs no less than 500 bp;
b number of unigenes
# For a summary of molluscan transcriptome analysis prior to 2013, please refer [25]
Functional annotation of unigenes of the Cristaria plicata transcriptome.
| Databases | All annotated transcripts | ≤300 bp | 300–1000 bp | ≥1000 bp |
|---|---|---|---|---|
| PANM | 79,960 | 14,480 | 30,748 | 34,732 |
| UniGene | 13,934 | 1,848 | 3,721 | 8,365 |
| COG | 40,196 | 4,763 | 11,445 | 23,988 |
| GO | 23,246 | 2,593 | 5,625 | 15,028 |
| KEGG | 4,776 | 483 | 927 | 3,366 |
| All annotated | 84,274 | 15,700 | 33,108 | 35,466 |
Fig 3Statistical summary of homology search of assembled unigenes against the PANM protein database.
(A) Score distribution of BLAST hits for each unigene with a cutoff E-value of 1E -5. (B) E-value distribution of each unigene using BLAST hits with a cutoff E-value of 1E -5. (C) Identity distribution of the top BLAST hits for each unigene. (D) Similarity distribution of the top BLAST hits for each unigene. (E) Lengths of unigenes compared with the presence or absence of BLAST hits.
Fig 4Top-hit species distribution of C. plicata visceral mass unigenes against the PANM database (custom-devised curatable database of mollusc, arthropod, and nematode protein sequences downloaded from the NCBI nr database).
An E-value cutoff of 1E -5 was maintained and the hit distribution shows high homology to known genome sequences of the Mollusca phylum.
List of the top-hit 40 InterPro domains in C. plicata transcriptome.
| InterPro domain | Description | Unigenes |
|---|---|---|
| IPR015880 | Zinc finger, C2H2-like domain | 1374 |
| IPR027417 | P-loop containing nucleoside triphosphate hydrolase domain | 1126 |
| IPR000477 | Reverse transcriptase domain | 637 |
| IPR012337 | Ribonuclease H-like domain | 496 |
| IPR011042 | Six-bladed beta-propeller, TolB-like domain | 491 |
| IPR013783 | Immunoglobulin-like fold domain | 465 |
| IPR001841 | Zinc finger, RING-type domain | 405 |
| IPR002110 | Ankyrin repeat | 404 |
| IPR005135 | Endonuclease/Exonuclease/phosphatase domain | 388 |
| IPR000315 | B-box-type zinc finger domain | 350 |
| IPR000276 | G protein-coupled receptor, rhodopsin-like family | 320 |
| IPR002290 | Serine/threonine/dual specificity protein kinase, catalytic domain | 272 |
| IPR001370 | BIR repeat | 229 |
| IPR003599 | Immunoglobulin subtype domain | 223 |
| IPR024810 | Mab-21 domain | 219 |
| IPR000504 | RNA recognition motif domain | 214 |
| IPR000242 | PTP type protein phosphatase domain | 214 |
| IPR002035 | von Willebrand factor, type A domain | 213 |
| IPR003615 | HNH nuclease domain | 211 |
| IPR027124 | SWR1-complex protein 5/Craniofacial development protein family | 209 |
| IPR002048 | EF-hand domain | 209 |
| IPR013083 | Zinc finger, RING/FYVE/PHD-type domain | 196 |
| IPR000742 | EGF-like domain | 195 |
| IPR008979 | Toll/interleukin-1 receptor homology (TIR) domain | 193 |
| IPR000157 | Galactose-binding domain-like | 193 |
| IPR002126 | Cadherin domain | 189 |
| IPR003591 | Leucine-rich repeat, typical subtype repeat | 179 |
| IPR000436 | Sushi/SCR/CCP domain | 175 |
| IPR001680 | WD40 repeat | 173 |
| IPR003593 | AAA+ ATPase domain | 172 |
| IPR001304 | C-type lectin domain | 169 |
| IPR011029 | Death-like domain | 163 |
| IPR013087 | Zinc finger C2H2-type/integrase DNA-binding domain | 159 |
| IPR011701 | Major facilitator superfamily | 158 |
| IPR015943 | WD40/YVTN repeat-like-containing domain | 150 |
| IPR001128 | Cytochrome P450 family | 150 |
| IPR019734 | Tetratricopeptide repeat | 145 |
| IPR013126 | Heat shock protein 70 family | 145 |
| IPR000210 | BTB/POZ domain | 136 |
Fig 5Clusters of orthologous groups (COG) classification of unigenes.
Out of 374,794 annotated unigenes, 40,196 sequences had a COG classification from among the 25 COG categories (excluding the multi category).
Fig 6Functional annotation of C. plicata visceral mass assembled sequences based on gene ontology (GO) categorization.
(A) An overlap model of the annotated unigenes assigned to biological processes, molecular functions, and cellular components based on GO function. (B) Numbers of unigenes assigned to GO term annotations.
Fig 7GO classifications of the C. plicata transcriptome at level 2.
GO analyses were performed for three major classification categories: (A) biological processes; (B) cellular components and (C) molecular functions.
Fig 8Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis.
The C. plicata visceral mass unigenes were assigned to KEGG pathways (inner circle). The total number of enzymes ascribed within each KEGG pathway is shown in the outer circle. Each pathway is represented by a different color.
Summary of simple sequence repeat (SSR) types based on the number of repeat units.
| Repeat numbers | Motif length | Total | ||||
|---|---|---|---|---|---|---|
| di- | tri- | tetra- | penta- | hexa- | ||
| 5 | 0 | 1936 | 526 | 17 | 0 | 2479 |
| 6 | 2541 | 713 | 380 | 1 | 0 | 3635 |
| 7 | 1648 | 417 | 44 | 2 | 0 | 2111 |
| 8 | 1211 | 401 | 66 | 1 | 0 | 1679 |
| 9 | 751 | 104 | 60 | 2 | 0 | 917 |
| 10 | 652 | 117 | 50 | 4 | 0 | 823 |
| 11 | 853 | 102 | 29 | 2 | 0 | 986 |
| 12 | 618 | 79 | 37 | 1 | 0 | 735 |
| 13 | 189 | 57 | 44 | 2 | 0 | 292 |
| 14 | 263 | 86 | 32 | 3 | 0 | 384 |
| 15 | 237 | 46 | 44 | 1 | 1 | 329 |
| 16 | 220 | 48 | 45 | 0 | 0 | 313 |
| 17 | 201 | 52 | 39 | 0 | 0 | 292 |
| 18 | 204 | 38 | 32 | 2 | 0 | 276 |
| 19 | 127 | 43 | 19 | 0 | 0 | 189 |
| 20 | 157 | 46 | 13 | 0 | 0 | 216 |
| ≥21 | 1430 | 96 | 67 | 2 | 0 | 1595 |
| Total | 11302 | 4381 | 1527 | 40 | 1 | 17251 |
Fig 9Frequency distribution of simple sequence repeats (SSRs) based on motif types found in C. plicata visceral mass unigene sequences.