Literature DB >> 26872384

Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery.

Bharat Bhusan Patnaik^1,2, Tae Hun Wang¹, Se Won Kang¹, Hee-Ju Hwang¹, So Young Park¹, Eun Bi Park¹, Jong Min Chung¹, Dae Kwon Song¹, Changmu Kim³, Soonok Kim³, Jun Sang Lee⁴, Yeon Soo Han⁵, Hong Seog Park⁶, Yong Seok Lee¹.

Abstract

BACKGROUND: The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction.
RESULTS: The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats.
CONCLUSIONS: This dataset represents the first transcriptome analysis of the endangered mollusc, C. plicata. The transcriptome provides a comprehensive sequence resource for the conservation of genetic information in this species and enrichment of the genetic database. The development of molecular markers will assist in the genetic improvement of C. plicata.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 26872384 PMCID： PMC4752248 DOI： 10.1371/journal.pone.0148622

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Cristaria plicata (Leach, 1815), a well-known “freshwater pearl bivalve”, belongs to the order Unionoida and family Unionidae under the phylum Mollusca. The species has restricted geographic distribution in Russia, Japan, Vietnam, Laos Republic, Thailand, Cambodia, and a wider presence in China, where it is used for freshwater pearl farming, medicinal purposes, and as a model for aquaculture industries [1-3]. In South Korea, C. plicata is found in the middle and lower sections of the Nakdong River, in Asan Lake in Chungcheongnam-do, and in Gosean Lake in Chungcheogbuk-do. The species has been classified as vulnerable owing to loss of natural habitats caused by river development, reduced host fish populations, and indiscriminate collection. Due to a rapid decrease in its population in recent years, C. plicata has been listed in the Korean Red List of Threatened Species under the endangered wildlife category by the Ministry of Environment and is protected by law. Under the International Union for Conservation of Nature and Natural Resources (IUCN) Red List of Threatened species, C. plicata has been assessed as data deficient with indications of localized decreases in the population [4]. Due to limited sample resources and genomic information, an exhaustive survey of novel candidate genes involved in local adaptation, the immune system, and reproduction for C. plicata is absent. The complete mitochondrial genome sequence and functional analysis of a few oxidative stress and immunity- related genes are the only available reports of C. plicata genetic information [5-9]. Although the available information increases our understanding of the phylogeny and molecular basis of the innate immune response in C. plicata, it is insufficient to address the sustainability and conservation of the species. To identify strategies for the local adaptation of the species, knowledge of the genes and pathways involved in the immune system and reproduction are required. C. plicata molecular markers, which are required for marker-assisted selection programs in aquaculture, remain poorly explored. The discovery of molecular markers generally acts as a catalyst for the study of genetic diversity and population structure. The identification of novel genomic resources using a rapid and cost-efficient approach for the conservation of C. plicata in its natural habitat is important. High-throughput next-generation sequencing (NGS) technologies, with their improved efficiency, cost benefits, and rapid data production, have been useful for understanding the mechanisms underlying the diversity of non-model organisms including American bullfrog (Rana catesbeiana) [10], polychaetes (Hermodice carunculata) [11], amphipods (Melita plumulosa) [12], green odorous frog (Odorana margaretae) [13], fish (Salmo salar) [14], shrimp (Litopenaeus vannamei) [15], and giant freshwater prawn (Macrobrachium rosenbergii) [16]. The Solexa/Illumina and 454/Roche NGS technologies have been revolutionary for understanding the rich transcriptomes of the molluscs. Many sequencing projects involving molluscs have used the Roche 454 Genome Sequencing FLX technology due to its faster production of accurate datasets. These have included transcriptome datasets for a bivalve mussel (Limnoperna fortunei) [17], small abalone (Haliotis diversicolor) [18], blue mussel (Mytilus edulis) [19], Manila clam (Ruditapes philippinarum) [20], Yesso scallop (Patinopecten yessoensis) [21], and an Antarctic bivalve (Laternula elliptica) [22], among others. Illumina sequencing technology, which is more efficient, provides shorter reads, and provides greater coverage, has been used for many molluscan transcriptome sequencing projects including blood cockle (Anadara trapezia) [23], Japanese scallop (Mizuhopecten yessoensis) [24], Eastern Oyster (Crassostrea virginica) [25], South African abalone (Haliotis midae) [26], and a snail species (Radix balthica) [27]. Furthermore, in a mollusc phylogenomics study, the matrix completeness of Illumina data was shown to be superior to that of 454 data [28]. Advances in assembly algorithms and relatively inexpensive work-flow have made Illumina sequencing the preferred choice with respect to transcriptome studies of endangered species [29,30]. De novo transcriptome analysis of C. plicata using Illumina short read sequencing and annotation of a high-quality transcriptome assembly can be used to increase our understanding of the diversity of genes. C. plicata is an endangered species and new genomic resources may serve as an important public information platform for conservation of the species in Korea and in progressive pearl culture production in the farming communities of other countries. Our transcriptome dataset provides the first characterization of expressed sequences in the pearl mollusc, C. plicata, including the identification of candidate genes involved in immunity and reproduction. Furthermore, simple sequence repeats (SSRs) generated from the transcriptome data may be useful for genetic improvement of this species.

Materials and Methods

Ethics statement

This study has been accorded permission (Ref. No. 2014–10) from the Guem River Basin Environmental Office.

Biological samples and RNA extraction

A single C. plicata specimen was used for RNA sequencing in this study due to restricted use of the species for experimental purposes as declared by the Ministry of Environment, South Korea. The specimen was collected from Sapgyoho Lake, Asan-si, Chungnam, South Korea. After transferring the specimen to the laboratory, the visceral pouch tissue was dissected and immediately placed into liquid nitrogen until RNA preparation. For RNA extraction, the snap-frozen C. plicata visceral mass tissue was homogenized using Trizol Reagent (Invitrogen) according to the manufacturer’s instructions. The purity and integrity of RNA preparations were determined using a NanoDrop-2000 spectrophotometer (Thermo, USA) and a Bioanalyzer 2100 (Agilent Technologies, USA).

Construction of the mRNA-seq library and Illumina Sequencing

An mRNA-seq library was constructed using the mRNA-seq sample preparation kit (Illumina, San Diego, CA) following the manufacturer’s instructions. Briefly, poly (A)+ mRNA was purified from total RNA samples with oligo(dT) magnetic beads and fragmented using an RNA fragmentation kit (Ambion, Austin, TX) prior to cDNA synthesis. The short mRNA fragments were reverse-transcribed into first-strand cDNA using reverse-transcriptase (Invitrogen, Carlsbad, CA) and random hexamer-primers. Second-strand cDNA synthesis was accomplished using DNA polymerase I (New England BioLabs, Ipswich, MA) and RNase H (Invitrogen). The double-stranded cDNA was end-repaired using T4 DNA polymerase (New England BioLabs), the Klenow fragment (New England BioLabs), and T4 polynucleotide kinase (New England BioLabs). The end-repaired cDNA fragments were ligated with PE Adapter Oligo Mix using T4 DNA ligase (New England BioLabs) at room temperature for 15 min. The ligated products were purified and separated by size on an agarose gel. DNA fragments of the desired size (200 ± 25 bp) were excised and, after validation, were sequenced on the Illumina HiSeq 2500 sequencing platform.

De novo assembly

Before de novo transcriptome assembly, the raw reads were cleaned by removing adaptor-only reads (recognized adaptor length ≤ 13 nucleotides and remaining adaptor-excluded length ≤ 35 nucleotides), repeated reads, and low-quality reads (Phred quality score < 20) using the Sickle software tool (http://github.com/najoshi/sickle) [31] and Fastq_filter software (part of the Galaxy toolshed) [32]. The remaining high-quality reads were assembled using the short read assembling program Trinity with 100 GB of memory and a path reinforcement distance of 50 [33]. The Trinity program (the default options and a minimum allowed length of 200 bp) first assembles reads of a certain length that overlap to form longer fragments without gaps called contigs. The total number of contigs and the mean length, N50 length, and GC% were recorded for the de novo assembly. These contigs were connected using the sequence clustering software TGICL [34] to obtain sequences that could no longer be extended on either end. Such sequences were defined as unigenes. They represent expressed assembled sequences, but are not characterized sufficiently to be represented as a gene.

Transcriptome annotation and discovery

For the annotation profile of C. plicata unigenes, we first constructed a unique reference dataset that combined protein sequence data of Arthropoda, Nematoda, and Mollusca downloaded from the Taxonomy browser of the NCBI nr database. The sequences were converted to multi-FASTA format and stored in the PANM reference database (PANM DB) [35] using the formatdb program (downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.26/). PANM DB is freely downloadable from the amino acid database BLAST web-interface of the Malacological Society of Korea (http://malacol.or.kr/blast/aminoacid.html). The assembled C. plicata unigene sequences were searched against the PANM DB reference database using the BLASTx algorithm [36] with an E-value threshold of 1.0E -5 to identify putative functional mRNA transcripts. Subsequently, the BLASTx hits of the assembled sequences against the UniGene DB were also recorded. The BLAST2GO software suite [37] was used to predict Gene Ontology (GO) terms [38], assign the assembled sequences to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [39], and to identify protein domains against the InterPro databases using the InterProScan tool [40]. Annotations using BLAST2GO were conducted with 1.0E -6 as the E-value hit filter, 55 as the annotation cut-off and 5 as the GO weight. No HSP-hit coverage cut-off was considered. The GO terms were classified into three categories, hence, we generated separate graphs (pie chart at level 2) for biological process, cellular component, and molecular functions. The unigenes were also predicted by a query of the NCBI Clusters of Orthologous Groups (COG) DB (BLASTx, E-value cutoff of 1E -5) [41].

Identification of candidate genes related to immune responses and reproduction

Identification of candidate C. plicata genes involved in immune responses, sex-determination, and reproduction was performed using a keyword search of our BLASTx annotation results in the PANM DB. A set of keywords, composed of a series of representative innate immunity and oxidative stress genes, was used to predict immune response genes based on annotation results. Similarly, representative sex-determination and sex-differentiation genes were used to search for reproduction- related unigenes from the annotation results. In addition, the GO terms and KEGG classification information were required to identify important candidate genes. GO terms such as “immune system process”, “response to stimulus”, “signaling” under the biological process domain and “anti-oxidant activity” under the molecular function domain were used to scan for the immune response genes. Similarly, the GO terms “reproduction” and “reproductive process” were used to select candidate genes involved in reproduction of the mollusc. In addition, the KEGG category “immune system” was also used to identify candidate immune response genes.

Microsatellite marker discovery

The assembled C. plicata unigenes were searched for simple sequence repeat (SSR) motifs using the MIcroSAtellite (MISA) software [42]. For SSR identification only >1 kb sequences of unigenes were considered. All types of SSRs, from dinucleotide to hexanucleotide repeats were examined. The searches were run with the minimum repeat number of six for dinucleotide repeats and five for all other repeat motifs.

Results and Discussion

Illumina Hi-Seq 2500 sequencing and assembly evaluation

Transcriptome information for the endangered freshwater mollusc, C. plicata, was characterized by constructing a cDNA library prepared from purified mRNA isolated from the visceral mass tissue. The Illumina Hi-Seq 2500 sequencing platform generated a total of 286,152,584 raw reads with 36,055,225,584 bases. The raw reads were filtered to remove adaptor sequences, low quality reads (reads with more than 50% of bases having a Q-value ≤ 20), and ambiguous bases. A total of 281,322,837 clean reads (Phred quality ≥ Q20) with an average length of 124.1 bases was obtained. The transcriptome sequencing, assembly, and annotation scheme used for C. plicata are depicted in Fig 1. The clean reads obtained using the Illumina Hi-Seq 2500 transcriptome sequencing platform constituted 98.31% of the raw reads. The C. plicata transcriptome sequencing and assembly statistics are shown in Table 1. The Illumina sequence data for C. plicata were deposited in the NCBI Sequence Read Archive (SRA) under accession number SRP062467. The raw, untrimmed Illumina HiSeq2500 data along with the transcriptome assembly have been included as NCBI BioProject PRJNA293023.

Fig 1

Schematic representation of transcriptome assembly and annotation.

Table 1

Transcriptome assembly statistics of C. plicata visceral mass using the Trinity analysis.

Description	Statistics
Total number of raw read sequences	286,152,584
Number of bases	36,055,225,584
Mean length (bp)	126
Total number of clean read sequences	281,322,837
Number of bases	34,909,374,303
Mean length (bp)	124.1
N50 length	126
High quality reads (%)	98.31 (sequences), 96.82 (bases)
Total number of contigs	453,931
Number of bases	331,930,879
Mean length of contig (bp)	731.2
N50 length of contig (bp)	1,254
GC% of contig	36.62
Largest contig (bp)	36,440
No. of large contigs (≥500 bp)	151,695
Total number of unigenes	374,794
Number of bases	276,264,683
Mean length of unigene (bp)	737.1
N50 length of unigene (bp)	1,262
GC% of unigene	36.47
Length ranges (bp)	212–68,788

Schematic representation of transcriptome assembly and annotation.

A C. plicata visceral mass transcriptome was obtained using an Illumina HiSeq2500 NGS platform. The raw reads obtained were preprocessed using the Sickle software tool (quality: 20, length: 40) and Fastq_filter software to obtain clean reads. Trinity assembly (K-mer, 25; minimum contig length; 200) and TGICL clustering (Identity; 94%; overlap; 30 bp) generated 374,794 unigenes. The unigenes were used for functional annotation using the PANM, Unigene, COG, GO, and KEGG databases and structural annotation for SSR detection. High-quality reads generated from the transcriptome sequencing of the C. plicata visceral mass tissue were subsequently assembled using the Trinity program because assembled and annotated genomic sequence information for the Cristaria species were not available. The Trinity de novo assembler is used for the assembly of trimmed reads with an optimal K-mer length of 25. Trinity is the first program designed specifically for de novo transcriptome assembly and utilizes a novel method for the reconstruction of transcriptomes from RNA-seq data using three sequential software modules; namely, Inchworm, Chrysalis, and Butterfly [43]. Other short-read transcriptome assemblers such as Oases [44], Trans-ABySS [45], SOAPdenovo-Trans [46], and Rnnotator [47] are available, which are essentially modifications from the genome assembly. The Trinity assembler generated a total of 453,931 contigs with 331,930,879 bases (N50 length, 1,254 bp; mean length, 731.2 bp) and a GC% of 36.62%. The contigs were assembled into a total of 374,794 unigenes with 276,264,683 bases. The N50 length and the mean length of unigenes produced were 1,262 and 737.1 bp, respectively, with a GC% of 36.47. The lengths of the smallest to largest unigenes in the C. plicata transcriptome ranged from 212–68,788 bp. Among these unigenes, 125,484 unigenes (33.48%) were no more than 300 bp, 188,221 unigenes (50.22%) were 301–1,000 bp, 32,700 (8.72%) unigenes were 1,001–2,000 bp, and 28,389 unigenes (7.57%) were greater than 2,000 bp (Fig 2).

Fig 2

Summary of C. plicata visceral mass unigene (≥ 200 bp) sequences after Trinity assembly.

Complete coverage of the C. plicata mitochondrial transcriptome was demonstrated with the Trinity assembler (S1 Table). The 100% sequence coverage of the 13 protein-coding genes of the C. plicata mitochondrial genome using only two assembled unigenes (unigene Cp_000887 and unigene Cp_009974) demonstrated the integrity and completeness of Trinity de novo assembler. Several assemblers have been tested to map mitochondrial protein-coding genes in assembled contig sequences with a lesser degree of coverage [25,27]. The coverage of mitochondrial DNA genes is a direct measure of the quality of assembled sequences. An earlier study reported that the Oases analysis pipeline with a K-mer size of 23 was the best program for assembling the de novo transcriptome of Crassostrea virginica compared to the SOAPdenovo-Trans (K-mer sizes of 41 and 51) and Trinity (K-mer size of 25) programs based on the N50 length of contigs, the number of contigs longer than 500 bp, and the alignment coverage [25]. The Trinity program used for the transcriptome assembly of C. plicata RNA-seq reads with average contig and unigene lengths of 731.2 bp and 737.1 bp, respectively, was found better than or similar to most other Illumina sequenced assemblies: 260 and 434bp average contig lengths in H. midae [26] and R. balthica [27], respectively, and 706 and 580bp average unigene lengths in the endangered species Chinese sturgeon, (Acipenser sinensis) [30] and Chinese salamander (Hynobius chinensis) [29], respectively. A comprehensive summary of molluscan transcriptomes in the last three years using NGS platforms (Table 2) shows a preference for Illumina technology combined with the Trinity assembly process. Transcriptome data on endangered or endemic molluscs obtained using NGS platforms would increase our understanding of their genomic attributes and provide information for species conservation in their natural environment. Thus, the C. plicata transcriptome and annotation of valuable genes will be useful for functional genomics research and the development of molecular markers, and will serve as reference information for closely related species.

Table 2

Summary of molluscan transcriptomics in the last three years using Next Generation Sequencing (NGS) platforms.

R- raw reads, C- clean reads;

Species (Tissue)	NGS platform	Reads (n)	Assembler	Contigs (n)	Contigs (mean length in bp)	Contigs (N50 length in bp)	Others	SRA Accession	Sequencing objectives	Reference
Cristaria plicata	Illumina HiSeq2500	R- 286,152,584 C- 281,322,837	Trinity	453,931	731.2	1,254	374,794^b	SRP062467 PRJNA293023	Endangered species	This study
Crassostrea hongkongensis	Roche 454GS FLX	R- 1,595,855 C- 1,405,240	GS Denovo Assembler v2.6	41,472	958	1,571	—	SRR949615	Genetic selection	[48]
Tritonia diomedea (Central Nervous System)	Illumina HiSeq2000	R- 133,156,930	Trinity	185,546	74	363	—	PRJNA252890	Molecular correlates of behavior	[49]
Octopus vulgaris (hemocytes)	Illumina GAIIx	R- 150,302,926 C- 127,019,711	Trinity	254,506	669	1,632	87,408^a	SRP043705	Molecular basis of immune defense	[50]
Mytilus galloprovincialis (Digestive gland)	Illumina	R- 57,059,700 C- 52,770,704	Trinity	21,193	771	1,010	—	SRP011280.2	Molecular markers for toxin accumulation	[51]
Perna viridis (adductor muscle, gills, hepatopancreas)	Illumina Genome Analyzer IIx	C- 544,272,542	Trinity	233,257	1,264	2,868	—	SRP043984	Molecular aspects of toxicity responses	[52]
Anadara trapezia (Multiple tissue types)	Illumina HiSeq2000	C- 27,000,000	Trinity	75,024	505	597	13,507	PRJNA210944	Physiological response to environment stress	[23]
Pinctada maxima (mantle)	Illumina HiSeq2000	R- 49,500,748	Trinity	108,704^b	407	—	—	—	Development of EST-SSR markers	[53]
Echinolittorina malaccana	Illumina HiSeq2000	R- 61,000,000	Trinity	115,211^b	453	492	—	SRP041635	Thermal adaptations	[54]
Pecten maximus (mantle)	Illumina HiSeq2000	R- 1,335,123,074	SOAPdenovo	26,064	1,011	—	—	SRP040427	Shell production	[55]
Pecten maximus (hemocyte)	Illumina HiSeq2000	R- 216,444,674	CLC Genomic workbench	73,752	502.6	—	—	SRR1009240, SRR1009241, SRR1009242	Immunity	[56]
Corbicula fluminea (Multiple tissue types)	Illumina GAIIx	R- 67,087,130 C- 62,250,336	Velvet & Oasis	134,684^b	791.06	1,264	—	SRA062349	Exploration as environmental test organism	[57]
Chlamys farreri (mantle)	Illumina HiSeq2000	R- 59,918,916 C- 55,122,820	Trinity	188,629	249	306	—	—	Shell production	[58]
Sinonovacula constricta (Multiple tissue types)	Roche 454GS FLX	C- 859,313	Newbler2.7	16,323	1,376	—	—	GALB01000000	Molecular markers for growth	[59]
Mizuhopecten yessoensis (Multiple tissue types)	Illumina GAIIx	R- 112,265,296	Velvet & Oases	217,190	436	—	—	SRR653778	Molecular response to heavy metals	[24]

a number of contigs no less than 500 bp;

b number of unigenes

# For a summary of molluscan transcriptome analysis prior to 2013, please refer [25]

Summary of molluscan transcriptomics in the last three years using Next Generation Sequencing (NGS) platforms.

R- raw reads, C- clean reads; a number of contigs no less than 500 bp; b number of unigenes # For a summary of molluscan transcriptome analysis prior to 2013, please refer [25]

Sequence annotation of unigenes

The assembled unigenes in the C. plicata transcriptome were used to conduct a BLASTx search (E-value ≤ 1E-5) against the curated PANM DB, UniGene DB, and COG DB for validation and annotation of genes. Of 374,794 unigenes, 79,960 (21.33%), 40,196 (10.72%), and 13,934 (3.72%) unigenes were similar to sequences in the PANM, COG, and UniGene DBs, respectively (Table 3). The majority of unigenes annotated to homologous sequences in the DBs had lengths ≥1000 bp. A total of 39,682 (10.59%) and 11,368 (3.03%) unigenes had common homologous matches in the PANM with COG DBs, and the PANM with UniGene DBs, respectively. A total of 9,820 (2.62%) unigenes were annotated simultaneously by all three DBs. In total, 84,274 (22.49%) annotations were found within the clustered unigenes of the C. plicata transcriptome. The non-annotated unigene sequences were less likely to produce BLAST hits in the protein databases possibly due to their shorter sequences and their lack of a representative protein domain.

Table 3

Functional annotation of unigenes of the Cristaria plicata transcriptome.

Databases	All annotated transcripts	≤300 bp	300–1000 bp	≥1000 bp
PANM	79,960	14,480	30,748	34,732
UniGene	13,934	1,848	3,721	8,365
COG	40,196	4,763	11,445	23,988
GO	23,246	2,593	5,625	15,028
KEGG	4,776	483	927	3,366
All annotated	84,274	15,700	33,108	35,466

Homology characteristics and functional annotation of unigenes

Characteristics of the homology search of assembled unigenes against the PANM DB are summarized in Fig 3. The score distribution, which represents the quality of the BLAST alignment, showed that 38,142 (47.70%) unigenes had scores between 50 and 100 and 27,457 unigenes had scores between 100 and 500 (Fig 3A). Only 4,809 (6.01%) unigenes had a score < 50 reflecting the quality of the assembly and sequence annotation process. The E-value distribution revealed that 55,740 unigenes (69.71%) showed significant homology to deposited sequences (1E- 50 to 1E- 5, Fig 3B). The identity distribution showed that 33,781 (42.25%) and 16,884 (21.12%) unigenes showed identities of 40–60% and > 60%, respectively, to deposited sequences (Fig 3C). In addition, the similarity distribution showed that 47,027 (58.81%) unigenes had similarities greater than 60% (Fig 3D). The lengths of unigenes were directly related to the presence or absence of BLAST hits (Fig 3E). This is understandable since the longer sequences are more likely to contain protein domain characteristics and are more likely to have BLAST matches in the protein database. Another basis for understanding unigene characteristics is the BLAST top-hit species distribution which shows putative homology of the annotated sequence across species in the PANM DB (Fig 4). Based on our analysis, the highest homology was observed with the oyster, Crassostrea gigas (32,609 unigenes, 40.78%), followed by the owl limpet, Lottia gigantea (12,065 unigenes, 15.09%). As expected, the majority of unigene hits belonged to molluscan and other arthropod proteins. A summary of the top-hit InterPro domains identified 1,374 unigenes with zinc finger, C2H2-like domains. Zinc finger domains participate in important cell processing functions including signal transduction and transcriptional regulation and are a common feature in molluscs, insects, and other crustacean groups [60,61]. The C2H2-like zinc finger proteins are the most common DNA-binding motifs present in prokaryotic and eukaryotic transcription factors [62]. Other top domains identified based on unigene homology included the Toll/interleukin-1 receptor homolog (TIR) domain, C-type lectin domain, death-like domain, and heat shock protein 70 family domain, which are putative candidates for involvement in immune signaling processes in C. plicata. The 40 top-hit InterPro domains in the C. plicata transcriptome are summarized in Table 4.

Fig 3

Statistical summary of homology search of assembled unigenes against the PANM protein database.

(A) Score distribution of BLAST hits for each unigene with a cutoff E-value of 1E -5. (B) E-value distribution of each unigene using BLAST hits with a cutoff E-value of 1E -5. (C) Identity distribution of the top BLAST hits for each unigene. (D) Similarity distribution of the top BLAST hits for each unigene. (E) Lengths of unigenes compared with the presence or absence of BLAST hits.

Fig 4

Top-hit species distribution of C. plicata visceral mass unigenes against the PANM database (custom-devised curatable database of mollusc, arthropod, and nematode protein sequences downloaded from the NCBI nr database).

An E-value cutoff of 1E -5 was maintained and the hit distribution shows high homology to known genome sequences of the Mollusca phylum.

Table 4

List of the top-hit 40 InterPro domains in C. plicata transcriptome.

InterPro domain	Description	Unigenes
IPR015880	Zinc finger, C2H2-like domain	1374
IPR027417	P-loop containing nucleoside triphosphate hydrolase domain	1126
IPR000477	Reverse transcriptase domain	637
IPR012337	Ribonuclease H-like domain	496
IPR011042	Six-bladed beta-propeller, TolB-like domain	491
IPR013783	Immunoglobulin-like fold domain	465
IPR001841	Zinc finger, RING-type domain	405
IPR002110	Ankyrin repeat	404
IPR005135	Endonuclease/Exonuclease/phosphatase domain	388
IPR000315	B-box-type zinc finger domain	350
IPR000276	G protein-coupled receptor, rhodopsin-like family	320
IPR002290	Serine/threonine/dual specificity protein kinase, catalytic domain	272
IPR001370	BIR repeat	229
IPR003599	Immunoglobulin subtype domain	223
IPR024810	Mab-21 domain	219
IPR000504	RNA recognition motif domain	214
IPR000242	PTP type protein phosphatase domain	214
IPR002035	von Willebrand factor, type A domain	213
IPR003615	HNH nuclease domain	211
IPR027124	SWR1-complex protein 5/Craniofacial development protein family	209
IPR002048	EF-hand domain	209
IPR013083	Zinc finger, RING/FYVE/PHD-type domain	196
IPR000742	EGF-like domain	195
IPR008979	Toll/interleukin-1 receptor homology (TIR) domain	193
IPR000157	Galactose-binding domain-like	193
IPR002126	Cadherin domain	189
IPR003591	Leucine-rich repeat, typical subtype repeat	179
IPR000436	Sushi/SCR/CCP domain	175
IPR001680	WD40 repeat	173
IPR003593	AAA+ ATPase domain	172
IPR001304	C-type lectin domain	169
IPR011029	Death-like domain	163
IPR013087	Zinc finger C2H2-type/integrase DNA-binding domain	159
IPR011701	Major facilitator superfamily	158
IPR015943	WD40/YVTN repeat-like-containing domain	150
IPR001128	Cytochrome P450 family	150
IPR019734	Tetratricopeptide repeat	145
IPR013126	Heat shock protein 70 family	145
IPR000210	BTB/POZ domain	136

Statistical summary of homology search of assembled unigenes against the PANM protein database.

Top-hit species distribution of C. plicata visceral mass unigenes against the PANM database (custom-devised curatable database of mollusc, arthropod, and nematode protein sequences downloaded from the NCBI nr database).

An E-value cutoff of 1E -5 was maintained and the hit distribution shows high homology to known genome sequences of the Mollusca phylum. We subjected all unigenes to a search against the COG DB to make functional predictions. The unigenes were distributed among 25 functionally classified categories (excluding the multi category) (Fig 5). Among the 25 COG categories, the “general function prediction” cluster constituted the largest group (9,549; 23.75%), followed by “signal transduction mechanisms” (4,916; 12.23%), “post-translational modification, protein turnover, chaperones” (3,291; 8.19%), “function unknown” (2,557; 6.4%), “transcription” (1,610; 4%), “cytoskeleton” 1,344; 3.3%), and “RNA processing and modification” (1,002; 2.5%). A greater number of unigenes (6,107; 15.19%) were also allocated to the multi assignment category.

Fig 5

Clusters of orthologous groups (COG) classification of unigenes.

Out of 374,794 annotated unigenes, 40,196 sequences had a COG classification from among the 25 COG categories (excluding the multi category).

Clusters of orthologous groups (COG) classification of unigenes.

Out of 374,794 annotated unigenes, 40,196 sequences had a COG classification from among the 25 COG categories (excluding the multi category). Gene Ontology (GO)-based annotation is an internationally standardized gene functional classification system that describes gene products in terms of their associated biological processes, cellular components, and molecular functions. To make functional predictions for the C. plicata unigenes, we mapped the associated GO terms to 79,960 unigenes that had BLAST matches. The GO annotation was based on the BLASTx results against the nr database. Protein domains and motif information were retrieved using the InterProScan sequence search tool via BLAST2GO and the annotation was merged with already existing GO terms. After merging, the 23,246 unigenes (11,419 for biological process, 6,391 for cellular component, and 21,189 for molecular function) were assigned one or more GO terms based on sequence similarity. Furthermore, 10,016 (43.09% of the 23,246 unigenes) were annotated with both biological processes and cellular components, 5,058 (21.76%) were annotated with both cellular components and molecular function, 4,484 (19.29%), annotated with both biological process and cellular components, while 3,805 (16.37%) were assigned to all three categories (Fig 6A). A calculation of the number of unigenes associated with GO terms suggested that 7,892 and 7,781 sequences were associated with two or one GO term annotations, respectively (Fig 6B). As expected, the evidence code distribution showed an over-representation of electronic annotations that have not been created manually and may contain higher false positives. The evidence code ‘inferred from electronic annotation’ (IEA) like others, such as ‘inferred from sequence or structural similarity’ (ISS), ‘inferred from reviewed computational analysis’ (RCA), and ‘inferred from genomic context’ (IGC) belong to computational source of evidence; those that constitute over 95% of the total GO annotation analysis [63]. Hence, with respect to our GO term annotation results, all of the GO terms are not of equal validity and, based on this, the interpretation of unigenes relates to only the predicted function.

Fig 6

Functional annotation of C. plicata visceral mass assembled sequences based on gene ontology (GO) categorization.

Functional annotation of C. plicata visceral mass assembled sequences based on gene ontology (GO) categorization.

(A) An overlap model of the annotated unigenes assigned to biological processes, molecular functions, and cellular components based on GO function. (B) Numbers of unigenes assigned to GO term annotations. Among the 23,246 unigenes for which we obtained GO terms, we observed a wide diversity of functional categories represented on level 2 of the GO database. Fig 7 shows a total of 19 biological process, 10 cellular components, and 12 molecular function GO level-2 classes in which the unigenes were predicted to function. Within the biological processes category, genes involved in cellular processes (GO:0009987) and metabolic processes (GO:0008152) were represented prominently, followed by single-organism processes (GO:0044699) and biological regulation (GO:0065007) (Fig 7A). In the cellular component category, membrane (GO:0016020), cell (GO:0005623), and organelle (GO:0043226) represented the majority of terms (Fig 7B), while in the molecular function category, the top-represented GO terms included binding (GO:0005488) and catalytic activity (GO:0003824) (Fig 7C). The GO classifications suggested for C. plicata showed similarities with those for sequenced P. yessoensis [21] and C. hongkongensis [48].

Fig 7

GO classifications of the C. plicata transcriptome at level 2.

GO analyses were performed for three major classification categories: (A) biological processes; (B) cellular components and (C) molecular functions.

GO classifications of the C. plicata transcriptome at level 2.

GO analyses were performed for three major classification categories: (A) biological processes; (B) cellular components and (C) molecular functions. We also performed a search of all unigenes against the KEGG database to identify the active biological pathways in C. plicata Using BLASTx, we found 4,776 unigenes that shared homology with known enzymes in the KEGG database (Fig 8). The unigene sequences mapped to 123 KEGG pathways. Among them, 709 unigenes possessing an Enzyme Commission number were assigned to these pathways. The KEGG pathways were related to metabolism (4,315 unigenes), genetic information processing (53 unigenes), environmental information processing (80 unigenes), and organismal systems (328 unigenes). Predominantly, the unigenes were enriched in “nucleotide metabolism pathway” (1,147 unigenes of the 4,776 unigenes) followed by “metabolism of co-factors and vitamins” (951 unigenes), “xenobiotic biodegradation and metabolism” (554 unigenes), and “immune system pathways” (328 unigenes) categories. The C. plicata unigenes annotated to KEGG pathways are presented in S2 Table. Overall, both GO term and KEGG analyses identified, based on similarity to sequences already identified, transcribed region potentially involved in C. plicata stress responses as well as immunity and reproduction that could be crucial for determining adaptation in the natural environment and promoting conservation by means of genetic improvement programs.

Fig 8

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis.

The C. plicata visceral mass unigenes were assigned to KEGG pathways (inner circle). The total number of enzymes ascribed within each KEGG pathway is shown in the outer circle. Each pathway is represented by a different color.

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis.

Genes and pathways related to the innate immune system

A combined annotation profile of the various methods allowed for an analysis of the C. plicata unigene sequences involved in immunity and stress mechanisms. We identified unigene sequences associated functionally with innate immunity and oxidative stress mechanisms. A summary of unigene sequences list of immunity and defense mechanisms is presented in S3 Table. A majority of unigenes showed homology to lectins, toll-like receptors (TLRs), and cathepsin, followed by complement C1q, scavenger receptor, caspase, and heat shock proteins (HSP). Overall, the C. plicata unigene sequences covered the major pathways and provided an extensive coverage of the immune gene repertoire in the species. The innate immune system of molluscs consists of cellular and humoral components that operate in a coordinated manner to defend against a multitude of pathogens [64]. Pattern recognition proteins (PRPs) recognize microbial surfaces and initiate a signaling cascade which culminates in the regulatory release of antimicrobial effectors and forms the humoral defense response. The role of lectins as PRPs and in phagocytosis mechanisms has been characterized in molluscs [65,66] and is attributable mainly to the presence of carbohydrate-binding domains (CRDs) [67, 68]. The C. plicata transcriptome data shows the presence of putative lectin sequences including tandem-repeat galectin, C-type lectin, sialic-acid binding lectin, fucolectin, and immulectin-3. Tandem-repeat galectins are known to act as an acute-phase protein implicated in the immune defense of R. philippinarum, pearl oyster, and Pinctada fucata against Vibrio species [69,70]. C-type lectins have numerous roles in bivalve organisms such as non-self recognition, microbe agglutination, induction of phagocytosis and encapsulation, and anti-bacterial properties [71]. Transcriptome analysis of the related species, C. virginica, identified 8 galectins and 140 C-type lectin domain proteins [25]. A cross-species comparison analysis of C-type lectin domain proteins and CRD proteins showed an overrepresentation of such innate immune factors [25]. A comprehensive repertoire of carbohydrate-binding molecules has been analyzed in the common periwinkle, Littorina littorea, with unigene sequences corresponding to C-type lectins, fucolectins, galectins, chitinase-like lectins, and I-type lectins [72]. Fucolectins formed the largest group of CRD proteins in the C. plicata transcriptome. Fucolectins have been characterized in the mussel Mytilus galloprovincialis [73] and the sea cucumber Apostichopus japonicus [74], and have further characterized in detail [75]. Peptidoglycan recognition proteins (PGRPs), apolipophorin, and CD63, which are known for their recognition-mediated immune functions, were also identified among the C. plicata unigene sequences and may confer a selective advantage to the species under threatening conditions. Among the PGRPs, we identified both the short and long forms that bind to and hydrolyze bacterial peptidoglycans and activate the Toll or IMD signal transduction pathways in invertebrates [76,77]. PGRP homologs have been identified in squid (Euprymna scolopes) cDNA library [78] and the assembled transcriptomes of C. virginica [25] and Octopus vulgaris [50]. Apolipophorin III has been implicated in pattern recognition of beta-1, 3 glucans and responds to intracellular pathogens in insects [79,80], but requires a more extensive understanding in molluscs. TLRs are membrane-bound pathogen recognition receptors implicated in intracellular signaling and they regulate the production of the effector antimicrobial peptides [81]. TLRs contain leucine-rich repeat motifs, a transmembrane region, and a cytoplasmic Toll/interleukin-1 receptor (TIR) domain which interacts with myeloid differentiation factor 88 (MyD88) or other adaptors, leading to activation of intracellular Toll signaling cascades. We identified TLR-2, -3, -4, -6, -7, -13, and other TLR precursors in the C. plicata transcriptome. In M. edulis, 27 TLRs have been described, indicative of diversity and an advanced immune system [19]. With a comprehensive array of TLRs identified from the transcriptome, it would be interesting to explore the innate immunity functions using a targeted gene approach. In addition, genes encoding intracellular Toll pathway proteins including MyD88, IRAK1 protein, relish, Janus kinase (JNK), and p38 have been identified from the rich transcriptome datasets, suggesting that the TLR pathway is conserved in C. plicata. Our results are consistent with TLR pathway genes identified in M. galloprovincialis (Illumina reads) [82], M. edulis (454 contigs and new assemblies) [83], and C. gigas (Illumina reads) [84]. Based on the TLR pathway in the related species C. virginica, the association of the TIR domain of TLR and MyD88 releases signals through IRAK and TRAF6 to induce the p38 signaling pathway (mediated by MEKK1, MKKs, or JNK) or the NF-κB-like relish protein. The identification of these intracellular components will increase our understanding of the TLR mediated immune system in C. plicata. MyD88 is an adaptor in the TLR/IL-1R signaling pathway and is highly expressed in bivalves in response to both Gram-positive and Gram-negative challenge. Recently, five genes encoding MyD88, which acts as an acute phase protein after infection with microbes, were characterized from P. yessoensis [85]. In an oyster model, TLR-based intracellular signaling is well represented, with mediators including MyD88, TNF (tissue necrosis factor) receptor associated factor 6 (TRAF6), and nuclear factor-κB (NF-κB) factors involved in the regulation of antimicrobial function [86]. Many gene families encoding proteins involved in immune responses to biotic and abiotic challenges have been identified from the C. gigas genome project including TLRs, MyD88, C-type lectins, fibrinogen-related proteins (FREP), superoxide dismutase (SOD), and globular head C1q domain containing protein (C1q) [87]. Transcriptome analysis of C. virginica [25] also revealed a rich set of genes related to the TLR pathway including MyD88, SARM, IRAK, TRAF6, MKKs, JNK, p38, AP-1, and NF-κB. We identified immune signaling candidates and immune effectors such as big defensins and defensins in the C. plicata transcriptome. Defensins are effectors of innate immunity present in marine bivalves [88] and some freshwater bivalves such as Lamellidens marginalis [89], Hyriopsis schlegelii [90], and Hyriopsis cumingii [91]. The evolution of antimicrobial peptide genes including defensins has been rapid in Crassostrea species, suggestive of molecular diversification of the effectors to cope with environmental challenges [92]. Cathepsins identified in this study are involved in the maintenance of homeostasis, regulation of antigen presentation and degradation, immune responses, and intracellular protein degradation. Eight putative cathepsin sequences (cathepsin B, C, D, F, I, L, S, and Z) present in the transcriptome may be required for several of highly regulated life processes. Multiple homologs and cDNA sequences of cathepsin L have been identified from C. gigas and P. fucata as well as a cloned cDNA from C. plicata [9,93,94]. The identification of candidate genes for environmental adaptation indicates that the dynamics of species survival should be further explored. Some unigene sequences in the C. plicata transcriptome showed homology to oxidative stress enzymes such as SOD (Mn-SOD and Cu-Zn SOD), glutathione-S-transferase (GST alpha, -mu, -omega, -pi, -sigma, -theta), catalase (CAT), glutathione peroxidase (GPX), and glutathione synthetase. The transcriptional stability of these genes is crucial under anthropogenic and pathogenic stresses and regulates cell homeostasis, as demonstrated in the mussel M. galloprovincialis [95]. SOD, CAT, and GPX were also identified in the C. virginica [25] and L. fortunei transcriptomes [17]. The heat shock protein (HSP) genes are key indicators of the physiological robustness of an organism and provide molecular insights into the response to environmental challenges [96]. The identification of HSP70 multigene family members and small HSP and HSP90 class genes reveals bivalve specialization for environmental sustainability. In the C. plicata transcriptome, we identified unigenes related putatively to HSP60, HSP70, and HSP90, and the small HSPs (HSP10, HSP20, and HSP40). HSP70 is associated with acclimation robustness in molluscs and is found to be highly expressed in response to biotic stressors [17,97]. Differential, tissue- and time-specific expression of HSP70 in C. hongkongensis [98] and HSP60 in the mussel Perna viridis [99] has been reported. HSP chaperones are important factors for the establishment of mussels such as M. galloprovincialis and M. trossulus in North America, and are proposed to serve as a biochemical marker [100]. Analysis of the C. plicata transcriptome also revealed several key genes encoding proteins related to apoptosis and programmed cell death including caspases (caspase-1, -2, -3, -7, -8, -9 like isoform, -10), Bcl-2, and Bax, suggestive of the significance of apoptosis in the immune response of the organism. The abundance of unigenes showing homology to caspases was apparent from the identification of initiator caspase group 2 and 8 and executioner caspase group 3 and 7. The appearance of multiple executioner caspases and the divergence in their sequences compared to the conserved initiator caspases reflects the importance of the executioner phase of apoptosis in bivalves [101]. A similar set of key apoptosis genes was identified in the related species, C. virginica, suggestive of a versatile apoptosis-pathway mediating system [25]. In accordance with an earlier report of the presence of C1q domain proteins in the C. virginica, M. galloprovincialis, and M. edulis transcriptomes, we identified a large number of C1q domain proteins in the C. plicata transcriptome, which supports the expansion of such proteins in bivalve molluscs, possibly in support of adaptation strategies [19,25,102].

Sex-determination and reproduction-related genes

Genes involved in the sex determination process determine the sex of an organism by directing the development of gonadal structures, such as the ovary or testis. Sex-differentiation genes regulate the development of the ovaries and the testis from the undifferentiated gonad. Gonad transcriptome analysis studies have taken advantage of the diverse reproduction strategies of molluscs to identify a large number of sex determination/differentiation genes [48,103-106]. The economic advantages of C. plicata for aquaculture industries and its endangered status make exploring the molecular mechanisms underlying gametogenesis important. We identified unigene sequences showing homology to sex determination, sex differentiation, and reproduction-related genes based on a keyword search of the PANM-DB using our BLASTx annotation results (S4 Table). The C. plicata unigenes showed homology to sex-determination genes including SRY-related HMG-box domain (SOX) family members (SOX5, SOX6, SOX9, SOX11, SOX15, and SOXB2), Doublesex, and mab-3 related transcription factors (DMRT3), WNT member 4 (WNT4), Fem-1 like protein (FEM1), steroidogenic factor 1 (SF1), sex-determining region on the Y-chromosome (SRY), dosage-sensitive sex reversal, adrenal hypoplasia critical region, and on chromosome X (gene 1, DAX1). Sex determination and differentiation genes that were conspicuously absent from the C. plicata transcriptome included Wilm’s tumor suppressor gene-1 (WT-1), forkhead box L2 (FOXL2), aromatase (CYP19A1), and anti-mullerian hormone. While WT-1 homologs have been reported in fish (Danio rerio), sturgeons (A. sinensis and A. naccarii), amphibians (H. chinensis), and mouse (Mus musculus), they have not been reported in oyster (C. hongkongensis) [48]. Homologs of aromatase and anti-mullerian hormone from oyster species have also not been reported. FOXL2 is an ovarian determination gene in vertebrates and it functions to suppress genes involved in testis differentiation. Homologs of FOXL2 show higher expression in ovaries of invertebrates, including molluscs [25,105,107]. Homologs may be present in C. plicata, suggestive of a role in the earlier stages of gonad development. In C. hongkongensis, sex determination genes such as DMRT1 and SRY were absent while homologs of other DMRT and SOX family members were present [48]. Genes including DMRT, SOX9, Fem1, and FOXL2 are known to participate in the regulatory processes underlying sex determination/differentiation [108,109]. SOX family proteins are a conserved group of transcription regulators involved in development and differentiation which possess a high mobility group (HMG)-box domain [29,110]. SRY (founding member of the SOX genes and the master switch in sex determination) with SOX9 activates the male determining pathway by inhibiting ovary development through the induction of anti-mullerian hormone in Sertoli cells [111,112]. A DMRT-1 testis-specific (zinc finger DM domain protein) unigene sequence was identified among the C. plicata unigenes, and may (along with other members of the family) promote male-specific development. This finding is supported by reports of testis-specific DMRT genes in other molluscs [105,113,114]. Other genes such as FEM1 and WNT4 are not involved in the sex-determining pathway or maintenance of mature gonads but may have a role at an earlier stage of sex-specific expression [25]. For the conservation of endangered species such as C. plicata, it is important to understand reproduction-specific unigenes from the transcriptome. We identified unigene sequences homologous to spermatogenesis-associated protein (SPATA1, 4, 5, 6 and 7), sperm flagellar protein 1/2, sperm motility kinase, nuclear autoantigenic sperm protein, spermidine synthase, and spermine oxidase potentially expressed in the male reproductive tissues. As expected, the majority of male reproduction-related genes are associated with sperm motility, which is crucial for the reproductive success of an organism. Moreover, we identified C. plicata unigenes showing homology to oocyte zinc finger protein, ovary development protein, and vitellogenin (Vg) which are other putative reproduction-related genes. Vg has a strong transcript presence in the ovary compared to the testis and is an important protein for oocyte maturation in molluscan species such as C. gigas [115], P. yessoensis [116], and Argopecten purpuratus [117]. Overall, the C. plicata transcriptome was useful for the discovery of homologs related to sex-determination and reproduction from among the assembled unigene sequences.

Characterization of cSSR markers in the C. plicata transcriptome

SSRs are tandem repeated motifs characterized by high-levels of polymorphism and are commonly used as marker systems in genetic diversity assessments, population structure dynamics studies, conservation genomics, and genetic linkage mapping for a variety of organisms [118-121]. These microsatellite sequences can be important for the characterization of invasive species with reduced genetic diversity [122-123] and can resolve structural dynamics of closely related populations [124]. Unfortunately, due to limitations of time and cost of development, microsatellite isolation from non-model organisms remains limited. Recently, with the availability of genomic and transcriptome sequences using the high-throughput and cost-efficient NGS platforms, de novo screening of large sets of microsatellites such as SSRs and single nucleotide polymorphism (SNPs) has become more common in non-model organisms [125-126], including mollusc species [48, 127]. Due to the extensive use of C. plicata as a pearl culture resource in China, few polymorphic microsatellite loci were developed using the dinucleotide-enriched genomic library for the improvement of species [128]. To address the conservation efforts of C. plicata, which is endangered in Korea, the development of SSRs is highly desirable. We obtained a set of SSRs from the transcriptomic dataset using the MISA program. A total of 61,141 unigene sequences (>1 kb) containing 17,251 SSRs were identified, with 3269 sequences containing more than one cSSR. We screened for dinucleotide repeats with a minimum of six iterations, and all other repeat types (tri- to hexanucleotides) repeated at least five times. The most abundant SSRs included dinucleotide repeats (11,302), followed by tri- (4381), tetra- (1527), penta- (40), and hexanucleotide repeats (1). Consistent with our results, dinucleotide repeats were the most abundant repeats in the cSSR profiles of M. rosenbergii [16], endangered A. sinensis [30], and Chinese salamander (H. chinensis) [29]. Conversely, in the invasive L. fortunei, tetranucleotide repeats were more common while dinucleotide repeats were less common [17]. A summary of SSRs based on the number of repeat is presented in Table 5. Six tandem repeats (3635) were predominant, followed by five (2479) and seven (2111) tandem repeats. The cSSRs also contained repeats with 20 (216) and ≥ 21 (1595) random reiterations consisting mostly of dinucleotide repeats (89.66%). An analysis of the frequency distribution of cSSRs based on motif sequence types is shown in Fig 9. The C. plicata transcriptome is rich in AC/GT (4992; 28.94%), AT/AT (3536; 20.50%) and AG/CT (2720; 15.77) repeats. The abundant repeats in the cDNA-associated SSRs of C. hongkongensis were identified as AG followed by AT, AC, and ATC [48]. The prominent trinucleotide repeat types common among cSSRs of C. plicata included AAT/ATT (1779; 10.31%), followed by ATC/ATG (693; 4.02%), AAC/GTT (647; 3.75%), ACT/AGT (361; 2.09%), and AAG/CTT (295; 1.71%). A tetranucleotide repeat ACAT/ATGT (771; 4.47%) was also identified within the top prominent SSR types. Generally, microsatellite repeat types exhibit species-specific differences, as has been reported for Crustacean species [16,129]. SSRs identified in this study can be valuable for genetic improvement programs and for the quantification of genetic diversity within and among populations of endangered C. plicata.

Table 5

Summary of simple sequence repeat (SSR) types based on the number of repeat units.

Repeat numbers	Motif length					Total
	di-	tri-	tetra-	penta-	hexa-	Total
5	0	1936	526	17	0	2479
6	2541	713	380	1	0	3635
7	1648	417	44	2	0	2111
8	1211	401	66	1	0	1679
9	751	104	60	2	0	917
10	652	117	50	4	0	823
11	853	102	29	2	0	986
12	618	79	37	1	0	735
13	189	57	44	2	0	292
14	263	86	32	3	0	384
15	237	46	44	1	1	329
16	220	48	45	0	0	313
17	201	52	39	0	0	292
18	204	38	32	2	0	276
19	127	43	19	0	0	189
20	157	46	13	0	0	216
≥21	1430	96	67	2	0	1595
Total	11302	4381	1527	40	1	17251

Fig 9

Frequency distribution of simple sequence repeats (SSRs) based on motif types found in C. plicata visceral mass unigene sequences.

Conclusions

This study is the first exhaustive investigation of the transcriptome of the endangered freshwater pearl bivalve, C. plicata. Using the Illumina HiSeq 2500 NGS platform and the Trinity assembler, we assembled approximately 79,960 unigenes and assigned them to 23,246 GO and 4,776 KEGG pathway annotations. A number of candidate unigenes involved in immunity, sex determination/differentiation, and reproduction were identified. Transcriptome sequence and annotation data are valuable because C. plicata has been listed as endangered in the Korean Red List of Threatened Species due to a decline in their natural habitat over time. Genetic markers in the form of cSSRs may assist in the development of genetic improvement programs for C. plicata.

Mapping of the annotated unigenes against C. plicata mitochondrial protein genes using BLASTn at a cutoff E-value of 1E-5.

(DOCX) Click here for additional data file.

Annotation of C. plicata unigenes to KEGG pathways.

(XLSX) Click here for additional data file.

Genes of interest for immune signaling response and defense mechanisms in C. plicata transcriptome.

(DOCX) Click here for additional data file.

Candidate genes for sex-determination and reproduction in C. plicata transcriptome.

(DOCX) Click here for additional data file.

109 in total

1. Molecular characterization and expression analysis of cathepsin L1 cysteine protease from pearl oyster Pinctada fucata.

Authors: Jianjun Ma; Dianchang Zhang; Jingjing Jiang; Shuge Cui; Hanlin Pu; Shigui Jiang
Journal: Fish Shellfish Immunol Date: 2010-05-26 Impact factor: 4.581

2. De novo assembly and analysis of RNA-seq data.

Authors: Gordon Robertson; Jacqueline Schein; Readman Chiu; Richard Corbett; Matthew Field; Shaun D Jackman; Karen Mungall; Sam Lee; Hisanaga Mark Okada; Jenny Q Qian; Malachi Griffith; Anthony Raymond; Nina Thiessen; Timothee Cezard; Yaron S Butterfield; Richard Newsome; Simon K Chan; Rong She; Richard Varhol; Baljit Kamoh; Anna-Liisa Prabhu; Angela Tam; YongJun Zhao; Richard A Moore; Martin Hirst; Marco A Marra; Steven J M Jones; Pamela A Hoodless; Inanc Birol
Journal: Nat Methods Date: 2010-10-10 Impact factor: 28.547

3. Six defensins from the triangle-shell pearl mussel Hyriopsis cumingii.

Authors: Qian Ren; Meng Li; Chi-Yu Zhang; Ke-Ping Chen
Journal: Fish Shellfish Immunol Date: 2011-08-04 Impact factor: 4.581

4. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.

Authors: Yinlong Xie; Gengxiong Wu; Jingbo Tang; Ruibang Luo; Jordan Patterson; Shanlin Liu; Weihua Huang; Guangzhu He; Shengchang Gu; Shengkang Li; Xin Zhou; Tak-Wah Lam; Yingrui Li; Xun Xu; Gane Ka-Shu Wong; Jun Wang
Journal: Bioinformatics Date: 2014-02-13 Impact factor: 6.937

5. Characterization of the Zhikong scallop (Chlamys farreri) mantle transcriptome and identification of biomineralization-related genes.

Authors: Mingjun Shi; Ya Lin; Guangrui Xu; Liping Xie; Xiaoli Hu; Zhenmin Bao; Rongqing Zhang
Journal: Mar Biotechnol (NY) Date: 2013-07-17 Impact factor: 3.619

6. Molecular characterization of a cDNA encoding putative vitellogenin from the Pacific oyster Crassostrea gigas.

Authors: Toshie Matsumoto; Akihumi M Nakamura; Katsuyoshi Mori; Toshiaki Kayano
Journal: Zoolog Sci Date: 2003-01 Impact factor: 0.931

Review 7. Control of cell fate and differentiation by Sry-related high-mobility-group box (Sox) transcription factors.

Authors: Véronique Lefebvre; Bogdan Dumitriu; Alfredo Penzo-Méndez; Yu Han; Bhattaram Pallavi
Journal: Int J Biochem Cell Biol Date: 2007-06-06 Impact factor: 5.085

8. Transcriptome analysis of the oriental river prawn, Macrobrachium nipponense using 454 pyrosequencing for discovery of genes and markers.

Authors: Keyi Ma; Gaofeng Qiu; Jianbin Feng; Jiale Li
Journal: PLoS One Date: 2012-06-20 Impact factor: 3.240

9. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis.

Authors: Paolo Franchini; Mathilde van der Merwe; Rouvay Roodt-Wilding
Journal: BMC Res Notes Date: 2011-03-11

10. Cloning, characterization and effect of TmPGRP-LE gene silencing on survival of Tenebrio molitor against Listeria monocytogenes infection.

Authors: Hamisi Tindwa; Bharat Bhusan Patnaik; Dong Hyun Kim; Seulgi Mun; Yong Hun Jo; Bok Luel Lee; Yong Seok Lee; Nam Jung Kim; Yeon Soo Han
Journal: Int J Mol Sci Date: 2013-11-14 Impact factor: 5.923

18 in total

1. Transcriptome analysis of the freshwater pearl mussel (Cristaria plicata) mantle unravels genes involved in the formation of shell and pearl.

Authors: Xuefeng Wang; Zhiming Liu; Wenjian Wu
Journal: Mol Genet Genomics Date: 2016-12-16 Impact factor: 3.291

2. Transcriptome sequencing and de novo characterization of Korean endemic land snail, Koreanohadra kurodana for functional transcripts and SSR markers.

Authors: Se Won Kang; Bharat Bhusan Patnaik; Hee-Ju Hwang; So Young Park; Jong Min Chung; Dae Kwon Song; Hongray Howrelia Patnaik; Jae Bong Lee; Changmu Kim; Soonok Kim; Hong Seog Park; Yeon Soo Han; Jun Sang Lee; Yong Seok Lee
Journal: Mol Genet Genomics Date: 2016-08-09 Impact factor: 3.291

Review 3. Comparative immunogenomics of molluscs.

Authors: Jonathan H Schultz; Coen M Adema
Journal: Dev Comp Immunol Date: 2017-03-18 Impact factor: 3.636

4. Transcriptome analysis of the threatened snail Ellobium chinense reveals candidate genes for adaptation and identifies SSRs for conservation genetics.

Authors: Se Won Kang; Bharat Bhusan Patnaik; So Young Park; Hee-Ju Hwang; Jong Min Chung; Min Kyu Sang; Hye Rin Min; Jie Eun Park; Jiyeon Seong; Yong Hun Jo; Mi Young Noh; Jong Dae Lee; Ki Yoon Jung; Hong Seog Park; Yeon Soo Han; Jun Sang Lee; Yong Seok Lee
Journal: Genes Genomics Date: 2017-12-01 Impact factor: 1.839

5. Transcriptome Analysis of the Tadpole Shrimp (Triops longicaudatus) by Illumina Paired-End Sequencing: Assembly, Annotation, and Marker Discovery.

Authors: Jiyeon Seong; Se Won Kang; Bharat Bhusan Patnaik; So Young Park; Hee Ju Hwang; Jong Min Chung; Dae Kwon Song; Mi Young Noh; Seung-Hwan Park; Gwang Joo Jeon; Hong Sik Kong; Soonok Kim; Ui Wook Hwang; Hong Seog Park; Yeon Soo Han; Yong Seok Lee
Journal: Genes (Basel) Date: 2016-12-02 Impact factor: 4.096

6. Deciphering the Link between Doubly Uniparental Inheritance of mtDNA and Sex Determination in Bivalves: Clues from Comparative Transcriptomics.

Authors: Charlotte Capt; Sébastien Renaut; Fabrizio Ghiselli; Liliana Milani; Nathan A Johnson; Bernard E Sietman; Donald T Stewart; Sophie Breton
Journal: Genome Biol Evol Date: 2018-02-01 Impact factor: 3.416

7. The Whole-Genome and Transcriptome of the Manila Clam (Ruditapes philippinarum).

Authors: Seyoung Mun; Yun-Ji Kim; Kesavan Markkandan; Wonseok Shin; Sumin Oh; Jiyoung Woo; Jongsu Yoo; Hyesuck An; Kyudong Han
Journal: Genome Biol Evol Date: 2017-06-01 Impact factor: 3.416

8. Draft Genome Sequence of Escherichia Phage PGN829.1, Active against Highly Drug-Resistant Uropathogenic Escherichia coli.

Authors: Naveen Chaudhary; Balvinder Mohan; Neelam Taneja
Journal: Microbiol Resour Announc Date: 2018-11-21

9. Transcriptomic Analysis of the Endangered Neritid Species Clithon retropictus: De Novo Assembly, Functional Annotation, and Marker Discovery.

Authors: So Young Park; Bharat Bhusan Patnaik; Se Won Kang; Hee-Ju Hwang; Jong Min Chung; Dae Kwon Song; Min Kyu Sang; Hongray Howrelia Patnaik; Jae Bong Lee; Mi Young Noh; Changmu Kim; Soonok Kim; Hong Seog Park; Jun Sang Lee; Yeon Soo Han; Yong Seok Lee
Journal: Genes (Basel) Date: 2016-07-22 Impact factor: 4.096

10. FOXL2 and DMRT1L Are Yin and Yang Genes for Determining Timing of Sex Differentiation in the Bivalve Mollusk Patinopecten yessoensis.

Authors: Ruojiao Li; Lingling Zhang; Wanru Li; Yang Zhang; Yangping Li; Meiwei Zhang; Liang Zhao; Xiaoli Hu; Shi Wang; Zhenmin Bao
Journal: Front Physiol Date: 2018-08-22 Impact factor: 4.566