Literature DB >> 30802266

Comparative analysis of the transcriptome of the Amazonian fish species Colossoma macropomum (tambaqui) and hybrid tambacu by next generation sequencing.

Fátima Gomes¹, Luciana Watanabe¹, João Vianez², Márcio Nunes², Jedson Cardoso^2,3, Clayton Lima², Horacio Schneider¹, Iracilda Sampaio¹.

Abstract

BACKGROUND: The C. macropomum is a characiform fish from the Amazon basin that has been hybridized with other pacu species to produce commercial hybrids, such as the tambacu. However, little is known of the functional genomics of the parental species or these hybrid forms. The transcriptome of C. macropomum and tambacu were sequenced using 454 Roche platform (pyrosequencing) techniques to characterize the domains of Gene Ontology (GO) and to evaluate the levels of gene expression in the two organisms.
RESULTS: The 8,188,945 reads were assembled into 400,845 contigs. A total of 58,322 contigs were annotated with a predominance of biological processes for both organisms, as determined by Gene Ontology (GO). Similar numbers of metabolic pathways were identified in both the C. macropomum and the tambacu, with the metabolism category presenting the largest number of transcripts. The BUSCO analysis indicated that our assembly was more than 40% complete. We identified 21,986 genes for the two fishes. The P and Log2FC values indicated significant differences in the levels of gene expression, with a total of 600 up-regulated genes.
CONCLUSION: In spite of the lack of a reference genome, the functional annotation was successful, and confirmed a considerable difference in the specificity and levels of gene expression between the two organisms. This report provides a comprehensive baseline for the genetic management of these commercially important fishes, in particular for the identification of specific genes that may represent markers involved in the immunity, growth, and fertility of these organisms, with potential practical applications in aquaculture management.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30802266 PMCID： PMC6388931 DOI： 10.1371/journal.pone.0212755

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Background

Next Generation Sequencing (NGS) technologies have provided a range of potential markers for both model and non-model species through the discovery of new genes from an enormous database [1-3]. Over the past few years, NGS has been used increasingly to characterize the transcripts of a number of different species and tissues, in addition to contributing to the evaluation of gene expression in commercial lineages. Fish are especially important, given that they comprise a highly diversified group, with many economically important species, but little information on their functional genomics. The FLX (GS FLX) Genome sequencing system, which uses the 454 sequencing approach, is one of the most widely-used and successful approaches to the evaluation of the fish transcriptome [4-6]. The tambaqui (Colossoma macropomum) is an important fishery resource in the Amazon region, and is distributed naturally in both the Amazon and Orinoco river basins. Given its aquacultural performance, due to its rapid growth and robustness, the C. macropomum is currently the most common species raised on Brazilian fish farms [7, 8], and is becoming increasingly popular in other tropical countries. As the second largest South American freshwater fish, reaching 90 cm and 30 kg, the C. macropomum is especially suitable for tropical aquaculture, especially as it is also extremely fertile and robust, and can adapt easily to polyculture systems [7]. This species has also often been hybridized with other pacus, in particular Piaractus mesopotamicus [9, 10], producing a hybrid known as the tambacu. The tambacu results from the induced crossing of the female C. macropomum with the male P. mesopotamicus, and has been raised ever more intensively in recent years, replacing its parental species in tropical aquaculture operations [8]. The few comparative studies available for the C. macropomum and tambacu have focused on growth, genetics, biochemistry, and the immune system [11-13], while no comparative analyses of their functional genomics have yet been published. In fact, few data are available on the functional genomics of C. macropomum. However, Prado-Lima and Val [14] did analyze the transcriptome of the muscle of juvenile t C. macropomum, subjected to variation in temperature. In this experiment, significant differences were observed in the expression of specific groups of genes in responses to changes in the ambient conditions. Gomes et al. [15] characterized the miRNAs of the liver and muscle of juvenile C. macropomum, and identified a number of conserved miRNAs that were shared between the two types of tissue, albeit with distinct levels of expression. Studying 30 C. macropomum specimens, Martinez et al. [16] analyzed nuclear Single Nucleotide Polymorphisms (SNPs) using Next Generation Sequencing-based genotyping. Using the genotyping by sequencing method, Nunes et al. [17] identified a large number of SNPs in C. macropomum, and constructed a high-resolution genetic linkage map for a full-sibling family of 124 individuals, including their parents. Apart from these studies, there are no data on the transcriptome of the tambacu or any comparative analyses involving its ancestral species. The transcriptome is the initial product of genome expression and is the principal focus of the investigation of gene function [18, 19]. Thus, we compared the transcriptome of C. macropomum and tambacu to characterize the Gene Ontology (GO) domains and metabolic pathways, as well as evaluating the levels of gene expression in the two organisms. Two types of tissue–skin and muscle–were chosen for these analyses. The skin represents the main defensive barrier of a vertebrate, and presents unique biological processes related to its complex structure and cell composition [20, 21]. Skeletal muscle represents about half of the body mass of a fish, and is associated with the swimming capabilities of these animals [22]. The results of this study are important for the conservation of the species, and the markers may also have practical applications for management of farmed stocks.

Material and methods

Ethics statement

The samples analyzed in the present study were obtained from fish farms and collected with the permission of the Brazilian federal environmental, authorized by the Chico Mendes Institute for Biodiversity (agency ICMBio) through license number 12.773–1 granted to Iracilda Sampaio. The animal experimentation conducted was in accordance with the National control Board of Animal Experimentation (CONCEA), Brazilian federal law no. 11.794, and the International guidelines defined by the U.K. Animals (Scientific Procedures) Act, 1986. The specimens were then euthanized by immersion in anesthetic Benzocaine Hydrochloride, as recommended by the Brazilian Federal Council of Veterinary Medicine (federal law 5,517, resolution CFMV no. 1000/2012). All efforts were made to minimize the intensity and duration of distress and exposure to pain.

Sample collection

The C. macropomum and tambacu specimens were collected from the Andrera fish farm in Bonito, in Pará state, Brazil. All specimens were identified prior to collection, based on morphological traits, such as the shape of the body, and the number of rakers on the first gill arch. However, given the considerable morphological similarities between the C. macropomum and its hybrids, genetic tests were conducted to confirm the identity of the specimens. As result, eight adults from C. macropomum and tambacu were selected, with both groups presenting a mean body length of 0.6 ± 0.2 m and weight of 3 ± 0.3 kg. The specimens were euthanized for the removal of fragments of the skin and muscle. The tissue samples were stored immediately in 1 mL of RNAlater (Ambion, Life Technologies, USA) at 4°C for 24 hours and then transferred to an ultrafreezer at -80°C. To ensure the identification of specimens (pure vs. hybrids), fragments of muscle were also stored in 100% ethanol at -20°C for DNA extraction. It is important to note that the C. macropomum is a rheophilic species that is unable to breed in captivity [7]. These fish present obvious sexual dimorphism only during the breeding period, when the females acquire a soft, rounded abdomen, as well as a prominent, reddish urogenital papilla, while the male expel fluid semen after gentle abdominal massage [23]. As the specimens were collected during the non-breeding period, none of the individuals presented these characteristics, which impeded the differentiation of males and females.

DNA extraction, PCRs and sequencing

The total DNA was isolated using phenol-chloroform and precipitation in ethanol [24]. The integrity of the DNA samples was verified by electrophoresis in 1% agarose gel stained with GelRed, followed by visualization under UV light. The isolation and amplification of the mitochondrial Control Region (D-loop) was based on Polymerase Chain Reaction (PCR), as described by Gomes et al. [13]. The products of the amplification were sequenced using the Big Dye Terminator v. 3.1 Cycle Sequencing kit (Applied Biosystems), followed by electrophoresis in an automatic sequencer (Genetic Analyzer 3500). The sequences were then edited and aligned in BIOEDIT 7.2.5 [25]. The identification of the mitochondrial maternal lineage was based on comparisons with the sequences of the native species provided by Gomes et al. [13] and the sequences available in GenBank (Colossoma macropomum DQ480074 and Piaractus mesopotamicus AF283959).

Multiplex PCR

The C. macropomum and hybrids individuals were also identified by multiplex PCR using the nuclear α-tropomiosin, as described by Gomes et al. [13]. We, then, used the universal primer Tropo Serra F (5’-GAGTTGGATCGGGCTCAG-3’) and species-specific primers (Tropo Cm R 5’-ATACAACAATGCCATCGCT-3’ and Tropo Pm R 5’-CTTCAGCTGGATCTCCTGA-3’).

Extraction of the total RNA

The total RNA was isolated using the PureLink RNA mini kit (Ambion, Life Technologies, USA) according to the manufacturer’s instructions. Equal concentrations and volumes of the tissue sampled from different individuals were combined to produce four RNA pools, separated by organism and type of tissue. The samples were treated with RNAse-free DNAse (Invitrogen, CA, USA) to remove any DNA contaminants. The amount of RNA extracted was determined using a PicoDrop spectrophotometer (Picodrop, United Kingdom), and the quality, with a Bioanalyzer 2100 (Agilent Technologies). The integrity of the samples was confirmed by electrophoresis in 1.5% agarose gel. All the RNA samples were treated with RiboMinus (Invitrogen, CA, USA), according to the manufacturer’s instructions, for the selective depletion of the ribosomal RNA (rRNA) transcripts from the total RNA. The RNA samples were then stored at –80ºC prior to the pyrosequencing reaction.

Construction of the cDNA library

Four C. macropomum and tambacu libraries were established. Complementary DNA (cDNA) libraries were constricted in two steps, firstly, by the fragmentation of the total RNA and then the synthesis of the double-stranded cDNA. The initial quantity of RNA used in the fragmentation was approximately 600 ng/μ, with zinc chloride (ZnCl2) and Tris-HCL being added to initiate the fragmentation process. The integrity of the fragments was verified in the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA), using the RNA 6000 Pico kit. The double strands of the cDNA were then obtained using the cDNA Synthesis System kit (Roche), with the following steps: denaturation of the RNA by the addition of a randomic primer, synthesis of the first cDNA strand, synthesis of the second cDNA strand, and the purification of the double cDNA strands. The irregular ends of the cDNA fragments generated by this process were then repaired. The cDNA fragments were quantified using a TBS 380 QuantiFluor fluorometer (Promega, USA), according to the manufacturer’s instructions. The size of the cDNA fragments was verified in a bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) using the High Sensitivity DNA chip.

Amplification of the cDNA library and sequencing

The cDNA fragments of all the samples (C. macropomum and tambacu) were pyrosequenced in the 454 GS FLX Titanium platform (Roche, Branford, CT, USA). The emulsion PCR (emPCR) was based on the enrichment, purification, and preparation of the Pico Titer Plate (PTP), conducted according to the manufacturer’s instructions. All the libraries were sequenced five times, with each run using a PTP with two regions. The raw sequence reads have been submitted in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (Bioproject: PRJNA353388/ SRX2375124, SRX2375122, SRX2375120, SRX2375119).

Computational analysis

Pre-processing

The raw Sff files were converted into FastQ in Geneious v. 7.1.7 [26], and the adapters were removed. The ribosomal reads were then filtered using the default parameters and compared with the database available in SortMeRNA v. 2.0 [27]. The subsequent non-rRNA reads were separated by size (100 bps), quality (>20) and homopolymers (<12) in Mothur v. 1.35.1 [28].

De novo assembly

The contigs were assembled in Newbler v. 3.0 [29] and Mira v. 4.0.2 [30]. Both packages adopt the overlap-layout-consensus (OLC) algorithm [31]. The contigs obtained for each tissue and each assembler were clustered in CD-HIT-EST v. 4.6.4 [32] to remove redundant elements based on a minimum overlapping length of 60 bps and minimal overlapping similarity of 95%. A reference transcriptome was generated from the entire set of assembled contigs.

Assessment of transcriptome assembly completeness

The BUSCO v3 (Benchmarking Universal Single-Copy Orthologs) program [33] was used to evaluate transcriptome completeness. The transcriptomes were compared with a predefined set of eukaryote and metazoan single-copy orthologs from the OrthoDB v9 database [34], assuming a cut-off of 1E-3. This analysis was based on the complete database of the transcripts of all the samples (skin and muscle) from both organisms.

Functional annotation

The contigs were compared with non-redundant protein sequences of the Swissprot protein (Swiss-Prot) dataset using BlastX, assuming a cut-off of 1E-3 and similarity of at least 30%. The Swiss-Prot is a highly accurate protein database with high rates of annotation, reduced redundancy, and high levels of integration with other datasets [35]. The XML outputs from BlastX were annotated in Blast2GO v. 3.3 [36], being assigned to one of the three main categories of Gene Ontology (GO): Biological Processes (BP), Molecular Function (MF), and Cell Components (CC). The contigs were then mapped according to their known metabolic and molecular pathways in the vertebrates, using the Kyoto Encyclopedia of Genes and Genomes (KEGG) [37].

Differential expression

The total reads of each library were mapped in STAR 2.5 [38] based on the reference transcriptome (the entire set of assembled contigs). The individual counts of the transcripts identified in each tissue were obtained in HTSeq 0.6.1 [39], and were used as input to estimate the abundance of transcripts in RPKM (Reads per Kilobase per Million mapped reads) in the R statistical package. The read counts were normalized using the RPKM function of the EdgeR package [40]. The online software IDEG6 (p value <0.001) [41] was used to identify the differentially expressed genes (DEGs), confirmed by Fold-change log 2 (Log2FC) values ≤ -1, ≥ 1. The Heatmap graph was based on the RPKM values of the DEGs, as identified by the IDEG6 (≤ 0.001) and Log2FC values. The CIM function of the mixOmics package [42] was used to run the correlation analysis, and all transcripts that presented read numbers of zero in at least one library were removed from the analysis. The Volcano Plot was based on the significant p (≤ 0.001) and Log2FC (≤ -1, ≥ 1) values. The heatmaps were produced in Heatmap.2 [43] and the Volcano plots were obtained in Ggplot [44] from the R package.

Results and discussion

Genetic identification of the organisms

The morphological similarities between the C. macropomum and its hybrid forms have long been an impediment on the precise identification of the species reared in tropical aquaculture. Therefore, genetic markers are required to ensure the reliable identification of specimens. The markers used in the present study were the mitochondrial D-loop sequences and the multiplex PCR of the nuclear α-Tropomiosin gene. The multiplex PCR is widely used as a highly effective, low-cost tool for the rapid identification of interspecific hybrids and parent species [13, 45]. As the mitochondrial DNA is inherited maternally in most eukaryotes [46, 47], the sequencing of the D-loop confirmed that the female parents of all the specimens analyzed were C. macropomum (S1 Table). The hybrids were discriminated from the C. macropomum by multiplex PCR, using the parental species (C. macropomum and P. mesopotamicus) as control. The results of the PCR yielded distinct band profiles for P. mesopotamicus, with fragments of 300 base pairs (bps), and C. macropomum, which presented bands of only 200 bps. The multiplex PCR produced a single band of 200 bps in eight of the 16 specimens analyzed, confirming that they belong to C. macropomum (Fig 1A). The other eight specimens had two distinct bands (one of 300 bps and the other of 200 bps), confirming their identification as tambacu hybrids. The 300-bp fragment is inherited from the P. mesopotamicus father and the 200-bp fragment from the C. macropomum mother (Fig 1B).

Fig 1

Molecular identification of the organisms based on multiplex PCR of α–tropomiosin gene.

Fig 1A. Samples 1 and 2 represent the control species P. mesopotamicus (300 bps) and C. macropomum (200 bps), respectively. Samples 3–10 represent the pure C. macropomum individuals sampled in the present study. Fig 1B. Samples 1 and 2 represent the control species P. mesopotamicus (300 bps) and C. macropomum (200 bps), respectively. Samples 3–10 represent the hybrid tambacu individuals sampled in the present study.

Molecular identification of the organisms based on multiplex PCR of α–tropomiosin gene.

Results of the sequencing and assembly

A total of 8,188,945 raw reads were obtained from the four libraries derived from the samples of the skin and muscle tissue of C. macropomum and the tambacu, using the Roche 454 platform (Table 1). After filtering, the remaining reads obtained from each type of tissue were assembled separately using MIRA and Newbler, providing a total of 572,760 contigs. The dataset was analyzed in CD-Hit-Est to remove redundant contigs and putative assembly errors, resulting in a final number of 400,845 contigs (Table 1). The use of two assemblers enriched both the number and the size of contigs, and is a successful, widely-used strategy [48, 49]. The contigs ranged in size from 200 to 49,961 bps. The largest contig was assembled from the C. macropomum muscle samples (S1 Fig). The N50 values for each type of tissue and other assembly metrics are shown in Table 1.

Table 1

De novo assembly and contigs obtained for each type of tissue in the C. macropomum and tambacu samples.

Organism	C. macropomum (Skin)	tambacu (Skin)	C. macropomum (Muscle)	tambacu (Muscle)
Total Reads	2,359,900	1,902,668	2,043,773	1,882,604
Filtered Reads	778909	430979	641023	843675
Mean Quality	32.4	32.82	30.74	31.08
Average Length Reads	442.5	363.54	409.82	401.16
De novo assembly	NEWBLER MIRA	NEWBLER MIRA	NEWBLER MIRA	NEWBLER MIRA
Assembled Reads	225,669 276,699	57,499 77,521	282,652 251,282	329,908 336,303
Total Contigs	167,404 94,624	62,468 26,030	42,235 31,030	92,284 56,685
Largest Contig	16,759 10,530	7,664 5,909	49,960 22,203	43,182 49,465
N50	826 971	743 858	837 846	796 836
Average Length	817 910.5	753 776.8	833 768.5	817 786.8
Contigs after CD-Hit-Est	179,830	66,583	47,985	106,447
Annotated Contigs	18,200	7,932	11,316	20,875

Assessment of the completeness of the transcriptome

The completeness of the transcriptome assembly of the C. macropomum and tambacu samples was assessed through comparison with the Benchmarking Universal Single-Copy Orthologs (BUSCO) for the metazoan and eukaryote gene sets. The genes sets were classified according to BUSCO parameters (Table 2) as either Complete (C), Single (S), Duplicated (D), Fragmented (F) or Missing (M). In this analysis, including the total set of transcripts of C. macropomum and the tambacu, 41.1% of the complete genes were observed in comparison with the metazoan gene set, and 42.3% in comparison with the eukaryote set. If the fragmented genes are included here, the percentage of genes recovered increases to 75.5% in comparison with the metazoans and 77.9% for the eukaryotes. On the other hand, 24.55% of the genes were classified as missing in comparison with the metazoan set, and 22.1% in comparison with the eukaryote set, which may reflect biological innovations in the C. macropomum and tambacu [33]. Although this cannot be evaluated due to the lack of data on the complete genome of these organisms, or even this result may be a consequence of problems in the data assembled.

Table 2

BUSCO statistics for the completeness of the C. macropomum and tambacu transcriptome assembly in comparison with the metazoan and eukaryote gene sets.

BUSCO	Metazoan Odb9 (%)	Eukaryote Odb9 (%)
Complete	41.1	42.3
Single	28.9	29.4
Duplicated	12.2	12.9
Fragmented	34.4	35.6
Missing	24.5	22.1

Functional annotation

When the contigs from the present study were compared to the Swiss-Prot database using BlastX, 100,298 (25%) of the transcripts having significant hits. Of this total, 58,322 contigs were annotated (S2 Fig). Overall, 19.4% of the transcripts annotated in C. macropomum referred to the muscle, and 31.2% to the skin. In the tambacu 35.8% referred to the muscle and 13.8% to the skin. The top hits in BlastX revealed that C. macropomum and its hybrid shared transcripts with a number of other species of fish, amphibian, and mammal (S3 and S4 Figs).

Gene Ontology (GO)

The Gene Oncology (GO) project provides a useful bioinformatics tool for the representation of genes or their products in different species and are especially informative for large datasets [50, 51]. In the present study, the GO analysis of the three principal domains, Biological Processes (BP), Molecular Function (MF), and Cell Components (CC), revealed a large number of functions for C. macropomum (15,634 GO terms) and tambacu (15,140 GO terms). The BP domain had the highest number of functions, with 10,491 (67.1% of the total) terms in the C. macropomum and 10,104 (66.7%) in the tambacu. This domain include a number of distinct systems, ranging from metabolic pathways to the physiological and behavioral features of an organism [52]. This domain was followed by MF, with 3,675 (23.5%) terms in the C.macropomum and 3,584 (23.7%) in the tambacu, while the CC had 1,469 (9.4%) in the C. macropomum and 1,452 (9.6%) in the tambacu (S5 Fig). Similar results were obtained for the C. macropomum and tambacu in relation to the main functions observed in the BP, MF, and CC domains (Fig 2). The GO term “cellular process” (GO:0009987) was predominant in the BP domain (>12% in the four libraries), whereas in the MF domain, “binding” (GO:0005488) was the most dominant, with nearly half of the transcripts, followed by “catalytic activity” (GO:0003824), responsible for another quarter. In the case of the CC domain, the “cell” (GO:0005623) and “organelle” (GO:0043226) were the most common (45% of the transcripts in the four libraries). The distribution of the most expressed GO terms from each domain were similar to those reported in the transcriptomes of other vertebrate species [53-55].

Fig 2

Top 20 GO terms.

Processes with the highest number of transcripts within Biological Processes (BP), Molecular Function (MF) and Cell Components (CC) domains in the muscle and skin of C. macropomum and tambacu.

Top 20 GO terms.

Processes with the highest number of transcripts within Biological Processes (BP), Molecular Function (MF) and Cell Components (CC) domains in the muscle and skin of C. macropomum and tambacu.

Kyoto Encyclopedia of Genes and Genomes (KEGG)

The KEGG analysis provides plots of the biochemical pathways, and is useful for the systematic analysis of gene function, focusing in particular on the gene and the molecular network [56]. In the present study, we combined the tissues from each organism, resulting in 132 signaling pathways for the C. macropomum and 133 for the tambacu, corresponding to 8276 (1337 enzymes) and 16,366 (1334 enzymes) sequences, respectively (S2 and S3 Tables). The pathways were classified in four categories: Metabolism, Genetic information processing, Environmental information processing, and Organismal system (Fig 3). The largest numbers of transcripts (7,620 in C. macropomum and 15,281 of those in the tambacu) were allocated to the Metabolism category. Within this category, a number of differences were found between the two organisms in the distribution of the subcategories. With 22.7% of the C. macropomum transcripts being involved with carbohydrate metabolism and 12.8% with lipid metabolism, whereas in the tambacu, 19.0% of the transcripts were involved with nucleotide metabolism and 17.7% with the metabolism of cofactors and vitamins.

Fig 3

Pathway assignment based on the Kyoto Encyclopedia of Genes and Genomes (KEGG).

(A) Metabolism categories, (B) Genetic information processing, (C) Environmental information processing, and (D) Organismal systems.

Pathway assignment based on the Kyoto Encyclopedia of Genes and Genomes (KEGG).

(A) Metabolism categories, (B) Genetic information processing, (C) Environmental information processing, and (D) Organismal systems. We identified some differences in the most common pathways in the C. macropomum and tambacu, although in both organisms, the purine metabolism pathways were predominant, with 689 transcripts in the C. macropomum, and 2850 in the tambacu (S6 and S7 Figs). Previous reports on the fish transcriptome produced similar results, such as those recorded for the hybrid pufferfish (Takifugu rubripes x Takifugu flavidus) [57], for example, and the closely-related species, Piaractus mesopotamicus [6] and Piaractus brachypomus [58].

Differential expression

The 454 FLX Genome sequencer (GS FLX) is suitable for the de novo sequencing of the transcriptome of non-model organisms [59-61], as confirmed in a number of previous studies [5, 6, 62]. As the organisms analyzed did not have a reference genome, we prepared a reference transcriptome, based on the entire set of assembled contigs. These sequences were mapped in the Star software, and counted in Htseq. Using rigorous parameters, we obtained a mapping index of 64.6% for C. macropomum, and 65.4% for the tambacu. The present study provides the first comparative analysis of the transcriptomes of C. macropomum and the tambacu. Thus, based on the 400,845 contigs compiled in the four libraries, 21,986 genes were identified, including isophorms (S4 Table), with 16,632 being identified for C. macropomum and 14,835 for tambacu. The dataset from the muscle from both organisms comprised 13,951 genes, with 2,352 expressed only in the C. macropomum and 6,630 only in the tambacu. The dataset for the skin was composed of 16,857 genes, of which, 9,371 were expressed only in the C. macropomum and 2,772 only in the tambacu. The differences in the number of exclusive and shared genes support the hypothesis that the expressed transcripts depend on the development stage or genetic background of the sample specimens, given that genes may be expressed differentially in the same tissue [63, 64]. The results of the heatmap analysis (Fig 4) indicate the clustering of tissues, with one clade formed by the muscle transcriptome of the C. macropomum and tambacu, and the other by the skin samples of the two organisms. The similar size of branches in both clades indicates equivalent patterns of divergence in both tissues in each organism. Each library was also characterized by a number of different groups of distinct and overexpressed genes, considering only the DEGs with RPKM <100 (all the genes included in the analysis are shown in S5 Table). The IDEG6 analysis (p < 0,001) together with the Log2FC values (≥ 1 and ≤ -1) revealed a number of differentially-expressed genes (see the DEGs in S6 Table). A total of 168 up-regulated genes were found in the library of the C. macropomum muscle, 222 in the tambacu muscle, 95 in the C. macropomum skin and 115 in the tambacu skin (Figs 5 and 6). These results reinforce the differences found in gene expression among organs, tissues or even different types of cell from the same organ [65].

Fig 4

Heatmap graph.

Constructed with RPKM values of differentially expressed genes for C. macropomum and tambacu (P Value ≤ 0,001; Log2FC ≥ 1 e ≤ -1).

Fig 5

Volcano plot graph.

Fig 6

Volcano plot graph.

Comparison of C. macropomum skin DEGs (P ≤0,001; Log2FC ≥1) with tambacu skin (P ≤ 0,001; Log2FC ≤ -1), in red color the transcripts with significant values and in color black the no significant.

Heatmap graph.

Constructed with RPKM values of differentially expressed genes for C. macropomum and tambacu (P Value ≤ 0,001; Log2FC ≥ 1 e ≤ -1).

Volcano plot graph.

Comparison of C. macropomum muscle DEGs (P ≤0,001; Log2FC ≥1) and tambacu muscle (P ≤ 0,001; Log2FC: ≤ -1), in red color the transcripts with significant values and in color blue the no significant. Comparison of C. macropomum skin DEGs (P ≤0,001; Log2FC ≥1) with tambacu skin (P ≤ 0,001; Log2FC ≤ -1), in red color the transcripts with significant values and in color black the no significant.

Genes expressed differentially in the muscle of the C. macropomum and tambacu

The largest contig with functional annotation (54,960 bps) was observed in the muscle of the C. macropomum, while the same transcript in the tambacu was slightly shorter, at 54,465 bps, but was nevertheless the most abundant protein (in reads) in this organism. This transcript refers to the gene involved in the synthesis of Titin (ttn), which is the largest and the most abundant protein in the skeletal and cardiac muscle of humans [66, 67]. In C. macropomum, the most expressed gene was the Myosin heavy chain fast skeletal muscle (myss), which is related to muscle contraction, and was also up-regulated in this species (Table 3).

Table 3

Some of the genes with the highest level of expression identified in the muscle and skin of C. macropomum and the tambacu.

Genes	Length	Over-Express	tissue	RPKM C.macropomum	RPKM tambacu	FC	Log2FC	P Value
Myosin heavy chain, fast skeletal muscle	7110	C. macropomum	Muscle	12424,47	5090,61	2,44	1,28	<0,001
Parvalbumin beta	2857	C. macropomum	Muscle	10822,37	832,01	13,007	3,71	<0,001
Beta-enolase	1650	C. macropomum	Muscle	3422,18	722,15	4,738	2,24	<0,001
Actin, alpha skeletal muscle	1135	C. macropomum	Muscle	1889,28	322,40	5,859	2,55	<0,001
Myosin-2	1379	C. macropomum	Muscle	768,64	35,08	21,905	4,45	<0,001
ADP/ATP translocase 1	1398	tambacu	Muscle	146,74	1213,59	0,12	-3,04	<0,001
Elongation factor 1-alpha	2219	tambacu	Muscle	356,60	962,20	0,37	-1,43	<0,001
4F2 cell-surface antigen heavy chain	2078	tambacu	Muscle	21,15	852,84	0,02	-5,33	<0,001
Alpha-crystallin B chain	2344	tambacu	Muscle	20,83	845,08	0,02	-5,34	<0,001
Tripartite motif-containing protein 54	1689	tambacu	Muscle	52,84	668,71	0,07	-3,66	<0,001
LINE-1 retrotransposable element ORF2 protein	3900	C.macropomum	Skin	558,90	256,92	2,17	1,12	<0,001
Collagen alpha-1 (I) chain	5865	C.macropomum	Skin	511,16	33,03	15,4	3,95	<0,001
Collagen alpha-3 (VI) chain	9460	C.macropomum	Skin	334,63	28,08	11,9	3,57	<0,001
LINE-1 reverse transcriptase homolog	3533	C.macropomum	Skin	307,53	117,51	2,61	1,38	<0,001
Cation channel sperm-associated protein	4478	C.macropomum	Skin	85,37	9,88	8,63	3,10	<0,001
Keratin, type II cytoskeletal 8	2829	tambacu	Skin	160,02	1473,49	0,11	-3,21	<0,001
Keratin, type I cytoskeletal 13	1687	tambacu	Skin	47,70	826,93	0,05	-4,11	<0,001
Ictacalcin	809	tambacu	Skin	111,91	650,07	0,17	-2,53	<0,001
Desmoplakin	2927	tambacu	Skin	18,33	455,80	0,04	-4,63	<0,001
Keratin, type I cytoskeletal 50 kDa	1992	tambacu	Skin	10,10	333,48	0,03	-5,04	<0,001

Of the 168 up-regulated genes in the muscle of C. macropomum, those of the myosin gene family are the most prominent, including Myosin-1 (myh1), Myosin-2 (myh2), Myosin-3 (myh3), Myosin-4 (myh4), Myosin-7 (myh7), Myosin-7B (my7b), Myosin-8 (myh8), and Myosin-13, myh13 (Table 3, S6 Table). This gene family encodes motor proteins, which are responsible for the contraction mechanism, as well as being involved in several mechanochemical processes in eukaryotes [68, 69]. In addition to this gene family, Actin (acts/acta1) was also expressed prominently in the muscle, and is particularly active in the eukaryote cell, where it has a synergistic effect with myosin in the control of mobility and muscle contraction [70], as well as the genes that encode Myomesin 1 (myom1) and Myomesin 2 (myom2), the main components of M-line [71] (Table 3, S6 Table). It is interesting to note that all these up-regulated genes in C. macropomum are related to muscular contraction. This may be related to the fact that this species is rheophilic, and needs to make an enormous effort to swim upstream, against the current, to mature sexually. The individuals raised in captivity (as those used in the present study) can to mature sexually, although the final stage of the spawning process does not occur, given the absence of the external factors necessary to trigger the reproductive process [7]. The genes responsible for the expression of catalytic proteins were also abundant in the DEGs of the muscle of the C. macropomum, including the enzymes Beta Enolase (enob/ eno3), Triosephosphate Isomerase B (tpisb/tip1b), NAD(P)H dehydrogenase (quinone) 1 (nqo1), Enolase (eno), and Phosphoglucomutase-1 (pgm1), most of which are related to the synthesis or degradation of glucose and the metabolism of carbohydrates [72-74] (Table 3, S6 Table). This finding may account for the predominance of the carbohydrate metabolism subcategory identified in the KEGG analysis for C. macropomum. The results obtained for the transcriptome of the tambacu muscle included several up-regulated genes related to processes such as aminoacid metabolism, stress, and catalysis. In the KEGG analysis of the tambacu muscle, nucleotide metabolism was the most common subcategory (involving 19% of the transcripts) in the “metabolism” category. This finding can be accounted for by the presence of a number of differentially-expressed genes related to the metabolism of aminoacids, such as Elongation factor 1-alpha (eef1a), 4F2 cell-surface antigen heavy chain (4f2/ slc3a2), and Large neutral amino acids transporter small subunit 4 (lat4/ slc43a2) [75-77] (Table 3, S6 Table). The DEGs observed in the muscle of the tambacu also included genes of the ubiquitin family and similar genes, such as Ubiquitin-40S ribosomal protein S27a (rps27a), F-box only protein 32 (fbxo32), AN1-type zinc finger protein 5 (zfand5) (Table 3, S6 Table). These genes are involved primarily in the regulation of proteins [78]. Genes of the Heat shock protein (HSP) family were also prominent in the up-regulated genes of the muscle of the tambacu: Heat shock protein beta-1 (hspb1), Heat shock protein beta-11 (hspb11/hspbb), Heat shock 70 kDa protein (hsp70), Heat shock 70 kDa protein 1 (hsp71), Heat shock protein beta-6 (hspb6), Heat shock protein beta-8 (hspb8) (S6 Table). The expression of this protein family is related mainly to environmental and physiopathological stress [79, 80]. The Alpha-crystallin B (cryab) proteins are also prominent in this group (Table 3), being characterized by their presence in a number of different types of tissue, including the skeletal muscle [81, 82]. Like other HSPs, these proteins act as molecular chaperones to prevent the incorrect association and aggregation of the polypeptide chain [83, 84]. The presence of a number of HSPs with prominent over-expression in the muscle of tambacu is noteworthy, given that the expression of these genes is usually a response to stress factors, including environmental conditions and disease [79, 80, 85]. The marked expression of the genes related to these proteins potentially indicates environmental effects and adaptive processes in the hybrid tambacu. In fact, the HSPs are helpful for the repair of protein damages, given that over-expression is usually sufficient to protect the cell from exposure to stressful conditions, such as high temperatures [80, 86]. Some transcripts related to processes such as catalytic action, defense, and oxidative stress were also relatively common in the DEGs of the tambacu muscle, including AMP desaminase 3 (ampd3), AMP desaminase 1 (ampd1) Phosphoglycerate mutase 2 (pgam2), Sestrin-1 (Sesn-1), and Thioredoxin-interacting protein (txnip) [Proc Nat Acad Sci U S A. 1994 ">87-89] (S6 Table).

Genes expressed differentially in the skin of the C. macropomum and tambacu

The skin is a complex type of tissue that plays a key role in sensorial activities, thermoregulation, hormone metabolism, and defense [20, 90, 91]. In addition to these functions, the morphological similarities between the C. macropomum and the tambacu also justifies the analysis of this tissue. Overall, 210 DEGs were identified in the skin of C. macropomum and tambacu. In the skin of the C. macropomum, the transcripts with the largest number of mapped reads corresponded to the transposable elements: Transposable element Tcb1 transposase (tcb1), Transposable element Tc1 transposase (tc1a), and Transposable element Tcb2 transposase (tcb2). These elements are DNA segments capable of moving through the genome of a cell (“jumping genes”), thus affecting the evolution and genome composition of plants and animals, inducing processes ranging from pathologies to genetic differences in these organisms [92-94]. As in the muscle, the genes that encode collagen proteins were also found to be over-expressed in the skin of C. macropomum, including: Collagen alpha-1(I) chain (co1a1), Collagen alpha-3(VI) chain (co6a3) and Collagen alpha-2(I) chain (co1a2). These proteins are abundant in animals and are important for the molecular architecture and the shape and mechanical properties of tissues [95, 96]. Another important gene family, that should be mentioned, is the Cation channel sperm-associated protein (catsper1/ctsr1 and catsper3/ctsr3). Despite presenting RPKM values lower than 100 (Table 3, S6 Table), this gene group was up-regulated in the skin of C. macropomum. However, the contigs equivalent to these transcripts were not assembled in the tambacu. The proteins encoded by this gene form a complex ion channel that permits the Ca2+ to enter the flagellum of the spermatozoa, which ensures their hyperactivation and plays a key role in male fertility [97]. The lack of this transcript in the tambacu may be related to the infertility or subfertility of these hybrids. Although some authors have observed that tambacu males produce small amounts of spermatozoa following abdominal pressure, which indicates that these fish can be crossed with their parent species (unpublished data), it remains unclear whether the tambacu is effectively fertile. In addition, the presence of up-regulated transcripts of catsper1 and catsper3 in the C. macropomum skin may represent a potential genetic marker for sex differentiation of the specie, which would be important, given the lack of sexual dimorphism in these fish during non-breeding periods, as occurred in the present study. With regard to the differential expression of genes in the skin of the tambacu, the transcript with the largest number of mapped reads was related to the keratin, type II cytoskeletal 8 (krt8) protein, which was also present in the DEGs. In fact, other transcripts that encode proteins of this large keratin family (classes I and II) were also found in the DEGs of the tambacu skin (Table 3, S6 Table). The products of this gene family are structural proteins of the epithelial cells, which are expressed differentially according to the cell layer and tissue. The functions of these proteins have not yet been determined fully, but they are known to be involved in a number of processes, including structural support, cytoarchitecture, the response to stress, the regulation of signaling pathways related to apoptosis and protein synthesis, the distribution of organelles, and tissue repair and proliferation [98-100]. In contrast with the pattern observed in the C. macropomum, only a single gene of the collagen family–the collagen alpha-1(XVII) chain (Col17a1)–was observed in the DEGs of the tambacu skin. This transmembrane protein is a structural component of the hemidesmosomes in the epithelial cells and is associated with the keratinocytes [101]. One other important gene is the dsp, which encodes desmoplakin (Table 3). This is a key desmosome protein, which plays a major role in the attachment, assembly and stabilization of intermediate filaments, in addition to interacting with keratin proteins [102, 103]. Some authors have suggested that hybrids are more resistant and grow faster than their parent species [11, 104]. However, few studies have demonstrated any actual advantages in terms of increased productivity, and in fact, some data indicate that hybrids may be more susceptible to parasites [105, 106]. In the present study, major differences were found in the specificity and levels of gene expression in the muscle and skin of the C. macropomum and tambacu, which may contribute to the identification of the specific genes involved in the immunity, growth and fertility of these organisms. These findings raise a number of important questions that must be settled in order to confirm whether hybrids have any real advantages over their parent species, and whether they constitute a threat to the stocks of native species. We would further emphasize here the various up-regulated genes linked to stress found in the tambacu, which may be relevant to the farming of this hybrid, given that the stress of the captive environment may have a negative effect on the development of the fish. Given this, it will be very important to identify the factors that provoke this condition, not only for the organism itself, but also fish farming (the region in which this hybrid is farmed should also be evaluated).

Conclusion

This is the first comparative analysis of the C. macropomum and its hybrid form, the tambacu. The results indicated that the NGS platform used in the study is a powerful tool for the identification of genes and molecular markers in non-model species, in addition to confirming that the combination of distinct assembly methods can enrich the dataset considerably. Despite of the lack of a reference genome, we obtained a satisfactory number of transcripts with functional annotation, associated typically with biological processes, with the skin presenting the greatest number of GO terms. The KEGG analysis also identified similarities in the metabolic pathways of the two organisms, with several transcripts in the metabolism category. There was a degree divergence in the number of exclusive genes, and the levels of gene expression and function between the C. macropomum and tambacu tissues. However, a significant number of genes remained uncharacterized, and are also of particular interest for future studies on the physiology, conservation, and genetic improvement of the native species, as well as the management of hybrid stocks. This research is thus an important contribution to the investigation of genes that contribute to potential advantages or disadvantages in the productivity of the hybrids, their implications for the conservation and management of C. macropomum, and provide a valuable tool for future research in functional genomics.

Distribution of the transcript lengths recorded in the C. macropomum and the tambacu.