| Literature DB >> 34072576 |
Javier Cordoba1, Emilie Perez1,2, Mick Van Vlierberghe2, Amandine R Bertrand2, Valérian Lupo2, Pierre Cardol1, Denis Baurain2.
Abstract
Euglena gracilis is a well-known photosynthetic microeukaryote considered as the product of a secondary endosymbiosis between a green alga and a phagotrophic unicellular belonging to the same eukaryotic phylum as the parasitic trypanosomatids. As its nuclear genome has proven difficult to sequence, reliable transcriptomes are important for functional studies. In this work, we assembled a new consensus transcriptome by combining sequencing reads from five independent studies. Based on a detailed comparison with two previously released transcriptomes, our consensus transcriptome appears to be the most complete so far. Remapping the reads on it allowed us to compare the expression of the transcripts across multiple culture conditions at once and to infer a functionally annotated network of co-expressed genes. Although the emergence of meaningful gene clusters indicates that some biological signal lies in gene expression levels, our analyses confirm that gene regulation in euglenozoans is not primarily controlled at the transcriptional level. Regarding the origin of E. gracilis, we observe a heavily mixed gene ancestry, as previously reported, and rule out sequence contamination as a possible explanation for these observations. Instead, they indicate that this complex alga has evolved through a convoluted process involving much more than two partners.Entities:
Keywords: co-expression network; database contamination; gene expression; kleptoplastidy; ontology network; taxonomic analysis; transcriptional regulation; transcriptome assembly
Mesh:
Year: 2021 PMID: 34072576 PMCID: PMC8227486 DOI: 10.3390/genes12060842
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Representation of the collected data and overview of the experimental design. Exp. Code: letter assigned to each experiment (one letter per study). Study Acc.: public accession number of the BioProject. Sample Code: first letter corresponds to the experiment, first digit to experimental conditions of the samples, and second digit (if any) to the replicates. Run Acc.: public accession number of read FASTQ files. Temp.: estimated Celsius degrees of cell culture temperature. Medium: type of cell culture medium, rich (R) or mineral (M) plus carbon source (+C). Light: estimated light experimental conditions, darkness (D), low-light (LL) and high-light (HL). Shaking: rpm of shaker incubator. Cult. Cond.: trophic regime, fermentative (F), heterotrophic (H), phototrophic (P) or mixotrophic (M). Harvest Phase: development stage of the culture when collected, exponential phase (Exp) or stationary phase (Stat).
| Exp. Code | Study Acc. | Sample | Run Acc. | Temp. | Medium | Light | Shaking | Cult Cond. | Harvest Phase | Reference |
|---|---|---|---|---|---|---|---|---|---|---|
| A | PRJNA310762 | A.1.1 | SRR3159774 | 25 | R + C | D | 0 | H | Exp | [ |
| A.1.2 | SRR3159775 | 25 | R + C | D | 0 | H | Exp | |||
| A.1.3 | SRR3159776 | 25 | R + C | D | 0 | H | Exp | |||
| A.2.1 | SRR3159777 | 25 | R + C | LL | 0 | M | Exp | |||
| A.2.2 | SRR3159778 | 25 | R + C | LL | 0 | M | Exp | |||
| A.2.3 | SRR3159779 | 25 | R + C | LL | 0 | M | Exp | |||
| B | PRJEB10085 | B.1 | ERR974915 | 21 | M + C | LL | 0 | M | Stat | [ |
| B.2 | ERR974916 | 30 | R+C | D | 200 | H | Stat | |||
| C | PRJNA298469 | C.0 | SRR2628535 | 25 | M | LL | 0 | M | Stat | [ |
| D | PRJNA289402 | D.0 | SRR3195326 | 26 | R+C | HL | 120 | M | Stat | [ |
| D.1.1 | SRR3195327 | 26 | R+C | HL | 120 | M | Stat | |||
| D.1.2 | SRR3195329 | 26 | R+C | HL | 120 | M | Stat | |||
| D.1.3 | SRR3195331 | 26 | R+C | HL | 120 | M | Stat | |||
| D.2.1 | SRR3195332 | 26 | R+C | HL | 120 | F | Stat | |||
| D.2.2 | SRR3195334 | 26 | R+C | HL | 120 | F | Stat | |||
| D.2.3 | SRR3195335 | 26 | R+C | HL | 120 | F | Stat | |||
| D.3.1 | SRR3195338 | 26 | R+C | HL | 120 | F | Stat | |||
| D.3.2 | SRR3195339 | 26 | R+C | HL | 120 | F | Stat | |||
| D.3.3 | SRR3195340 | 26 | R+C | HL | 120 | F | Stat | |||
| E | PRJEB38787 | E.1 | ERR4227585 | 25 | M | LL | 100 | P | Exp | This study |
| E.2 | ERR4227586 | 25 | M+C | D | 100 | H | Exp | |||
| E.3 | ERR4227587 | 25 | M+C | LL | 100 | M | Exp | |||
| E.4 | ERR4227588 | 25 | M+C | HL | 100 | M | Exp |
Figure 1Schematic representation of our de novo transcriptome meta-assembly pipeline.
Basic statistics based on transcript properties of reconstructed transcriptomes from collected data. ACC: study accession, REF: bibliographic reference, RAW: number of downloaded reads, PRE: number of good reads after pre-processing, CNT: number of reads removed after pre-processing considered as contamination (reads per million; rpm), SEQ: number of transcripts, MIN: minimal sequence length, MAX: maximal sequence length, MEAN: mean sequence length, TOTAL: combined sequence length, SEQ < 200: number of transcripts under 200 n, SEQ > 1 k: number of transcripts over 1000 nt, SEQ > 10 k: number of transcripts over 10,000 nt, ORF: number of sequences with a predicted open reading frame, ORF (%): for contigs with an ORF, the mean % of the contig covered by the ORF, N[z]: minimum contig length needed to cover [z]% of the transcriptome. GC (%): percentage of guanine-cytosine content, PART and PART (%): number and percentage of sequences contributed to the final consensus transcriptome (see below). In PRJEB10085 (B) (filtered), sequences <500 nt were further discarded (see text).
| Statistic | A | B | B (Filtered) | C | D | E |
|---|---|---|---|---|---|---|
| ACC | PRJNA310762 | PRJEB10085 | PRJEB10085 | PRJNA298469 | PRJNA289402 | PRJEB38787 |
| REF | [ | This study | ||||
| RAW | 61,531,862 | 383,416,636 | 383,416,636 | 27,096,926 | 285,148,782 | 1,902,226,200 |
| PRE | 57,862,467 | 310,302,570 | 310,302,570 | 25,244,887 | 267,779,751 | 875,299,135 |
| CNT | 740 (12 rpm) | 9080 (29 rpm) | 9080 (29 rpm) | 1750 (68 rpm) | 1191 (4 rpm) | 2403 (2 rpm) |
| SEQ | 38,559 | 95,490 | 36,287 | 42,363 | 37,425 | 51,021 |
| MIN | 101 | 101 | 500 | 101 | 101 | 101 |
| MAX | 13,929 | 21,744 | 21,744 | 11,354 | 26,839 | 10,795 |
| MEAN | 1043 | 647 | 1312 | 810 | 1120 | 610 |
| TOTAL | 40,861,413 | 64,426,688 | 47,615,807 | 34,438,742 | 42,382,170 | 31,671,589 |
| SEQ < 200 | 4330 | 17,074 | 0 | 782 | 3051 | 3989 |
| SEQ > 1 k | 16,289 | 18,638 | 18,638 | 10,932 | 17,048 | 7104 |
| SEQ > 10 k | 4 | 15 | 15 | 1 | 13 | 1 |
| ORF | 24,757 | 29,060 | 27,842 | 27,063 | 24,817 | 26,882 |
| ORF (%) | 88% | 82% | 83% | 89% | 87% | 93% |
| N90 | 576 | 347 | 654 | 419 | 606 | 367 |
| N70 | 1140 | 667 | 1101 | 686 | 1187 | 528 |
| N50 | 1607 | 1282 | 1574 | 1014 | 1658 | 753 |
| N30 | 2257 | 2033 | 2243 | 1452 | 2318 | 1090 |
| N10 | 3600 | 4026 | 3707 | 2358 | 3812 | 1850 |
| GC (%) | 64% | 58% | 62% | 64% | 64% | 64% |
| PART | 22,234 | - | 27,730 | 10,129 | 19,663 | 11,602 |
| PART (%) | 24.3% | - | 30.3% | 11.1% | 21.5% | 12.7% |
Basic statistics of transcript properties computed for the three public transcriptome assemblies, including the consensus transcriptome generated in the present work, and completed with data retrieved from the publications of Ebenezer et al. (2019) [7] and Yoshida et al. (2016) [53]. Row titles are as in Table 2, except for CDS: number of unique coding sequences (i.e., ORFs or UNIGENEs), GMS-T and GMS-T (%): number and percentage of predicted protein coding regions calculated by GeneMarkS-T.
| Statistic | GEFR01 | GDJR01 | HBDM01 |
|---|---|---|---|
| REF | [ | This study | |
| SEQ | 72,506 | 113,152 | 91,040 1 |
| MIN | 202 | 201 | 201 |
| MAX | 25,763 | 21,553 | 26,839 |
| MEAN | 869 | 1087 | 1096 |
| TOTAL | 63,049,595 | 122,976,775 | 100,187,451 |
| SEQ < 200 | 0 1 | 0 1 | 0 1 |
| SEQ > 1 k | 19,740 | 49,277 | 37,294 |
| SEQ > 10 k | 25 | 27 | 24 |
| ORF 2 | 30,467 | 65,943 | 62,287 |
| ORF (%) | 79% | 73% | 85% |
| N90 | 374 | 523 | 545 |
| N70 | 704 | 1130 | 965 |
| N50 | 1242 | 1604 | 1432 |
| N30 | 1916 | 2181 | 2049 |
| N10 | 3344 | 3347 | 3410 |
| GC (%) | 61% | 63% | 63% |
| CDS | 32,128 | 49,826 | 49,922 |
| GMS-T | 35,929 | 63,432 | 58,542 |
| GMS-T (%) | 49% | 56% | 64% |
1 Submission tools for sequence repositories do not accept transcripts ≤ 200 nt. Hence, the number of sequences in the public version of HBDM01 is lower than reported elsewhere in this work. 2 ORFs were determined with TransDecoder, whereas CDS were determined with EvidentialGene (or a similar tool, depending on the study).
Mapping fraction of pre-processed reads from each collected dataset (rows) to the three public transcriptome assemblies (columns), GEFR01 [7], GDJR01 [53] and HBDM01 (this study).
| Code | Accession | Reference | GEFR01 | GDJR01 | HBDM01 |
|---|---|---|---|---|---|
| A | PRJNA310762 | [ | 87.40% | 92.51% | 93.38% |
| B | PRJEB10085 | [ | 84.68% | 90.13% | 91.49% |
| C | PRJNA298469 | [ | 80.26% | 91.66% | 90.39% |
| D | PRJNA289402 | [ | 85.25% | 95.04% | 94.28% |
| E | PRJEB38787 | This study | 75.28% | 51.39% | 80.76% |
Figure 2BUSCO-generated charts showing the relative completeness of the three public transcriptome assemblies, GEFR01 [7], GDJR01 [53] and HBDM01 (this study). BUSCO datasets were based on odb9. (a) “Eukaryota” (303 BUSCOs); (b) “Protists ensembl” (215 BUSCOs).
TransRate and Detonate assembly scores for the three public transcriptome assemblies, GEFR01 [7], GDJR01 [53] and HBDM01 (this study). Scores indicate how well transcripts are supported by the RNA-Seq data.
| Assembly Score | GEFR01 | GDJR01 | HBDM01 |
|---|---|---|---|
| TransRate Score | 0.1789 | 0.0304 | 0.0430 |
| TransRate Optimal Score | 0.2051 | 0.1729 | 0.0764 |
| Detonate Score | −97,461 × 106 | −97,561 × 106 | −97,459 × 106 |
SL-sequence related statistics for the three public transcriptome assemblies, GEFR01 [7], GDJR01 [53] and HBDM01 (this study). These correspond to exact matches limited to the first 40 nucleotides of each transcript.
| Threshold (nt) | Statistic | GEFR01 | GDJR01 | HBDM01 |
|---|---|---|---|---|
| 24 | Forward matches | 24 | 5 | 86 |
| Reverse matches | 21 | 0 | 114 | |
| Total matches | 45 | 5 | 200 | |
| Average length (nt) | 24.00 | 24.00 | 24.00 | |
| 14 | Forward matches | 176 | 16,580 | 3370 |
| Reverse matches | 200 | 12,999 | 3265 | |
| Total matches | 376 | 29,579 | 6635 | |
| Average length (nt) | 16.28 | 15.57 | 15.59 | |
| 12 | Forward matches | 749 | 18,322 | 4403 |
| Reverse matches | 766 | 16,016 | 5397 | |
| Total matches | 1515 | 34,338 | 9800 | |
| Average length (nt) | 13.37 | 15.19 | 14.68 |
Figure 3Taxonomic analysis of reconstructed transcripts (BLASTX MEGAN-like affiliations). The Krona chart is a zoom on the 13,850 transcripts to which a taxonomy could be associated, i.e., 28% of the 49,922 reconstructed transcripts. Among this classified fraction, 937 (7%) correspond to organisms to which we cannot assign a specific taxon (“other cellular organisms”). The thin blue slice is labelled “Archaea” (0.2%). The interactive chart is available as HTML Supplementary File S1.
Figure 4Annotation network of ontological terms showing the functional organization and relationships between the 2500 most variable genes. GO and KEGG terms were considered as a large pool in which the genes could be associated with 0 to N terms. Such associations served as the basis to infer the network (see text). Colours correspond to ontological terms (or groups of related ontological terms).
Figure 5Selected co-expression clusters computed on the 2500 most variable genes. Only the five clusters characterized by significantly overrepresented ontological terms (featuring 631 transcripts) are shown. Heat maps and trees regroup samples behaving similarly across genes on the horizontal axis and genes behaving similarly across samples on the vertical axis; gene expression is vertically clustered to facilitate visualization (see text). Samples are colour-coded both by condition (F = fermentative, M = mixotrophic, H = heterotrophic, P = phototrophic) and by study (A = PRJNA310762, B = PRJEB10085, C = PRJNA298469, D = PRJNA289402, E = PRJEB38787). (a) Cluster 1; (b) cluster 4; (c) cluster 19; (d) cluster 24; (e) cluster 25.
Figure 6Expression heat map of 133 genes involved in electron transport chains. Heat maps and trees regroup samples behaving similarly across genes on the horizontal axis and genes behaving similarly across samples on the vertical axis (see text). Samples are colour-coded both by condition (F = fermentative, M = mixotrophic, H = heterotrophic, P = phototrophic) and by study (A = PRJNA310762, B = PRJEB10085, C = PRJNA298469, D = PRJNA289402, E = PRJEB38787). Genes are colour-coded by organelle (CP = chloroplast; MT = mitochondrion).