| Literature DB >> 29371626 |
Quentin Carradec1,2,3, Eric Pelletier4,5,6, Corinne Da Silva1, Adriana Alberti1, Yoann Seeleuthner1,2,3, Romain Blanc-Mathieu7, Gipsi Lima-Mendez8,9,10,11, Fabio Rocha12, Leila Tirichine12, Karine Labadie1, Amos Kirilovsky1,2,3,12, Alexis Bertrand1, Stefan Engelen1, Mohammed-Amin Madoui1,2,3, Raphaël Méheust12, Julie Poulain1, Sarah Romac13,14, Daniel J Richter13,14, Genki Yoshikawa7, Céline Dimier12,13,14, Stefanie Kandels-Lewis15,16, Marc Picheral17, Sarah Searson18, Olivier Jaillon1,2,3, Jean-Marc Aury1, Eric Karsenti12,16,17, Matthew B Sullivan19, Shinichi Sunagawa15,20, Peer Bork15,21,22,23, Fabrice Not13,14, Pascal Hingamp24, Jeroen Raes8,9, Lionel Guidi17,18, Hiroyuki Ogata7, Colomban de Vargas13,14, Daniele Iudicone25, Chris Bowler26, Patrick Wincker27,28,29.
Abstract
While our knowledge about the roles of microbes and viruses in the ocean has increased tremendously due to recent advances in genomics and metagenomics, research on marine microbial eukaryotes and zooplankton has benefited much less from these new technologies because of their larger genomes, their enormous diversity, and largely unexplored physiologies. Here, we use a metatranscriptomics approach to capture expressed genes in open ocean Tara Oceans stations across four organismal size fractions. The individual sequence reads cluster into 116 million unigenes representing the largest reference collection of eukaryotic transcripts from any single biome. The catalog is used to unveil functions expressed by eukaryotic marine plankton, and to assess their functional biogeography. Almost half of the sequences have no similarity with known proteins, and a great number belong to new gene families with a restricted distribution in the ocean. Overall, the resource provides the foundations for exploring the roles of marine eukaryotes in ocean ecology and biogeochemistry.Entities:
Mesh:
Year: 2018 PMID: 29371626 PMCID: PMC5785536 DOI: 10.1038/s41467-017-02342-1
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1The Tara Oceans eukaryote gene catalog. a Sampling map. Geographic distribution of 68 sampling stations at which seawater from the surface (SRF) and/or the deep chlorophyll maximum (DCM) was collected and size fractionated into four main groups: 0.8–5 µm (blue), 5–20 µm (red), 20–180 µm (green), and 180–2000 µm (orange). Availability of sequence data sets is indicated by the colored boxes at each sampling station. Two stations (TARA_40 and TARA_153) containing only atypical size fractions are shown on this map with empty boxes. b Rarefaction curves of detected genes. Top panel: rarefaction curves of 441 eukaryotic samples (red curve) compared to 139 prokaryotic samples (green curve) derived from Sunagawa et al[9]. Other panels: rarefaction curve of eukaryotic samples by oceanic region (IO, Indian Ocean; MS, Mediterranean Sea; NAO, North Atlantic Ocean; NPO, North Pacific Ocean; SAO, South Atlantic Ocean; SO, Southern Ocean; SPO, South Pacific Ocean), size fraction, and depth (SRF or DCM). For each curve, sampling order has been 10-fold permuted. c Estimated number of transcriptomes in eukaryotic samples. Left panel: distribution of the total number of transcriptomes estimated for each size fraction computed from the number of unigenes similar to a catalog of 24 single-copy ribosomal proteins. Right panel: distribution of the number of transcriptomes in each sample (small dashes) grouped by size fraction
Fig. 2Taxonomic composition of the gene catalog. a Origin of the best similarity sequence match as a fraction of the total in the circular diagram (MMETSP[14]: release of August 2014, with manual curation; UniRef90[42]: release of September 2014; “Others”: are other reference transcriptomes that were added as reference to offset the lack of knowledge about organisms in large size fractions, in particular copepods and rhizaria; Methods section). Unigenes without significant matches (i.e., those with an e-value >10–5 for their best similarity match) are tagged as “No match”. The proportion of unigenes affiliated to each major taxonomic group is indicated in the right column. O/U, other or unassigned. b Proportion of each major taxonomic group across Tara Oceans stations based on the mean number of unigenes classified as one of 24 different single-copy ribosomal proteins detected in each sample (IO, Indian Ocean; MS, Mediterranean Sea; NAO, North Atlantic Ocean; NPO, North Pacific Ocean; SAO, South Atlantic Ocean; SO, Southern Ocean; SPO, South Pacific Ocean). c Eukaryotic viral unigenes. NCLDV unigenes are classified at the family level
Fig. 3Characterization of highly expressed gene families. a Major Pfam domains present in different size fractions and in different taxonomic groups. Among the highly expressed Pfam domains (Supplementary Fig. 4), those with specific patterns are shown. The relative expression of Pfam domains in the four filter sizes (left panel) and the contribution of each taxonomic group to the total expression of the Pfam domain (right panel) are shown as an average of all Tara Oceans SRF and DCM samples. O/U, other or unassigned. b Unrooted phylogenetic tree of type-I rhodopsin subfamilies (PF01036) obtained using sampling of 300 sequences of the three largest MCL clusters (see details in Supplementary Fig. 5b). The vertical size of the triangles represents the number of unigenes in each cluster (explicitly indicated in white) and their width represents the maximum branch length of 95% of sequences in the cluster. Taxonomic assignments of reference sequences (inner ring) and unigenes (outer ring) are indicated for each cluster with the color code of a. The number of reference sequences in each cluster is indicated in the center in bold, with the number of eukaryotic sequences in parentheses. c Logo consensus sequences, based on the global alignment of each cluster. Two regions of interest (helices C and G and their neighborhoods) containing functional and conserved residues are represented[25]. Specific functional residues are indicated with arrows. Red: proton donor (D65) and acceptor (E76); green: residue specific to green light-sensitive proteorhodopsins; blue: amino acid specific to blue light-sensitive proteorhodopsins; yellow: lysine residue linked to retinal. Predicted transmembrane helices are represented as gray boxes
Fig. 4Eukaryote gene catalog clustering and characterization of novel genes. a Global repartition of unigenes based on the gene catalog clustering. Unigenes were considered as singletons if they are in clusters of less than three units. Gene families are novel (nGF), taxonomically assigned (tGF), functionally assigned (fGF), or both (ftGF) (Methods). Numbers above each bar indicate the numbers of unigenes per cluster. b Distribution of unknown unigenes in the different categories described in a. c Ratio of tGFs vs. ftGFs in the main taxonomic groups. The total number of GFs assigned to each taxonomic group is indicated on the right. d Distribution of GF occupancy for the three main GF categories. GFs are classified according to their size (x-axis) and the y-axis indicates the number of stations where the GF family is expressed (at least one unigene detected with a coverage of more than 80% of the unigene length). Kolmogorov-Smirnov tests with p < 10–5 between occupancy distributions are indicated with red stars. e Distribution of mean expression levels of the three different categories of GFs among all samples. GFs are classified according to their size (x-axis). The expression of a GF in a sample was determined by the sum of the expression of its unigenes in RPKM
Fig. 5New gene families expressed in 20–180 μm size fraction. a Graph representation of the protein group number 14079. Each GF of the protein group is represented by a node with a diameter proportional to the number of unigenes in the GF. Protein matches between GFs are represented by an edge. b Mean expression of GFs in different size-fractions and depths. Each color corresponds to a GF of protein group 14079. c World map representation of protein group 14079 expression in the 20–180 µm size fraction. SRF and DCM samples have been pooled. Circle diameters represent the relative expression of the protein group in RPKM. The contribution to expression of each GF is represented by the different colors. d Sequence logo of the multiple alignments of the protein group 14079. 45 ORFs (153 amino acids in average) of protein group 14079 were aligned and positions with more than 50% of gaps were removed. Mean numbers of amino acids on unaligned regions of the protein are indicated in gray boxes. A signal peptide cleavage site, indicated on the left part of the sequence logo was predicted on 21 sequences
Fig. 6Ratios of differential gene abundance and relative expression of ferredoxin vs. flavodoxin in the five major photosynthetic groups. a Representation of the relative abundance (left) and expression (right) of the two genes identified in surface samples for Chlorophyta, Pelagophyceae, Haptophyceae (from 0.8 to 5 µm filters), Bacillariophyta and Dinophyceae (from the 5 to 20 µm filters). The circle colors, from red to blue, represent the relative expression of one gene compared to the other, with the color code given in the top diagram. The sum of the expression levels of the two genes affiliated to each taxonomic group is represented by the circle diameter as a percentage of the total expression of these genes. b Distribution of the relative abundance (left) or expression (right) of ferredoxin in low iron stations (<0.02 µmol m−3, 15 stations, dark gray) or iron rich stations (>0.2 µmol m−3, 31 stations, light gray) according to a model of iron concentration in the oceans (Supplementary Data 5). Significant differences of expression between low and rich iron stations are indicated with red stars (non-parametric wilcoxon rank-sum test, p < 10–3) c Correlations between the relative metagenome (MetaG) abundance and metatranscriptome (MetaT) expression of ferredoxin in SRF and DCM samples, expressed as a percentage of the total value of ferredoxin + flavodoxin. Pearson correlation coefficients (r) and their statistical significance (p) are indicated in each graph. Ferredoxins and flavodoxins were identified using the Pfams PF00111 and PF00258, respectively