Literature DB >> 33865444

Broadly sampled orthologous groups of eukaryotic proteins for the phylogenetic study of plastid-bearing lineages.

Mick Van Vlierberghe1, Hervé Philippe2,3, Denis Baurain4.   

Abstract

OBJECTIVES: Identifying orthology relationships among sequences is essential to understand evolution, diversity of life and ancestry among organisms. To build alignments of orthologous sequences, phylogenomic pipelines often start with all-vs-all similarity searches, followed by a clustering step. For the protein clusters (orthogroups) to be as accurate as possible, proteomes of good quality are needed. Here, our objective is to assemble a data set especially suited for the phylogenomic study of algae and formerly photosynthetic eukaryotes, which implies the proper integration of organellar data, to enable distinguishing between several copies of one gene (paralogs), taking into account their cellular compartment, if necessary. DATA DESCRIPTION: We submitted 73 top-quality and taxonomically diverse proteomes to OrthoFinder. We obtained 47,266 orthogroups and identified 11,775 orthogroups with at least two algae. Whenever possible, sequences were functionally annotated with eggNOG and tagged after their genomic and target compartment(s). Then we aligned and computed phylogenetic trees for the orthogroups with IQ-TREE. Finally, these trees were further processed by identifying and pruning the subtrees exclusively composed of plastid-bearing organisms to yield a set of 31,784 clans suitable for studying photosynthetic organism genome evolution.

Entities:  

Keywords:  Algae; CASH; Contamination; Endosymbiotic gene transfer (EGT); Eukaryotic evolution; Horizontal or lateral gene transfer (HGT/LGT); Kleptoplasty; Organelles; Orthology; Phylogenomics; Proteomes

Mesh:

Year:  2021        PMID: 33865444      PMCID: PMC8052839          DOI: 10.1186/s13104-021-05553-4

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Objective

Our main objective is to analyse the phylogenetic origin of plastid-targeted genes in complex algae [1-3] in a fully automated fashion. To do so, we designed and developed a series of strategies and tools around a large-scale single-gene tree analysis pipeline. The first step was to build alignments of orthologous sequences with OrthoFinder, a high accuracy orthogroup inference algorithm [4]. We focused on top-quality proteomes, especially with high completeness, which is essential to obtain the most complete and balanced OGs possible [5, 6]. In order to maximize completeness and to facilitate the phylogenetic analysis, we complemented beforehand the proteomes having no or only incomplete plastid and/or nucleomorph sequences. Then we processed the resulting OGs, first by isolating the OGs containing photosynthetic organisms, and second by sorting out gene copies shared by plastid-bearing algae from their paralogs. To this end, we built trees using IQ-TREE [7] and used our own tool (tree-clan-splitter.pl) to detect and prune the subtree(s) of interest.

Data description

We collected 73 top-quality eukaryotic proteomes (i.e., conceptually translated genomes; Data file 1, Data set 1, Data set 2) with high completeness (Data file 2) [5, 6] and low contamination levels (Data set 3) [8, 9] (Table 1). Those were selected to be taxonomically diverse, covering all photosynthetic phyla [10, 11], along with some non-photosynthetic organisms to be used as beacons by our clan-identifying algorithm. Those proteomes were complemented with organellar (i.e., plastid and nucleomorph) proteins if they were partly or fully missing in the original source. Hence, 16 were complemented with plastid proteomes whereas two were complemented with nucleomorph proteomes. All proteomes (complemented or not) were dereplicated with CD-HIT [12]. In addition, we used tag-loc-ids.pl, a custom tool designed to tag sequence identifiers according to their encoding genome and cellular localization, such as nuclear-encoded-and-plastid-targeted (nucpt#), nuclear-encoded-periplastid-compartment-targeted (nuppct#), plastid-encoded-plastid-targeted (cpcpt#), nucleomorph-encoded (nm#), and mitochondrion-encoded (mt#), to facilitate subsequent phylogenetic analyses. Then, we used OrthoFinder [4] for orthology inference, which resulted in 47,266 OGs (Data file 3, Data set 4), composed of two or more sequences belonging to eleven main taxonomic groups (according to NCBI Taxonomy [13]), either classified as “primary algae” (Glaucocystophyceae, Rhodophyta, Viridiplantae) or “complex algae” (Apicomplexa, Colpodellida, Dinophyceae, Cryptophyceae, Euglenozoa, Ochrophyta (including Pelagophyceae), Haptophyta, and Chlorarachniophyceae). Hence, OGs were tabulated into three different categories: “two-algae” (at least one complex alga from two different groups or at least one complex alga and one primary alga, n = 11,775), “one-alga” (at least one alga, n = 18,844) and “zero-algae” (no algae, n = 16,647) using the script classify-mcl-out.pl. In order to address the issue of multiple-copy genes (paralogs), we developed a strategy to isolate subtrees (“clans”) of interest, i.e., including only plastid-bearing organisms. Briefly, we computed trees for the 11,775 “two-algae” OGs when possible (i.e., ≥ 3 sequences, n = 11,499) with IQ-TREE [7] and developed a tool for identifying and pruning subtrees fulfilling user-specified taxonomic filters (tree-clan-splitter.pl). This way, we obtained 31,784 “photosynthetic” clans (Data set 5) only composed of plastid-bearing organisms (including species with a non-photosynthetic plastid, such as Plasmodium falciparum). Additionally, we provide detailed annotation reports obtained with eggNOG [14].
Table 1

Overview of data files/data sets

LabelName of data file/data setFile types (file extension)Data repository and identifier (DOI or accession number)
Additional file 1MethodsPDF file (.pdf)Figshare https://doi.org/10.6084/m9.figshare.13604102.v3 [18]
Data file 1Taxonomic samplingImage file (.png)Figshare https://doi.org/10.6084/m9.figshare.13603511.v1 [19]
Data set 1Proteome set descriptionText files (.csv,.html)Figshare https://doi.org/10.6084/m9.figshare.13113893.v1 [20]
Data set 2Proteome filesFASTA files (.tar.gz)Figshare https://doi.org/10.6084/m9.figshare.13573424.v2 [21]
Data file 2BUSCO reportText file (.csv)Figshare https://doi.org/10.6084/m9.figshare.13235045.v1 [22]
Data set 3Forty-Two reports and configuration filesText files (.tsv,.csv,.yaml)Figshare https://doi.org/10.6084/m9.figshare.13235063.v3 [23]
Data file 3Orthogroup propertiesImage file (.pdf)Figshare https://doi.org/10.6084/m9.figshare.13312622.v1 [24]
Data set 4OrthogroupsFASTA files, YAML configuration file (.tar.gz)Figshare https://doi.org/10.6084/m9.figshare.13573658.v3 [25]
Data set 5ClansFASTA files (.tar.gz)Figshare https://doi.org/10.6084/m9.figshare.13573415.v1 [26]
Data file 4Organelle databaseText file (.tsv)Figshare https://doi.org/10.6084/m9.figshare.13246841.v1 [27]
Data file 5Plastid-targeted proteinsSpreadsheet (.xlsx)Figshare https://doi.org/10.6084/m9.figshare.13246784.v1 [28]
Data file 6eggNOG OG annotationsText file (.tsv)Figshare https://doi.org/10.6084/m9.figshare.13415048.v1 [29]
Data file 7eggNOG clan annotationsText file (.tsv)Figshare https://doi.org/10.6084/m9.figshare.13415060.v1 [30]
Overview of data files/data sets

Limitations

Occasionally, organellar genome sequences are from a different strain than the nucleus data; it could be an issue if we were trying to resolve relationships between close relatives of the same lineage. Nonetheless, it is not the case here, since the major endosymbiotic-like events we are tracking occurred most certainly between distinct lineages. The way we handle the tagging overwrites the information about potential NUMTs, NUNMs and NUPTs; this means that if a gene existed in both genomic compartments (nucleus and organelle) we always retained the organellar counterpart. Only a few of the nucleus-encoded-and-plastid-targeted proteins (nucpt#) were identified by proteomics (e.g., in P. falciparum) [17]; the remaining are the results of in silico predictions [15, 16], which are less reliable than proteomic experiments.
  16 in total

1.  Origin and distribution of Calvin cycle fructose and sedoheptulose bisphosphatases in plantae and complex algae: a single secondary origin of complex red plastids and subsequent propagation via tertiary endosymbioses.

Authors:  René Teich; Stefan Zauner; Denis Baurain; Henner Brinkmann; Jörn Petersen
Journal:  Protist       Date:  2007-03-21

2.  A "green" phosphoribulokinase in complex algae with red plastids: evidence for a single secondary endosymbiosis leading to haptophytes, cryptophytes, heterokonts, and dinoflagellates.

Authors:  Jörn Petersen; René Teich; Henner Brinkmann; Rüdiger Cerff
Journal:  J Mol Evol       Date:  2006-02-10       Impact factor: 2.395

3.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

4.  Comparative and Functional Algal Genomics.

Authors:  Crysten E Blaby-Haas; Sabeeha S Merchant
Journal:  Annu Rev Plant Biol       Date:  2019-03-01       Impact factor: 26.379

5.  Chimeric origins of ochrophytes and haptophytes revealed through an ancient plastid proteome.

Authors:  Richard G Dorrell; Gillian Gile; Giselle McCallum; Raphaël Méheust; Eric P Bapteste; Christen M Klinger; Loraine Brillet-Guéguen; Katalina D Freeman; Daniel J Richter; Chris Bowler
Journal:  Elife       Date:  2017-05-12       Impact factor: 8.140

Review 6.  NCBI Taxonomy: a comprehensive update on curation, resources and tools.

Authors:  Conrad L Schoch; Stacy Ciufo; Mikhail Domrachev; Carol L Hotton; Sivakumar Kannan; Rogneda Khovanskaya; Detlef Leipe; Richard Mcveigh; Kathleen O'Neill; Barbara Robbertse; Shobha Sharma; Vladimir Soussov; John P Sullivan; Lu Sun; Seán Turner; Ilene Karsch-Mizrachi
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

7.  Metabolic quirks and the colourful history of the Euglena gracilis secondary plastid.

Authors:  Anna M G Novák Vanclová; Martin Zoltner; Steven Kelly; Petr Soukal; Kristína Záhonová; Zoltán Füssy; ThankGod E Ebenezer; Eva Lacová Dobáková; Marek Eliáš; Julius Lukeš; Mark C Field; Vladimír Hampl
Journal:  New Phytol       Date:  2019-11-04       Impact factor: 10.151

8.  IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.

Authors:  Lam-Tung Nguyen; Heiko A Schmidt; Arndt von Haeseler; Bui Quang Minh
Journal:  Mol Biol Evol       Date:  2014-11-03       Impact factor: 16.240

Review 9.  Genomic Insights into Plastid Evolution.

Authors:  Shannon J Sibbald; John M Archibald
Journal:  Genome Biol Evol       Date:  2020-07-01       Impact factor: 3.416

10.  Integrative proteomics and bioinformatic prediction enable a high-confidence apicoplast proteome in malaria parasites.

Authors:  Michael J Boucher; Sreejoyee Ghosh; Lichao Zhang; Avantika Lal; Se Won Jang; An Ju; Shuying Zhang; Xinzi Wang; Stuart A Ralph; James Zou; Joshua E Elias; Ellen Yeh
Journal:  PLoS Biol       Date:  2018-09-13       Impact factor: 8.029

View more
  4 in total

1.  The taxonomy of the Trichophyton rubrum complex: a phylogenomic approach.

Authors:  Luc Cornet; Elizabet D'hooge; Nicolas Magain; Dirk Stubbe; Ann Packeu; Denis Baurain; Pierre Becker
Journal:  Microb Genom       Date:  2021-11

Review 2.  Was the Last Bacterial Common Ancestor a Monoderm after All?

Authors:  Raphaël R Léonard; Eric Sauvage; Valérian Lupo; Amandine Perrin; Damien Sirjacobs; Paulette Charlier; Frédéric Kerff; Denis Baurain
Journal:  Genes (Basel)       Date:  2022-02-18       Impact factor: 4.096

3.  De Novo Transcriptome Meta-Assembly of the Mixotrophic Freshwater Microalga Euglena gracilis.

Authors:  Javier Cordoba; Emilie Perez; Mick Van Vlierberghe; Amandine R Bertrand; Valérian Lupo; Pierre Cardol; Denis Baurain
Journal:  Genes (Basel)       Date:  2021-05-29       Impact factor: 4.096

4.  Decontamination, pooling and dereplication of the 678 samples of the Marine Microbial Eukaryote Transcriptome Sequencing Project.

Authors:  Mick Van Vlierberghe; Arnaud Di Franco; Hervé Philippe; Denis Baurain
Journal:  BMC Res Notes       Date:  2021-08-09
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.