| Literature DB >> 33865444 |
Mick Van Vlierberghe1, Hervé Philippe2,3, Denis Baurain4.
Abstract
OBJECTIVES: Identifying orthology relationships among sequences is essential to understand evolution, diversity of life and ancestry among organisms. To build alignments of orthologous sequences, phylogenomic pipelines often start with all-vs-all similarity searches, followed by a clustering step. For the protein clusters (orthogroups) to be as accurate as possible, proteomes of good quality are needed. Here, our objective is to assemble a data set especially suited for the phylogenomic study of algae and formerly photosynthetic eukaryotes, which implies the proper integration of organellar data, to enable distinguishing between several copies of one gene (paralogs), taking into account their cellular compartment, if necessary. DATA DESCRIPTION: We submitted 73 top-quality and taxonomically diverse proteomes to OrthoFinder. We obtained 47,266 orthogroups and identified 11,775 orthogroups with at least two algae. Whenever possible, sequences were functionally annotated with eggNOG and tagged after their genomic and target compartment(s). Then we aligned and computed phylogenetic trees for the orthogroups with IQ-TREE. Finally, these trees were further processed by identifying and pruning the subtrees exclusively composed of plastid-bearing organisms to yield a set of 31,784 clans suitable for studying photosynthetic organism genome evolution.Entities:
Keywords: Algae; CASH; Contamination; Endosymbiotic gene transfer (EGT); Eukaryotic evolution; Horizontal or lateral gene transfer (HGT/LGT); Kleptoplasty; Organelles; Orthology; Phylogenomics; Proteomes
Mesh:
Year: 2021 PMID: 33865444 PMCID: PMC8052839 DOI: 10.1186/s13104-021-05553-4
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Overview of data files/data sets
| Label | Name of data file/data set | File types (file extension) | Data repository and identifier (DOI or accession number) |
|---|---|---|---|
| Additional file 1 | Methods | PDF file (.pdf) | Figshare |
| Data file 1 | Taxonomic sampling | Image file (.png) | Figshare |
| Data set 1 | Proteome set description | Text files (.csv,.html) | Figshare |
| Data set 2 | Proteome files | FASTA files (.tar.gz) | Figshare |
| Data file 2 | BUSCO report | Text file (.csv) | Figshare |
| Data set 3 | Forty-Two reports and configuration files | Text files (.tsv,.csv,.yaml) | Figshare |
| Data file 3 | Orthogroup properties | Image file (.pdf) | Figshare |
| Data set 4 | Orthogroups | FASTA files, YAML configuration file (.tar.gz) | Figshare |
| Data set 5 | Clans | FASTA files (.tar.gz) | Figshare |
| Data file 4 | Organelle database | Text file (.tsv) | Figshare |
| Data file 5 | Plastid-targeted proteins | Spreadsheet (.xlsx) | Figshare |
| Data file 6 | eggNOG OG annotations | Text file (.tsv) | Figshare |
| Data file 7 | eggNOG clan annotations | Text file (.tsv) | Figshare |