| Literature DB >> 35765784 |
Martin Romei1,2, Guillaume Sapriel1,3, Pierre Imbert1, Théo Jamay1, Jacques Chomilier2, Guillaume Lecointre1, Mathilde Carpentier1.
Abstract
Several studies showed that folds (topology of protein secondary structures) distribution in proteomes may be a global proxy to build phylogeny. Then, some folds should be synapomorphies (derived characters exclusively shared among taxa). However, previous studies used methods that did not allow synapomorphy identification, which requires congruence analysis of folds as individual characters. Here, we map SCOP folds onto a sample of 210 species across the tree of life (TOL). Congruence is assessed using retention index of each fold for the TOL, and principal component analysis for deeper branches. Using a bicluster mapping approach, we define synapomorphic blocks of folds (SBF) sharing similar presence/absence patterns. Among the 1232 folds, 20% are universally present in our TOL, whereas 54% are reliable synapomorphies. These results are similar with CATH and ECOD databases. Eukaryotes are characterized by a large number of them, and several SBFs clearly support nested eukaryotic clades (divergence times from 1100 to 380 mya). Although clearly separated, the three superkingdoms reveal a strong mosaic pattern. This pattern is consistent with the dual origin of eukaryotes and witness secondary endosymbiosis in their phothosynthetic clades. Our study unveils direct analysis of folds synapomorphies as key characters to unravel evolutionary history of species.Entities:
Keywords: Phylogeny; protein folds; synapomorphy; tree of life
Mesh:
Year: 2022 PMID: 35765784 PMCID: PMC9541633 DOI: 10.1111/evo.14550
Source DB: PubMed Journal: Evolution ISSN: 0014-3820 Impact factor: 4.171
Figure 1Heatmap showing protein fold repartition through the diversity of life. Columns are species ordered according to the reference phylogenetic tree. The 210 species from left to right are as follows: 70 bacteria, 70 archaea, 70 eukaryotes. For convenience, column colors exhibit taxonomic groups according to NCBI nomenclature. For bacteria, we chose to exhibit only two phyla of interest. Rows are 1073 protein folds as extracted from SCOP. Dots are fold presence in the corresponding species, colored according to the retention index as calculated for the fold repartition onto the reference phylogenetic tree. Darker red dots refer to folds that can be interpreted as reliable taxonomic markers (i.e., group synapomorphies). An interactive version of this heatmap is given in the Supporting Information.
Average retention index calculated for all characters with either all organisms or only bacteria, Eukarya or Archaea (in line). The characters are the predicted presence or absence in the proteomes of SCOP folds, T level architecture of CATH or X level architecture of ECOD (in column)
| SCOP | CATH | ECOD | |
| All | 0.56 | 0.53 | 0.54 |
| Bacteria | 0.29 | 0.26 | 0.27 |
| Eukaryotes | 0.44 | 0.43 | 0.47 |
| Archaea | 0.27 | 0.27 | 0.27 |
Figure 2(a) Projection of species in the two first dimensions of a principal component analysis of fold repartition. Colors refer to the three superkingdoms of life: yellow refers to bacteria, green refers to eukaryotes, and blue refers to archaea. (b) Protein fold contributions to species repartition in the previous two dimensions of the principal component analysis. Arrows show the most contributive folds to this species repartition, in which length shows the strength of the contribution. Four clusters are distinguished with the following color code: blue refers to fold repartitions discriminating eukaryotes, and pink and purple refer to folds discriminating archaea and bacteria, orange refers to folds discriminating bacteria and archaea too and photosynthetic eukaryotes from other eukaryotes. (c) Same clusters of folds spread onto the heatmap. It shows that blue folds are markedly distributed among eukaryotes, pink folds are markedly shared by eukaryotes and bacteria, purple folds by eukaryotes and archaea, and orange folds by bacteria and photosynthetic eukaryotes.
Figure 3Heatmap of folds and species as in Figure 1, with a color code showing groups of folds shared between two superkingdoms or two distant clades. Three types of groups are extracted with the Dynamic Tree Cut algorithm. The black groups are folds shared within all species. The blue and red groups are folds shared between eukaryotes and one of the two other superkingdoms. The green and orange groups are shared between bacteria and photosynthetic eukaryote groups.
Figure 4Heatmap of groups of eukaryotic folds. Each group extracted with the Dynamic Tree Cut algorithm matches with a eukaryotic clade. The dark blue groups are folds specific to all eukaryotes. The light green and red are folds specific to photosynthetic clades. The other colors are imbricate clades from Opisthokonta to Nematoda.
List of putative fold synapomorphies found within eukaryotes: for the 11 blocks specifically associated with clades (monophyletic taxonomic groups) and are reliably supported by at least three folds and with high RI
| Clade | Folds (from SCOP) |
|---|---|
| Nematoda | e.76 (Viral glycoprotein ectodomain‐like), d.62 (pepsin inhibitor‐3), a.226 (Her‐1), b.169 (MFPT repeat‐like) |
| Ecdysozoa | a.260 (Rhabdovirus nucleoprotein‐like), b.102 (Methuselah ectodomain), a.85 (hemocyanin, N‐terminal domain), a.163 (crustacean CHH/MIH/GIH neurohormone) |
| Tetrapoda | a.206 (P40 nucleoprotein), h.3 (Stalk segment of viral fusion proteins), a.61 (retroviral matrix proteins), b.20 (ENV polyprotein, receptor‐binding domain), h.6 (apolipoprotein A‐II), g.77 (resistin), g.9 (defensin‐like), b.63 (oncogene products), d.234 (proguanylin), a.101 (uteroglobin‐like), a.212 (KRAB domain [Kruppel‐associated box]), d.5 (RNase A‐like) |
| Gnathostomata | a.109 (Class II MHC‐associated invariant chain ectoplasmic trimerization domain), d.6 (prion‐like), d.9 (IL8‐like), d.19 (MHC antigen‐recognition domain), d.288 (GTF2I‐like repeat) |
| Vertebrata | h.7 (Synuclein), g.25 (heparin‐binding domain from vascular endothelial growth factor), f.50 (Connexin43), a.126 (serum albumin‐like), a.26 (4‐helical cytokines) |
| Metazoa | b.54 (Core binding factor beta, CBF), d.200 (integrin beta tail domain), g.1 (insulin‐like), a.77 (DEATH domain), g.28 (thyroglobulin type‐1 domain), g.27 (FnI‐like domain), d.164 (SMAD MH1 domain), g.62 (cysteine‐rich DNA binding domain, (DM domain)), g.17 (cystine‐knot cytokines), a.277 (TAFH domain‐like), g.76 (hormone receptor domain), g.22 (serine protease inhibitors), a.123 (nuclear receptor ligand‐binding domain), a.271 (SOCS box‐like), f.7 (lipovitellin‐phosvitin complex), beta‐sheet shell regions), d.217 (SAND domain‐like) |
| Chozoa | b.22 (TNF‐like), g.73 (CCHHC domain), g.8 (BPTI‐like), a.194 (L27 domain), a.37 (A DNA‐binding domain in eukaryotic transcription factors) |
| Holozoa | g.64 (Somatomedin B domain), d.171 (fibrinogen C‐terminal domain‐like), d.170 (SRCR‐like), a.12 (Kix domain of CBP (creb binding protein)), a.135 (tetraspanin), a.215 (a middle domain of Talin 1), g.16 (Trefoil/Plexin domain‐like), g.12 (LDL receptor‐like module), a.256 (RUN domain‐like), g.65 (Notch domain), g.18 (complement control module/SCR domain), g.14 (Kringle‐like) |
| Opisthokonta | a.83 (Guanido kinase N‐terminal domain), d.246 (mRNA decapping enzyme DcpS N‐terminal domain), a.68 (Wiscott–Aldrich syndrome protein, WASP, C‐terminal domain), d.370 (BTG domain‐like), d.332 (RGC domain‐like), f.52 (ATP synthase B chain‐like), a.216 (I/LWEQ domain), g.20 (blood coagulation inhibitor (disintegrin)), g.52 (inhibitor of apoptosis [IAP] repeat), e.55 (Rap/Ran‐GAP), a.117 (Ras GEF), a.87 (DBL homology domain [DH‐domain]), a.205 (Hsp90 co‐chaperone CDC37), a.141 (Frizzled cysteine‐rich domain), a.221 (Lissencephaly‐1 protein [Lis‐1, PAF‐AH alpha] N‐terminal domain) |
| Angiospermae | a.220 (Hypothetical protein At3g22680), g.13 (crambin‐like), g.88 (intrinsically disordered proteins) |
| Embryophyta | g.69 (Plant proteinase inhibitors), a.52 (bifunctional inhibitor/lipid‐transfer protein/seed storage 2S albumin), b.162 (At5g01610‐like), b.143 (NAC domain) |