| Literature DB >> 36042479 |
Svetlana Karamycheva1, Yuri I Wolf1, Erez Persi1, Eugene V Koonin1, Kira S Makarova2.
Abstract
BACKGROUND: Evolutionary rate is a key characteristic of gene families that is linked to the functional importance of the respective genes as well as specific biological functions of the proteins they encode. Accurate estimation of evolutionary rates is a challenging task that requires precise phylogenetic analysis. Here we present an easy to estimate protein family level measure of sequence variability based on alignment column homogeneity in multiple alignments of protein sequences from Clade-Specific Clusters of Orthologous Genes (csCOGs).Entities:
Keywords: Clusters of orthologous genes; Evolutionary reconstructions; Paralogs; Variability
Mesh:
Substances:
Year: 2022 PMID: 36042479 PMCID: PMC9425974 DOI: 10.1186/s13062-022-00337-7
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 7.173
Fig. 1Pipeline for protein variability analysis. Homogeneity values are calculated for each position of multiple alignments of clade-specific COG (csCOG) sequences (top left). Homogeneity profiles along the sequences are smoothed and converted to distributions of the homogeneity values (top middle). Distances between the homogeneity value distributions are used to embed csCOGs into a metric space (top right). Homogeneity values, scaled by the average homogeneity across the clade, are transformed into variabilities (bottom middle). csCOG-specific values form clade-level distributions (bottom left). Position-specific variability values allow to categorize alignment sites into conserved, intermediate, and variable; relative frequency of these classes, plotted on a simplex diagram, identifies csCOG with unusual conservation patterns (bottom right)
Fig. 2Distribution of variability values across clade-specific COGs. Gaussian kernel-smoothed probability density functions for variability values in clade-specific pangenomes (plots for eight clades are shown). Threshold values for conserved (variability v < 0.5), intermediate (0.5 < v < 2), and variable (v > 2) csCOGs are indicated
Fig. 3Association of protein variability with other genomic and biological features. A Fraction (in percent) of variance of protein variability explained by other properties. The “total explained” fraction is estimated using multivariable regression. The fraction explained by individual properties is estimated using ANOVA. The cells, corresponding to properties, excluded by Akaike criterion based stepwise reduction of multivariable regression model, are shaded in gray. B Average variability of subsets of genes categorized by other properties. C Average variability of subsets of genes categorized by COG functional categories. Functional categories are the following: J—Translation, ribosomal structure and biogenesis; K—Transcription; L—Replication, recombination and repair; D—Cell cycle control, cell division, chromosome partitioning; V—Defense mechanisms; T—Signal transduction mechanisms; M—Cell wall/membrane/envelope biogenesis; N—Cell motility; W—Extracellular structures; O—Posttranslational modification, protein turnover, chaperones; X—Mobilome: prophages, transposons; C—Energy production and conversion; G—Carbohydrate transport and metabolism; E—Amino acid transport and metabolism; F—Nucleotide transport and metabolism; H—Coenzyme transport and metabolism; I—Lipid transport and metabolism; P—Inorganic ion transport and metabolism; Q—Secondary metabolites biosynthesis, transport and catabolism; R—General function prediction only; S—Function unknown; Color scale from blue to red is proportional to the value
Fig. 4Multidimensional scaling analysis of variability values and selected features. Homogeneity distribution density was calculated for each csCOG as described in Material and Methods. Classical multidimensional scaling (cmdscale function in R) was applied to visualize the relationship between csCOGs. Hellinger distance (one of the conceptually simplest distance measures which is also symmetrical and metric) was used to quantify the similarity between each two probability distributions. Results for the first two dimensions were used to construct plots. Variability of the data points are shown as follows: Conserved (0–0.5): light blue; medium (0.5–2.0): light gray; variable (> 2.0)” dark blue. The following features are overlayed onto points: presence in the set of core genes—red dots; high gain rate (> 2.5)—magenta dots; membrane (csCOGs with the average fraction of proteins with predicted transmembrane segments > 0.333)—dark green dots; secreted (csCOGs with the average fraction of proteins with signal peptide > 0.333), microsatellite like regions (the average fraction of protein sequences in the csCOG identified >= 0.15)—orange dots; high paralogy (> 2.0)—dark gray dots
Fig. 5Breakdown of high variability protein families by presence in 1–8 other analyzed lineages. Numbers on the plot indicate the actual number of csCOGs with high variability (> 2.0) that are present in the given number of genomes; the plots for each family are scaled to 100%
COGs that are among hypervariable families among both bacteria and archaea
| COG number | Function | Gene name | Description | Number of csCOGs* | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Deinococcus-V1 | Deinococcus-V2 | Flavobacteriales—V1 | Flavobacteriales—V2 | Haloferacales—V1 | Haloferacales—V2 | Methanosarcinales—V1 | Methanosarcinales—V2 | ||||
| COG0438 | M | RfaB | Glycosyltransferase involved in cell wall biosynthesis | 6 | 5 | 14 | 10 | 8 | 4 | 4 | 10 |
| COG0456 | J | RimI | Ribosomal protein S18 acetylase RimI and related acetyltransferases | 10 | 3 | 7 | 7 | 15 | 2 | 5 | 3 |
| COG0463 | M | WcaA | Glycosyltransferase involved in cell wall biosynthesis | 2 | 1 | 12 | 5 | 8 | 2 | 5 | 3 |
| COG0531 | E | PotE | Serine transporter YbeC, amino acid:H + symporter family | 0 | 0 | 4 | 1 | 1 | 1 | 5 | 1 |
| COG0671 | I | PgpB | Membrane-associated phospholipid phosphatase | 2 | 1 | 3 | 1 | 1 | 1 | 0 | 1 |
| COG0747 | E | DdpA | ABC-type transport system, periplasmic component | 9 | 1 | 1 | 0 | 3 | 3 | 1 | 1 |
| COG0842 | V | YadH | ABC-type multidrug transport system, permease component | 2 | 0 | 4 | 2 | 1 | 1 | 2 | 2 |
| COG1131 | V | CcmA | ABC-type multidrug transport system, ATPase component | 5 | 1 | 8 | 1 | 8 | 1 | 3 | 2 |
| COG1216 | G | WcaE | Glycosyltransferase, GT2 family | 0 | 1 | 6 | 1 | 0 | 2 | 0 | 1 |
| COG1846 | K | MarR | DNA-binding transcriptional regulator, MarR family | 9 | 3 | 3 | 1 | 6 | 2 | 7 | 1 |
| COG2226 | H | UbiE | Ubiquinone/menaquinone biosynthesis C-methylase UbiE/MenG | 8 | 3 | 3 | 2 | 16 | 2 | 14 | 7 |
| COG2244 | M | RfbX | Membrane protein involved in the export of O-antigen and teichoic acid | 1 | 2 | 4 | 2 | 0 | 2 | 6 | 4 |
| COG2814 | G | AraJ | Predicted arabinose efflux permease AraJ, MFS family | 23 | 4 | 3 | 3 | 18 | 2 | 6 | 2 |
*V1—low and medium variability csCOGs; V2—high variability csCOGs; Haloferacales family, Sulfolobales family, Thermococcales family, Methanosarcinales family and bacteria—Flavobacteriales family, Deinococcus genus, Paenbacillus genus, Rhodococcus genus
Fig. 6Evolutionary history of sulfo9.00007 family of WcaE-like glycosyltransferases. The neighborhood of all genes from sulfo9.00007 are mapped to 16S rRNA tree of Sulfolobales genomes analyzes in this work. For each gene neighborhood, the genbank accession and coordinates of the locus are indicated on the right. Genes are shown by block arrows, roughly to scale. csCOG number is indicated for all genes and follow gene name (if available). For the genes that are in respective arCOGs the cluster number corresponds to the respective arCOG number. Memebers of sulfo9.00007 are colored by blue shades according to phylogenetic analysis of WcaE-like glycosyltransferases (clades A-E, Additional file 3: Fig. S2). Other glycosyltransferases assigned to COG1216, but not to sulfo9.00007 are shown by blue outline. Closest most frequent gene neighbors are shown by yellow (FabG) and pink (WsaA)
Protein families with high fraction of conserved and variable positions
| csCOG identifier | COG | Func | Gene | Description | Comment |
|---|---|---|---|---|---|
| flavo9.00376 | COG1158 | K | Rho | Transcription termination factor Rho | Mostly Bacteroidetes |
| flavo9.00582 | COG1314 | U | SecG | Protein translocase subunit SecG | All bacteroidetes, but also in some other bacteria such as Chlorobia, some Proteobacteria, Spirochaetes; others do not possess the variable tail |
| flavo9.00756 | – | – | – | – | xre family HTH (N-terminal), the loop is present mostly in Bacteroidetes, but seen in some Bacilli too |
| flavo9.00944 | COG4807 | S | YehS | Uncharacterized conserved protein YehS, DUF1456 family | Specific for Flavobacterium |
| deino9.00350 | – | – | – | – | An artefact: wrong ORFs start in some of these genes |
| deino9.00475 | COG1722 | L | XseB | Exonuclease VII small subunit | Variable tail in other bacteria too |
| deino9.00842 | COG0511 | I | AccB | Biotin carboxyl carrier protein | PA-rich, present in most bacteria |
| deino9.01337 | – | – | – | – | Uncharacterized, small, Deinococcus specific |
| deino9.01490 | COG0568 | K | RpoD | DNA-directed RNA polymerase, sigma subunit (sigma70/sigma32) | Specific N-terminal extension in Deinococci and Truepera, although partially low complexity region is present in Thermus |
| deino9.03407 | COG0199 | J | RpsN | Ribosomal protein S14 | Xenologous gene displacement by zinc finger variant in some Deinococci |
| paen9.00611 | COG1937 | K | FrmR | DNA-binding transcriptional regulator, FrmR family | Copper-sensitive operon repressor, variable N-terminal region is present in many other Firmicutes |
| paen9.00802 | – | – | – | YycC-like protein, PF14174.7 | Paenibacillus specific variable tail |
| paen9.00805 | COG3874 | S | YtfJ | Uncharacterized spore protein YtfJ | Sporulation protein YtfJ; variable region is present in many sporulating Bacilli, but variable tail is rather specific for Paenibacillus |
| paen9.00958 | COG1674 | D | FtsK | DNA segregation ATPase FtsK/SpoIIIE or related protein | Variable insertion is present in all Bacilli and other bacteria, in Paenibacillus these regions are longer |
| paen9.01226 | COG0323 | L | MutL | DNA mismatch repair ATPase MutL | Common feature among some archaea and some bacteria |
| paen9.01699 | COG4467 | L | YabA | Regulator of replication initiation timing YabA | Variable insertion is present in all Firmicutes and other bacteria, in Paenibacillus these regions is longer [ |
| paen9.02368 | COG0532 | J | InfB | Translation initiation factor IF-2, a GTPase | Variable insertion is present in all Firmicutes (very different lengths), in Paenibacillus these regions are longer, but not the longest among Firmicutes. In many other bacteria the insertion is much smaller [ |
| rhodo7.000637 | COG1826 | U | TatA | Twin-arginine protein secretion pathway components TatA and TatB | Variable tail is specific for at least actinobacteria |
| rhodo7.001015 | COG5416 | S | YrvD | Uncharacterized integral membrane protein YrvD | Variable N-terminal region specific for actinobacteria, but not others |
| rhodo7.001149 | COG2409 | S | YdfJ | Predicted lipid transporter YdfJ, MMPL/SSD domain, RND superfamily | Variable tail region specific for actinobacteria, but not others, sometime the tail is missing in actinobacteria too |
| rhodo7.001169 | – | – | – | lipid droplet-associated protein | Found in lipid droplets in |
| rhodo7.001269 | COG1158 | K | Rho | Transcription termination factor Rho | N-terminal variable region specific for actinobacteria |
| rhodo7.001344 | COG0328 | L | RnhA | Ribonuclease HI | Variable region is present in many bacteria |
| rhodo7.001562 | COG1862 | U | YajC | Protein translocase subunit YajC | Variable region is present in many bacteria |
| rhodo7.001949 | COG0305 | L | DnaB | Replicative DNA helicase | Some contain intein |
| thermo9.00277 | (arCOG04026) | – | – | Pilin/Flagellin, contains class III signal peptide | Thermococcus specific, not present elsewhere |
| halo9.00332 | COG0323 | L | MutL | DNA mismatch repair enzyme (predicted ATPase) | Common feature among some archaea and some bacteria |
| halo9.00351 | COG1885 | S | – | Uncharacterized protein, DUF555 family | Uncharacterized, variable tail present in Methanosarcina, but not in a few other euryarchaea |
| halo9.00421 | COG4530 | S | – | Uncharacterized protein | Uncharacterized DUF5806, specific for Halobacteria variable N-terminal region, some have CxxCxHxxH motif, variable N-terminal region |
| halo9.00587 | COG0805 | U | – | Sec-independent protein translocase protein TatC | Specific for Halobacteria variable N-terminal region |
| halo9.00602 | COG0552 | U | – | Signal recognition particle-docking protein FtsY | N-terminal variable region present in many euryarchaea |
| halo9.00879 | COG1474 | L | – | orc1/cdc6 family replication initiation protein | N-terminal region specific for Haloferacales |
| halo9.00317 | COG0358 | L | DnaG | DNA primase (bacterial type) | Common feature among euryarchaea |
| methano7.000496 | COG1311 | L | HYS2 | Archaeal DNA polymerase II, small subunit/DNA polymerase delta, subunit B | Specific for Methanosarcina |
Selected functionally uncharacterized protein families with low variability and presence in 85% or more genomes in respective lineage
| csCOG | Genome number | Proteins number | Varia-bility | COG (arCOG)* | Pfam (DUF) | Comment |
|---|---|---|---|---|---|---|
| sulfo9.02117 | 52 | 52 | 0.25 | COG1698 (arCOG04308) | Essential [ | |
| sulfo9.02278 | 52 | 52 | 0.26 | (arCOG08212) | ||
| sulfo9.01977 | 52 | 52 | 0.29 | (arCOG05886) | Essential [ | |
| sulfo9.00722 | 52 | 52 | 0.57 | COG1888 (arCOG04140) | PDB: 3BPD; ferredoxin fold | |
| sulfo9.01763 | 52 | 52 | 0.61 | COG4755 (arCOG04123) | DUF2153 | Linked to Trm112 RNA methyltransferase activating protein |
| halo9.02555 | 37 | 40 | 0.36 | COG1885 (arCOG02119) | DUF555 | Single CxxC, weak similarity to CREN7 |
| halo9.01859 | 37 | 37 | 0.38 | (arCOG04616) | DUF5800 | |
| halo9.01783 | 37 | 37 | 0.39 | (arCOG04777) | ||
| halo9.02264 | 36 | 36 | 0.28 | (arCOG04587) | Linked to glutaredoxin family protein | |
| halo9.02689 | 32 | 32 | 0.23 | (arCOG03655) | Linked to Anion-transporting ATPase ArsA | |
| halo9.02039 | 37 | 37 | 0.49 | COG2412 (arCOG04051) | DUF424 | PDB: 2QYA; linked to TPR repeats containing protein |
| thermo9.00526 | 40 | 40 | 0.46 | (arCOG04849) | Linked to Ribosome biogenesis GTPase A | |
| thermo9.01167 | 41 | 41 | 0.3 | COG2412 (arCOG04051) | DUF424 | linked to NMD protein affecting ribosome stability and mRNA decay |
| thermo9.01884 | 41 | 41 | 0.32 | (arCOG05846) | Linked to Transcription initiation factor IIE, alpha subunit | |
| thermo9.01623 | 41 | 41 | 0.36 | COG1885 (arCOG02119) | DUF555 | Linked to Uncharacterized protein, DUF357 family |
| thermo9.02768 | 42 | 43 | 0.2 | COG1888 (arCOG04140) | Linked to ArsR transcriptional regulators; PDB: 2X3D [ | |
| thermo9.01533 | 42 | 42 | 0.31 | COG1531 (arCOG01302) | Linked to MBL-fold metallohydrolase superfamily; predicted RNA cyclic group end recognition domain [ | |
| thermo9.01369 | 42 | 42 | 0.42 | (arCOG05869) | PDB: 2K4N; linked 23S rRNA G2069 N7-methylase RlmK or C1962 C5-methylase RlmI; | |
| methano7.000565 | 41 | 48 | 0.48 | COG4744 (arCOG03208) | DUF2149 | Membrane protein; linked to biopolymer transport protein TolQ |
| methano7.001417 | 41 | 41 | 0.48 | COG3377 (arCOG04424) | DUF1805 | PDB: 1QW2; linked to tRNA G10 N-methylase Trm11 |
| methano7.001273 | 41 | 41 | 0.45 | COG4050 (arCOG04903) | DUF2112 | In a conserved context with uncharacterized protein, DUF2102 family and others; single CxxC motif; methanogenesis maker 5 |
| methano7.001697 | 41 | 41 | 0.4 | (arCOG04388) | Linked to Uncharacterized protein, DUF2551 family | |
| methano7.001273 | 41 | 41 | 0.45 | COG4050 (arCOG04903) | DUF2102 | Methanogenesis maker 6; linked to DUF2112 |
| flavo9.00782 | 50 | 50 | 0.47 | – | DUF4286 | Linked to outer membrane protein assembly factor BamD |
| flavo9.01459 | 50 | 50 | 0.45 | – | Linked to RuvX, Holliday junction resolvase; SRPBCC domain, Hsp90 cochaperone in yeast [ | |
| flavo9.00789 | 50 | 50 | 0.45 | – | DUF2797 | Linked to GH3 auxin-responsive promoter; contains Zn ribbon |
| flavo9.01638 | 50 | 50 | 0.30 | – | SRPBCC domain, also see flavo9.01459 | |
| flavo9.02618 | 50 | 50 | 0.30 | – | DUF4254 | Linked to ADP-heptose:LPS heptosyltransferase, RfaF |
| deino9.00587 | 33 | 33 | 0.34 | – | Annotated as quinate 5-dehydrogenase; present in Thermus and other bacteria | |
| deino9.01277 | 33 | 33 | 0.35 | – | DUF4385 | Linked to DNA-binding ferritin-like protein Dps; present in Thermus |
| deino9.00288 | 33 | 33 | 0.45 | – | Linked to uncharacterized membrane protein, Outer membrane protein assembly factor BamB, contains PQQ-like beta-propeller repeat; secreted; present in Thermus | |
| deino9.01656 | 33 | 33 | 0.49 | – | ||
| deino9.02309 | 32 | 32 | 0.33 | – | DUF1844 | Linked to D-Tyr-tRNA(Tyr) deacylase |
| paen9.03935 | 66 | 66 | 0.22 | COG4472 | DUF965 | Linked to Alanyl-tRNA synthetase, AlaS; homolog of IreB, acting a negative regulator of cephalosporin resistance [ |
| paen9.05835 | 66 | 66 | 0.34 | – | Next uncharacterized protein YrrD, contains PRC-barrel domain and Cysteine sulfinate desulfinase/cysteine desulfurase or related enzyme; Zn ribbon domain | |
| paen9.02641 | 66 | 66 | 0.37 | – | YokU-like protein, putative antitoxin RelE fold family | |
| paen9.02767 | 66 | 66 | 0.39 | – | Linked to uncharacterized membrane protein SpoIIM, required for sporulation | |
| paen9.02361 | 66 | 66 | 0.4 | – | DUF1499 | |
| rhodo7.006964 | 53 | 53 | 0.07 | – | DUF2469 | Often found in Actinomycetes clustered with signal peptidase and/or RNAse HII |
| rhodo7.004823 | 53 | 53 | 0.14 | – | DUF3039 | Possibly metal-binding; Hx(20)C…CxxC motif |
| rhodo7.005227 | 53 | 54 | 0.159 | – | DUF3151 | Linked to Uncharacterized membrane protein YgaE, UPF0421/DUF939 family |
| rhodo7.003034 | 53 | 53 | 0.253 | – | DUF4191 | 2TM domain, in operon with Lipoate synthase LipA |
| rhodo7.002008 | 53 | 53 | 0.615 | – | DUF3090 | Contain CxxC..HxC motif, putative metal-binding protein |