Literature DB >> 31856202

The protein architecture in Bacteria and Archaea identifies a set of promiscuous and ancient domains.

Rafael Hernandez-Guerrero1, Edgardo Galán-Vásquez2, Ernesto Pérez-Rueda1,3.   

Abstract

In this work, we describe a systematic comparative genomic analysis of promiscuous domains in genomes of Bacteria and Archaea. A quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used and applied to 1317 domains in 1320 genomes of Bacteria and Archaea. A functional analysis associated with the WDAS per genome showed that 18 of 50 functional categories were identified as significantly enriched in the promiscuous domains; in particular, small-molecule binding domains, transferases domains, DNA binding domains (transcription factors), and signal transduction domains were identified as promiscuous. In contrast, non-promiscuous domains were identified as associated with 6 of 50 functional categories, and the category Function unknown was enriched. In addition, the WDASs of 52 domains correlated with genome size, i.e., WDAS values decreased as the genome size increased, suggesting that the number of combinations at larger domains increases, including domains in the superfamilies Winged helix-turn-helix and P-loop-containing nucleoside triphosphate hydrolases. Finally, based on classification of the domains according to their ancestry, we determined that the set of 52 promiscuous domains are also ancient and abundant among all the genomes, in contrast to the non-promiscuous domains. In summary, we consider that the association between these two classes of protein domains (promiscuous and non-promiscuous) provides bacterial and archaeal cells with the ability to respond to diverse environmental challenges.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 31856202      PMCID: PMC6922389          DOI: 10.1371/journal.pone.0226604

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Since Wetlaufer [1] described that consecutive residues in polypeptide chains tend to fold into more or less compact modules called domains, it has been generally accepted that domains are the protein evolutionary modules, and modular reuse has been demonstrated in all domains of life [2]. This modularity might be advantageous to the organisms, allowing signaling proteins to expand their regulatory linkages and may elicit a broader range of control mechanisms either via modular combinations or through modulation of inter-modular linkages [3]. Therefore, proteins are modular, and it has been suggested that large proteins contain multiple domains, either with the same or with different structural folds [4]. In this regard, comparative analyses of proteins have revealed a substantial fraction of multidomain proteins in eukaryotic organisms [5, 6]. Similar conclusions have been reached through the analysis of domain combinations, namely, that more complex organisms display greater proportions of domain combinations and duplications [7, 8]. Recently, some of these domain combinations have been identified as stable during evolution, whereas others have been highly mobile [9]. Accordingly, domains are defined as promiscuous if they are combined with many other domains, suggesting that such combination of domains with different structures and functions is a major source of diversity and modulation of molecular functionality. A second group of proteins whose domain architecture is conserved has been identified, and this group contains domains with few associations, or monolithic domains. A recent analysis showed that the increase in domain combinations of 19 families of DNA-binding transcription factors (TFs) correlated with the complexity of the organisms studied; however, it is uncertain if this correlation can be extended to additional domains [10] or is a particular observation for proteins devoted to gene regulation. Therefore, in this work we evaluated the tendency of protein domains to combine with other domains, i.e., domain promiscuity, in proteins of bacterial and archaeal genomes. We assigned domains based on a Superfamily hidden Markov models (HMM), and a quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used to analyze 1319 non-redundant bacterial and archaeal genomes. We further attempted to correlate this measure with function, genome size and protein fold ancestrality.

Materials and methods

Prokaryotic genomes

A total of 5321 genomes corresponding to Bacteria and Archaea domains were downloaded from the NCBI server and filtered to exclude redundancy, using the criteria described in [11]. We performed comparative analyses using the open reading frames (ORFs) that encode predicted proteins in all organisms. Hence, a total of 1214 bacteria; and 105 archaeal non-redundant genomes were considered for this analysis.

Superfamily domain assignments

The predicted protein sequences (the proteome) in all organisms were scrutinized to assess their domain organization by using the Superfamily database assignments [12]. A library of 1659 Superfamily HMMs was compared to the protein sequences by using the script hmmscan.pl (HMMer 3.1b2 [13]) provided by Superfamily (v. 2015) that considers the following parameters: -E 10 -Z 15438 -hmmscan -threads 4. To generate consistent assignments from the output files, we ran the script ass3.pl -t n -f 4 -e 0.0001 (this e-value is considered as conservative default).

Evaluation of domain architecture diversity

The WDAS, a measure of domain architecture diversity, was computed as described elsewhere [14]. The WDAS considers the proteins containing a domain and the total number of proteins per genome via the inverse abundance frequency (IAF) statistic, as follows: where P is the total proteins per genome and P is the number of proteins with domain d. To determine the diversity of architectures associated with a specific domain, the inverse variability (IV) is obtained from the inverse of the number of distinct partner domain families at the N- and C-terminal sides adjacent to a domain. The definition of the IV of a domain, d, is: where f is the number of different domain families adjacent to domain d. Finally, the WDAS of a domain is the product of the IAF and the IV: The associations between the WDAS and genome size (measured in ORFs) were binned in 13 intervals without overlaps, with a width of 836 ORFs, as previously described [10]. In brief, the number of windows was calculated by using the Sturges’s formula: k = 1+ log2N where k is the number of equal classes and N the number of data, rounding to the nearest integer, the k value. Thus, the width of classes was determined with the equation: c = R/k where R = high value–low value (Genome size in ORFs). Values resulting from the application of the above formulas were k = 13 and c = 836, thus 13 windows without overlaps were used with a width of 836 ORFs [10].

Promiscuity identification

In order to define the threshold to separate a domain as promiscuous and non-promiscuous, we computed an accumulative plot for the number of enriched functions associated to promiscuous and non-promiscuous domains at different scores of WDASs. In a posterior step, both datasets were adjusted with polynomial regression (S1 Fig). From this plot we found an increasing between the first, second and third intervals, that corresponds to values between 0 and 3.0 of WDAS. The numbers of enriched functions decays when a value of 0 to 4.0 was consider. Therefore, a WDAS of 3.0 could be considered as a threshold to identify promiscuous and non-promiscuous domains. This found is consistent, when the number of total domains per interval was consider, i.e. the number of domains exhibited a slight increase when values between 0 and 3.0 of WDAS were considered and that represents the 13.69% of the total of domains (S1 Fig). In summary, domains with a WDAS between 0 and 3.0 were defined as promiscuous.

Functional enrichment

For each genome, two lists of protein domains were considered, using a threshold of ≤2.7 for promiscuous and ≥2.8 for non-promiscuous domains. An enrichment analysis using a hypergeometric test was conducted on these lists, considering 50 different functional categories obtained from the Superfamily (SUPFAM) and Structural Classification of Protein (SCOP) databases. Such a distribution describes the probability of finding x domains associated with a particular category in a list of interest k, from a set of N domains, with m domains associated with the same category. The P-values were corrected for multiple testing with Benjamini-Hochberger method to decreases the false discovery rate, and a statistical significance at P-value of <0.05 was set.

Ancestrality of protein domains

To evaluate the protein domains ancestrality, superfamily assignments were associated to the phylogenomic tree of Protein Fold Families values previously described [15]. In brief, Caetano-Anolles et al. [15] considered a PSI-BLAST comparisons between PDB and genome sequences [16]; and the usage and sharing of protein folds was characterized with the: fold occurrence (G), average genome occurrence (Ḡ), and fraction of genomes harboring a fold (f). G defines how often a protein fold (i) occurs in a given proteome (j). Ḡ represents averages of G values. Ḡ and f measure the extent of fold sharing within each domain or combination of domains. Values were converted into linearly ordered multistate characters and normalized using an arbitrary scale (generally 0–20). Cumulative frequency plots were used to illustrate the accumulation of folds belonging to a protein class along a phylogenetic tree, and it is a function of distance in nodes from the root. These plots can be considered time plots of lineages [17] with a time axis defined in relative units.

Results and discussion

Distribution of domains according to superfamily assignments

It has been previously described that the increase in complexity of the domain organization of proteins would substantially contribute to the evolution of organismal complexity [2, 18]. In order to gain insights into protein architecture among prokaryotes, 1214 bacterial and 105 archaeal non-redundant genomes were analyzed in terms of their repertoire of protein domains. A coverage of assignment of 75.8% per genome, with some genomes with coverage of more than 90%, such as the gammaproteobacterium “Candidatus Evansia muelleri” (GCF_000953435.1), with coverage of more than 95% and representing the organism with the highest proportion of proteins with domain assignments. This result is consistent with the assignment percentage of proteins that were assigned with at least one Superfamily domain for Eukaryotes, Archaea, Bacteria and Viruses [12]. The genomes with the minor coverage of domain assignments corresponded to four mycoplasma genomes: Mycoplasma haemofelis Ohio2 (GCF_000186985.1), a hemoplasma species with a coverage of less than 30% and M. suis str. Illinois (GCF_000179035.2), M. ovis str. Michigan (GCF_000508245.1), and Spiroplasma kunkelii CR2-3x (GCF_001274875.1), with coverages of 33.4%, 33.6%, and 38.2%, respectively, suggesting that a large diversity of novel protein domains remains to be identified in those genomes and must be further explored (S1 Table). We evaluated all bacterial and archaeal proteins in terms of their domain organization, identifying that 46.7% of proteins contain one domain, 31.3% contain two domains, 11.2% have three domains, and 10.6% contain more than four domains. It is evident from this distribution that more than 53% are multidomain proteins. The proportion of multidomain proteins follows a linear distribution, where large genomes contain a large proportion of two-, three-, or four-domains proteins, suggesting a positive correlation (R2 = 0.955) between the abundance of multidomain proteins and the number of proteins per genome (S2 Fig and S1 Table), when a non-linear least squared to fit the dataset was used. In this regard, domains are present in various combinations in multidomain proteins. Thus, in this work, promiscuous domains were defined as domains that reside in many different combinations, i.e. combined with many other domains, a property that suggests a high degree of flexibility to combining with other different domains. Hence, to determine the diversity of structural domains associated with the bacterial and archaeal genomes, we evaluated the repertoire of all non-redundant domains per genome. As shown in Fig 1, organisms with reduced genomes (less than 2000 ORFs) contained a low proportion of non-redundant domains (less than 500 different protein domains) with an increase in organisms with more than 2000 ORFs (around 700 protein domains); however, the maximum number of different domains reached a plateau of around 800 different protein domains in organisms with more than 5000 ORFs. Therefore, small organisms exhibit a low diversity of different structural domains, suggesting that superfamilies at large genomes have increased their repertoire mainly via diverse duplication events followed by evolutionary divergence, as it has been previously suggested in eukarya [19]. In this manner, organisms could recycle protein domains to increase their repertoire and respond to diverse environmental challenges.
Fig 1

The total number of non-redundant domains follows a logarithmic distribution (A), whereas promiscuous domains follow a polynomial distribution (B). On the X-axis is the genome size (in ORFs), and on the Y-axis is the total number non-redundant (NR) domains identified. A non-linear least squared to fit the dataset was used.

The total number of non-redundant domains follows a logarithmic distribution (A), whereas promiscuous domains follow a polynomial distribution (B). On the X-axis is the genome size (in ORFs), and on the Y-axis is the total number non-redundant (NR) domains identified. A non-linear least squared to fit the dataset was used.

Protein architecture provides functional diversity to bacterial and archaeal genomes

Based on the previous result and considering the hypothesis of domain recycling, we evaluated the protein architectures of the 1317 superfamilies. To this end, the complete set of proteins in 1320 bacterial and archaeal genomes was analyzed in terms of structural domain composition, according to the formula described in methods, with the WDAS, a measure of domain promiscuity, determined and plotted as a function of the genome size (see S1 File). Based on the WDAS, we defined that values closer to 0 represent promiscuity (diversity of their protein domain architectures), and higher values suggest no diversity in the protein architecture and, consequently these protein domains must be considered non-promiscuous, or monolithic. To evaluate the set contribution of promiscuous and non-promiscuous domains to the total repertoire of domains associated with all the genomes, those domains with a WDAS between 0 and 3.0 (See methods) were considered promiscuous. Based on this definition, we found that an average less than 100 different domains per genome that corresponds to around 11.34% of the total protein domains per genome can be considered promiscuous (Fig 1), and species containing the highest proportions of this class of domains are also associated with large bacterial genomes, such as Archangium gephyr (GCF_001027285.1_ASM102728v1) and Sorangium cellulosum So0157-2 (GCF_000418325.1_ASM41832v1), among others. This finding is interesting because it shows that a low proportion of domains are associated with the highest number of architectural combinations, compared to non-promiscuous domains.

Functional association of promiscuous domains

Based on the definition of promiscuity, it was found that the proportion of promiscuous domains represents less than 10% of all non-redundant domains per genome, suggesting that a small fraction of the total repertoire of domains is determinant for the diversity of combinations of architectural identified so far. Therefore, in order to determine the preferential functions of promiscuous and non-promiscuos domains per genome, both datasets were evaluated, considering the functional annotations deposited in the SUPFAM and SCOP databases. In these databases, 50 different functional groups have been determined, and they are classified in eight broad categories (General; Information; Metabolism; Processes_EC, Processes_IC, and Regulation; Other; and N_A). From this analysis, 18 functional categories were identified as significantly enriched in the promiscuous domains in at least one genome, and in particular the Small-molecule binding domains were identified as enriched in 1219 genomes, whereas Signal transduction was enriched in 377 genomes and DNA-binding (transcription) and transferases were enriched in 153 and 118 genomes, respectively. Finally, the other 14 functional categories identified as enriched in minor proportions, such as Polysaccharide metabolism and Inorganic ion transport and metabolism and, were associated with 34 and 25 genomes, respectively (Fig 2).
Fig 2

Functional enrichment analysis on promiscuous and non-promiscuous domains per genome.

On the X-axis is the total number of genomes with an enriched function, and on the Y-axis are the functional categories.

Functional enrichment analysis on promiscuous and non-promiscuous domains per genome.

On the X-axis is the total number of genomes with an enriched function, and on the Y-axis are the functional categories. In contrast, non-promiscuous domains were associated with 6 functional categories, with Unknown function, Translation, ribosomes, ribosome biogenesis; tRNA metabolism and DNA replication, recombination, repair, the most enriched. These functions, associated to the non-promiscuous domains were identified in 207, 112 and 8 genomes, respectively. Therefore, functional associations were different between promiscuous and non-promiscuous domains, suggesting that bias to DNA binding, Transferases, Small-molecule binding and Signal transduction functions are specifically associated with promiscuous domains. Fig 2.

Protein architecture as a function of genome size

In our second approach, we considered the domains as a function of their correlation to genome size, where the correlation between a WDAS for a given domain and the genome size (number of ORFs) could be positive or negative and a set of uncorrelated domains could be identified. To this end, R-values were calculated and two large groups were defined: those negatively correlated and those with no correlation. Therefore, we were interested in the set of negatively correlated domains (R-values between -1 and -0.4), because they showed WDAS values decreasing as a function of genome size, i.e., protein domains are less promiscuous in small genomes and they are more promiscuous in large genomes. In addition, only protein domains identified in more than 60% of the total genomes or 770 different genomes were considered in this analysis, because we were interested in those domains with a wide genomic distribution for our analysis (See S1 File). From the complete set of 1317 protein domains analyzed, 52 protein domains had negative correlation values (R-values between -1 to -0.4), i.e., the number of architectures associated with them increased as a function of genome size (Table 1). In this category were, the protein domains primarily associated with small-molecule binding, such as (SSF:56176) FAD-binding domain, (SSF:52518) Thiamine diphosphate-binding fold (THDP-binding), (SSF:51905) FAD/NAD(P)-binding domain, (SSF:51735) NAD(P)-binding Rossmann fold domains, and (SSF:52540) P-loop–containing nucleoside triphosphate hydrolases, represented 23% of the total domains defined as promiscuous. The second group of promiscuous domains was associated with DNA-binding proteins, such as (SSF:46894) C-terminal effector domain of the bipartite response regulators, (SSF:47413) Lambda repressor-like DNA-binding domains, (SSF:46689) Homeodomain-like, and (SSF:46785) Winged helix DNA-binding domain, representing 11% of the domains associated with this functional category; and transferases (metabolism category), including (SSF:52151) FabD/lysophospholipase-like, (SSF:52317) Class I glutamine amidotransferase-like, (SSF:55729) Acyl-CoA N-acyltransferases (Nat), (SSF:53383) Pyridoxal phosphate (PLP)-dependent transferases, and (SSF: 53335) S-adenosyl-L-methionine-dependent methyltransferases, representing 10%; and finally other enzymes (metabolism) (9%), including (SSF:56235) N-terminal nucleophile aminohydrolases (Ntn hydrolases), (SSF:51338) Composite domain of metallo-dependent hydrolases, (SSF:51556) Metallo-dependent hydrolases, (SSF:56801) Acetyl-CoA synthetase-like, (SSF:53901) Thiolase-like, (SSF:56784) HAD-like, and (SSF:53474) alpha/beta-Hydrolases. Table 1.
Table 1

Promiscuous domains in Bacteria and Archaea.

SUPFAM IDFunctionDescriptionR-valueP-valueGenome distributionNumber of domains
46689LAHomeodomain-like-0.4475989692.11E-70120663188
46785LAWinged helix DNA-binding domain-0.4241539052.22E-521306102676
46894LAC-terminal effector domain of the bipartite response regulators-0.5507756153.17E-94111422398
47336SACP-like-0.5021842173.21E-10287211982
47384THomodimeric domain of signal transducing histidine kinase-0.4650544331.88E-65118030953
47413LAlambda repressor-like DNA-binding domains-0.437662738.42E-77123127826
48179C6-phosphogluconate dehydrogenase C-terminal domain-like-0.5119778644.88E-95128016142
48452RDTPR-like-0.502773091.28E-70121842972
48498KTetracyclin repressor-like, C-terminal domain-0.4475325791.91E-7393114919
50129OGroES-like-0.5388240719.75E-97111613619
51182EARmlC-like cupins-0.4960064784.30E-100120116674
51338RCComposite domain of metallo-dependent hydrolases-0.4403194291.94E-76116710765
51395RDFMN-linked oxidoreductases-0.4772435656.47E-9312429334
51556RCMetallo-dependent hydrolases-0.5219240962.10E-101128516908
51735HANAD(P)-binding Rossmann-fold domains-0.4606917072.84E-701319122830
51905HAFAD/NAD(P)-binding domain-0.5575290393.06E-93131446246
52096OAClpP/crotonase-0.4849023874.77E-76126020743
52151RBFabD/lysophospholipase-like-0.5268225578.97E-889676126
52172TCheY-like-0.5166865472.99E-78120856901
52317RBClass I glutamine amidotransferase-like-0.5439883021.52E-80130919566
52343RAFerredoxin reductase-like, C-terminal NADP-linked domain-0.4083377671.52E-679254940
52402FAdenine nucleotide alpha hydrolases-like-0.4638332362.48E-71131924519
52518HAThiamin diphosphate-binding fold (THDP-binding)-0.4051252585.96E-53131125285
52540HAP-loop containing nucleoside triphosphate hydrolases-0.5244976664.13E-941319293049
52833RAThioredoxin-like-0.4663975342.88E-77130735294
53098FRibonuclease H-like-0.4367133122.92E-63131722384
53187OAZn-dependent exopeptidases-0.4589983351.23E-72129516486
53335RBS-adenosyl-L-methionine-dependent methyltransferases-0.5516208352.47E-100131760989
53383RBPLP-dependent transferases-0.5789303364.92E-122131338176
53474RCalpha/beta-Hydrolases-0.5478613716.58E-114126044812
53850PPeriplasmic binding protein-like II-0.4761266111.73E-75130165801
53901RCThiolase-like-0.5502955711.30E-101126821450
54292RA2Fe-2S ferredoxin-like-0.4569317228.47E-5610968687
54427RFNTF2-like-0.4821124459.35E-878809935
54593RAGlyoxalase/Bleomycin resistance protein/Dihydroxybiphenyl dioxygenase-0.4509653417.55E-79101614244
54631RFCBS-domain-0.4113474441.65E-63128312925
55729RBAcyl-CoA N-acyltransferases (Nat)-0.4563537031.03E-86125034844
55781TGAF domain-like-0.4903405918.87E-102108319207
55811LNudix-0.4045053277.98E-71120811668
55874OATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase-0.4961490893.81E-69127350342
55961RBet v1-like-0.4192767461.15E-668108377
56059HGlutathione synthetase ATP-binding domain-like-0.462147174.47E-50127518325
56112OBProtein kinase-like (PK-like)-0.4383631191.36E-76120617033
56176HAFAD-binding domain-0.4723065466.13E-7012139295
56235RCN-terminal nucleophile aminohydrolases (Ntn hydrolases)-0.4593051391.50E-57124310121
56784RCHAD-like-0.4160965099.25E-60130223797
56801RCAcetyl-CoA synthetase-like-0.5477450041.80E-109110718746
63380HRiboflavin synthase domain-like-0.4102582387.10E-7011447771
82866RFMultidrug efflux transporter AcrB transmembrane domain-0.4324697559.57E-65121117967
88659TASigma3 and sigma4 domains of RNA polymerase sigma factors-0.5237163912.02E-78124820632
88946SSigma2 domain of RNA polymerase sigma factors-0.5851163141.62E-105115516024
103473PMFS general substrate transporter-0.4835681695.55E-84128539310

Columns are as follows: SUPFAM ID, Function, Description, R-value (correlation of WDAS score vs. genome size); P-value; number of genomes where the domain was identified; number of genomes for which the SUPFAM domain was identified and total number of domains in the indicated superfamily.

Functional categories. General: HA (Small molecule binding), R (General or several functions), RD (Dimerization domains). Information: K (Transcription), L (DNA replication, recombination, repair). Metabolism: C (Energy production and conversion), EA (Nitrogen metabolism), F (Nucleotide transport and metabolism), RA (Oxidation/Reduction), RB (Transferases), RC (Other enzymes). Processes_IC: O (Posttranslational modification, protein turnover, chaperones); OA (Proteases, peptidases and their inhibitors), P (Inorganic ion transport and metabolism), RF (Transport). Regulation: LA (DNA-binding transcription factors), OB (Kinases and phosphatases and inhibitors), T (Signal transduction), TA (Other regulatory function). Other: S (Function unknown).

Columns are as follows: SUPFAM ID, Function, Description, R-value (correlation of WDAS score vs. genome size); P-value; number of genomes where the domain was identified; number of genomes for which the SUPFAM domain was identified and total number of domains in the indicated superfamily. Functional categories. General: HA (Small molecule binding), R (General or several functions), RD (Dimerization domains). Information: K (Transcription), L (DNA replication, recombination, repair). Metabolism: C (Energy production and conversion), EA (Nitrogen metabolism), F (Nucleotide transport and metabolism), RA (Oxidation/Reduction), RB (Transferases), RC (Other enzymes). Processes_IC: O (Posttranslational modification, protein turnover, chaperones); OA (Proteases, peptidases and their inhibitors), P (Inorganic ion transport and metabolism), RF (Transport). Regulation: LA (DNA-binding transcription factors), OB (Kinases and phosphatases and inhibitors), T (Signal transduction), TA (Other regulatory function). Other: S (Function unknown). In summary, the four functional categories more abundant associated to this set of promiscuous domains are related to metabolism (other enzymes, and transferases), DNA-binding and small-molecule binding. These functional categories suggest that these structural domains could be associated with ancient and global functions that define the basic functions of cellular maintenance, signaling and regulation of gene expression, and they are influenced by the dynamics of the genome architecture in bacteria [20]; i.e. an increase in genome complexity also increases the probable combinations of structural domains. In contrast, three main functions not associated with promiscuous domains were identified: Translation (SSF:100704), metabolism (redox reactions) (SSF:119536), and other enzymes (SSF:170266). These functions were identified in a small proportion of domains compared to promiscuous domains, suggesting that they are not as highly abundant as promiscuous domains. Although there are diverse and interesting protein superfamily domains, there are two cases to illustrate the association of promiscuity with genome size: the P-loop–containing nucleoside triphosphate hydrolases domain (P-loop; SSF:52540) and the Winged helix-turn-helix domain (SSF:46785). The P-loop domain is the most prevalent domain of the nucleotide-binding protein folds, has been classified in the small-molecule binding functional category (HA) and represents 13.8% of the total domains identified as promiscuous. Proteins with this fold catalyze the hydrolysis of the beta-gamma phosphate bond of nucleoside triphosphate (NTP), and the energy from NTP hydrolysis is used to induce conformational changes in other molecules [21]. P-loop NTPases are characterized by two conserved sequence signatures, the Walker A motif (the P-loop proper) and Walker B motifs which bind the beta and gamma phosphate moieties of the bound NTP and a Mg2+ cation, respectively [22]. Thus, when we analyzed the number of different domains associated with the P-loop, we found a large number of related domains in larger genomes versus smaller genomes, suggesting that these superfamilies not only increase their association with other domains but also with different architectures (Fig 3).
Fig 3

The architecture of the P-loop superfamily as a function of genome size.

A) WDAS of the P- loop (R-value = -0. 52; p-value = 4.13E-94); B) number of different domains associated with the P-loop (R-value = 0.77, p-value = 4.84e-264). On the X-axis of each graph, genome size ranges are displayed in 13 windows, with a range of 836 ORFs each. On the Y-axis are the WDASs. The lines shown in the boxes are the median values. The whisker caps represent the minimum and maximum values.

The architecture of the P-loop superfamily as a function of genome size.

A) WDAS of the P- loop (R-value = -0. 52; p-value = 4.13E-94); B) number of different domains associated with the P-loop (R-value = 0.77, p-value = 4.84e-264). On the X-axis of each graph, genome size ranges are displayed in 13 windows, with a range of 836 ORFs each. On the Y-axis are the WDASs. The lines shown in the boxes are the median values. The whisker caps represent the minimum and maximum values. A second representative example is associated with the Winged helix-turn-helix DNA-binding domain. These structures are characterized by the presence of a third alpha-helix and an adjacent beta-sheet and are central components for DNA binding; the “wings” are small beta-sheets [23]. The recognition helix binds as described for regular helix-turn-helix motifs (i.e., with contacts in the major groove of DNA), and the extra secondary structural elements make different DNA contacts, often with the minor groove or the backbone of DNA [24, 25]. The wHTH has been identified in almost all microorganisms, from Bacteria to Archaea, and includes diverse families, such as the Catabolite gene activator (CAP) family, in which global regulators (CRP and FNR) have been described in the bacterium Escherichia coli [26], the heat shock and E2F/DP TFs, and the Ets domain family, among others [25]. The WDAS associated with the wHTH decreases as a function of the genome size, where larger genomes contain a higher proportion of these proteins, with diverse architectures and a high diversity of structural domains (Fig 4).
Fig 4

Architecture of the wHTH superfamily as a function of genome size.

A) WDAS of the wHTH superfamily (R-value = -0.42; p-value = 2.22E-52); B) number of different domains associated with the wHTH (R-value = 0.545 and p-value = 3.07e-102). On the X-axes of each graph are genome size ranges, displayed in 13 windows, each with a length of 836 ORFs. On the Y-axes are the WDASs. The horizontal lines in the boxes are the median values. The whisker caps represent the minimum and maximum values.

Architecture of the wHTH superfamily as a function of genome size.

A) WDAS of the wHTH superfamily (R-value = -0.42; p-value = 2.22E-52); B) number of different domains associated with the wHTH (R-value = 0.545 and p-value = 3.07e-102). On the X-axes of each graph are genome size ranges, displayed in 13 windows, each with a length of 836 ORFs. On the Y-axes are the WDASs. The horizontal lines in the boxes are the median values. The whisker caps represent the minimum and maximum values. In summary, we suggest that the set of 52 superfamilies follows trends similar to those of the P-loop and wHTH superfamilies, reinforcing the notion that these superfamilies not only increase their associations with other domains but also their architecture diversity.

Promiscuous domains are ancient

To evaluate how promiscuity developed during evolution, we mapped the WDASs associated with all domains against the ancient approach described by [6]. This analysis is based on the hypothesis that the most promiscuous domains could be also ancient domains. Therefore, the 52 protein domains identified as promiscuous associated to the genome size and non-promiscuous were traced along with their ancestry according to the approach described by Caetano-Anolles et al. [15]. S3 Table. In brief, Caetano-Anolles et al. consider the timeline of protein domain evolution spanning ~3.8 billion years of evolution, where a score of 0 represents the origin of domains and a score of 1 represents the present day. In this way, ancestrality is defined by ancestries of protein domain constituents derived from a structural phylogenomic census [6, 15]. In Fig 5, we show the frequencies of the 52 promiscuous domains that correlated with the genome size and identified in this analysis as a function of the ancestry score. From this analysis, two main results can be described: the first is that promiscuous domains are mainly associated with ancient evolutionary events (i.e., antiquity scores close to 0), with few recent emergence events (scores closer to 1) (Fig 5A). In contrast, no promiscuous domains are distributed along the whole antiquity timeline, suggesting that their emergence has been at diverse times in evolution (Fig 5B). In addition, we determined that promiscuous domains are more abundant than non-promiscuous domains. In addition, a Mann-Whitney U test was used to compare the two datasets, finding an U-value of 1064 and because the distribution is approximately normal the Z-Score is -10.62215 with a p-value of < 0.00001, being significant at p < .01. In other words, the difference between the promiscuous and non-promiscuous datasets is statistically significant. In summary, the promiscuous domains are abundant and ancient, such as the P-loop–containing nucleoside triphosphate hydrolases (SF:52540) and the DNA-binding domain (wHTH), because they are recurrent and most ancestral among the universal domains, in comparison to non-promiscuous domains. Therefore, the interplay of promiscuous and no-promiscuous domains determines the architecture of proteins in all the genomes.
Fig 5

Antiquity and abundance of structural domains.

A) Promiscuous and B) non-promiscuous domains. On the X-axes are antiquity assignments, i.e., how ancient each structural domain present in the universal enzymatic reactions is, as suggested by Caetano-Anolles et al [15]; a score of 0 represents an ancient event, whereas 1.0 represents recent domain emergence. On the Y-axis is the abundance (frequencies) of the superfamilies. The size of a circle represents the proportion of each domain in relation to the total protein domains.

Antiquity and abundance of structural domains.

A) Promiscuous and B) non-promiscuous domains. On the X-axes are antiquity assignments, i.e., how ancient each structural domain present in the universal enzymatic reactions is, as suggested by Caetano-Anolles et al [15]; a score of 0 represents an ancient event, whereas 1.0 represents recent domain emergence. On the Y-axis is the abundance (frequencies) of the superfamilies. The size of a circle represents the proportion of each domain in relation to the total protein domains.

Conclusions

In this work, we evaluated 1317 superfamilies for their architectures, their abundance in bacterial and archaeal genomes, their functional roles and their antiquities. Our data reinforce that increased gene complexity also requires the development of mechanisms for gene regulation at the transcription level and small-molecule binding, where the P-loop and the wHTH are the most significant examples, suggesting that the interplay of these structural domains could increase the ability of an organism to recognize and respond to diverse environmental stimuli. These findings are relevant in the context of functional categories, because promiscuous domains are mainly associated with basic metabolic, signal transduction, and gene regulation functions. We also found 52 domains whose promiscuity value or WDAS depends on genome size, and this was reinforced by the number of domains associated with each of them. In summary, we consider the association between these two classes of protein domains (promiscuous and non-promiscuous) has provided, since ancestral times, bacterial and archaeal cells with the ability to respond to diverse environmental challenges.

Distribution of enriched functions for Promiscuous and non-promiscuous domains at different thresholds.

The distributions were adjusted using the function polyfit of the numpy package in python, and different degree of polynomial regression were tested to maximize R2. A polynomial regression of grade 4 was found as the best for both datasets. (TIFF) Click here for additional data file.

The total of multidomain proteins per organisms follows a lineal distribution.

On the X-axis is the number of ORFs per genome, and the Y-axis shows the total of proteins with two or more domains. Each open circle denotes a genome. (R-value = 0.955, p-value < 0.0). A non-linear least squared to fit the dataset was used. (TIFF) Click here for additional data file.

Domain assignments by SUPFAM of 1507 organisms based on the NCBI classification system.

Columns correspond to: Genome ID; Total of Proteins per genome (ORFs); NRProteins_Assigned (total of proteins with one non redundant domain); Coefficient (coverage = total of NR proteins / ORFs); NRDomains; 1Domain_count; 2Domain_Count; 3Domain_Count; > = 4DomainCounter. (TXT) Click here for additional data file.

Domain distribution per taxonomical classification.

Columns correspond to: Domain; C.C.; p-value, Gen_Count; Domain_Count Function; Group_Count; Firmicutes; Proteobacteria; Euryarchaeota; Actinobacteria; Tenericutes; Chlamydiae: Chlorobi; Crenarchaeota; Fusobacteria; Bacteroidetes; Spirochaetes; Cyanobacteria; Deinococcus; Thermus; Thermotogae: Aquificae; Chloroflexi; Gemmatimonadetes: Deferribacteres; Acidobacteria; Candidatus Korarchaeota; Verrucomicrobia: Elusimicrobia; Dictyoglomi; Nitrospirae; Fibrobacteres; Synergistetes; Thermobaculum; Planctomycetes; Cloacimonetes; Chrysiogenetes: Thermodesulfobacteria; Ignavibacteriae; Caldiserica; Thaumarchaeota: Armatimonadetes; Taxonomy total distribution. (TXT) Click here for additional data file. Columns correspond to: Domain; Function; SCOP; Description; ancient_order; Ancient_score; R-value; Total_Domains; Archaea_Domains; and Bacteria_Domains. (TXT) Click here for additional data file.

Boxplot of the architecture of the 52-superfamily domains as a function of genome size.

On the X-axis of each graph, genome size ranges are displayed in 13 windows, with a range of 836 ORFs each. On the Y-axis are the WDASs. The lines shown in the boxes are the median values. The whisker caps represent the minimum and maximum values. Superfamily IDs correspond to the names in Table 1. (ZIP) Click here for additional data file.
  26 in total

1.  Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds.

Authors:  Hedi Hegyi; Jimmy Lin; Dov Greenbaum; Mark Gerstein
Journal:  Proteins       Date:  2002-05-01

Review 2.  The structure of the protein universe and genome evolution.

Authors:  Eugene V Koonin; Yuri I Wolf; Georgy P Karev
Journal:  Nature       Date:  2002-11-14       Impact factor: 49.962

3.  Modules, multidomain proteins and organismic complexity.

Authors:  Hedvig Tordai; Alinda Nagy; Krisztina Farkas; László Bányai; László Patthy
Journal:  FEBS J       Date:  2005-10       Impact factor: 5.542

4.  Global phylogeny determined by the combination of protein domains in proteomes.

Authors:  Minglei Wang; Gustavo Caetano-Anollés
Journal:  Mol Biol Evol       Date:  2006-09-13       Impact factor: 16.240

5.  Phylogenomic clustering for selecting non-redundant genomes for comparative genomics.

Authors:  Gabriel Moreno-Hagelsieb; Zilin Wang; Stephanie Walsh; Aisha ElSherbiny
Journal:  Bioinformatics       Date:  2013-02-08       Impact factor: 6.937

6.  Nucleation, rapid folding, and globular intrachain regions in proteins.

Authors:  D B Wetlaufer
Journal:  Proc Natl Acad Sci U S A       Date:  1973-03       Impact factor: 11.205

Review 7.  The winged-helix DNA-binding motif: another helix-turn-helix takeoff.

Authors:  R G Brennan
Journal:  Cell       Date:  1993-09-10       Impact factor: 41.582

Review 8.  STAND, a class of P-loop NTPases including animal and plant regulators of programmed cell death: multiple, complex domain architectures, unusual phyletic patterns, and evolution by horizontal gene transfer.

Authors:  Detlef D Leipe; Eugene V Koonin; L Aravind
Journal:  J Mol Biol       Date:  2004-10-08       Impact factor: 5.469

9.  Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold.

Authors:  J E Walker; M Saraste; M J Runswick; N J Gay
Journal:  EMBO J       Date:  1982       Impact factor: 11.598

Review 10.  The Evolution of Bacterial Genome Architecture.

Authors:  Louis-Marie Bobay; Howard Ochman
Journal:  Front Genet       Date:  2017-05-30       Impact factor: 4.599

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.