Literature DB >> 26272053

Evolutionary conservation analysis between the essential and nonessential genes in bacterial genomes.

Hao Luo1, Feng Gao2, Yan Lin1.   

Abstract

Essential genes are thought to be critical for the survival of the organisms under certain circumstances, and the natural selection acting on essential genes is expected to be stricter than on nonessential ones. Up to now, essential genes have been identified in approximately thirty bacterial organisms by experimental methods. In this paper, we performed a comprehensive comparison between the essential and nonessential genes in the genomes of 23 bacterial species based on the Ka/Ks ratio, and found that essential genes are more evolutionarily conserved than nonessential genes in most of the bacteria examined. Furthermore, we also analyzed the conservation by functional clusters with the clusters of orthologous groups (COGs), and found that the essential genes in the functional categories of G (Carbohydrate transport and metabolism), H (Coenzyme transport and metabolism), I (Transcription), J (Translation, ribosomal structure and biogenesis), K (Lipid transport and metabolism) and L (Replication, recombination and repair) tend to be more evolutionarily conserved than the corresponding nonessential genes in bacteria. The results suggest that the essential genes in these subcategories are subject to stronger selective pressure than the nonessential genes, and therefore, provide more insights of the evolutionary conservation for the essential and nonessential genes in complex biological processes.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26272053      PMCID: PMC4536490          DOI: 10.1038/srep13210

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Essential genes are the genes that are indispensable for the maintenance of organisms. They play significant roles in many critical cellular processes, and hence are also considered the foundation of cellular life1. A wide variety of in vivo and in vitro approaches, including single-gene knock-out, transposon mutagenesis, antisense RNA and RNA interference, have been employed to identify the essential genes2. During the past decade, the application of the next-generation sequencing technology in the transposon mutagenesis has also facilitated various methods in the identification of the essential genes, such as TrsDIS, INSeq, HITS and In-seq3. As a consequence, the increase of the available essential genes promoted a broad spectrum of subsequent studies of essential genes, which are aiming at investigating the characteristics of the essential and nonessential genes. For instance, the essential genes are preferentially situated at the leading strand as well as in the cytoplasm, and enriched in protein complexes and enzymes4567. Therefore, these outcomes have led to a development of the predictive models to identify the essential genes891011. Additionally, the knowledge about the essential genes helps us to determine the universal minimal set of genes to sustain life and develop novel antibiotics to treat pathogenic bacterial infections, which will support the advancement of the pharmaceutical industry as well as the synthetic biology12. It is well known that the rates of evolution have significant variations among protein-coding genes. If a protein plays a significant role in the cellular life, it should be under rigorous functional or structural constraints in response to the strong purifying (negative) selective pressure13. And its direct manifestation is the restriction of amino acid changes. The key principle for the identification of the essential genes is that the function absence of normal genes results in lethality or infertility in some special conditions2. Given these results, it is likely that the essential genes have a greater level of purifying selection pressure during the natural evolution. Some previous studies have reported that the essentiality of proteins plays an important role in the rates of evolution. Koonin et al. performed a genome-wide evolutionary analysis in three bacterial species, including Escherichia coli, Helicobacter pylori, and Neisseria meningitides, and found that the essential genes are evolutionarily conserved than the nonessential genes14. However, due to the limitation of data size at the time, a subset of the essential genes in their work were putative assumed based on the functional characteristics, rather than confirmed by experiments. Guo et al. also analyzed 16 different biological features on the evolutionary rate, and found that function essentiality is one of main contributors to the protein evolutionary rate variation15. We constructed a database of essential genes named DEG, which has collected and organized the records of both essential genes and essential non-coding elements by genome-wide gene essentiality screens, including bacteria, archaea and eukaryotes31617. In this study, along with the availability of the essential genes identified by experiment from the DEG database, the previous finding, that the essential genes are evolutionarily conserved than the nonessential genes, was confirmed with 23 genomes of bacterial organisms. Furthermore, we examined evolutionary conservation based on the clusters of orthologous groups of proteins (COGs), and found that the essential genes in the COG functional categories G, H, I, J, K and L tend to have a lower rate of evolution compared with the corresponding nonessential genes. The results suggest the difference between the essential and nonessential genes in terms of evolutionary rates among various functional categories, and provide further insights into the evolutionary pressures acting on the essential and nonessential genes.

Methods

Ka/Ks estimation

The Ka/Ks ratio is the ratio of the number of non-synonymous substitutions per non-synonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks), which could be used as an indicator of selective pressure acting on the protein-coding gene18. We developed a workflow to estimate the Ka/Ks ratio of all the genes in 23 bacterial organisms, whose essential and nonessential genes are available in DEG database. Figure 1 presents the procedure of the estimation. For each organism, we randomly picked at least one homologous strain to find pairs of orthologous proteins with E-value less than 10−5 based on BLASTP searches, and the orthologous protein with the highest score by BLASTP was selected for the further analysis19. All the complete genome sequences were downloaded via NCBI FTP from ftp://ftp.ncbi.nih.gov/genomes/Bacteria. The pairs of protein sequences were aligned by ClustalW2 with default options, and the nucleotide sequences were aligned to their corresponding amino acid sequences using Pal2Nal2021. Ka/Ks value was calculated by KaKs_Calculator1.2 employing the Nei–Gojobori method2223. And all the essential and nonessential genes were obtained from the latest release of DEG database, which is available at http://tubic.tju.edu.cn/deg/. During the Ka/Ks estimation, the majority of protein sequences in some homologous strains are precisely the same with the protein sequences in the original organism, so that the ratios of Ka and Ks of these homologous sequences could not be determined based on the Nei–Gojobori method. As a result, we only selected the strains that keep enough diversity in most protein-coding sequences.
Figure 1

The workflow of the Ka/Ks estimation.

Flow chart schematically shows the procedure to estimate the Ka/Ks ratios of all the protein-coding genes.

Bootstrap analysis

In order to estimate the difference of evolutionary conversation between the essential and nonessential genes in each organism, we performed a bootstrap analysis by half-sampling with replacement from the original gene set over 1000 replicates. The average values for these resampled sets including Ka, Ks and Ka/Ks were then calculated. It should be noted that the genes without valid Ka or Ks value were excluded from the analysis.

COG analysis

The Clusters of Orthologous Groups of proteins (COGs) database is a useful tool for the functional annotation, and provides a consistent classification of bacterial and eukaryotic species based on orthologous groups24. In this study, all the essential and nonessential genes were divided into several functional subcategories based on the COG annotation (http://www.ncbi.nlm.nih.gov/COG/). It should be noted that owing to the absence of COG annotations, the genes of Bacteroides fragilis 638R were excluded in the COG analysis. The significance of difference for the Ka/Ks values between the essential gene and nonessential genes in each COG scope by organism was performed by Mann-Whitney U test. And the P-value less than 0.01 was considered statistically significant.

Analysis tools

All the pipelines including BLASTP, ClustalW2, Pal2Nal and KaKs_calculator were executed by a custom python script. Biopython module was used to parse the GenBank and aln format files25. Statistical analyses were carried out using the Scipy and Pandas package. The figures were generated by the Python module Matplotlib26.

Results and Discussion

Essential genes are evolutionarily conserved than nonessential genes

In order to estimate the difference of evolutionary conservation between the essential and nonessential genes, we constructed a data set containing the essential and nonessential genes, which were determined in genome-wide screen. The gene set consists of more than 70,000 genes from 23 genomes of bacterial organisms. Table 1 represents the organisms used in this study. Then we identified and aligned more than 220,000 pairs of orthologous protein sequences based on BLASTP search, in which more than 180,000 pairs of proteins have valid synonymous (Ks) or nonsynonymous (Ka) substitution rates (see Methods). We also evaluated the properties of 180,000 pairs of proteins, and found that about 90% of them are with more than 30% amino acid identity and 50% minimum aligned residues by BLASTP searches. In addition, most of the rest of pairs with relative lower amino acid identity and length difference still have the same biological functions and COG assignments. This illustrates that with the condition of E-value less than 10−5, the most pairs of proteins we found are orthologous. Then the average Ka, Ks and Ka/Ks ratios of the essential and nonessential gene were calculated in each organism, and the levels of significance for the difference between the essential and nonessential genes were determined using the Mann-Whitney U test (Figure 2). Based on the results, we could find that the ratios of Ka and Ks show significantly lower level for the essential genes than for the nonessential genes in most organisms, which indicates that not only non-synonymous sites but also synonymous sites are subject to some degree of selection pressure. In addition, Student’s t-test was performed for the three averages between the essential and nonessential genes in all the organisms, and the differences are statistically significant (P = 0.0004, 0.001 and 0.0004, respectively). The lower three ratios of the essential genes, in particular, suggest that the essential genes are more conserved during the evolution, and consistent with the fact that the negative selection against amino acid replacements acting on the essential genes are more strict than on the nonessential genes.
Table 1

The summary of the dataseta.

OrganismRefSeqEssential
Nonessential
PNEb (%)RefSeq of the homologous strainsc
GenesCOGsGenesCOGs
A. baylyi ADP1NC_005966499474259416990.00NC_009085 NC_011595 NC_017847
B. subtilis 168NC_000964271267390424830.15NC_014479 NC_016047 NC_017195
B. fragilis 638RNC_01677654737433.37NC_003228 NC_006347 NC_009614
B. thetaiotaomicron 5482NC_004663325256445325780.00NC_009614 NC_010831 NC_015164
B. pseudomallei K96243NC_006351 NC_0063505054535222372421.52NC_009078 NC_009076 NC_017831 NC_017831 NC_018527 NC_018529
B. thailandensis E264NC_007651 NC_007650406372522637060.00NC_021173 NC_017832
C. jejuni NCTC 11168NC_002163228178139510000.86NC_003912 NC_009707 NC_017280
C. crescentus NA1000NC_011916480436322419880.00NC_002696 NC_010338 NC_014100
E. coli MG1655NC_000913296284407731441.84NC_008253 NC_009801 NC_013008
F. novicida U112NC_00860139234013299420.08NC_017449 NC_017450 NC_017909
H. influenzae KW20NC_0009076425495124220.20NC_007146 NC_017451 NC_022356
H. pylori 26695NC_00091532323011357590.00NC_017359 NC_019563 NC_022886
M. tuberculosis H37RvNC_0009626875893070174745.24NC_017528 NC_021192 NC_021193
M. genitalium G37NC_000908381287946439.36NC_018495 NC_018496 NC_018497 NC_018498
P. gingivalis 33277NC_01072946337416278401.23NC_002950 NC_015571
P. aeruginosa PAO1NC_002516117985454407513.31NC_017548 NC_018080 NC_021577
P. aeruginosa PA14NC_0084633352719606277.08NC_002516 NC_017548 NC_018080
S. Typhi Ty2NC_004631358338390627073.84NC_003197 NC_021151 NC_022544
S. wittichii RW1NC_00951153545243153184NC_020561
S. aureus N315NC_0027453022802281147113.42NC_002951 NC_007795 NC_018608
S. aureus NCTC 8325NC_007795351308254113467.20NC_002745 NC_018608 NC_020533
S. sanguinis SK36NC_009009218211205213410.00NC_017595 NC_017618 NC_018526
V. cholerae N16961NC_002506 NC_0025057794332943217347.54NC_012578 NC_012580 NC_012582 NC_012583 NC_017269 NC_017270

aThe organism name, RefSeq, proportion of PNE genes and the RefSeq used to evaluate conservation are provided. The count of the essential and nonessential genes as well as their COGs are also present in this table.

bThe percentage is the proportion of PNE genes in the nonessential genes.

cThe genome of S. wittichii RW1 only have single completely homologous genomes in NCBI, so that the percentage of PNE was not available. And due to the absence of homologous strains in A. baylyi, the three genomes were selected from the Acinetobacter genus.

Figure 2

The average of Ka, Ks and Ka/Ks value for essential and nonessential genes in each organism.

The histogram shows the averages for (A) Ka, (B) Ks and (C) Ka/Ks values between the essential and nonessential genes, respectively. The P-values calculated by Mann–Whitney U Test are also displayed at the top of figures.

In order to rule out the effect of the extreme values upon our results, the average values for Ka/Ks were evaluated by bootstrap analysis of 1000 replicates through half-sample resampling. Figure 3 reports the comparison for the distributions of the averages between the essential and nonessential genes by box plot. The result eliminates the influence of the abnormal values, and proofs that essential genes are often highly evolutionarily conserved than the nonessential genes across bacterial organisms to a certain extent.
Figure 3

Box-plot diagrams for the differences between the essential and nonessential genes.

The pairs of box-plots, presented in red and blue respectively, compare the distributions for the average Ka/Ks ratios of the essential genes and nonessential genes in each organisms.

However, in the Fig. 2, we also found a significant reduction in Ka, Ks and Ka/Ks for the nonessential genes than for the essential genes in Mycobacterium tuberculosis H37Rv and Vibrio cholerae N1696. The highly conserved nonessential genes, throughout distantly related bacteria, have been found in other organism, which are termed persistent nonessential (PNE) genes27. Due to the restrictions of current experimental techniques to define gene essentiality, the PNE genes are only dispensable for short-term survival and growth under laboratory conditions. Nevertheless, from an evolutionary point of view, PNE genes are also essential for successful survival of the population under various external environment28. With the aim of testing whether the abnormal results is related to the PNE genes, we performed a rough estimate of the proportion of the PNE genes in these organisms. In this study, the nonessential genes, which did not exhibit any variations with orthologous proteins in the nucleotide sequences (Ka, Ks = 0) and had homologous proteins in two or more organisms, were defined as PNE genes. Therefore, significantly higher percentages of the PNE genes for the two organisms than for other organisms are found, which means the PNE genes are enriched in the genomes of M. tuberculosis H37Rv and V. cholerae N1696 (See Table 1). And the enrichment of PNE genes indicates that a large amount of conserved nonessential genes were not recognized as experimentally essential genes, so the abnormal phenomenon do not conflict with the fact that the essential genes are evolutionarily conserved than the nonessential ones. In addition, for the pathogenic microorganisms, the most common antibiotics hit the only targets or pathways that are crucial for the organisms, and the proteins in these targets or pathways are often essential genes. As a result, the essential genes may be subject to the positive selection pressure, and evolve more rapidly than the nonessential genes in order to survive in some harsh and extreme environments. Finally, due to the limits of method, the selected homologous strains, which must keep enough diversity with original genome, may also have impact on the results. However, the conclusions drawn from the analysis of the evolutionary conservation in the 23 genomes of bacterial species are also valid.

Evolutionary analysis of the essential genes based on the COG terms

In order to further analyze the evolutionary conservation of the essential and nonessential genes in functional level, we classified all the essential and nonessential genes into 25 functional classes based on the COG subcategories. Consequently, a total of 54,175 genes in 22 organisms were classified in to 25 COG categories, while 11,359 genes had no COG assignments. Table 1 represents the numbers of COGs for the essential and nonessential genes in each organism. Then the Mann-Whitney U test was employed to test the significance of the differences between the essential and nonessential genes in each COG subcategory. The P-values less than 0.01 were considered statistically significant. A hierarchical clustering heat map, used to display the statistically significant COG categories in the 22 species of prokaryotic organisms, is present in the Figure 4. Note that the genes annotated by COG codes R and S were excluded in this study, because the two categories are denoted as unknown function and general prediction function. And the COG categories B, Y and Z were not considered due to the absence of available essential genes in them. In order to unify the standard of conserved function subcategories, the COG subcategories, in which essential genes are conserved than nonessential genes in more than half of all organisms, are defined as conserved function subcategories. The reason, that we did not used the clustering result to define the conserved subcategories, is that the different hierarchical clustering approaches would lead to different clusters. From the Figure 4, we could find that the essential genes from the functional subcategories with the COG codes G (Carbohydrate transport and metabolism), H (Coenzyme transport and metabolism), I (Transcription), J (Translation, ribosomal structure and biogenesis), K (Lipid transport and metabolism) and L (Replication, recombination and repair) are significantly conserved than the nonessential genes in more than half of all the organisms, and these COG subcategories are defined as conserved function subcategories in this research. Furthermore, the subcategories with COG codes C (Energy production and conversion) and U (Intracellular trafficking, secretion, and vesicular transport) also have statistical significance in no less than ten organisms. The COG subcategory M (Cell wall/membrane/envelope biogenesis) is conserved as well, because it is clustered into one group with other conserved subcategories based on the dendrogram and appears in nine organisms. Notably, the differences between the essential genes are not statistically significant in COG subcategories A (RNA processing and modification) and N (Cell motility). Due to only few genes annotated with COG code W (Extracellular structures), there is also no significant difference in this subcategory. In addition, the extraordinary circumstances, that the nonessential genes are conserved than the essential genes, are not extensively observed in this figure. It has been well known that the COG subcategories could be classified into four broad functional categories: (1) information processing and storage, (2) cellular processes, (3) metabolism, and (4) poorly characterized. The COG subcategories, in which the essential genes are significantly conserved than the nonessential ones, are mainly found in information processing and storage, and metabolism. Overall, from the results obtained so far, it seems that the highly evolutionary conservation for the essential genes than for the nonessential genes is often discovered in some central cellular mechanisms, which indicates that these biological processes expert stronger evolutionary pressure on the essential genes than on the nonessential genes.
Figure 4

The heat map analysis for the significant conserved genes based on the COG categories.

The hierarchal cluster diagram was constructed by Ward’s linkage clustering. The P-values of each COG category are calculated with Mann–Whitney U Test by organism, which reflect the significance of the difference for the Ka/Ks value between the essential and nonessential genes. The blue boxes represent that the COG subcategory in which the essential genes are evolutionarily conserved than the nonessential ones, while the red boxes represent the opposite case.

Conclusion

In the presented work, by comprehensively analyzing the Ka/Ks ratio of the essential genes in 23 species of prokaryotic organisms, we have demonstrated that the bacterial essential genes are evolutionarily conserved than the nonessential genes. Furthermore, the essential genes in COG subcategories of G, H, I, J, K and L present more evolutionary conservation than the nonessential genes, which indicates the essential genes are under stronger selection constraint in these biological processes. The results provide further insights into the evolutionary conservation of essential genes, and help to develop novel gene essentiality prediction algorithms.

Additional Information

How to cite this article: Luo, H. et al. Evolutionary conservation analysis between the essential and nonessential genes in bacterial genomes. Sci. Rep. 5, 13210; doi: 10.1038/srep13210 (2015).
  28 in total

1.  Functionality of essential genes drives gene strand-bias in bacterial genomes.

Authors:  Yan Lin; Feng Gao; Chun-Ting Zhang
Journal:  Biochem Biophys Res Commun       Date:  2010-04-24       Impact factor: 3.575

2.  Essence of life: essential genes of minimal genomes.

Authors:  Mario Juhas; Leo Eberl; John I Glass
Journal:  Trends Cell Biol       Date:  2011-09-01       Impact factor: 20.808

Review 3.  Essential genes as antimicrobial targets and cornerstones of synthetic biology.

Authors:  Mario Juhas; Leo Eberl; George M Church
Journal:  Trends Biotechnol       Date:  2012-08-30       Impact factor: 19.536

4.  BLAST+: architecture and applications.

Authors:  Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal:  BMC Bioinformatics       Date:  2009-12-15       Impact factor: 3.169

5.  Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors:  Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal:  Bioinformatics       Date:  2009-03-20       Impact factor: 6.937

6.  Evolutionary conservation of essential and highly expressed genes in Pseudomonas aeruginosa.

Authors:  Andreas Dötsch; Frank Klawonn; Michael Jarek; Maren Scharfe; Helmut Blöcker; Susanne Häussler
Journal:  BMC Genomics       Date:  2010-04-09       Impact factor: 3.969

7.  Investigating the predictability of essential genes across distantly related organisms using an integrative approach.

Authors:  Jingyuan Deng; Lei Deng; Shengchang Su; Minlu Zhang; Xiaodong Lin; Lan Wei; Ali A Minai; Daniel J Hassett; Long J Lu
Journal:  Nucleic Acids Res       Date:  2010-09-24       Impact factor: 16.971

8.  Enzymes are enriched in bacterial essential genes.

Authors:  Feng Gao; Randy Ren Zhang
Journal:  PLoS One       Date:  2011-06-28       Impact factor: 3.240

9.  Putative essential and core-essential genes in Mycoplasma genomes.

Authors:  Yan Lin; Randy Ren Zhang
Journal:  Sci Rep       Date:  2011-08-03       Impact factor: 4.379

10.  DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes.

Authors:  Feng Gao; Hao Luo; Chun-Ting Zhang
Journal:  Nucleic Acids Res       Date:  2012-10-23       Impact factor: 16.971

View more
  23 in total

Review 1.  Emerging and evolving concepts in gene essentiality.

Authors:  Giulia Rancati; Jason Moffat; Athanasios Typas; Norman Pavelka
Journal:  Nat Rev Genet       Date:  2017-10-16       Impact factor: 53.242

Review 2.  A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes.

Authors:  Chong Peng; Yan Lin; Hao Luo; Feng Gao
Journal:  Front Microbiol       Date:  2017-11-27       Impact factor: 5.640

3.  Complete Genome Sequence Analysis of Enterobacter sp. SA187, a Plant Multi-Stress Tolerance Promoting Endophytic Bacterium.

Authors:  Cristina Andrés-Barrao; Feras F Lafi; Intikhab Alam; Axel de Zélicourt; Abdul A Eida; Ameerah Bokhari; Hanin Alzubaidy; Vladimir B Bajic; Heribert Hirt; Maged M Saad
Journal:  Front Microbiol       Date:  2017-10-20       Impact factor: 5.640

4.  SSER: Species specific essential reactions database.

Authors:  Abraham A Labena; Yuan-Nong Ye; Chuan Dong; Fa-Z Zhang; Feng-Biao Guo
Journal:  BMC Syst Biol       Date:  2017-04-19

Review 5.  Variability of Bacterial Essential Genes Among Closely Related Bacteria: The Case of Escherichia coli.

Authors:  Enrique Martínez-Carranza; Hugo Barajas; Luis-David Alcaraz; Luis Servín-González; Gabriel-Yaxal Ponce-Soto; Gloria Soberón-Chávez
Journal:  Front Microbiol       Date:  2018-05-29       Impact factor: 5.640

6.  Testing the Domino Theory of Gene Loss in Buchnera aphidicola: The Relevance of Epistatic Interactions.

Authors:  David J Martínez-Cano; Gil Bor; Andrés Moya; Luis Delaye
Journal:  Life (Basel)       Date:  2018-05-29

7.  Essentiality Is a Strong Determinant of Protein Rates of Evolution during Mutation Accumulation Experiments in Escherichia coli.

Authors:  David Alvarez-Ponce; Beatriz Sabater-Muñoz; Christina Toft; Mario X Ruiz-González; Mario A Fares
Journal:  Genome Biol Evol       Date:  2016-09-26       Impact factor: 3.416

8.  An Approach for Predicting Essential Genes Using Multiple Homology Mapping and Machine Learning Algorithms.

Authors:  Hong-Li Hua; Fa-Zhan Zhang; Abraham Alemayehu Labena; Chuan Dong; Yan-Ting Jin; Feng-Biao Guo
Journal:  Biomed Res Int       Date:  2016-08-30       Impact factor: 3.411

9.  Functional and structural characterization of osteocytic MLO-Y4 cell proteins encoded by genes differentially expressed in response to mechanical signals in vitro.

Authors:  Fanchi Meng; Graeme F Murray; Lukasz Kurgan; Henry J Donahue
Journal:  Sci Rep       Date:  2018-04-30       Impact factor: 4.379

10.  Metabolic models and gene essentiality data reveal essential and conserved metabolism in prokaryotes.

Authors:  Joana C Xavier; Kiran Raosaheb Patil; Isabel Rocha
Journal:  PLoS Comput Biol       Date:  2018-11-16       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.