Literature DB >> 26067107

Statistical Analysis of Hurst Exponents of Essential/Nonessential Genes in 33 Bacterial Genomes.

Xiao Liu1, Baojin Wang1, Luo Xu1.   

Abstract

Methods for identifying essential genes currently depend predominantly on biochemical experiments. However, there is demand for improved computational methods for determining gene essentiality. In this study, we used the Hurst exponent, a characteristic parameter to describe long-range correlation in DNA, and analyzed its distribution in 33 bacterial genomes. In most genomes (31 out of 33) the significance levels of the Hurst exponents of the essential genes were significantly higher than for the corresponding full-gene-set, whereas the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. All of the Hurst exponents of essential genes followed a normal distribution, with one exception. We therefore propose that the distribution feature of Hurst exponents of essential genes can be used as a classification index for essential gene prediction in bacteria. For computer-aided design in the field of synthetic biology, this feature can build a restraint for pre- or post-design checking of bacterial essential genes. Moreover, considering the relationship between gene essentiality and evolution, the Hurst exponents could be used as a descriptive parameter related to evolutionary level, or be added to the annotation of each gene.

Entities:  

Mesh:

Year:  2015        PMID: 26067107      PMCID: PMC4466317          DOI: 10.1371/journal.pone.0129716

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Essential genes are those genes of an organism that are critical for its survival, and their content are highly determined by the circumstances in which the organism lives [1]. Essential genes are characterized by lethal mutant phenotypes that block survival or reproduction. Identifying essential genes is important for understanding the minimal requirements for cellular survival, and this information may have practical applications in the fields of medicine and bioengineering [2]. Furthermore, this information may provide insight into the complex mechanisms of cell function and evolution. The identification of essential genes still relies predominantly on biochemical experiments. Single-gene-specific mutagenesis, saturation transposon mutagenesis, and antisense RNA inhibition are employed as traditional approaches [3]. However, these methods are costly, time-consuming, and labor-intensive. In addition, experimental approaches are not always possible as the vast majority of organisms are unculturable. Computational prediction methods therefore offer a good alternative. Many methods for essential gene prediction have previously been reported; these include comparative genomic approaches using homology mapping, constraint-based methods, machine learning methods based on a partial essential gene list, and statistical modeling based on transposon mutagenesis data [4]. To increase prediction accuracy, researchers employed a large range of features, such as amino acid composition, codon bias, and protein evolution rate [2,4]. However, the accuracy of these methods is not always high enough for large scale practical application. Moreover, the increase in the number of features (the dimension of feature space) results in a sharp increase in computational complexity and computational cost [5], which highlights the necessity for more efficient methods and feature(s). Long-range correlation (or self-similarity) has been observed in DNA sequences, in both coding and non-coding regions [6,7]. The Hurst exponent is used as a characteristic parameter to describe long-range correlation. Yu et al. proposed a time-series model and a visual representation for DNA sequence analysis [8]. The correlation dimension, Hurst exponent, fractal dimension, and dimension spectrum (multifractal analysis) were considered in their work. Liu et al. calculated the rescaled range functions; the Hurst exponents of human chromosome 22 and enterobacteria phage lambda DNA sequences; and the transmission coefficients, Landauer resistances, and Lyapunov coefficients of segments of the corresponding DNA sequences [9]. Stan et al. investigated the characteristics of a series of lengths of coding and non-coding DNA sequences from certain bacteria and archaea, using the generalized Hurst exponent on the size of fluctuations, the shape of the singularity spectra, the shape and relative disposition of the curves of the singular measures scaling exponent, and the values of the associated parameters [10]. With the rapid development of synthetic biology, especially research on minimal genomes, the question has arisen as to whether these long-range correlation features change in reduced DNA sequences. We performed an initial study of the Hurst exponent of nine bacteria from the DEG 6.5 database (the original database of essential genes) and obtained an initial view of the distribution feature of the index, that is, a normal distribution exists for the Hurst exponent [11]. However, the sample size was limited in that study. The updated DEG 10.7 database, released more recently, contains 33 bacteria, allowing for a more comprehensive study [12]. Here, we investigated the distribution of the Hurst exponent of 33 bacteria from the DEG 10.7 database to analyze the long-range correlation of this feature.

Materials and Methods

The essential gene lists of 33 bacterial objects were downloaded from the DEG 10.7 database, and their genome files and sequence data were obtained from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Sequence information is listed in Table 1.
Table 1

Information of the analyzed objects.

Analysis organismsNCBI RefSeq access numberGene number (Full-gene-set)Gene number (Essential)Gene number (Nonessential)
Acinetobacter baylyi ADP1NC_00596633074992594
Bacillus subtilis 168NC_00096441752713904
Bacteroides fragilis 638RNC_01677642905473743
Bacteroides thetaiotaomicron VPI-5482NC_00466347783254453
Burkholderia pseudomallei K96243NC_006350/0063513398+23295055222
Burkholderia thailandensis E264NC_007650/0076513276+23564065226
Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819NC_00216315762281395
Caulobacter crescentus NC_01191638854803224
Escherichia coli MG1655 INC_00091341406092923
Escherichia coli MG1655 IINC_00091341402964077
Francisella novicida U112NC_00860117193921329
Haemophilus influenzae Rd KW20NC_0009071610642512
Helicobacter pylori 26695NC_00091514693231135
Mycobacterium tuberculosis H37RvNC_00096239066142552
Mycobacterium tuberculosis H37Rv IINC_00096239067713171
Mycobacterium tuberculosis H37Rv IIINC_00096239066873070
Mycoplasma genitalium G37NC_00090847538194
Mycoplasma pulmonis UAB CTIPNC_002771782310322
Porphyromonas gingivalis ATCC 33277NC_01072920894631627
Pseudomonas aeruginosa PAO1NC_00251655721175454
Pseudomonas aeruginosa UCBPP-PA14NC_0084635892335960
Salmonella enterica serovar Typhi NC_00463143523534005
Salmonella enterica serovar Typhi Ty2NC_00463143523583906
Salmonella enterica serovar Typhimurium SL1344NC_01681044463534035
Salmonella enterica subsp. enterica serovar Typhimurium str. 14028SNC_01685653151055210
Salmonella typhimurium LT2NC_00319744512304228
Shewanella oneidensis MR-1NC_00434740654031103
Sphingomonas wittichii RW1NC_00951148505354315
Staphylococcus aureus N315NC_00274525823022281
Staphylococcus aureus NCTC 8325NC_00779527673512541
Streptococcus pneumoniae NC_0030981813244NULL a
Streptococcus sanguinis NC_00900922702182052
Vibrio cholerae N16961NC_002505/0025062534+9707792943

a NO nonessential genes information provided in DEG.

a NO nonessential genes information provided in DEG. It should be noted that there are actually 33 bacterial object sequences provided in the DEG database, with six object sequences originating from three bacteria (E. coli MG1655, Mycobacterium tuberculosis H37Rv, and S. enterica serovar Typhi) according to different experimental environments and methods (See S1 and S2 Datasets or http://tubic.tju.edu.cn/deg/organism.php?db=p). The organism number is therefore actually 29. Moreover, three objects have two chromosomal sequences (Burkholderia pseudomallei K96243, Burkholderia thailandensis E264, and V. cholerae N16961).

Analytical procedures

The data were analyzed according to the following steps: Sequences were digitized. Nucleotide sequences were transformed into digital sequences by expressing each nucleotide as a digital number. The four nucleotides, A, G, C, and T, were assigned the digital numbers 0, 1, 2, and 3, respectively (e.g., DNA sequence AGCTTTT would be digitized as 0123333). Hurst exponents were analyzed using the R software. The Hurst exponent describes the degree of self-similarity of a data set. The Hurst exponent of a data series with long-range dependence is between 0.5 and 1. An increased Hurst exponent indicates an increase in the degree of self-similarity and long-range dependence [13]. The Hurst exponents of each gene (including the full-gene-set, essential genes, and nonessential genes) were calculated using the R software [14]. To compare the feature in various models, six approaches (nine modes) for Hurst exponents were employed, which are [15]: RoverS, which estimates the Hurst exponent using the rescaled range (R/S) method. hurstSpec in standard, smoothed, and robinson modes, which estimates the Hurst exponent via spectral regression. hurstACVF, which estimates the Hurst exponent by regression of scaled asinh plot of autocovariance function vs. log(lag). Detrended fluctuation analysis, which estimates the scaling exponent from the results. FDWhittle, which analyzes the input time series with the spectral density function. Block in aggvar and higuchi mode, which estimates the Hurst exponent in the time domain. Statistical analysis with SPSS. The distribution properties of the Hurst exponents of the full-gene-set, the essential genes, and the nonessential genes of the 33 object sequences were analyzed using SPSS (IBM, Armonk, NY). Two methods in SPSS were used. Q–Q plots present an intuitionistic and graphic result. However, Q–Q plots lack a quantitative description of the data. Therefore, significance levels based on the Kolmogorov–Smirnov (K–S) test, were also calculated, which evaluated whether the datasets were significantly different from an assumed theoretical distribution.

Results and Discussion

A distribution hypothesis is accepted when it has a significance level of greater than or equal to 0.05, and is rejected when the significance level is less than 0.05. Of the four hypothetical distributions, that is, normal, uniform, Poisson, and exponential distribution, our results showed that only a normal distribution was satisfied at various levels. The hurstSpec method in smoothed mode provided the highest significance level among the nine methods, and was chosen as our optimal analysis method (See S1 and S2 Datasets in supplementary materials for detail). The results are listed in Table 2.
Table 2

Significance levels of 33 objects in a normal distribution based on the hurstSpec method in smoothed mode.

Analysis organismsNCBI RefSeq access numberFull-gene-setEssential GenesNonessential Genes
Acinetobacter baylyi ADP1NC_0059660.0520.6040.093
Bacillus subtilis 168NC_0009640.0040.4390.004
Bacteroides fragilis 638RNC_0167760.0020.1750.015
Bacteroides thetaiotaomicron VPI-5482NC_0046630.0000.6880.000
Burkholderia pseudomallei K96243NC_006350/0063510.0000.6450.001
Burkholderia thailandensis E264NC_007650/0076510.0000.4080.000
Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819NC_0021630.0180.1920.074
Caulobacter crescentus NC_0119160.0000.7570.000
Escherichia coli MG1655 INC_0009130.0000.8070.075
Escherichia coli MG1655 IINC_0009130.0000.6390.000
Francisella novicida U112NC_0086010.0450.2580.089
Haemophilus influenzae Rd KW20NC_0009070.0370.7110.291
Helicobacter pylori 26695NC_0009150.2890.3240.394
Mycobacterium tuberculosis H37RvNC_0009620.0000.7170.009
Mycobacterium tuberculosis H37Rv IINC_0009620.0000.4310.001
Mycobacterium tuberculosis H37Rv IIINC_0009620.0000.8450.004
Mycoplasma genitalium G37NC_0009080.9960.9930.662
Mycoplasma pulmonis UAB CTIPNC_0027710.1310.8940.133
Porphyromonas gingivalis ATCC 33277NC_0107290.0000.3430.000
Pseudomonas aeruginosa PAO1NC_0025160.0010.9780.001
Pseudomonas aeruginosa UCBPP-PA14NC_0084630.0010.2890.181
Salmonella enterica serovar Typhi NC_0046310.0010.1830.002
Salmonella enterica serovar Typhi Ty2NC_0046310.0010.5030.024
Salmonella enterica serovar Typhimurium SL1344NC_0168100.0060.4210.015
Salmonella enterica subsp. enterica serovar Typhimurium str. 14028SNC_0168560.0000.9040.000
Salmonella typhimurium LT2NC_0031970.0030.5160.004
Shewanella oneidensis MR-1NC_0043470.0140.7840.212
Sphingomonas wittichii RW1NC_0095110.0020.1670.005
Staphylococcus aureus N315NC_0027450.0040.4370.013
Staphylococcus aureus NCTC 8325NC_0077950.0000.1240.002
Streptococcus pneumoniae NC_0030980.2200.177NULL
Streptococcus sanguinis NC_0090090.0090.1270.020
Vibrio cholerae N16961NC_002505/0025060.0020.0000.053
The significance levels of the Hurst exponents for essential genes were significantly greater than those of the corresponding full-gene-set, except for Mycoplasma genitalium G37, Streptococcus pneumoniae, and Vibrio cholerae N16961. For example, the significance levels of Escherichia coli MG1655 I were <0.001 and 0.807, respectively. Those of Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S were <0.001 and 0.904, respectively. These results indicate that a normal distribution exists for the Hurst exponent of the essential genes in these organisms. Even for the two exceptions, the significance levels of M. genitalium G37 (0.996, 0.993) and S. pneumoniae (0.220, 0.177) were significantly greater than 0.005, which means that a normal distribution also exists for the Hurst exponent of these essential genes. V. cholerae N16961 was different, showing a significance level of <0.001 for the Hurst exponents of essential genes. In fact, V. cholerae strain C6706 was used for analysis and the results were compared with V. cholerae strain N16961 [16]. Although only 50–250 single nucleotide polymorphisms were detected across the two entire genomes, the cause of these differences remains to be determined [17]. From our results, the significance levels of most of the nonessential genes remained unchanged or increased slightly compared with those of the corresponding full gene set. For example, the significance levels of E. coli MG1655 I increased slightly (0.075), whereas those of S. enterica subsp. enterica serovar Typhimurium str. 14028S remained unchanged (<0.001). There was one exception, Helicobacter pylori 26695, for which the significance levels of nonessential genes were greater than those of both the full gene set and the essential genes. However, the three values were all significantly greater than 0.005, indicating that the Hurst exponents of the essential genes still followed a normal distribution. It should be addressed that the genomes of Mycoplasma genitalium G37 and Mycoplasma pulmonis UAB CTIP are already near minimal. Although the other 31 genomes encoded much more genomic redundancy than Mycoplasma genitalium G37 and Mycoplasma pulmonis UAB CTIP, the Hurst exponents followed the same distribution (normal distribution), which showed the generality of the Hurst exponents. Q–Q (quantile–quantile) plots were also generated from our data set. In a Q–Q plot, if the data investigated follow a normal distribution, a straight line should be produced when they are plotted against the expected data. Because of the lack of quantitative descriptions, we only provide the results of two objects in Fig 1, E. coli MG1655 I and S. enterica subsp. enterica serovar Typhimurium str. 14028S. The essential gene data clearly departed less from linearity than the corresponding full gene set, indicating that the essential gene data followed a normal distribution more closely. However, the nonessential genes of most objects were relatively conserved with the full-gene-set.
Fig 1

Q–Q plots of the Hurst exponents.

A, B, and C: Escherichia coli MG1655 I; D, E, and F: Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S. A and D show Q–Q plots of the Hurst exponents of the full-gene-sets of the two objects from respective organisms. B and E show Q–Q plots of the Hurst exponents of the essential genes of the two objects from respective organisms. C and F show Q–Q plots of the Hurst exponents of the nonessential genes of the two objects from respective organisms.

Q–Q plots of the Hurst exponents.

A, B, and C: Escherichia coli MG1655 I; D, E, and F: Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S. A and D show Q–Q plots of the Hurst exponents of the full-gene-sets of the two objects from respective organisms. B and E show Q–Q plots of the Hurst exponents of the essential genes of the two objects from respective organisms. C and F show Q–Q plots of the Hurst exponents of the nonessential genes of the two objects from respective organisms.

Conclusion

There are more than 5000 bacterial genomes currently available in the NCBI sequence database, far more than the number of objects provided in the DEG database. There is a demand for improved computational prediction methods for determining gene essentiality. The aim of our study was to identify the shared feature(s) related to gene essentiality and to use these to aid in silico predictions. We investigated the statistical feature of Hurst exponents in the 33 available genomes. The results showed that in most organisms (31 out of 33) the significance levels of the Hurst exponents of the essential genes were significantly higher than for the corresponding full-gene-set. In most organisms, the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. With only one exception, the Hurst exponents of essential genes followed a normal distribution. Machine learning methods have previously been employed for predicting gene essentiality. Various sequence-based and biochemistry-based characteristics have been chosen as descriptive features, including GC content, codon effective number, topology features of protein-protein interactions, gene expression, cellular localization, and biological process annotation [4]. To increase prediction accuracy, more efficient features should be explored. The final goal of computational prediction methods is to reduce or even eliminate the dependency on biochemical experiments. Thus, some of the features used currently should be avoided, such as protein-protein interactions and gene expression, which cannot be derived from the sequence directly. Based on our findings, the distribution feature of Hurst exponents show its generality in the essential genes of bacteria, therefore we propose that the distribution feature of Hurst exponents of essential genes could be used as an effective index for data classifying or clustering in the prediction of essential genes of bacteria. The Hurst exponent has been screened out and adopted as key feature in our related support vector machine-based prediction method and the results verified the efficiency of the Hurst exponent (unpublished results). As essential genes code for fundamental cellular functions required for the viability of an organism, they tend to be more conserved than nonessential genes across organisms, which means lower evolutionary rates [18,19]. Furthermore, it has been reported that long-range correlations increase with evolution [20]. This means that the Hurst exponents of essential/nonessential genes are a descriptive parameter related to evolutionary level, which could be added to the annotation of each gene. Quantitative analysis of the relationship between the Hurst exponents of essential genes and the evolution of essential genes will be the subject of our future work. Under different environmental conditions, a gene may gain or lose essentiality. This has prompted research into conditionally essential genes, which will expand our knowledge of which genes in an organism are essential and under what conditions they are essential [21,22]. Readers can find the detail of the laboratory condition in DEG (http://tubic.tju.edu.cn/deg/organism.php?db=p) or S1 and S2 Datasets. It’s a pity that, at present, there are not enough contrasting data of the objects in variant laboratory conditions for classifying or clustering, highlighting the need for a database of conditionally essential genes. Experimental confirmation of data derived from computational prediction methods is still indispensable for the accurate prediction of essential genes. However, the development and improvement of specific databases, such as DEG, will help to increase the prediction accuracy of computational approaches.

Hurst exponents of the essential genes of the 33 bacteria in DEG10.7.

(XLSX) Click here for additional data file.

Hurst exponents of the nonessential genes of the 33 bacteria in DEG10.7.

(XLSX) Click here for additional data file.
  16 in total

1.  Statistical mechanics in biology: how ubiquitous are long-range correlations?

Authors:  H E Stanley; S V Buldyrev; A L Goldberger; Z D Goldberger; S Havlin; R N Mantegna; S M Ossadnik; C K Peng; M Simons
Journal:  Physica A       Date:  1994-04-01       Impact factor: 3.263

2.  A statistical feature of Hurst exponents of essential genes in bacterial genomes.

Authors:  Xiao Liu; Shi-Yuan Wang; Jia Wang
Journal:  Integr Biol (Camb)       Date:  2011-11-23       Impact factor: 2.192

3.  Essence of life: essential genes of minimal genomes.

Authors:  Mario Juhas; Leo Eberl; John I Glass
Journal:  Trends Cell Biol       Date:  2011-09-01       Impact factor: 20.808

4.  Investigation on series of length of coding and non-coding DNA sequences of bacteria using multifractal detrended cross-correlation analysis.

Authors:  Cristina Stan; Monica Teodora Cristescu; Buimaga Iarinca Luiza; C P Cristescu
Journal:  J Theor Biol       Date:  2013-01-10       Impact factor: 2.691

5.  Experimental and computational assessment of conditionally essential genes in Escherichia coli.

Authors:  Andrew R Joyce; Jennifer L Reed; Aprilfawn White; Robert Edwards; Andrei Osterman; Tomoya Baba; Hirotada Mori; Scott A Lesely; Bernhard Ø Palsson; Sanjay Agarwalla
Journal:  J Bacteriol       Date:  2006-09-29       Impact factor: 3.490

Review 6.  Essential genes in Bacillus subtilis: a re-evaluation after ten years.

Authors:  Fabian M Commichau; Nico Pietack; Jörg Stülke
Journal:  Mol Biosyst       Date:  2013-02-18

7.  Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved.

Authors:  Iñaki Comas; Jaidip Chakravartti; Peter M Small; James Galagan; Stefan Niemann; Kristin Kremer; Joel D Ernst; Sebastien Gagneux
Journal:  Nat Genet       Date:  2010-05-23       Impact factor: 38.330

8.  A defined transposon mutant library and its use in identifying motility genes in Vibrio cholerae.

Authors:  D Ewen Cameron; Jonathan M Urbach; John J Mekalanos
Journal:  Proc Natl Acad Sci U S A       Date:  2008-06-23       Impact factor: 11.205

9.  High-resolution definition of the Vibrio cholerae essential gene set with hidden Markov model-based analyses of transposon-insertion sequencing data.

Authors:  Michael C Chao; Justin R Pritchard; Yanjia J Zhang; Eric J Rubin; Jonathan Livny; Brigid M Davis; Matthew K Waldor
Journal:  Nucleic Acids Res       Date:  2013-07-30       Impact factor: 16.971

10.  Patterns of evolutionary conservation of essential genes correlate with their compensability.

Authors:  Tobias Bergmiller; Martin Ackermann; Olin K Silander
Journal:  PLoS Genet       Date:  2012-06-28       Impact factor: 5.917

View more
  1 in total

1.  Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species.

Authors:  Xiao Liu; Bao-Jin Wang; Luo Xu; Hong-Ling Tang; Guo-Qing Xu
Journal:  PLoS One       Date:  2017-03-30       Impact factor: 3.240

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.