| Literature DB >> 27933038 |
José P Faria1, James J Davis2, Janaka N Edirisinghe2, Ronald C Taylor3, Pamela Weisenhorn4, Robert D Olson2, Rick L Stevens5, Miguel Rocha6, Isabel Rocha6, Aaron A Best7, Matthew DeJongh8, Nathan L Tintle9, Bruce Parrello10, Ross Overbeek11, Christopher S Henry12.
Abstract
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curated set of regulatory interactions for E. coli found in RegulonDB, showing that ARs are more consistent with gold standard regulons than are data-driven gene clusters. We further examined the consistency of ARs and data-driven gene clusters in the context of gene interactions predicted by Context Likelihood of Relatedness (CLR) analysis, finding that the ARs show better agreement with CLR predicted interactions. We determined the impact of increasing amounts of expression data on AR construction and find that while more data improve ARs, it is not necessary to use the full set of gene expression experiments available for E. coli to produce high quality ARs. In order to explore the conservation of co-regulated gene sets across different organisms, we computed ARs for Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus, each of which represents increasing degrees of phylogenetic distance from E. coli. Comparison of the organism-specific ARs showed that the consistency of AR gene membership correlates with phylogenetic distance, but there is clear variability in the regulatory networks of closely related organisms. As large scale expression data sets become increasingly common for model and non-model organisms, comparative analyses of atomic regulons will provide valuable insights into fundamental regulatory modules used across the bacterial domain.Entities:
Keywords: CLR; Escherichia coli; atomic regulon; clustering; gene expression analysis; hierarchical clustering; k-means clustering; transcriptomic data
Year: 2016 PMID: 27933038 PMCID: PMC5121216 DOI: 10.3389/fmicb.2016.01819
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Atomic Regulon Inference. Six steps of Atomic Regulon (AR) inference algorithm. Step 1. Generate Initial Atomic Regulon Gene Sets. Initial clusters are proposed using gene clustering within putative operons and membership of genes within SEED subsystems. Step 2. Process Gene Expression Data and Calculate Pairwise Expression Profile Similarities. Integrate gene expression data, load the normalized data, and compute Pearson correlation coefficients. Step 3. Expression Informed Splitting of Initial Atomic Regulon Gene Sets. Split operon and subsystem-based clusters using the criterion that genes in a set must have pairwise expression data profiles greater than a set Pearson correlation coefficient (PCC) threshold. Step 4. Restrict Gene Membership to One Atomic Regulon Gene Set. Merge the clusters built from operons and subsystems then use the binary connections to form a single set of large clusters using transitive closure. This also ensures that no gene is a member of more than one cluster. Step 5. Filter Atomic Regulon Gene Sets to Remove Low Correlation Genes. Split the merged clusters based on a distance computed between every pair of genes. This corrects for genes with a low PCC value that may have been placed in a common cluster. Step 6. Generate Final Set of Atomic Regulons. Estimate the ON/OFF status of each cluster in any specific experimental sample by a simple voting algorithm using the ON/OFF estimates for the genes that make up the AR. This merged set becomes the final set of ARs.
Figure 2Comparison of RegulonDB regulons with hierarchical clustering, k-means clustering and atomic regulons. (A) The Jaccard coefficient comparing E. coli RegulonDB regulons vs. the clustering methods is shown as a percentage of similarity. (B) Comparison AR similarity to RegulonDB regulons with and without inclusion of SEED subsystems using 100, 50, and 10% of experiment data.
Figure 3Degree of CLR support. (A) CLR support compared between our ARs, and ARs produced via k-means clusters and hierarchical clusters. (B) CLR support for our AR construction method, broken down by different AR sizes.
Figure 4Sensitivity analysis of Atomic Regulon inference for Average number of genes in atomic regulons. (B) Average number of atomic regulons. (C) Average number of genes always ON (D) Average number of genes always OFF. Standard deviation error bars represent the variation across 100 data set randomizations from random sampling of experiments.
Average Jaccard similarity coefficient between each set of atomic regulons from 2-fold the cross validation.
| Set1 | 1 | 0.80 ± 0.35 | 0.89 ± 0.26 |
| Set2 | 0.83 ± 0.31 | 1 | 0.92 ± 0.19 |
| All ARs | 0.81 ± 0.35 | 0.80 ± 0.37 | 1 |
Mean ± Standard Deviation.
Atomic regulon statistics for .
| 907 | 646 | 2604 (60%) | 292 (6.8%) | 69 (1.6%) | 4309 | |
| 245 | 335 | 1559(37%) | 265 (6.4%) | 32 (0.8%) | 4167 | |
| 236 | 423 | 2427(43%) | 557 (9.8%) | 78 (1.4%) | 5682 | |
| 543 | 196 | 1422(63%) | 692 (30.9%) | 27 (1.2%) | 2239 | |
| 852 | 397 | 1749(63%) | 447 (16.1%) | 28 (1%) | 2770 |
Figure 5Comparison of The % of similarity is given by the Jaccard coefficient, which is defined as the size of the intersection divided by the size of the union of the sample sets. (B) The % of similarity is given by the Jaccard coefficient, which is defined as the size of the intersection divided by the size of the union of the sample sets. Jaccard coefficients computed for each E. coli AR across all combinations of four, three, and two genomes.
Atomic regulons similarity >75% across .
| 15 | 13 | NADH-ubiquinone oxidoreductase chain | |
| PAO1 | |||
| 54 | 7 | Biogenesis of c-type cytochromes | |
| PAO1 | |||
| 57 | 6 | Tryptophan synthesis | |
| 112 | 5 | Phosphate transport system | |
| 316 | 3 | Molybdenum transport system | |
| PAO1 | |||
| 362 | 2 | Heat shock proteins | |
| 398 | 2 | Paraquat-inducible proteins | |
| PAO1 | |||
| 500 | 2 | Ribonucleotide reductase | |