| Literature DB >> 27040162 |
Masayuki Ishitsuka1, Tatsuya Akutsu2, Jose C Nacher1.
Abstract
Recently, the number of essential gene entries has considerably increased. However, little is known about the relationships between essential genes and their functional roles in critical network control at both the structural (protein interaction network) and dynamic (transcriptional) levels, in part because the large size of the network prevents extensive computational analysis. Here, we present an algorithm that identifies the critical control set of nodes by reducing the computational time by 180 times and by expanding the computable network size up to 25 times, from 1,000 to 25,000 nodes. The developed algorithm allows a critical controllability analysis of large integrated systems composed of a transcriptome- and proteome-wide protein interaction network for the first time. The data-driven analysis captures a direct triad association of the structural controllability of genes, lethality and dynamic synchronization of co-expression. We believe that the identified optimized critical network control subsets may be of interest as drug targets; thus, they may be useful for drug design and development.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27040162 PMCID: PMC4819195 DOI: 10.1038/srep23541
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A random network (a) and a scale-free network (b) and their corresponding degree distribution. The example shows that the network structure strongly affects the controllability roles of the nodes. Whereas in random networks the critical nodes are almost absent, in scale-free networks, they have a noticeable representation impacting the controllability of the network. Random networks, by contrast, are more flexible because they depend on intermittent control nodes. The proposed algorithmic procedure relies on the structure of the scale-free networks and on the existence of highly connected nodes to pre-determine a large number of critical and redundant nodes without using ILP. (c) The results of the algorithm computed on the human protein interaction network. (d) The fraction of proteins for each control category in all analysed PPI networks. The set of proteins involved in critical control is very small (less than 10% in almost all cases). The smallest critical control set is found in the H. sapiens and represents 6% of all proteins.
Figure 2Illustration of the computation of critical, redundant and intermittent sets using the proposed algorithmic procedure.
Details of the calculations, including the exact equations to be solved by ILP are shown in SI. (a) The initial network. Next, we apply the novel pre-processing step for determining critical nodes (b) and redundant nodes (c). (d) The MDS is computed for the entire network (hexagonal nodes). (e) Among the MDS, only node 12 is not pre-determined as a critical node. The critical set procedure is applied and node 12 (see red arrow) is determined to be intermittent. (f–h) The remaining nodes do not belong to the MDS; therefore, the redundant set procedure is applied sequentially to nodes 11 and 13 (see red arrow) (i) The experimental results for computational time (milliseconds (ms)) versus network size on scale-free networks with γ = 2 and average degree
Figure 3The enrichment of critical, intermittent and redundant nodes versus node degree for each organism.
To measure the statistical proportion of proteins engaged in a given control set S (critical, intermittent and redundant) according to their degrees, we performed the following enrichment computation, which was also used by Wuchty9. First, proteins were classified according to their degree in logarithmic bins of increasing size. For each bin class i, we computed the frequency of proteins with degree k as . Next, we calculated the fraction of proteins with degree k that also appeared in a given set S as . Then, the enrichment of proteins with degree k that appear in the control set S (critical, intermittent and redundant) in bin i was computed as . Positive values of this function indicate the enrichment of degree k for the critical, intermittent and redundant set, respectively. Negative values indicate depletion of degree k.
The two-tailed p-values for the Fisher’s exact test show that optimized proteins engaged in critical control are statistically significantly enriched with essential genes for all analysed networks (P < 0.05) except for that of S. pombe, which is not surprising because of the few collected statistics for this organism, as shown in Fig. S9 and Table S1.
| Organism | P-value (Swiss) | P-value (Swiss & TrEMBL) |
|---|---|---|
| 2.51E-01 | 1.55E-02 | |
| 5.67E-03 | 3.12E-03 | |
| 1.87E-02 | 1.86E-02 | |
| 1.93E-02 | 1.92E-02 | |
| 1.96E-03 | 1.94E-03 | |
| 7.83E-03 | 7.10E-03 | |
| 3.01E-07 | 3.01E-07 | |
| 8.63E-01 | 8.63E-01 | |
| 1.41E-04 | 1.41E-04 | |
| 9.20E-03 | 9.20E-03 | |
| 4.25E-04 | 4.25E-04 |
The C. elegans results show statistical significance (P < 0.05) for the Swiss & TrEMBL dataset. For E.coli MG1655, we also compiled several datasets. E. coli 1 and 2 refer to data from Gerdes et al.30 and Baba et al.31, respectively. Both datasets were available in the DEG. E. coli 3 refers to the combined data from E. coli 1 and E. coli 2. For H. sapiens datasets see details in methods section.
Figure 4The box-and-whisker plot of the proteins enriched for each Gene Ontology annotations: biological process, molecular function, and cellular component.
Each protein was classified into five categories based on whether it is engaged in critical (red), intermittent (blue) or redundant (purple) network control and also on whether it plays an essential (green) or an essential and critical control role simultaneously (orange). The calculation of the enrichment of each category was then calculated as shown in the Methods section. The results show that those essential proteins engaged in critical control (orange) also show the highest enrichment in biological process, molecular function and cellular component annotations. The strongest signal for their associations with GO categories corresponds to H. sapiens, S. cerevisiae, C. elegans, D. melanogaster and E. coli organisms. Each organism is indicated in the figure. E. coli organisms show results for three different essential gene datasets, respectively. The results for H. sapiens (1) are shown in figure. For the rest of datasets for H. sapiens we obtained very similar results. The statistics and data sources for the essential genes are shown in Table S2. Additional statistical data is shown in Table 2 in main text and Tables S6 and S7 in SI. See also Supplementary Excel files for additional data.
The top-five Gene Ontology (GO) terms for the three GO categories ordered by the number of critical and essential critical genes.
| GO category | Go term | Number of | Enrichment | p-Value | |||
|---|---|---|---|---|---|---|---|
| critical | essential critical | critical | essential critical | critical | essential critical | ||
| Biological Process | transcription, DNA-templated | 67 | 15 | 0.29 | 0.53 | 1.27E-02 | 4.40E-02 |
| Biological Process | gene expression | 52 | 13 | 0.42 | 0.78 | 1.59E-03 | 7.52E-03 |
| Biological Process | positive regulation of transcription from RNA polymerase II promoter | 51 | 21 | 0.59 | 1.45 | 2.71E-05 | 6.78E-09 |
| Biological Process | innate immune response | 43 | 14 | 0.72 | 1.34 | 3.81E-06 | 1.17E-05 |
| Biological Process | viral process | 41 | 9 | 0.58 | 0.81 | 2.06E-04 | 1.76E-02 |
| Cellular Component | cytoplasm | 208 | 45 | 0.35 | 0.56 | 2.62E-10 | 6.19E-06 |
| Cellular Component | nucleus | 198 | 47 | 0.24 | 0.54 | 1.91E-05 | 4.69E-06 |
| Cellular Component | cytosol | 155 | 37 | 0.41 | 0.72 | 7.75E-09 | 1.90E-06 |
| Cellular Component | nucleoplasm | 128 | 36 | 0.26 | 0.74 | 6.77E-04 | 2.28E-06 |
| Cellular Component | plasma membrane | 87 | 20 | 0.24 | 0.52 | 1.14E-02 | 1.60E-02 |
| Molecular Function | ATP binding | 63 | 16 | 0.38 | 0.76 | 1.46E-03 | 2.89E-03 |
| Molecular Function | identical protein binding | 39 | 10 | 0.89 | 1.27 | 1.83E-07 | 4.38E-04 |
| Molecular Function | sequence-specific DNA binding transcription factor activity | 39 | 13 | 0.36 | 1.01 | 1.78E-02 | 7.24E-04 |
| Molecular Function | ubiquitin protein ligase binding | 33 | 9 | 1.26 | 1.70 | 8.77E-11 | 3.27E-05 |
| Molecular Function | ligase activity | 33 | 5 | 0.83 | 0.69 | 5.06E-06 | 1.06E-01 |
The associated enrichment factors and the two-tailed p-values for the Fisher’s exact test are also displayed. The result corresponds to the H. sapiens 1 dataset. For the H. sapiens 2 and H. sapiens OGEE datasets see Supplementary Information.
Figure 5Enrichment of co-expressed genes for each category: critical (red), intermittent (blue), redundant (purple), essential (green) and simultaneously essential and critical control roles (orange) across 36 different healthy human tissues.
The subset of co-expressed critical genes that are also annotated as essential genes exhibit the highest enrichment in almost all tissues. The analysis is performed for three different datasets of essential genes as shown in Table S2. The results for H. sapiens (1) dataset are shown in figure. The results for the rest of datasets for H. sapiens are shown in Fig. S20 in SI.
Figure 6The fraction of genes in a particular set of critical (a), essential and critical (b), essential (c), intermittent (d) or redundant (e) genes among genes whose transcripts are expressed in t tissues, from 1 to 36. For each figure (a–d), the scale for the grey bars is marked on the right axis and indicates the number of proteins found in each tissue. The fraction of critical and essential critical proteins increases when the number of tissues in which they are expressed also increases. The essential gene dataset corresponds to H. sapiens (1). The results from two other human essential gene datasets are shown in SI.
The list of critical genes that are co-expressed in 35 or more human tissues classified according to the KOG functional class for the H. sapiens 1 dataset.
| KOG Functional class | Function | Number of critical genes | Gene name |
|---|---|---|---|
| A | RNA processing and modification | 5 | DDX24, DDX39B, U2AF2, SF3B3, UPF3A |
| B | Chromatin structure and dynamics | 1 | TLE1 |
| C | Energy production and conversion | 1 | ATP6AP1 |
| D | Cell cycle control, cell division, chromosome partitioning | 2 | MOB4, PEA15 |
| E | Amino acid transport and metabolism | 1 | GOT2 |
| F | Nucleotide transport and metabolism | 0 | |
| G | Carbohydrate transport and metabolism | 1 | GAPDH |
| H | Coenzyme transport and metabolism | 1 | ALAS1 |
| I | Lipid transport and metabolism | 1 | FDFT1 |
| J | Translation, ribosomal structure and biogenesis | 4 | EEF1A1, RPLP1, RPL8, RPS3A |
| K | Transcription | 9 | XBP1, CTBP1, MAX, TCF4, ZNHIT3, TCF12, SKP1, MED23, TCEB1 |
| L | Replication, recombination and repair | 1 | XRCC6 |
| M | Cell wall/membrane/envelope biogenesis | 0 | |
| N | Cell motility | 0 | |
| O | Posttranslational modification, protein turnover, chaperones | 14 | UBE2I, UBE2D3, UBE2D2, UBE2D4, RNF11, UBE2E1, UBE2E3, UBE2N, UBE2K, YWHAE, UBE3A, UBE2L6, DNAJA1, CDC34 |
| P | Inorganic ion transport and metabolism | 1 | SAT1 |
| Q | Secondary metabolites biosynthesis, transport and catabolism | 0 | |
| R | General function prediction only | 8 | EWSR1, CDC42, PLEKHF2, RAC1, LMO4, RAP2A, KLF10, PPFIA1 |
| S | Function unknown | 1 | WBP11 |
| T | Signal transduction mechanisms | 13 | NCK1, FYN, CRK, PIK3R1, ACVR1, MAPK14, NUDT3, ARHGDIA, PSEN1, MPP3, PRKAR1A, MAPK10, PPP2CA |
| U | Intracellular trafficking, secretion, and vesicular transport | 1 | SEC23B |
| V | Defense mechanisms | 0 | |
| W | Extracellular structures | 0 | |
| Y | Nuclear structure | 1 | NSFL1C |
| Z | Cytoskeleton | 7 | MAP1LC3B, NDEL1, TUBGCP4, PFN2, GABARAPL2, ARPC3, DYNLL1 |
See SI for the rest of datasets.
Figure 7The fraction of genes for each category (critical (a), critical and essential (b), essential (c), intermittent (d) and redundant (e)) among those whose average Pearson Correlation Coefficient (PCC) with all other genes in the cell is <ρ>. See the Methods section for details. (f) The PCC averaged for all genes included in each category. The essential gene dataset corresponds to H. sapiens (1). The results for two other human essential gene datasets are shown in SI. The linear fit to the data (black line) also shows the 95% confidence band in the same colour as the data points. When it is too small, it may not be visible. The exact p-values for the analysis of statistical significance analysis are shown in SI, Table S5.