| Literature DB >> 33114263 |
Furqan Aziz1,2, Animesh Acharjee1,2,3, John A Williams1,2,4, Dominic Russ1,2, Laura Bravo-Merodio1,2, Georgios V Gkoutos1,2,3,5,6,7.
Abstract
Inferring the topology of a gene regulatory network (GRN) from gene expression data is a challenging but important undertaking for gaining a better understanding of gene regulation. Key challenges include working with noisy data and dealing with a higher number of genes than samples. Although a number of different methods have been proposed to infer the structure of a GRN, there are large discrepancies among the different inference algorithms they adopt, rendering their meaningful comparison challenging. In this study, we used two methods, namely the MIDER (Mutual Information Distance and Entropy Reduction) and the PLSNET (Partial least square based feature selection) methods, to infer the structure of a GRN directly from data and computationally validated our results. Both methods were applied to different gene expression datasets resulting from inflammatory bowel disease (IBD), pancreatic ductal adenocarcinoma (PDAC), and acute myeloid leukaemia (AML) studies. For each case, gene regulators were successfully identified. For example, for the case of the IBD dataset, the UGT1A family genes were identified as key regulators while upon analysing the PDAC dataset, the SULF1 and THBS2 genes were depicted. We further demonstrate that an ensemble-based approach, that combines the output of the MIDER and PLSNET algorithms, can infer the structure of a GRN from data with higher accuracy. We have also estimated the number of the samples required for potential future validation studies. Here, we presented our proposed analysis framework that caters not only to candidate regulator genes prediction for potential validation experiments but also an estimation of the number of samples required for these experiments.Entities:
Keywords: causal modelling; experimental design; gene regulatory network; omics integration
Mesh:
Substances:
Year: 2020 PMID: 33114263 PMCID: PMC7660606 DOI: 10.3390/ijms21217886
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1GRN generated from the application of partial least square based feature selection (PLSNET) on the inflammatory bowel disease (IBD) dataset.
Frequencies of different genes appearing as Regulatory (R), Target (T), or Intermediate (I) gene for different threshold values for the IBD data. For each threshold value, the experiment was executed 100 times with the same set of parameter values.
| Genes | Top 2% | Top 5% | Top 10% | Top 15% | Top 20% | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R | T | I | R | T | I | R | T | I | R | T | I | R | T | I | |
|
| 0 | 66 | 0 | 0 | 98 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 36 | 0 | 0 | 97 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 1 | 0 | 0 | 23 | 0 | 0 | 98 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 0 | 0 | 1 | 0 | 0 | 1 | 14 | 0 | 0 | 84 | 2 | 0 | 90 | 0 |
|
| 0 | 49 | 0 | 0 | 94 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 7 | 0 | 0 | 26 | 1 | 0 | 86 | 1 | 2 | 87 | 0 | 12 | 23 | 0 | 77 |
|
| 81 | 1 | 0 | 99 | 1 | 0 | 99 | 0 | 1 | 92 | 0 | 8 | 33 | 0 | 67 |
|
| 82 | 0 | 0 | 95 | 0 | 0 | 97 | 0 | 2 | 95 | 0 | 5 | 59 | 0 | 41 |
|
| 5 | 0 | 0 | 12 | 0 | 0 | 40 | 3 | 0 | 46 | 12 | 37 | 0 | 1 | 99 |
|
| 0 | 31 | 0 | 0 | 93 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 99 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 3 | 0 | 0 | 45 | 0 | 0 | 96 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 20 | 1 | 0 | 75 | 1 | 0 | 99 | 1 | 0 | 99 | 1 | 0 | 99 | 1 |
|
| 0 | 2 | 0 | 1 | 28 | 0 | 0 | 88 | 3 | 0 | 96 | 4 | 0 | 94 | 6 |
|
| 0 | 27 | 0 | 0 | 95 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 1 | 2 | 0 | 2 | 26 | 0 | 1 | 85 | 1 | 0 | 98 | 2 | 0 | 97 | 3 |
Figure 2The four identified regulators for the IBD data are represented by the largest observed effect size. The effect size of each assessed variable is shown along the y axis and a series of sample sizes along the x axis.
Figure 3Gene regulatory network (GRNs) generated from the application of MIDER on the IBD Dataset. (a) GRN with all edges selected (no threshold). (b) GRN with selected edges (threshold corresponding to 65% edges).
Distribution of the posteriors versus observed experimental states for the IBD dataset.
| Genes | Predicted Marginals | Observed States | ||
|---|---|---|---|---|
| 0 | 1 | 0 | 1 | |
|
| 0.55 | 0.45 | 0.55 | 0.45 |
|
| 0.9 | 0.1 | 0.9 | 0.1 |
|
| 0.2 | 0.8 | 0.2 | 0.8 |
|
| 0.5498 | 0.4502 | 0.55 | 0.45 |
|
| 0.9 | 0.1 | 0.9 | 0.1 |
|
| 0.5999 | 0.4001 | 0.6 | 0.4 |
|
| 0.5999 | 0.4001 | 0.6 | 0.4 |
|
| 0.5998 | 0.4002 | 0.6 | 0.4 |
|
| 0.3501 | 0.6499 | 0.35 | 0.65 |
|
| 0.45 | 0.55 | 0.45 | 0.55 |
|
| 0.5 | 0.5 | 0.5 | 0.5 |
|
| 0.75 | 0.25 | 0.75 | 0.25 |
|
| 0.85 | 0.15 | 0.85 | 0.15 |
|
| 0.35 | 0.65 | 0.35 | 0.65 |
|
| 0.25 | 0.75 | 0.25 | 0.75 |
|
| 0.7 | 0.3 | 0.7 | 0.3 |
Figure 4Pearson correlation plots for IBD dataset.
Figure 5GRN generated from the application of PLSNET on pancreatic ductal adenocarcinoma (PDAC) dataset.
Frequencies of different genes appearing as Regulatory (R), Target (T), or Intermediate (I) gene for different threshold values for the PDAC dataset.
| Genes | Top 2% | Top 5% | Top 10% | Top 15% | Top 20% | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R | T | I | R | T | I | R | T | I | R | T | I | R | T | I | |
|
| 100 | 0 | 0 | 100 | 0 | 0 | 65 | 0 | 35 | 3 | 0 | 97 | 0 | 0 | 100 |
|
| 63 | 1 | 0 | 6 | 5 | 89 | 0 | 1 | 99 | 0 | 0 | 100 | 0 | 0 | 100 |
|
| 0 | 0 | 0 | 2 | 5 | 0 | 3 | 69 | 24 | 0 | 29 | 71 | 0 | 4 | 96 |
|
| 7 | 0 | 0 | 29 | 27 | 30 | 0 | 4 | 96 | 0 | 0 | 100 | 0 | 0 | 100 |
|
| 0 | 0 | 0 | 0 | 98 | 0 | 0 | 97 | 3 | 0 | 69 | 31 | 0 | 18 | 82 |
|
| 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 76 | 0 | 24 | 14 | 0 | 86 |
|
| 0 | 0 | 0 | 0 | 8 | 0 | 0 | 100 | 0 | 0 | 94 | 6 | 0 | 80 | 20 |
|
| 0 | 0 | 0 | 21 | 31 | 15 | 0 | 13 | 87 | 0 | 1 | 99 | 0 | 0 | 100 |
|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 75 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 0 | 0 | 0 | 12 | 0 | 0 | 98 | 2 | 0 | 49 | 51 | 0 | 17 | 83 |
|
| 0 | 96 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 95 | 5 |
|
| 0 | 9 | 0 | 0 | 98 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 0 | 0 | 0 | 2 | 0 | 0 | 78 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 13 | 0 | 0 | 78 | 21 | 0 | 24 | 76 | 0 | 2 | 98 | 0 | 0 | 100 |
|
| 26 | 0 | 0 | 61 | 8 | 27 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 |
|
| 2 | 0 | 0 | 15 | 0 | 0 | 36 | 11 | 41 | 0 | 4 | 96 | 0 | 0 | 100 |
|
| 0 | 0 | 0 | 0 | 66 | 0 | 0 | 100 | 0 | 0 | 88 | 12 | 0 | 39 | 61 |
|
| 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
|
| 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 |
Figure 6GRNs generated from the application of MIDER on the PDAC dataset. (a) GRN with all edges selected (no threshold); (b) GRN with selected edges (Using 95% threshold); (c) GRN with selected edges (Combining output of PLSNET).
Figure 7GRNs generated from the application of mutual information distance and entropy reduction (MIDER) on the PDAC dataset. (a) GRN with selected edges (95% threshold); (b) GRN with selected edges (Combining output of MIDER and PLSENT).
Figure 8GRN generated from the application of MIDER on the acute myeloid leukaemia (AML) dataset.
Figure 9Predicted protein–protein interactions using the STRING database for the IBD GRN genes. Edges represent interactions between proteins, and multiple edges represent additional sources of evidence. Analysis was performed with String v. 10.
Figure 10Predicted protein–protein interactions using the STRING database for the PDAC GRN genes. Edges represent interactions between proteins, and multiple edges represent additional sources of evidence. Analysis was performed with String v. 10.
Figure 11Reactome Functional Interaction visualization of the IBD dataset. Dashed edges are predicted associations. Directional edges indicate regulation, and T junction edges represent inhibition.
Figure 12Reactome Functional Interaction visualization of the PDAC dataset. Dashed edges are predicted associations. Directional edges indicate regulation, and T junction edges represent inhibition.
Information about all the real-world datasets used in this study. Here, N represents the number of genes used for network inference.
| Author Name | Disease Type | N | Reference |
|---|---|---|---|
| Quraishi et al. | Inflammatory bowel disease | 16 | [ |
| Rajamani et al. | Pancreatic ductal adenocarcinoma | 20 | [ |
| Mills et al. | Acute myeloid leukemia | 60 | [ |