| Literature DB >> 28250767 |
Faridah Hani Mohamed Salleh1, Suhaila Zainudin2, Shereena M Arif3.
Abstract
Gene regulatory network (GRN) reconstruction is the process of identifying regulatory gene interactions from experimental data through computational analysis. One of the main reasons for the reduced performance of previous GRN methods had been inaccurate prediction of cascade motifs. Cascade error is defined as the wrong prediction of cascade motifs, where an indirect interaction is misinterpreted as a direct interaction. Despite the active research on various GRN prediction methods, the discussion on specific methods to solve problems related to cascade errors is still lacking. In fact, the experiments conducted by the past studies were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrences of cascade errors. Hence, this research aims to propose Multiple Linear Regression (MLR) to infer GRN from gene expression data and to avoid wrongly inferring of an indirect interaction (A → B → C) as a direct interaction (A → C). Since the number of observations of the real experiment datasets was far less than the number of predictors, some predictors were eliminated by extracting the random subnetworks from global interaction networks via an established extraction method. In addition, the experiment was extended to assess the effectiveness of MLR in dealing with cascade error by using a novel experimental procedure that had been proposed in this work. The experiment revealed that the number of cascade errors had been very minimal. Apart from that, the Belsley collinearity test proved that multicollinearity did affect the datasets used in this experiment greatly. All the tested subnetworks obtained satisfactory results, with AUROC values above 0.5.Entities:
Year: 2017 PMID: 28250767 PMCID: PMC5303608 DOI: 10.1155/2017/4827171
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1MLR in the context of GRN. MLR predicts the variations in the response variables from the variations in the predictors.
Figure 2Regression analysis methods. The methods listed on the right are the recommended models extracted from Table 1. The methods that had been applied by other researchers are marked with double asterisks (∗∗).
Decision tree.
| Nature of data | Chosen model type | Recommended model | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Independent variables (predictors) | Dependent variables (response variables) | Model | |||||||
| Continuous | Categorical | Continuous | Restricted | Multivariable | Linear | Nonlinear | |||
| (1) |
|
|
|
| Fitted model coefficients |
| |||
| (2) |
|
|
|
| Fitted model and fitted coefficients |
| |||
| (3) |
|
|
|
| Fitted generalized linear model coefficients | Generalized linear models | |||
| (4) |
|
|
| Fitted nonlinear model coefficients |
| ||||
| (5) |
|
|
| Ridge/LASSO/elastic net regression |
| ||||
| (6) |
|
|
| Fitted model and fitted coefficients |
| ||||
| (7) |
|
|
| Nonparametric model |
| ||||
| (8) |
| ANOVA | ANOVA | ||||||
| (9) |
|
|
| Fitted multivariate regression model coefficients | |||||
| (10) |
|
|
|
| Fitted mixed-effects model coefficients | Mixed-effects model | |||
Note: cells with “∗” indicate the type of variables that suit the nature of GRN. The recommended models are marked with asterisks (∗∗).
Figure 3Cascade motifs in Table 2(b). The dashed lines show the cascade errors.
Figure 4List of directed edges and cascade motifs. The numbers represent the name of genes. The dashed arrows represent cascade errors, while the black texts represent the cascade motifs.
Results of the experiments that used the datasets in which the cascade motifs have been removed.
| Number of genes | Number of observations |
| AUROC | |
|---|---|---|---|---|
| Subnetwork A size 415 | 415 | 466 | <0.05 | 0.6860 |
| Subnetwork A size 415 | 415 | 466 | <0.04 | 0.6795 |
| Subnetwork A size 415 | 415 | 907 | <0.05 | 0.6622 |
| Subnetwork B size 893 | 893 | 907 | <0.05 | 0.5022 |
| Subnetwork C size 871 | 871 | 907 | <0.05 | 0.5081 |
Characteristics of the datasets tested in the experiment and the AUROC results.
| Set 1 | Set 2 | Set 3 | |
|---|---|---|---|
| Number of | 363 | 360 | 160 |
| Number of | 397 | 533 | 255 |
| Total number of tested genes | 760 | 893 | 415 |
| Total number of possible edges | 576,840 | 796,556 | 171,810 |
| Total number of correct prediction (CORRECT_PRED) | 119 | 10 | 825 |
| Total number of incorrect prediction | 27,526 | 253 | 170,985 |
|
| |||
|
| 0.511 | 0.502 | 0.662 |
|
| |||
Note:
(1) Cascade motif genes (italic text) are referring to the gene itself. This is not similar to cascade motif.
(2) Number of cascade motif genes in the tested datasets are obtained by comparing the cascade motifs genes with the genes in the datasets.
(3) Total number of possible edges = Total number of tested genes × (Total number of tested genes – 1).
AUROC of selected methods on the M3D datasets of E. coli.
| Methods | References | M3D | Experimental data |
|---|---|---|---|
| ANOVA | [ | 0.798 | One whole network of |
| Genie3 | [ | 0.673 | |
| Pearson | [ | 0.646 | |
| MRNet | [ | 0.645 | |
| CLR | [ | 0.642 | |
| ARACNe | [ | 0.635 | |
|
| |||
| MLR | This article | 0.558 | Predetermined subnetworks that consist of expression data with added cascade motifs |
Note: the results marked in ∗ are reported by [10].
CI and the level of collinearity [61].
| Condition index (CI) | Collinearity |
|---|---|
| 5 < CI < 10 | Weak |
| 30 < CI < 100 | Moderate to strong |
| CI > 100 | Severe |
The CIs and the VDPs of four genes from Set 3 as example of data generated from the diagnostic test.
| condIdx | aaeA_b3241_14 | aceA_b4015_15 | aceE_b0114_15 | aceF_b0115_15 |
|---|---|---|---|---|
| … | ||||
| 168.1754 | 0 | 0.0003 | 0 | 0 |
| 172.9103 | 0 | 0.0001 | 0.0001 | 0.0001 |
| 176.3094 | 0 | 0.0001 | 0.0001 | 0.0001 |
| 176.8486 | 0 | 0.0002 | 0 | 0 |
| 182.4254 | 0 | 0.0002 | 0 | 0 |
| … |
|
|
|
|---|---|
| hupB | tyrP |
| crp | hupA |
| narL | dmsB |
| narL | dmsC |
| ihfA | ompR |
| ihfB | ompR |
| ompR | fadL |
| ompR | bolA |
| ompR | ompC |
| ompR | ompF |
|
|
|
|---|---|
| ihfA | ompR |
| ihfB | ompR |
| ompR | fadL |
| ompR | bolA |
| ompR | ompC |
| ompR | ompF |
| Case | Total | Total number of “cascade motifs” that match with GS | Multiple Linear Regression | Percentage of cascade motifs in datasets |
|---|---|---|---|---|
| Set 1 | ||||
| gadE | 105 | 29 | 3 | 0.16% |
| csgD | 41 | 12 | 0 | |
| arcA | 157 | 77 | 0 | |
| gadX | 216 | 53 | 3 | |
| dcuR | 21 | 15 | 0 | |
| marA | 150 | 40 | 1 | |
| fis | 658 | 173 | 3 | |
|
| ||||
|
|
|
|
| |
|
| ||||
| Set 2 | ||||
| gadE | 105 | 29 | 0 | 0.12% |
| csgD | 41 | 12 | 0 | |
| arcA | 157 | 77 | 0 | |
| gadX | 216 | 53 | 0 | |
| dcuR | 21 | 15 | 0 | |
| marA | 150 | 40 | 0 | |
| fis | 658 | 173 | 0 | |
|
| ||||
|
|
|
|
| |
| Case | Total | Total number of “cascade motifs” that match with GS | Multiple Linear Regression | Percentage of cascade motifs in datasets |
|---|---|---|---|---|
| Set 3 | ||||
| gadE | 105 | 29 | 3 | 0.54% |
| csgD | 41 | 12 | 0 | |
| arcA | 157 | 77 | 14 | |
| gadX | 216 | 53 | 8 | |
| dcuR | 21 | 15 | 5 | |
| marA | 150 | 40 | 7 | |
| fis | 658 | 173 | 57 | |
|
| ||||
|
|
|
|
| |
Note:
(1) Percentage of cascade motifs in datasets ((Total cascade motifs – Total TRUE_CASCADE)/Total number of possible edges) × 100.
(2) Refer to Table 4 for the total number of possible edges.
(3) Cascade motif is defined as A → C for the case of A → B → C.