| Literature DB >> 28056073 |
Samarendra Das1, Prabina Kumar Meher1, Anil Rai2, Lal Mohan Bhar1, Baidya Nath Mandal3.
Abstract
Selection of informative genes is an important problem in gene expression studies. The small sample size and the large number of genes in gene expression data make the selection process complex. Further, the selected informative genes may act as a vital input for gene co-expression network analysis. Moreover, the identification of hub genes and module interactions in gene co-expression networks is yet to be fully explored. This paper presents a statistically sound gene selection technique based on support vector machine algorithm for selecting informative genes from high dimensional gene expression data. Also, an attempt has been made to develop a statistical approach for identification of hub genes in the gene co-expression network. Besides, a differential hub gene analysis approach has also been developed to group the identified hub genes into various groups based on their gene connectivity in a case vs. control study. Based on this proposed approach, an R package, i.e., dhga (https://cran.r-project.org/web/packages/dhga) has been developed. The comparative performance of the proposed gene selection technique as well as hub gene identification approach was evaluated on three different crop microarray datasets. The proposed gene selection technique outperformed most of the existing techniques for selecting robust set of informative genes. Based on the proposed hub gene identification approach, a few number of hub genes were identified as compared to the existing approach, which is in accordance with the principle of scale free property of real networks. In this study, some key genes along with their Arabidopsis orthologs has been reported, which can be used for Aluminum toxic stress response engineering in soybean. The functional analysis of various selected key genes revealed the underlying molecular mechanisms of Aluminum toxic stress response in soybean.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28056073 PMCID: PMC5215982 DOI: 10.1371/journal.pone.0169605
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Microarray studies used in comparative analysis.
| Data descriptions | #GEO Series | # Genes | # Samples | # Classes |
|---|---|---|---|---|
| Salt stress in Rice | 6 | 6637 | 70 | 2 (stress: 1 and control: -1) |
| Cold stress in Rice | 5 | 8839 | 100 | 2 (stress: 1 and control: -1) |
| Al stress in Soybean | 4 | 15510 | 68 | 2 (stress: 1 and control: -1) |
# GEO series: Number of GEO series; # Genes: Number of genes; #Samples: Total number of microarray samples; # Classes: Number of classes
Decision matrix for differential hub gene analysis.
| Sl. No. | Stress Condition | Control Condition | Descriptions |
|---|---|---|---|
| 1 | Housekeeping hub gene | ||
| 2 | Unique hub gene for stress condition | ||
| 3 | Unique hub gene for control Condition | ||
| 4 | Not a hub gene |
p-value: Obtained statistical hub gene significance value; α: Desired level of statistical significance
Comparison of Boot-SVM-RFE with other competitive algorithms for different sliding window sizes.
| Boot-SVM-RFE | SVM-RFE | t-Score | F-Score | IG | RF | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WS | CA | CV | CA | CV | CA | CV | CA | CV | CA | CV | CA | CV |
| Aluminum stress gene expression data in Soybean | ||||||||||||
| 50 | 95.629 | 2.622 | 93.421 | 2.778 | 89.127 | 5.342 | 89.820 | 4.081 | 90.859 | 3.146 | 92.105 | 3.719 |
| 100 | 96.199 | 2.926 | 92.249 | 4.297 | 90.789 | 4.008 | 91.667 | 3.303 | 92.251 | 3.910 | 92.471 | 3.929 |
| 150 | 96.279 | 3.020 | 94.362 | 3.215 | 90.480 | 2.386 | 91.950 | 3.501 | 92.337 | 4.341 | 93.040 | 2.889 |
| 200 | 97.724 | 2.182 | 96.135 | 2.619 | 90.378 | 3.748 | 91.776 | 3.608 | 94.408 | 4.594 | 93.572 | 2.584 |
| 250 | 96.737 | 2.356 | 93.544 | 2.905 | 91.404 | 2.767 | 91.667 | 4.260 | 93.070 | 2.417 | 93.860 | 3.461 |
| 300 | 97.086 | 2.203 | 95.335 | 2.770 | 91.635 | 3.845 | 91.447 | 3.775 | 94.549 | 3.489 | 95.771 | 2.861 |
| 350 | 97.862 | 2.606 | 97.470 | 2.431 | 91.397 | 4.904 | 92.915 | 4.150 | 94.737 | 4.049 | 94.737 | 3.586 |
| 400 | 97.930 | 1.842 | 97.368 | 1.911 | 92.982 | 2.031 | 93.311 | 2.974 | 94.627 | 3.998 | 95.724 | 2.563 |
| 450 | 97.249 | 2.599 | 97.129 | 2.332 | 93.062 | 2.009 | 92.943 | 3.541 | 95.096 | 2.239 | 95.813 | 2.934 |
| 500 | 97.763 | 2.011 | 97.632 | 2.273 | 93.289 | 3.669 | 93.421 | 3.814 | 94.342 | 4.314 | 96.316 | 3.075 |
| Mean | 97.046 | 95.464 | 91.454 | 92.092 | 93.627 | 94.340 | ||||||
| Salinity stress gene expression data in Rice | ||||||||||||
| 50 | 97.218 | 1.927 | 94.015 | 3.382 | 90.000 | 3.346 | 93.684 | 4.498 | 90.150 | 5.200 | 93.684 | 2.401 |
| 100 | 98.175 | 1.203 | 96.984 | 1.742 | 92.778 | 2.613 | 94.444 | 2.690 | 92.222 | 3.242 | 94.841 | 2.375 |
| 150 | 98.319 | 0.924 | 95.731 | 1.402 | 92.773 | 3.054 | 95.378 | 1.874 | 93.697 | 2.474 | 95.462 | 2.065 |
| 200 | 98.482 | 0.832 | 96.786 | 2.052 | 93.571 | 2.493 | 95.804 | 2.071 | 93.304 | 1.651 | 95.446 | 2.363 |
| 250 | 98.190 | 1.162 | 97.810 | 1.218 | 93.333 | 2.432 | 96.286 | 2.157 | 93.333 | 2.432 | 95.905 | 1.856 |
| 300 | 98.265 | 0.842 | 97.449 | 1.742 | 94.490 | 3.015 | 96.653 | 1.244 | 93.265 | 2.118 | 96.327 | 1.813 |
| 350 | 98.352 | 0.545 | 96.923 | 1.455 | 95.055 | 1.693 | 96.692 | 1.419 | 93.187 | 1.421 | 96.154 | 1.407 |
| 400 | 98.571 | 0.000 | 96.619 | 1.151 | 94.167 | 2.543 | 97.143 | 1.659 | 94.286 | 2.238 | 95.952 | 1.533 |
| 450 | 98.571 | 0.000 | 97.273 | 1.386 | 93.636 | 2.399 | 97.922 | 1.197 | 94.416 | 1.258 | 95.714 | 2.111 |
| 500 | 97.000 | 1.465 | 96.857 | 1.942 | 95.000 | 2.270 | 97.000 | 2.018 | 94.286 | 1.428 | 95.286 | 1.742 |
| Mean | 98.114 | 96.645 | 93.480 | 96.101 | 93.215 | 95.477 | ||||||
| Cold stress gene expression data in Rice | ||||||||||||
| 50 | 96.328 | 1.830 | 94.947 | 2.031 | 94.000 | 1.701 | 94.579 | 2.153 | 94.526 | 2.322 | 94.526 | 2.221 |
| 100 | 97.175 | 1.387 | 95.778 | 2.043 | 94.333 | 2.356 | 95.889 | 1.820 | 95.722 | 2.209 | 95.611 | 2.224 |
| 150 | 97.507 | 0.932 | 96.471 | 1.762 | 94.235 | 2.236 | 95.235 | 1.983 | 95.824 | 1.760 | 96.294 | 2.080 |
| 200 | 98.482 | 0.832 | 97.000 | 1.304 | 95.500 | 1.622 | 95.875 | 1.861 | 96.250 | 1.615 | 97.375 | 2.368 |
| 250 | 98.190 | 1.162 | 96.067 | 1.906 | 95.333 | 1.969 | 95.933 | 1.446 | 96.333 | 1.472 | 96.267 | 2.051 |
| 300 | 98.265 | 0.842 | 96.000 | 1.634 | 95.786 | 1.487 | 96.014 | 1.935 | 96.143 | 1.255 | 96.643 | 1.855 |
| 350 | 96.785 | 0.554 | 96.923 | 1.296 | 95.923 | 1.163 | 96.062 | 1.247 | 96.154 | 1.432 | 97.923 | 1.742 |
| 400 | 98.881 | 0.687 | 95.567 | 2.027 | 95.667 | 1.433 | 96.667 | 1.273 | 96.000 | 1.740 | 97.333 | 1.752 |
| 450 | 98.777 | 0.383 | 95.545 | 1.432 | 95.818 | 1.671 | 95.909 | 1.185 | 97.727 | 1.033 | 97.545 | 1.855 |
| 500 | 97.679 | 1.454 | 96.700 | 1.545 | 94.500 | 1.433 | 95.100 | 1.353 | 97.300 | 1.078 | 97.300 | 1.594 |
| Mean | 97.807 | 96.100 | 95.110 | 95.726 | 96.197 | 96.681 | ||||||
Boot-SVM-RFE: Bootstrap SVM-RFE; RF: Random forest; IG: Information gain measure; WS: Sliding window Sizes; CA: Classification accuracy; CV: Co-efficient of Variation in CA
Fig 1Gene selection plot for selection of informative genes for Al stress in soybean.
The horizontal axis represents negative logarithm of statistical significance values obtained from Boot-SVM-RFE. The vertical axis shows the negative logarithm of statistical significance values from t-test. Green dots indicate selected probes with–log (p-value) from Boot-SVM-RFE ≥ threshold of 2.5 and t-test–log (p-value) ≥ threshold of 4. Red stars indicate the selected probes which have Arabidopsis orthologs. Blue dots indicate unselected probes.
Fig 2Functional enrichment analysis of selected genes and hub genes under Al stress.
The GO term enrichment analysis of 981 selected informative genes (A) and hub genes (B) for Al stress condition using Agrigo is shown for different gene ontology categories (CC, MF and BP). For (A), the GO terms are chosen whose p-values < 0.008 and FDR values (false discovery rate) < 0.6. For (B), the GO terms are chosen whose p-values < 0.1 and FDR values < 0.8.
Fig 3Clustering dendrogram of selected genes and gene modules under Al stress and control condition.
The correspondence between Consensus Modules (CM) with modules under Stress (SM) (A) and control (NM) (B) conditions is represented.
List of gene modules along with their gene and hub gene memberships under Al stress condition.
| SN | Module | G | AO | HG | UHG | Molecular Functions |
|---|---|---|---|---|---|---|
| 1 | Black | 40 | 25 | 11 | 4 | Monooxygenase activity, iron ion binding, heme binding, tetrapyrrole binding, oxidoreductase activity, cation binding ion binding,transition metal ion binding |
| 2 | Blue | 137 | 68 | 0 | 0 | Protein kinase activity, kinase activity, phosphotransferase activity, alcohol group as acceptor |
| 3 | Brown | 100 | 68 | 38 | 32 | Iron ion binding, hydrolase activity, acting on ester bonds, metal ion binding, cation binding, ion binding, transcription factor activity, DNA binding, protein kinase activity, phosphotransferase activity, transition metal ion binding, oxidoreductase activity, kinase activity |
| 4 | Cyan | 29 | 23 | 0 | 0 | Metal ion binding, cation binding, ion binding, transition metal ion binding, nucleic acid binding |
| 5 | Green | 58 | 32 | 0 | 0 | Protein kinase activity, phosphotransferase activity, protein serine /threonine kinase activity, protein tyrosine kinase activity, kinase activity |
| 6 | Green-yellow | 33 | 17 | 3 | 3 | Unknown |
| 7 | Grey | 9 | 4 | 0 | 0 | Unknown |
| 8 | Grey60 | 21 | 11 | 0 | 0 | Binding |
| 9 | Light cyan | 23 | 13 | 1 | 1 | Binding |
| 10 | Light-green | 16 | 11 | 0 | 0 | Catalytic activity |
| 11 | Magenta | 35 | 16 | 7 | 6 | Hydrolase activity |
| 12 | Midnight-blue | 24 | 11 | 5 | 3 | Catalytic activity Binding |
| 13 | Pink | 37 | 18 | 0 | 0 | Nucleotide binding, ATP binding, adenyl ribonucleotide binding, purine nucleoside binding, nucleoside binding, adenyl nucleotide binding |
| 14 | Purple | 34 | 20 | 0 | 0 | Adenyl ribonucleotide binding adenyl nucleotide binding purine nucleoside binding, nucleoside binding, purine ribonucleotide binding, ribonucleotide binding, nucleotide binding |
| 15 | Red | 54 | 28 | 20 | 2 | Oxidoreductase activity |
| 16 | Salmon | 31 | 15 | 0 | 0 | Hydrolase activity, nucleotide binding |
| 17 | Tan | 31 | 19 | 12 | 8 | Hydrolase activity |
| 18 | Turquoise | 185 | 106 | 86 | 45 | Primary active transmembrane transporter activity, zinc ion, binding protein kinase activity, ATPase activity, cation transmembrane transporter activity, transition metal ion binding, metal ion binding, active transmembrane transporter activity, phosphotransferase activity, ATPase activity, cation binding, ion binding, ion transmembrane transporter activity, transferase activity, kinase activity |
| 19 | Yellow | 84 | 49 | 45 | 26 | Oxidoreductase activity |
| Total | 981 | 554 | 228 | 130 |
SN: Serial number of module; grey module: genes which do not belong to any module are shown with grey colour; Module: module represented by colours; G: Number of genes belongs to the modules; AO: Number of Arabidopsis orthologs genes belong to each module; HG: Number of hub genes belong to each module; UHG: number of hub genes unique to stress
Fig 4Module interaction network for gene modules under Al stress.
The network consists of 19 nodes and 70 edges (regulatory relations). To remove the weak interaction among the modules, a threshold value for posterior probability is fixed at 0.2.
Comparison of proposed and existing approach in terms of predicted hub genes.
| Existing Approach | Proposed Approach | |||||
|---|---|---|---|---|---|---|
| Data sets | # HG | % HG | ||||
| # HG | % HG | # HG | % HG | |||
| Salinity stress in rice | ||||||
| Rice (Salinity stress) | 214 | 38.49 | 187 | 33.63 | 165 | 29.66 |
| Rice (Control) | 229 | 41.19 | 208 | 37.41 | 180 | 32.36 |
| Al stress in soybean | ||||||
| Soybean (Al stress) | 383 | 39.05 | 331 | 33.74 | 228 | 23.24 |
| Soybean (Control) | 362 | 36.91 | 285 | 29.05 | 187 | 19.14 |
| Cold stress in rice | ||||||
| Rice (Cold stress) | 301 | 46.3 | 265 | 40.7 | 234 | 36 |
| Rice (Control) | 242 | 37.23 | 208 | 32 | 162 | 24.09 |
# HG: Number of hub genes;
% HG: Percentage of hub genes in the gene co-expression network;
Two thresholds for p value are taken as 1E-5 and 1E-10
Fig 5Distribution of WGS in complete networks under stress and control conditions.
The distributions of WGS of genes in GCNs for Al stress (A) and control (B) conditions in soybean are shown. The distributions of WGS of genes in GCNs for salinity stress (C) and control (D) conditions in rice are shown. For all these cases, the distributions are heavy tailed.
Fig 6Distribution of p-values under stress and control conditions.
The distributions of p-values of genes in GCNs for Al stress (A) and control (B) conditions in soybean are shown. The distributions of p-values of genes in GCNs for salinity stress (C) and control (D) conditions in rice are shown. Genes with low p-values represent highly interacting genes in the GCN.
Groups of hub genes predicted using DHGA approach.
| Data | # Housekeeping hub | #UHG stress | #UHG control | # Non hub | # Total genes |
|---|---|---|---|---|---|
| Soybean (Al stress | 98 | 130 | 89 | 566 | 981 |
| Rice (Salt stress | 141 | 46 | 67 | 161 | 556 |
| Rice (Cold stress | 124 | 141 | 84 | 177 | 650 |
#Housekeeping Hub: Number of hub genes common to stress and control; #UHG stress: Number of hub genes unique to stress; #UHG control: Number of hub genes unique to control; #Non hub: Number of genes which are not hub gene in the GCN; #Total genes: Total number of genes in GCN
Fig 7Gene Co-expression Networks for two differential conditions in soybean.
The GCNs are constructed for Al stress (A) and control (B) conditions respectively. The nodes with red colors represent the housekeeping hub genes, green color nodes represent UHG and blue color nodes represent the non-hub genes. (C) Venn diagram of hub genes in the GCNs constructed under Al stress (A) and control (B) conditions in soybean. The number of orthologous genes found in Arabidopsis corresponding to unique and common hub genes in soybean is also shown.