| Literature DB >> 30048494 |
Chihyun Park1, JungRim Kim1, Jeongwoo Kim1, Sanghyun Park1.
Abstract
The identification of disease-related genes and disease mechanisms is an important research goal; many studies have approached this problem by analysing genetic networks based on gene expression profiles and interaction datasets. To construct a gene network, correlations or associations among pairs of genes must be obtained. However, when gene expression data are heterogeneous with high levels of noise for samples assigned to the same condition, it is difficult to accurately determine whether a gene pair represents a significant gene-gene interaction (GGI). In order to solve this problem, we proposed a random forest-based method to classify significant GGIs from gene expression data. To train the model, we defined novel feature sets and utilised various high-confidence interactome datasets to deduce the correct answer set from known disease-specific genes. Using Alzheimer's disease data, the proposed method showed remarkable accuracy, and the GGIs established in the analysis can be used to build a meaningful genetic network that can explain the mechanisms underlying Alzheimer's disease.Entities:
Mesh:
Year: 2018 PMID: 30048494 PMCID: PMC6062065 DOI: 10.1371/journal.pone.0201056
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Visualisation of expression levels for four genes according to their class label (Normal and AD).
Four genes were divided into two groups, i.e. AD-unrelated and -related groups.
Basic statistics and PCC values for four cases shown in Fig 1.
The correlation values for AD-related genes were relatively larger than those for AD-unrelated genes. However, the correlation values for AD-related genes were not sufficient to accurately determine correlations in AD.
| AD-unrelated genes | Case (see | Class label | Gene | Mean of expression values | Standard deviation of expression values | PCC of two expression lists |
| (A) | Normal | 0.099 | 0.284 | 0.021 | ||
| -0.882 | 1.207 | |||||
| (B) | AD | -0.070 | 0.304 | -0.080 | ||
| -0.832 | 1.094 | |||||
| AD-related genes | (C) | Normal | 0.359 | 1.475 | -0.590 | |
| -0.135 | 0.952 | |||||
| (D) | AD | 0.997 | 1.369 | -0.280 | ||
| -0.817 | 0.906 |
Fig 2Overview of the proposed approach.
Gene expression data with two class labels are normalized by the z-scoring approach. For class label 1, which indicates disease, possible gene pairs are selected by incorporating disease-related genes and interactome data. For class label 0, which indicates normal, the same number of gene pairs as that for class label 1 is randomly selected. From all gene pairs, 22 features are extracted and used to inform the machine learning-based model. In order to evaluate performance, 10-fold cross validation is performed.
Notation of gene expression values for each class and gene in one gene pair.
| Gene pair | Class label 0 (Normal) | Class label 1 (AD) |
|---|---|---|
| Gene A | EA_L0 | EA_L1 |
| Gene B | EB_L0 | EB_L1 |
List of the features.
| Feature name | Definition |
|---|---|
| MeanA_L0 | mean of EA_L0 |
| MeanA_L1 | mean of EA_L1 |
| MeanB_L0 | mean of EB_L0 |
| MeanB_L1 | mean of EB_L1 |
| SDA_L0 | standard deviation of EA_L0 |
| SDA_L1 | standard deviation of EA_L1 |
| SDB_L0 | standard deviation of EB_L0 |
| SDB_L1 | standard deviation of EB_L1 |
| dMmA_L0 | maximum element of EA_L0 –minimum element of EA_L0 |
| dMmA_L1 | maximum element of EA_L1 –minimum element of EA_L1 |
| dMmB_L0 | maximum element of EB_L0 –minimum element of EB_L0 |
| dMmB_L1 | maximum element of EB_L1 –minimum element of EB_L1 |
| WTA_L0_B_L0 | Welch’s |
| WTA_L1_B_L1 | Welch’s |
| WTA_L0_A_L1 | Welch’s |
| WTB_L0_B_L1 | Welch’s |
| PCCA_L0_B_L0 | Pearson's correlation coefficient (EA_L0, EB_L0) |
| PCCA_L1_B_L1 | Pearson's correlation coefficient (EA_L1, EB_L1) |
| MIA_L0_B_L0 | Mutual Information (EA_L0, EB_L0) |
| MIA_L1_B_L1 | Mutual Information (EA_L1, EB_L1) |
| MIA_L0_A_L1 | Mutual Information of Make equal-sized element list (E’A_L0, E’A_L1) |
| MIB_L0_B_L1 | Mutual Information of Make equal-sized element list (E’B_L0, E’B_L1) |
Detailed description of the dataset used for performance evaluation.
For all datasets, we used the AD-gene network published by the IntAct Molecular Interaction Database, which is curated by broad literature searches. However, since the size of the IntAct(AD) was small, interactome data were integrated to increase the size of the training dataset.
| Dataset ID | Description of dataset | Sample Size | ||
|---|---|---|---|---|
| Interactome dataset | Use of AD-related genes | Normal | AD | |
| 1 | IntAct(AD) + bPPI | Y | 3,241 | 3,241 |
| 2 | IntAct(AD) + bPPI + HumanNet (5%) | Y | 4,916 | 4,916 |
| 3 | IntAct(AD) + bPPI + HumanNet (10%) | Y | 7,013 | 7,013 |
| 4 | IntAct(AD) + bPPI | N | 23,546 | 23,546 |
| 5 | IntAct(AD) + bPPI + HumanNet (5%) | N | 46,206 | 46,206 |
| 6 | IntAct(AD) + bPPI + HumanNet (10%) | N | 69,296 | 69,296 |
List of the comparative algorithms and their primary parameters.
| Algorithms | Mainly used options |
|---|---|
| Naïve Bayes [ | No parameters |
| SVM [ | polynomial kernel |
| ANN [ | hidden layer = 3 |
| PART [ | minimum number of instances per rule = 2 |
Comparison of the performance of various algorithms for dataset 1, 2, and 3.
The proposed method showed the best performance for all three datasets.
| Dataset | Algorithm | Weighted average | ||||
|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F-Measure | ROC area | ||
| 1 | Naïve Bayes | 0.537 | 0.551 | 0.537 | 0.504 | 0.581 |
| SVM | 0.580 | 0.580 | 0.580 | 0.579 | 0.580 | |
| ANN | 0.570 | 0.570 | 0.570 | 0.570 | 0.603 | |
| PART | 0.742 | 0.742 | 0.742 | 0.742 | 0.842 | |
| Proposed method | 0.902 | 0.905 | 0.902 | 0.902 | 0.954 | |
| 2 | Naïve Bayes | 0.547 | 0.567 | 0.547 | 0.512 | 0.585 |
| SVM | 0.562 | 0.564 | 0.562 | 0.559 | 0.562 | |
| ANN | 0.567 | 0.567 | 0.567 | 0.567 | 0.597 | |
| PART | 0.713 | 0.723 | 0.713 | 0.710 | 0.812 | |
| Proposed method | 0.898 | 0.899 | 0.898 | 0.898 | 0.953 | |
| 3 | Naïve Bayes | 0.549 | 0.567 | 0.549 | 0.518 | 0.597 |
| SVM | 0.563 | 0.571 | 0.563 | 0.549 | 0.563 | |
| ANN | 0.570 | 0.570 | 0.570 | 0.570 | 0.601 | |
| PART | 0.744 | 0.746 | 0.744 | 0.743 | 0.850 | |
| Proposed method | 0.916 | 0.916 | 0.916 | 0.916 | 0.965 | |
| 4 | Naïve Bayes | 0.529 | 0.533 | 0.529 | 0.515 | 0.555 |
| SVM | 0.552 | 0.552 | 0.552 | 0.551 | 0.552 | |
| ANN | 0.535 | 0.537 | 0.535 | 0.528 | 0.565 | |
| PART | 0.628 | 0.628 | 0.628 | 0.628 | 0.704 | |
| Proposed method | 0.783 | 0.783 | 0.783 | 0.782 | 0.861 | |
| 5 | Naïve Bayes | 0.540 | 0.560 | 0.540 | 0.499 | 0.577 |
| SVM | 0.556 | 0.580 | 0.556 | 0.522 | 0.556 | |
| ANN | 0.559 | 0.559 | 0.559 | 0.559 | 0.587 | |
| PART | 0.642 | 0.644 | 0.642 | 0.640 | 0.718 | |
| Proposed method | 0.772 | 0.773 | 0.772 | 0.772 | 0.851 | |
| 6 | Naïve Bayes | 0.535 | 0.552 | 0.535 | 0.494 | 0.571 |
| SVM | 0.555 | 0.583 | 0.555 | 0.515 | 0.555 | |
| ANN | 0.565 | 0.566 | 0.565 | 0.565 | 0.591 | |
| PART | 0.662 | 0.662 | 0.662 | 0.662 | 0.752 | |
| Proposed method | 0.786 | 0.786 | 0.786 | 0.786 | 0.865 | |
Fig 3ROC curve for various algorithms using dataset 3.
Fig 4Visualisation of the subnetwork for features extracted by a degree-based topological analysis.
The number of nodes and edges were 130 and 247, respectively. The nodes coloured sequentially from red to yellow are the top 20 genes with a high degree. Blue nodes indicate seed genes.
Fig 5Functional enrichment results for the GSE15222 dataset.
An asterisk of a pathway and GO term indicates that it has been reported in previous studies. (A) We used GSEA with a FDR q-value threshold of 0.001 and selected 15 pathways that satisfy the threshold. Interestingly, several AD-related pathways, such as Regulation of actin cytoskeleton and Neurotrophin signalling pathway, were enriched as well as the Alzheimer’s disease pathway. (B) We used FuncAssociates 3.0 with the default evidence code. The p-value threshold was 0.001 and we selected 20 GO terms that are potentially related to AD. We found that many GO terms related to AD were significantly enriched. (C) We used GSEA with a FDR q-value threshold of 0.001 and selected 15 GO terms in the cellular component category that satisfy the threshold and are potentially related to neuronal functions.