| Literature DB >> 31074380 |
Xuemeng Fan1, Yaolai Wang1, Xu-Qing Tang2,3.
Abstract
BACKGROUND: Lung adenocarcinoma is the most common type of lung cancer, with high mortality worldwide. Its occurrence and development were thoroughly studied by high-throughput expression microarray, which produced abundant data on gene expression, DNA methylation, and miRNA quantification. However, the hub genes, which can be served as bio-markers for discriminating cancer and healthy individuals, are not well screened. RESULT: Here we present a new method for extracting gene predictors, aiming to obtain the least predictors without losing the efficiency. We firstly analyzed three different expression microarrays and constructed multi-interaction network, since the individual expression dataset is not enough for describing biological behaviors dynamically and systematically. Then, we transformed the undirected interaction network to directed network by employing Granger causality test, followed by the predictors screened with the use of the stepwise character selection algorithm. Six predictors, including TOP2A, GRK5, SIRT7, MCM7, EGFR, and COL1A2, were ultimately identified. All the predictors are the cancer-related, and the number is very small fascinating diagnosis. Finally, the validation of this approach was verified by robustness analyses applied to six independent datasets; the precision is up to 95.3% ∼ 100%.Entities:
Keywords: Granger causality test; Lung adenocarcinoma; Predictor extraction; Stepwise character selection
Mesh:
Substances:
Year: 2019 PMID: 31074380 PMCID: PMC6509866 DOI: 10.1186/s12859-019-2739-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Basic characteristics of 7 datasets
| Characteristics | Analysis | Validation | ||||||
|---|---|---|---|---|---|---|---|---|
| TCGA | GSE10072 | GSE83213 | GSE2088 | GSE32863 | GSE43458 | GSE43458 | ||
| Platform | Illumina | Affymetrix | Illumina | Illumina | Illumina | Affymetrix | Affymetrix | |
| Cancer/Normal (Total) | 539/59 (598) | 50/57 (107) | 11/46 (57) | 57/30 (87) | 58/58 (116) | 80/30 (110) | 25/25 (50) | |
| Male (%) | 107 (18) | 69 (64) | 28 (50) | Unknown | 26 (22) | Unknown | Unknown | |
| Race | Asian | 8 | 0 | 0 | 0 | 44 | 0 | 0 |
| Black | 59 | 0 | 0 | 0 | 0 | 0 | 0 | |
| White | 446 | 0 | 0 | 0 | 72 | 0 | 0 | |
| Unreported | 66 | 107 | 57 | 87 | 0 | 110 | 50 | |
| Mean age | 65 | 56 | Unknown | Unknown | 68 | Unknown | 58 | |
| Never-smoker (%) | 34 (6) | 36 (34) | Unknown | Unknown | Unknown | 70 (64) | Unknown | |
Fig. 1Flow chart of Granger Causality Test. The Pearson correlation test adapts p-value < 0.01 as threshold, and the other three: the unit root test, co-integration test and Granger causality test adapt p-value < 0.05 as threshold
Fig. 2Illustration of how to select the globally independent gens. A is the independent gene of B and D; B is the independent gene of C, B is also the dependent gene of E; F is a single node gene; H, I and G are in a feedback sub-network. In category 1, we identified A, E and F with indegree = 0 as the globally independent genes (marked in red). In category 2, H, I and G are all identified as the globally independent genes (marked in red). The nodes in green indicate genes that are both dependent genes and independent genes of some other genes. The blank nodes indicate genes are only dependent genes of some other genes
Fig. 3Flow chart of stepwise character selection based on RF
Process of stepwise character selection based on RF
| |
| |
| which acts as a single predictor in RF with 5-fold cross validation, respectively. |
| |
| While |
| |
| |
| 1. |
| 2. |
| |
| If |
| 1.Define |
| 2. |
| 3. |
| |
| End if. |
| |
| Calculate |
| End while. |
Top 5 target genes and their top 4 regulator miRNA
| Target genes | Degree | DEmiRNA |
|---|---|---|
| VEGFA | 12 | hsa-mir-378a (120), hsa-mir-373 (62), hsa-mir-34a (21), hsa-mir-17 (20) |
| CCND1 | 10 | hsa-mir-34a (21), hsa-mir-17 (70), hsa-mir-449a (14), hsa-mir-19a (12) |
| CDK6 | 10 | hsa-mir-615 (121), hsa-mir-21 (70), hsa-mir-203a (21), hsa-mir-34a (21) |
| BCL2 | 9 | hsa-mir-375 (419), hsa-mir-429 (26), hsa-mir-34a (21), hsa-mir-17 (20) |
| PTEN | 7 | hsa-mir-21 (48), hsa-mir-19a (12), hsa-mir-217 (11), hsa-mir-144 (8) |
Fig. 4Distribution of the 6661 diff-genes. The lengths of horizontal bars represent the size of Dmet-located-gene, DEmiRNA-target-gene, and DEG. The heights of the vertical bars represent the size of intersections among Dmet-located-gene, DEmiRNA-target-gene, and DEG group
Fig. 5Multi-interaction network of 148 feature genes. The red lines indicate the co-location interactions, the blue lines indicate the physical interactions, the yellow lines indicate the shared protein domain interactions, and the size of node indicates the degree
Fig. 6Sources of 148 feature genes. The blocks in blue denote the feature genes are DEGs, DEmiRNA-target-genes, or Dmet-genes. The gene highlighted by red square is EGFR, which is a well-known gene related to lung adenocarcinoma and it is also one of the predictors we identified. EGFR is only from DEmiRNA-target-gene group. This indicates that it is not sufficient to analysis gene activities based on single dataset
Fig. 7Directed causality network of 63 independent genes derived via Granger causality test. The directed arrows present regulation directions. The red lines indicate co-location interactions. The blue lines indicate physical interactions. The yellow lines indicate shared protein domain interactions. The nodes in green present the 63 globally independent genes screened
Fig. 8Performance of RF as a classifier based on 148 diff-genes, 63 feature genes, 6 predictors. a: 148 diff-genes, b: 63 feature genes, c: 6 predictors. “TP” and “NT” denote lung adenocarcinoma (marked in green) and normal samples (marked in brown), separately
Numbers of causality edges for each interaction type in three networks
| Interaction type | Number of edges for each interaction type | ||
|---|---|---|---|
| Independent genes network | Causality network | Feature genes network | |
| Co-location | 37 | 142 | 1219 (11.6%) |
| Physical | 2 | 52 | 581 (8.9%) |
| Shared protein domain | 13 | 23 | 32 (40.6%) |
Classification performances of the 5 gene sets including G, D, I, G∪D and G∪I
| Gene set | Number | Precision | AUC | |||
|---|---|---|---|---|---|---|
| ACC (%) | SN (%) | SP (%) | MCC (%) | |||
|
| 63 | 98.7 | 88.7 | 96.6 | 93.5 | 0.89 |
|
| 42 | 97.4 | 96.5 | 91.8 | 88.1 | 0.88 |
|
| 26 | 96.6 | 95.7 | 90.1 | 85.5 | 0.91 |
| 105 | 97.2 | 95.9 | 92.3 | 87.7 | 0.86 | |
| 89 | 97.6 | 96.3 | 94.3 | 90.0 | 0.89 | |
Classification performances of the 3 gene sets including feature genes, globally independent genes and predictors
| Gene set | Number | Precision | AUC | |||
|---|---|---|---|---|---|---|
| ACC (%) | SN (%) | SP (%) | MCC (%) | |||
| Feature gene | 148 | 97.4 | 95.1 | 95.0 | 87.5 | 0.90 |
| Independent gene | 63 | 97.7 | 88.7 | 96.6 | 93.5 | 0.89 |
| Predictor | 6 | 97.6 | 98.2 | 94.7 | 90.8 | 0.83 |
Fig. 9ROC curves of three gene sets
Classification accuracies of the resulting 6 predictors in 6 datasets
| Dataset | Tumor (%) | Precision | |||
|---|---|---|---|---|---|
| ACC (%) | SN (%) | SP (%) | MCC (%) | ||
| GSE10072 | 50 (47) | 98.3 | 98.2 | 94.7 | 90.8 |
| GSE83213 | 11 (19) | 95.7 | 100 | 92.5 | 86.0 |
| GSE2088 | 57 (66) | 97.2 | 96.9 | 96.5 | 94.3 |
| GSE32863 | 58 (50) | 95.3 | 95.7 | 90.0 | 89.0 |
| GSE43458 | 80 (110) | 98.3 | 98.0 | 95.0 | 95.0 |
| GSE27262 | 25 (50) | 100 | 100 | 100 | 100 |
Fig. 10Validation in 6 datasets from different sources. a: GSE10072, b: GSE83213, c: GSE2088, d: GSE32863, e: GSE43458, f: GSE27262. “TP” and “NT” denote lung adenocarcinoma (marked in yellow) and normal samples (marked in darkred), respectively