| Literature DB >> 16914437 |
Chun-Chi Liu1, Wen-Shyen E Chen, Chin-Chung Lin, Hsiang-Chuan Liu, Hsuan-Yu Chen, Pan-Chyr Yang, Pei-Chun Chang, Jeremy J W Chen.
Abstract
Cancer classification is the critical basis for patient-tailored therapy, while pathway analysis is a promising method to discover the underlying molecular mechanisms related to cancer development by using microarray data. However, linking the molecular classification and pathway analysis with gene network approach has not been discussed yet. In this study, we developed a novel framework based on cancer class-specific gene networks for classification and pathway analysis. This framework involves a novel gene network construction, named ordering network, which exhibits the power-law node-degree distribution as seen in correlation networks. The results obtained from five public cancer datasets showed that the gene networks with ordering relationship are better than those with correlation relationship in terms of accuracy and stability of the classification performance. Furthermore, we integrated the ordering networks, classification information and pathway database to develop the topology-based pathway analysis for identifying cancer class-specific pathways, which might be essential in the biological significance of cancer. Our results suggest that the topology-based classification technology can precisely distinguish cancer subclasses and the topology-based pathway analysis can characterize the correspondent biochemical pathways even if there are subtle, but consistent, changes in gene expression, which may provide new insights into the underlying molecular mechanisms of tumorigenesis.Entities:
Mesh:
Year: 2006 PMID: 16914437 PMCID: PMC1557825 DOI: 10.1093/nar/gkl583
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Flowchart of topology-based cancer classification framework. (a) Gene selection by S2N and permutation test is performed in the training dataset. (b) In order to reduce the noise and investigate the impact of various gene numbers on the classification performance, the microarray data are filtered by the selected gene list that is derived from the above (a). The topological classification framework includes three major steps: (i) the primitive gene network construction; (ii) the extension gene network construction and (iii) the computed similarity between the primitive and extension networks for cancer classification. The primitive networks are also applied to perform hub gene analysis and the calculation of network stability coefficient. A leave-one-out cross-validation (LOOCV) of the training dataset is performed to obtain a training accuracy before using the test dataset, and the training accuracy can indicate the quality of the training dataset.
Figure 2The diagram of the topology-based classification algorithm. If the sample number of subclass i is S , then in the matrix of the training dataset, each column represents a gene (total M genes), and each row represents a sample. g is the gene expression level of the gene i in sample j. Given that the gene expression data of test sample x is from the test dataset, x takes the subclass with the largest correlation coefficient of R1, R2, … , R . This diagram is just a representative of the classification procedures for one test sample, since others follow the same way.
The comparison of three topological quantities between the correlation and ordering networks
| Correlation | Ordering | ||||||
|---|---|---|---|---|---|---|---|
| Dataset | Statistic | DV | CCV | WAD | DV | CCV | WAD |
| ALL-subtype | Average | 0.9247 | 0.9012 | 0.8788 | 0.9659 | 0.8706 | 0.7647 |
| Maximum | 0.9294 | 0.9294 | 0.9647 | 0.9882 | 0.9765 | 0.8235 | |
| SD | 0.0149 | 0.0381 | 0.0392 | 0.0151 | 0.0620 | 0.0283 | |
| GCM | Average | 0.3957 | 0.3978 | 0.5500 | 0.6783 | 0.3978 | 0.6326 |
| Maximum | 0.4348 | 0.5000 | 0.6304 | 0.7174 | 0.4565 | 0.6522 | |
| SD | 0.0367 | 0.0492 | 0.0562 | 0.0321 | 0.0411 | 0.0216 | |
| Lung-cancer | Average | 0.9906 | 0.9805 | 0.9946 | 0.9960 | 0.9423 | 0.9859 |
| Maximum | 0.9933 | 0.9933 | 1.0000 | 1.0000 | 0.9933 | 0.9933 | |
| SD | 0.0085 | 0.0124 | 0.0053 | 0.0035 | 0.0278 | 0.0080 | |
| Lung-subtype-1 | Average | 0.8985 | 0.8361 | 0.8238 | 0.9178 | 0.8718 | 0.3104 |
| Maximum | 0.9257 | 0.8812 | 0.9109 | 0.9307 | 0.9059 | 0.3168 | |
| SD | 0.0150 | 0.0264 | 0.0893 | 0.0097 | 0.0208 | 0.0078 | |
| Lung-subtype-2 | Average | 0.9721 | 0.9519 | 0.9473 | 0.9667 | 0.9279 | 0.8612 |
| Maximum | 0.9845 | 0.9767 | 0.9535 | 0.9690 | 0.9457 | 0.9767 | |
| SD | 0.0138 | 0.0219 | 0.0080 | 0.0037 | 0.0110 | 0.2638 | |
| MLL-leukemia | Average | 1.0000 | 0.9467 | 0.9533 | 1.0000 | 0.9467 | 1.0000 |
| Maximum | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
| SD | 0.0000 | 0.1080 | 0.0549 | 0.0000 | 0.0878 | 0.0000 | |
The topology-based classification analyses were performed to compare the three topological quantities (DV; CCV; WAD) in all the tested datasets. The accuracies of the classification experiments are listed in this table.
aThe classification is based on the correlation network construction with the permutation significance threshold (P < 0.05).
bThe classification is based on the ordering network construction with the permutation significance threshold (P < 0.05).
cThe classification experiments are performed by using various gene number per subclass, such as 10, 20, … , 100 genes. The statistic quantities are the average accuracy, the best accuracy (Maximum) and the SD of accuracies.
Classification accuracies of all the tested datasets
| Correlation | Ordering | |||||||
|---|---|---|---|---|---|---|---|---|
| Dataset | Original | SVM | 40 | 80 | NSC | 40 | 80 | Best |
| ALL-subtype | 0.96 | 1.00 | 0.93 | 0.93 | 0.96 | 0.99 | 0.96 | 0.99 |
| GCM | 0.78 | 0.70 | 0.37 | 0.41 | 0.90 | 0.72 | 0.67 | 0.72 |
| Lung-cancer | 0.96 | 0.99 | 0.99 | 0.99 | 0.95 | 0.99 | 0.99 | 1.00 |
| Lung-subtype-1 | 0.87 | 0.94 | 0.91 | 0.88 | 0.95 | 0.93 | 0.91 | 0.93 |
| Lung-subtype-2 | — | 0.93 | 0.98 | 0.98 | 0.94 | 0.97 | 0.97 | 0.97 |
| MLL-leukemia | 0.95 | 1.00 | 1.00 | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 |
Top 40 and 80 genes per subclass are selected based on S2N score and permutation test (P < 0.05).
aThe classification is based on the correlation network construction with the permutation significance threshold (P < 0.05).
bThe classification is based on the ordering network construction with the permutation significance threshold (P < 0.05).
cThe classification methods of the original papers: ALL-subtype, SVM; GCM, SVM; Lung-cancer, gene expression ratios; Lung-subtype-1, KNN; MLL-leukemia, KNN.
dFor the purpose of comparison, SVM multiclass is performed for the classification.
eThe average of the NSC (network stability coefficient) of all subclasses with ordering network constructed by the top 40 genes per subclass.
fThe classification experiments are performed by using various gene number per subclass, such as 10, 20, … , 100 genes. The best accuracy among these cases was reported.
Figure 3Large-scale gene networks and the power-law node-degree distribution. (a) Classification accuracy profiles of large networks with the ordering and correlation networks. All of the tested datasets exhibit the classification accuracy profiles of the large-scale networks. The solid circle represents the ordering network, while the solid triangle represents the correlation network. The accuracy of the correlation networks decreased and became high variance while the network size increased and the gene number was in excess of some threshold. (b) The ALL-subtype dataset is used to verify the power-law node-degree distribution property in the ordering and correlation networks. These are the log-log plots of degree K versus the number of nodes with degree ≥ K i.e. P(K), and there are 6602 genes in the both ordering and correlation networks. The linear regression measures the linearity between log[P(K)] and log(K), which is a condition of the power-law node-degree distribution, where P(K) can be represented by power-law with a degree exponent r : P(K) ≈ K−. The determination coefficient R2 ranges from 0 to 1, with 1 representing perfect linearity (i.e. a perfect power-law distribution). The permuted network is constructed by randomly permuting all of the edges from the T-ALL subclass network. The degree exponent r and determination coefficient R2 of each subclass-specific network are shown in Supplementary Table S9.
Figure 4The ordering and correlation network diagrams for the significant pathways. The Lung-cancer dataset was used for topology-based pathway analysis, and the two significant pathways, derived from the BioCarta and KEGG pathway databases, respectively. (a) Ordering networks of eicosanoid metabolism; (b) correlation networks of eicosanoid metabolism; (c) ordering networks of ascorbate and aldarate metabolism; (d) correlation networks of ascorbate and aldarate metabolism. Two class-specific networks (MPM and AD) are shown for each pathway. The network topology reveals the apparent difference between the two class-specific networks, as well as the change of the hub gene. In the ordering networks, the hub gene is ALOX5 in AD but is PTGS1 in MPM (a); the hub gene is CYP26A1 in AD but is ALDH1A2 in MPM (c).