| Literature DB >> 19272192 |
Abstract
Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight. Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes. Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at http://bioinformatics.myweb.hinet.net/xai.htm.Entities:
Mesh:
Year: 2009 PMID: 19272192 PMCID: PMC2653531 DOI: 10.1186/1423-0127-16-25
Source DB: PubMed Journal: J Biomed Sci ISSN: 1021-7770 Impact factor: 8.410
Figure 1A three-tiered architecture applied to microarray gene expression data to integrate the tasks of data analysis from the pre-processing to the data mining.
Figure 2The X-AI framework with dataflow for cancer classification and knowledge acquisition from DNA microarray data.
Top ten genes selected by feature selection function of X-AI for two datasets
| Dataset | Probe ID | Gene annotation | χ2 Score |
| L1 | X95735 | Zyxin | 38.00 |
| M55150 | FAH Fumarylacetoacetate | 33.54 | |
| M27891 | CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) | 33.31 | |
| M31166 | PTX3 Pentaxin-related gene, rapidly induced by IL-1 beta | 33.31 | |
| X70297 | CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 | 29.77 | |
| U46499 | GLUTATHIONE S-TRANSFERASE, MICROSOMAL | 29.77 | |
| L09209_s | APLP2 Amyloid beta (A4) precursor-like protein 2 | 29.77 | |
| M77142 | NUCLEOLYSIN TIA-1 | 29.77 | |
| J03930 | ALKALINE PHOSPHATASE, INTESTINAL PRECURSOR | 29.02 | |
| M23197 | CD33 CD33 antigen (differentiation antigen) | 28.95 | |
| L2 | 36239_at | H. sapiens mRNA for oct-binding factor | 91.08 |
| 37539_at | Homo sapiens mRNA for KIAA0959 protein, partial cds | 84.51 | |
| 35260_at | Homo sapiens mRNA for KIAA0867 protein, complete cds | 83.72 | |
| 32847_at | Homo sapiens myosin light chain kinase (MLCK) mRNA, complete cds | 79.82 | |
| 35164_at | Homo sapiens transmembrane protein (WFS1) mRNA, complete cds | 79.46 | |
| 1325_at | Homo sapiens TWIK-related acid-sensitive K+ channel (TASK) mRNA, complete cds | 78.57 | |
| 40191_s_at | wg66h09.x1 Homo sapiens cDNA, 3' end | 77.22 | |
| 39318_at | H. sapiens mRNA for Tcell leukemia | 76.22 | |
| 32579_at | Human transcriptional activator (BRG1) mRNA, complete cds | 74.97 | |
| 41715_at | H. sapiens mRNA for phosphoinositide 3-kinase | 73.53 |
L1: the dataset of Golub et al. [1]
L2: the dataset of Armstrong et al. [20]
Figure 3Prediction performance of X-AI along with different number of genes on the test set of two datasets. The y-axis represents classification accuracy and the x-axis is the corresponding number of genes which were used as information in classification. L1: for the dataset of Golub et al. [1] L2: for the dataset of Armstrong et al. [20]
Figure 4Comparison of prediction performance between different methods. The y-axis denotes the number of samples which were misclassified by those methods on the test set of L1. The number of used genes is represented in the x-axis. Voting machine [1] SVM [31] Emerging patterns [32] MAMA [33] J48, NB, SMO-CFS, SMO-Wrapper [30] RIRLS, RPLS, RPCR, FPLS, MAVE, k-NN [34]
Figure 5Comparison of prediction performance between different methods. The y-axis denotes the number of samples which were misclassified by those methods on the test set L2. The number of used genes is represented in the x-axis. Classification based on correlation/ordering network [35] HC-TSP, HC-k-TSP, DT, NB, k-NN, SVM, PAM [19]
Two different classes of rules generated from dataset L1
| Consequent | Antecedent | Support (%) | Confidence (%) |
| ALL | L09209_s > 1056.5 & M23197 > 326.0 | 30.56 | 100 |
| M23197 > 401.5 | 29.17 | 100 | |
| M27891 > 2096.5 | 27.78 | 100 | |
| X95735 > 994.0 & M55150 > 1250.5 | 27.78 | 100 | |
| X95735 > 994.0 | 36.11 | 92 | |
| AML | U46499 < 154.5 | 59.72 | 100 |
| L09209_s < 992.5 | 58.33 | 100 | |
| X95735 < 994.0 | 63.89 | 98 | |
| Mean | 41.67 | 99 |
Three different classes of rules generated from dataset L2
| Consequent | Antecedent | Support (%) | Confidence (%) |
| ALL | 32847_at > 147.0 | 30.56 | 100 |
| 36239_at > 2201.0 | 27.78 | 100 | |
| AML | 39318_at < 1063.0 & 32579_at < 2285.0 | 34.72 | 100 |
| 1325_at < 1501.5, 39318_at < 1063.0 & 32579_at < 2285.0 | 34.72 | 100 | |
| 1325_at < 1501.5, 36239_at < 214.0 & 40191_s_at < 508.5 | 33.33 | 100 | |
| 36239_at < 214.0 & 40191_s_at < 508.5 | 33.33 | 100 | |
| 39318_at < 1063.0 & 35164_at < -794.5 | 31.94 | 100 | |
| 40191_s_at < 519.0 & 36239_at < 167.0 | 31.94 | 100 | |
| 1325_at < 1501.5, 39318_at < 1063.0 & 35164_at < -794.5 | 31.94 | 100 | |
| 1325_at < 1501.5, 40191_s_at < 519.0 & 36239_at < 167.0 | 31.94 | 100 | |
| 1325_at < 1501.5, 36239_at < 214.0 & 37539_at < -362.0 | 31.94 | 100 | |
| 36239_at < 214.0 & 37539_at < -362.0 | 31.94 | 100 | |
| 37539_at < -725.5 | 29.17 | 100 | |
| 32579_at < 2285.0 | 36.11 | 96 | |
| 1325_at < 1501.5 & 32579_at < 2285.0 | 36.11 | 96 | |
| 36239_at < 214.0 | 40.28 | 93 | |
| MLL | 1325_at < 201.0, 35260_at > 794.5 & 40191_s_at > 1107.5 | 19.44 | 100 |
| 1325_at < 201.0 & 36239_at > 214.0 | 23.61 | 94 | |
| 1325_at < 201.0 | 37.50 | 67 | |
| Mean | 32.02 | 97 |
Figure 6Snapshot of the prediction page of web service for cancer classification.