| Literature DB >> 23961367 |
Yuichi Tokuda1, Tomohito Yagi, Kengo Yoshii, Yoko Ikeda, Masahiro Fuwa, Morio Ueno, Masakazu Nakano, Natsue Omi, Masami Tanaka, Kazuhiko Mori, Masaaki Kageyama, Ikumitsu Nagasaki, Katsumi Yagi, Shigeru Kinoshita, Kei Tashiro.
Abstract
Primary open-angle glaucoma (POAG) is one of the major causes of blindness worldwide and considered to be influenced by inherited and environmental factors. Recently, we demonstrated a genome-wide association study for the susceptibility to POAG by comparing patients and controls. In addition, the serum cytokine levels, which are affected by environmental and postnatal factors, could be also obtained in patients as well as in controls, simultaneously. Here, in order to predict the effective diagnosis of POAG, we developed an "integration approach" using different attribute data which were integrated simply with several machine learning methods and random sampling. Two data sets were prepared for this study. The one is the "training data set", which consisted of 42 POAG and 42 controls. The other is the "test data set" consisted of 73 POAG and 52 controls. We first examined for genotype and cytokine data using the training data set with general machine learning methods. After the integration approach was applied, we obtained the stable accuracy, using the support vector machine method with the radial basis function. Although our approach was based on well-known machine learning methods and a simple process, we demonstrated that the integration with two kinds of attributes, genotype and cytokines, was effective and helpful in diagnostic prediction of POAG.Entities:
Keywords: GWAS; Glaucoma; Integration approach; Machine learning
Year: 2012 PMID: 23961367 PMCID: PMC3725912 DOI: 10.1186/2193-1801-1-41
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Clinical characteristic of samples
| Training data set | Test data set | |||
|---|---|---|---|---|
| POAG | Control | POAG | Control | |
| Number of sample | 42 | 42 | 73 | 52 |
| Famale / male ratio | 1.00 | 0.83 | 0.62 | 1.74 |
| Age at blood sampling | 56.4±5.5 | 55.3±3.4 | 70.9±10.7 | 61.8 ± 11.3 |
| Storage period of blood (days) | 880.1±112.0 | 865.7±106.0 | 1044.0±114.4 | 892.2 ± 129.9 |
Summary of 29 SNPs used in this study
| dbSNP ID | Chr. | SNP type | Nearest gene | Genotype frequency |
|---|---|---|---|---|
| rs547984 | 1 | intergenic | ZP4 | AA(0.263) AC(0.488) CC(0.249) |
| rs1892116 | 1 | intronic | AHCTF1 | AA(0.507) AG(0.445) GG(0.048) |
| rs4666488 | 2 | intergenic | OSR1 | AA(0.100) AG(0.397) GG(0.503) |
| rs2268794 | 2 | intronic | SRD5A2 | AA(0.005) AT(0.319) TT(0.676) |
| rs7574012 | 2 | intergenic | QPCT | AA(0.373) AG(0.459) GG(0.168) |
| rs1990702 | 2 | intergenic | LRP2 | GG(0.120) GA(0.433) AA(0.447) |
| rs10930437 | 2 | intergenic | SP5 | AA(0.429) AG(0.454) GG(0.117) |
| rs779701 | 3 | intronic | GRM7 | AA(0.490) AG(0.413) GG(0.097) |
| rs6550783 | 3 | intergenic | UBE2E1 | AA(0.412) AG(0.442) GG(0.146) |
| rs6550308 | 3 | intergenic | ARPP21 | GG(0.215) GA(0.488) AA(0.297) |
| rs3922704 | 3 | intronic | PLCXD2 | CC(0.034) CG(0.254) GG(0.712) |
| rs17279573 | 4 | intergenic | KIAA0922 | GG(0.120) GA(0.483) AA(0.397) |
| rs818725 | 5 | intronic | ADAMTS12 | CC(0.019) CG(0.226) GG(0.755) |
| rs11750584 | 5 | intergenic | HEATR7B2 | CC(0.029) CG(0.292) GG(0.679) |
| rs9640055 | 7 | intronic | GLCCI1 | GG(0.038) GA(0.344) AA(0.618) |
| rs2966712 | 7 | intergenic | LOC285965 | AA(0.005) AG(0.211) GG(0.784) |
| rs411102 | 9 | intergenic | KRT8P11 | GG(0.749) GA(0.242) AA(0.009) |
| rs7850541 | 9 | intergenic | GBGT1 | GG(0.514) GA(0.361) AA(0.125) |
| rs7081455 | 10 | intergenic | PLXDC2 | AA(0.644) AC(0.293) CC(0.063) |
| rs493622 | 11 | intergenic | CHORDC1 | AA(0.565) AC(0.383) CC(0.052) |
| rs610160 | 11 | intronic | GRIA4 | AA(0.693) AG(0.262) GG(0.045) |
| rs7961953 | 12 | intronic | TMTC2 | GG(0.522) GA(0.397) AA(0.081) |
| rs10492680 | 13 | intergenic | FLJ42392 | GG(0.005) GA(0.187) AA(0.808) |
| rs1571379 | 14 | intergenic | SEL1L | AA(0.440) AG(0.454) GG(0.106) |
| rs9788983 | 17 | intronic | RPH3AL | AA(0.770) AG(0.215) GG(0.015) |
| rs16940484 | 18 | intronic | TTC39C | GG(0.469) GA(0.450) AA(0.081) |
| rs2864107 | 19 | intergenic | ZNF175 | GG(0.684) GA(0.301) AA(0.015) |
| rs6115865 | 20 | intergenic | C20orf194 | AA(0.125) AG(0.428) GG(0.447) |
| rs5765558 | 22 | intergenic | ATXN10 | AA(0.287) AG(0.478) GG(0.235) |
The dbSNP ID represents with build 130. Chr. denotes the number of chromosome. The Nearest genes are positioned nearest by each SNP and referred to NCBI Build 36. Genotype frequencies are calculated by total samples used in this study, which are 115 POAG patients and 94 healthy control volunteers.
Summary of the three cytokines used in the integration approach
| Cytokine | Training data set | Test data set | |||||
|---|---|---|---|---|---|---|---|
| Concentration | P-value* | Concentration | P-value* | ||||
| Fas Ligand | POAG | 63.5 (52.2-87.3) | 0.002 | 37.5 (31.8-46.6) | 0.877 | ||
| Control | 53.3 (34.9-63.4) | 36.2 (28.0-45.4) | |||||
| Eotaxin | POAG | 309.1 (273.6-342.9) | 0.038 | 70.6 (54.9-90.8) | 0.013 | ||
| Control | 268.5 (236.7-311.6) | 63.5 (54.4-73.9) | |||||
| MIG | POAG | 410.9 (306.8-524.9) | 0.021 | 318.1 (182.9-511.7) | 0.109 | ||
| Control | 340.4 (198.9-470.1) | 148.4 (117.7-241.9) |
“Concentration” represents the median concentration and interquartile range. * P-value of the comparison between POAG and control calculated by Student’s t-test.
Summary of the three cytokines used in the integration approach
| Base classifier | Single analysis | Analysis with sampling* | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | |||
| LDA | Genotype | 0.688 | 0.712 | 0.654 | 0.671 ± 0.011 | 0.693 ± 0.015 | 0.639 ± 0.014 | |
| Cytokine | 0.592 | 0.466 | 0.769 | 0.584 ± 0.010 | 0.457 ± 0.012 | 0.763 ± 0.010 | ||
| Integrated | 0.632 | 0.616 | 0.654 | 0.655 ± 0.022 | 0.611 ± 0.034 | 0.717 ± 0.015 | ||
| SVM | linear | Genotype | 0.664 | 0.699 | 0.615 | 0.683 ± 0.013 | 0.754 ± 0.023 | 0.584 ± 0.016 |
| Cytokine | 0.568 | 0.452 | 0.731 | 0.577 ± 0.008 | 0.458 ± 0.012 | 0.745 ± 0.013 | ||
| Integrated | 0.659 | 0.648 | 0.673 | 0.668 ± 0.014 | 0.640 ± 0.024 | 0.706 ± 0.012 | ||
| polynomial | Genotype | 0.648 | 0.589 | 0.731 | 0.633 ± 0.010 | 0.539 ± 0.026 | 0.764 ± 0.018 | |
| Cytokine | 0.512 | 0.658 | 0.308 | 0.457 ± 0.012 | 0.275 ± 0.077 | 0.713 ± 0.086 | ||
| Integrated | 0.656 | 0.521 | 0.846 | 0.624 ± 0.010 | 0.480 ± 0.065 | 0.827 ± 0.078 | ||
| RBF | Genotype | 0.688 | 0.712 | 0.654 | 0.676 ± 0.010 | 0.685 ± 0.016 | 0.664 ± 0.013 | |
| Cytokine | 0.648 | 0.712 | 0.558 | 0.662 ± 0.006 | 0.701 ± 0.011 | 0.607 ± 0.020 | ||
| Integrated | 0.744 | 0.767 | 0.712 | 0.740 ± 0.013 | 0.805 ± 0.020 | 0.650 ± 0.014 | ||
| NBC | Genotype | 0.640 | 0.671 | 0.596 | 0.630 ± 0.006 | 0.651 ± 0.013 | 0.601 ± 0.014 | |
| Cytokine | 0.624 | 0.479 | 0.827 | 0.621 ± 0.006 | 0.489 ± 0.013 | 0.807 ± 0.019 | ||
| Integrated | 0.744 | 0.767 | 0.712 | 0.698 ± 0.013 | 0.644 ± 0.027 | 0.775 ± 0.051 | ||
| DT | Genotype | 0.536 | 0.342 | 0.808 | 0.562 ± 0.025 | 0.411 ± 0.070 | 0.774 ± 0.043 | |
| Cytokine | 0.624 | 0.904 | 0.231 | 0.605 ± 0.018 | 0.874 ± 0.099 | 0.226 ± 0.126 | ||
| Integrated | 0.600 | 0.959 | 0.096 | 0.617 ± 0.013 | 0.668 ± 0.032 | 0.545 ± 0.040 | ||
*These values are represented as the mean and SD of each statistics. The mean of each statistics included extremely good or bad result, especially small sampling size and few sampling repeat time.
Figure 1Scatter plot showing the ratio of POAG prediction for each sample. Figure 1 (a) The example figure for the scatter plot. The horizontal axis represents the ratio of positive prediction using genotype data. The positive prediction indicated the sample with POAG feature, and the negative prediction indicated the sample with control feature. The ratio was obtained by dividing the number of positive predictions by the total test number. Thus, “1” and “0” indicate 100% prediction as positive and negative, respectively. The vertical axis similarly represents the ratio using the cytokine data. Dots and triangles represent POAG and control samples, respectively. The figure can be read as, if one POAG sample was predicted as positive 60 times using the genotype data and 80 times using the cytokine data each with 100 sampling repeat times, the sample is plotted at (0.6, 0.8) by dot. If the approach has a good performance (means; highly negative or positive prediction) for samples with interaction between those two attributes, more samples will be plotted in the corner I or corner IV. If either the genotype or cytokine data is at risk for POAG, such samples will be plotted in the corner II or corner III, respectively. The diagonal line shows the threshold of the prediction by the integration approach. If a sample is plotted above or below the threshold, the final prediction result is positive or negative, respectively. Figure 1 (b) shows one of the examples as the comparatively smaller and unstable, which is the result with 40 sampling size and 201sampling times by RBF SVM method. Figure 1 (c), one of the examples as the best stable result, which is the result with 70 sampling size and 2,001sampling times by RBF SVM method.