| Literature DB >> 24106694 |
Anunchai Assawamakin1, Supakit Prueksaaroon, Supasak Kulawonganunchai, Philip James Shaw, Vara Varavithya, Taneth Ruangrajitpakorn, Sissades Tongsima.
Abstract
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly. Here, a novel two-step machine-learning framework is presented to address this need. First, a Naïve Bayes estimator is used to rank features from which the top-ranked will most likely contain the most informative features for prediction of the underlying biological classes. The top-ranked features are then used in a Hidden Naïve Bayes classifier to construct a classification prediction model from these filtered attributes. In order to obtain the minimum set of the most informative biomarkers, the bottom-ranked features are successively removed from the Naïve Bayes-filtered feature list one at a time, and the classification accuracy of the Hidden Naïve Bayes classifier is checked for each pruned feature set. The performance of the proposed two-step Bayes classification framework was tested on different types of -omics datasets including gene expression microarray, single nucleotide polymorphism microarray (SNParray), and surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) proteomic data. The proposed two-step Bayes classification framework was equal to and, in some cases, outperformed other classification methods in terms of prediction accuracy, minimum number of classification markers, and computational time.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24106694 PMCID: PMC3784073 DOI: 10.1155/2013/148014
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Empirical testing of NB selection using breast cancer dataset. Training breast cancer dataset was sampled 1 million times for lower-ranked marker set and 100,000 times for the top 40-ranked marker set.
Figure 2Empirical testing of NB selection using leukemia dataset. Training leukemia dataset was sampled 1 million times for lower-ranked marker set and 100,000 times for the top 40-ranked marker set.
Figure 3Empirical testing of NB selection using colon cancer dataset. Training colon cancer dataset was sampled 1 million times for lower-ranked marker set and 100,000 times for the top 40-ranked marker set.
Actual performance results on breast cancer (KRBDSR).
| Criterion | Filter | Wrapper methods | Hybrid methods | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fisher's ratio | RFE-LNW-GD | RFE-SVM | RFE-LSSVM | RFE-RR | RFE-FLDA | RFE-LNW1 | RFE-LNW2 | RFE-FSVs-7DK | NB-HNB | |
| Accuracy | 0.88 | 0.78 | 0.76 | 0.75 | 0.74 | 0.75 | 0.82 | 0.88 | 0.85 | 0.91 |
| Sensitivity, specificity | 0.83, 0.90 | 0.77, 0.81 | 0.68, 0.80 | 0.68, 0.80 | 0.68, 0.77 | 0.69, 0.80 | 0.74, 0.88 | 0.82, 0.90 | 0.84, 0.86 | 0.91, 0.91 |
| Number of genes selected | 35 | 26 | 33 | 36 | 39 | 28 | 35 | 33 | 21 | 25 |
Actual performance results on leukemia (KRBDSR).
| Criterion | Filter | Wrapper methods | Hybrid methods | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fisher's ratio | RFE-LNW-GD | RFE-SVM | RFE-LSSVM | RFE-RR | RFE-FLDA | RFE-LNW1 | RFE-LNW2 | RFE-FSVs-7DK | NB-HNB | |
| Accuracy | 0.99 | 0.99 | 0.99 | 0.99 | 0.48 | 0.997 | 0.96 | 0.99 | 0.98 | 1.00 |
| Sensitivity, specificity | 0.95, 1.00 | 1.00, 0.99 | 0.95, 1.00 | 0.98, 0.99 | 1.00, 0.31 | 0.99, 1.00 | 0.90, 0.98 | 0.95, 1.00 | 0.91, 1.00 | 1.00, 1.00 |
| Number of genes selected | 4 | 5 | 4 | 30 | 6 | 5 | 4 | 4 | 3 | 14 |
Actual performance results on colon cancer (KRBDSR).
| Criterion | Filter | Wrapper methods | Hybrid methods | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fisher's ratio | RFE-LNW-GD | RFE-SVM | RFE-LSSVM | RFE-RR | RFE-FLDA | RFE-LNW1 | RFE-LNW2 | RFE-FSVs-7DK | NB-HNB | |
| Accuracy | 0.90 | 0.87 | 0.87 | 0.91 | 0.83 | 0.89 | 0.91 | 0.89 | 0.91 | 0.93 |
| Sensitivity, specificity | 0.92, 0.88 | 0.89, 0.85 | 0.92, 0.79 | 0.97, 0.81 | 0.77, 0.91 | 0.93, 0.84 | 0.93, 0.88 | 0.93, 0.84 | 0.93, 0.89 | 0.93, 0.90 |
| Number of genes selected | 16 | 17 | 16 | 22 | 19 | 14 | 10 | 15 | 12 | 23 |
Performance comparison between NB-HNB and SVM-RFE on GEMLeR datasets.
| Data | NB-HNB | SVM-RFE | ||
|---|---|---|---|---|
| Accuracy | Number of genes selected | Accuracy | Number of genes selected | |
| AP_Breast_Colon | 0.96 | 22 | 0.96 | 8 |
| AP_Breast_Kidney | 0.96 | 17 | 0.96 | 8 |
| AP_Breast_Lung | 0.94 | 27 | 0.94 | 16 |
| AP_Breast_Omentum | 0.95 | 25 | 0.96 | 32 |
| AP_Breast_Ovary | 0.96 | 17 | 0.96 | 16 |
| AP_Breast_Prostate | 0.99 | 28 | 0.99 | 8 |
| AP_Breast_Uterus | 0.96 | 27 | 0.95 | 8 |
| AP_Colon_Kidney | 0.97 | 10 | 0.98 | 32 |
| AP_Colon_Lung | 0.95 | 17 | 0.94 | 32 |
| AP_Colon_Omentum | 0.95 | 18 | 0.94 | 32 |
| AP_Colon_Ovary | 0.95 | 11 | 0.94 | 16 |
| AP_Colon_Prostate | 0.98 | 20 | 0.98 | 8 |
| AP_Colon_Uterus | 0.96 | 10 | 0.95 | 16 |
| AP_Endometrium_Breast | 0.97 | 20 | 0.97 | 32 |
| AP_Endometrium_Colon | 0.95 | 21 | 0.97 | 32 |
| AP_Endometrium_Kidney | 0.98 | 17 | 0.98 | 32 |
| AP_Endometrium_Lung | 0.94 | 27 | 0.95 | 32 |
| AP_Endometrium_Omentum | 0.92 | 14 | 0.9 | 32 |
| AP_Endometrium_Ovary | 0.91 | 12 | 0.92 | 32 |
| AP_Endometrium_Prostate | 0.98 | 20 | 0.99 | 4 |
| AP_Endometrium_Uterus | 0.9 | 14 | 0.76 | 256 |
| AP_Lung_Kidney | 0.96 | 7 | 0.96 | 32 |
| AP_Lung_Uterus | 0.93 | 22 | 0.93 | 32 |
| AP_Omentum_Kidney | 0.97 | 18 | 0.98 | 16 |
| AP_Omentum_Lung | 0.94 | 24 | 0.9 | 128 |
| AP_Omentum_Ovary | 0.98 | 27 | 0.76 | 4 |
| AP_Omentum_Prostate | 0.98 | 30 | 0.98 | 16 |
| AP_Omentum_Uterus | 0.91 | 15 | 0.88 | 16 |
| AP_Ovary_Kidney | 0.97 | 14 | 0.97 | 32 |
| AP_Ovary_Lung | 0.94 | 15 | 0.93 | 32 |
| AP_Ovary_Uterus | 0.88 | 21 | 0.89 | 64 |
| AP_Prostate_Kidney | 0.98 | 20 | 0.98 | 2 |
| AP_Prostate_Lung | 0.98 | 14 | 0.98 | 4 |
| AP_Prostate_Ovary | 0.98 | 19 | 0.98 | 2 |
| AP_Prostate_Uterus | 0.97 | 28 | 0.99 | 2 |
| AP_Uterus_Kidney | 0.96 | 12 | 0.97 | 32 |
|
| ||||
| Average | 0.954 | 18.89 | 0.94 | 30.5 |
|
| ||||
| Standard deviation | 0.02568 | 0.05357 | ||
|
| ||||
| OVA_Breast | 0.94 | 15 | 0.96 | 32 |
| OVA_Colon | 0.96 | 19 | 0.97 | 16 |
| OVA_Endometrium | 0.97 | 6 | 0.96 | 2 |
| OVA_Kidney | 0.98 | 20 | 0.98 | 8 |
| OVA_Lung | 0.97 | 24 | 0.97 | 4 |
| OVA_Omentum | 0.95 | 3 | 0.95 | 2 |
| OVA_Ovary | 0.92 | 10 | 0.93 | 32 |
| OVA_Prostate | 0.99 | 13 | 0.997 | 2 |
| OVA_Uterus | 0.97 | 21 | 0.93 | 32 |
|
| ||||
| Average | 0.96 | 14.55 | 0.96 | 14.44 |
|
| ||||
| Standard deviation | 0.02147 | 0.02198 | ||
Figure 4Comparison of average accuracy results over all datasets (Avg), 35 All-Paired datasets (AP) and 9 One-Versus-All (OVA) datasets.
Figure 5AUC metrics comparing different approaches.
Actual performance result on SNPs data (Bovine) from IBHM.
| Accuracy | Sensitivity | Specificity | Number of selected SNP | |
|---|---|---|---|---|
| NB-HNB | 0.92 | 0.92 | 0.99 | 33 |
Actual performance result of NB-HNB from SELDI-TOF.
| Accuracy | Sensitivity | Specificity | Number of selected genes | |
|---|---|---|---|---|
| Prostate | 0.86 | 0.86 | 0.89 | 8 |
| Ovarian | 0.98 | 0.98 | 0.97 | 8 |
Summary of the information about each dataset, for example, sample sizes, number of attributes.
| SNP array | cDNA microarray | SELDI-TOF | |||||
|---|---|---|---|---|---|---|---|
| IBHM [ | KRBDSR [ | GEMLeR [ | NCICPD [ | ||||
| Data | Number of SNP | Data | Number of genes | Data | Number of genes | Data | Number of genes |
| Bovine | 9239 | Leukemia | 7129 | Colon | 10935 | Ovarian | 15154 |
| Colon cancer | 2000 | Breast | 10935 | Prostate | 15154 | ||
| Breast cancer | 24481 | Endometrium | 10935 | ||||
| Lymphoma | 4026 | Kidney | 10935 | ||||
| Prostate | 12600 | Lung | 10935 | ||||
| Lung cancer | 7129 | Omentum | 10935 | ||||
| Nervous | 7129 | Ovary | 10935 | ||||
| Prostate | 10935 | ||||||
| Uterus | 10935 | ||||||
Figure 6The overall two-step Bayes classification framework.