| Literature DB >> 22166097 |
Xiaofei Nan1, Gang Fu, Zhengdong Zhao, Sheng Liu, Ronak Y Patel, Haining Liu, Pankaj R Daga, Robert J Doerksen, Xin Dang, Yixin Chen, Dawn Wilkins.
Abstract
BACKGROUND: It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22166097 PMCID: PMC3236845 DOI: 10.1186/1471-2105-12-S10-S22
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Examples of piece-wise separable classification problems. Three binary classification examples are illustrated here, where red/blue indicates positive/negative class. The figure shows that with the help of a categorical attribute X3, the three problems can be solved by simple hypothesis classes such as linear or polynomial models.
Figure 2Restructuring a problem by one or more categorical attribute. By one or more discrete or categorical attributes, the original problem is split into multiple sub-problems. If the proper attribute is selected in the restructuring process, each sub-problem will have a comparably simpler target function.
Experimental Results of Artificial Data 1 (Fig1 (a)) with Linear Model.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 59.6000 ± 3.2042 | 64.7750 ± 4.0285 |
| 0.7860 ± 0.0044 | 99.5750 ± 0.2058 | 96.8607 ± 0.8680 | |
| 0.9001 ± 0.0035 | 61.1250 ± 1.7490 | 60.4881 ± 2.8090 |
Experimental Results of Artificial Data 2 (Fig 1.(b)) Using Two-degree Polynomial Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 71.9750 ± 6.4737 | 71.0500 ± 7.9292 |
| 0.8980 ± 0.0061 | 94.1000 ± 0.8350 | 94.3071 ± 0.9204 | |
| 0.9514 ± 0.0043 | 73.4000 ± 1.4443 | 73.8682 ± 2.8535 |
Experimental Results of Artificial Data 3 (Fig 1.(c)) Using Two-degree Polynomial Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 73.1750 ± 5.7772 | 71.6025 ± 8.3302 |
| 0.8455 ± 0.0059 | 96.5500 ± 0.8644 | 95.3658 ± 1.0224 | |
| 0.9328 ± 0.0032 | 72.8750 ± 1.5601 | 71.7689 ± 3.5528 |
Figure 3Experimental results for biological activity prediction of glycogen synthase kinase-3 The categorical attributes were ranked based on their estimated conditional entropies. We chose the first 31 attributes with smallest entropy values for problem partition. We restructured the learning problem according to these candidate attributes separately, and built linear models for each partition. Among the 31 attributes, there are 17 categorical attributes whose performance beat the baseline approach in terms of both cross-validation accuracy and test accuracy.
Learning Performance for the Selected Categorical Attributes in Biological Activity Data of Glycogen Synthase Kinase-3β Inhibitors Using Linear Kernel.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 75.60 | 74.64 |
| nCIR | 1 | 79.21 | 75.01 |
| F06[N-O] | 2 | 76.35 | 74.86 |
| H-049 | 3 | 76.95 | 76.14 |
| nN | 7 | 77.38 | 74.78 |
| F04[N-N] | 8 | 78.55 | 75.10 |
| Bioassay Protocol | 9 | 79.78 | 76.76 |
| nHDon | 12 | 77.26 | 74.88 |
| H-050 | 13 | 77.26 | 74.88 |
| nDB | 15 | 77.74 | 74.78 |
| F07[C-Br] | 16 | 76.62 | 75.76 |
| F02[N-O] | 22 | 77.07 | 75.62 |
| N-075 | 23 | 78.65 | 76.83 |
| F06[C-Br] | 25 | 76.94 | 74.66 |
| F02[N-N] | 26 | 77.93 | 74.92 |
| N-074 | 30 | 76.78 | 76.39 |
| F03[N-N] | 31 | 77.44 | 74.81 |
Performance Comparison for the Selected Categorical Attributes in Biological Activity Data of Glycogen Synthase Kinase-3β Inhibitors Using Two-degree Polynomial Kernel and Gaussian Kernels.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poly | Gausssian | Poly | Gausssian ( | Poly | Gausssian ( | |||||||
| 0.01 | 1 | 10 | 0.01 | 1 | 10 | 0.01 | 1 | 10 | ||||
| Baseline | – | – | – | – | 76.23 | 73.10 | 62.74 | 59.42 | 74.26 | 70.69 | 60.58 | 57.44 |
| nCIR | 3 | 2 | 1 | 1 | 78.84 | 75.41 | 64.48 | 60.15 | 74.55 | 71.23 | 61.26 | 58.02 |
| F06[N-O] | 2 | 1 | 2 | 2 | 77.62 | 73.23 | 63.34 | 60.23 | 73.28 | 70.49 | 60.87 | 56.95 |
| H-049 | 4 | 5 | 4 | 4 | 79.75 | 74.69 | 65.18 | 61.03 | 75.14 | 71.87 | 62.76 | 57.26 |
| nN | 1 | 6 | 6 | 7 | 79.24 | 74.87 | 64.77 | 60.49 | 75.23 | 71.04 | 62.38 | 57.15 |
| F04[N-N] | 7 | 3 | 5 | 6 | 78.32 | 74.14 | 63.14 | 60.63 | 74.16 | 70.02 | 61.79 | 57.69 |
| Bioassay Protocol | 8 | 7 | 3 | 5 | 79.15 | 75.54 | 65.15 | 62.25 | 76.03 | 72.87 | 63.76 | 59.34 |
| nHDon | 11 | 19 | 18 | 19 | 77.63 | 74.18 | 63.05 | 60.02 | 75.12 | 71.17 | 60.34 | 57.28 |
| H-050 | 21 | 7 | 7 | 9 | 76.95 | 73.57 | 63.72 | 60.35 | 74.34 | 71.09 | 59.28 | 56.94 |
| nDB | 13 | 24 | 21 | 25 | 75.37 | 73.89 | 62.83 | 59.25 | 73.22 | 70.18 | 60.47 | 56.74 |
| F07[C-Br] | 17 | 12 | 15 | 16 | 77.25 | 74.58 | 63.04 | 60.42 | 73.96 | 71.65 | 61.07 | 58.15 |
| F02[N-O] | 25 | 16 | 13 | 15 | 76.14 | 73.87 | 62.95 | 58.72 | 72.87 | 70.66 | 60.84 | 57.35 |
| N-075 | 20 | 17 | 17 | 21 | 78.06 | 74.92 | 63.74 | 60.87 | 75.64 | 71.29 | 62.88 | 59.04 |
| F06[C-Br] | 27 | 26 | 25 | 23 | 75.44 | 72.05 | 61.43 | 58.28 | 72.76 | 69.96 | 60.03 | 55.74 |
| F02[N-N] | 33 | 30 | 26 | 32 | 77.83 | 74.15 | 63.82 | 60.96 | 74.56 | 70.75 | 61.44 | 59.45 |
| N-074 | 29 | 35 | 33 | 34 | 76.54 | 73.47 | 63.95 | 60.42 | 74.75 | 71.03 | 60.58 | 57.96 |
| F03[N-N] | 36 | 31 | 34 | 37 | 75.69 | 74.26 | 62.65 | 59.35 | 73.48 | 70.33 | 59.87 | 57.28 |
Figure 4Experimental results for cannabinoid receptor subtypes CB1 and CB2 activity prediction. The categorical attributes were ranked based on their estimated conditional entropies, and the top 20 attributes were chosen to partition the problem separately. Linear models were built for each partition. Among the 20 attributes, there are 8 having better performance than the baseline approach in terms of both cross-validation accuracy and test accuracy.
Figure 5Experimental results for cannabinoid receptor subtypes CB1 and CB2 selectivity prediction. The categorical attributes were ranked based on their estimated conditional entropies, and the top 20 attributes were choseN to partition the problem separately. Linear models were built for each partition. Among the 20 attributes, there are 5 having better performance than the baseline approach in terms of both cross-validation accuracy and test accuracy.
Learning Performance for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data Using Linear Model.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 85.43 | 84.36 |
| F01[N-O] | 1 | 86.20 | 84.37 |
| N-076 | 4 | 86.51 | 85.12 |
| nArNO2 | 5 | 86.36 | 85.07 |
| nCconj | 15 | 87.13 | 86.37 |
| C-034 | 16 | 86.82 | 86.04 |
| B01[N-O] | 17 | 86.82 | 84.46 |
| N-073 | 18 | 85.89 | 85.81 |
| nN(CO)2 | 19 | 86.05 | 84.49 |
Learning Performance for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data Using Linear Model.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 78.02 | 75.56 |
| O-058 | 1 | 80.99 | 77.35 |
| nDB | 2 | 81.73 | 75.94 |
| F06[C-Cl] | 5 | 78.27 | 75.63 |
| nCconj | 7 | 82.72 | 77.92 |
| C-026 | 8 | 78.55 | 77.73 |
Performance Comparison for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data Using Two-degree Polynomial Model and Gaussian Models.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poly | Gausssian | Poly | Gausssian | Poly | Gausssian | |||||||
| Baseline | – | – | – | – | 86.51 | 75.34 | 65.21 | 66.76 | 85.58 | 74.35 | 65.79 | 65.61 |
| F01[N-O] | 1 | 2 | 1 | 1 | 85.15 | 76.12 | 66.16 | 66.44 | 85.44 | 76.14 | 65.65 | 66.15 |
| N-076 | 4 | 5 | 4 | 4 | 87.50 | 77.05 | 66.98 | 67.33 | 86.12 | 76.89 | 66.34 | 66.79 |
| nArNO2 | 6 | 7 | 5 | 5 | 86.82 | 75.14 | 66.78 | 66.58 | 85.27 | 84.35 | 76.34 | 64.96 |
| nCconj | 16 | 14 | 10 | 12 | 86.61 | 77.12 | 67.03 | 66.79 | 83.31 | 76.72 | 63.77 | 65.74 |
| C-034 | 17 | 16 | 11 | 17 | 85.98 | 76.38 | 66.44 | 65.89 | 85.69 | 75.28 | 64.59 | 65.88 |
| B01[N-O] | 20 | 19 | 19 | 18 | 87.21 | 76.38 | 66.38 | 66.66 | 86.72 | 76.37 | 66.29 | 65.62 |
| N-073 | 21 | 20 | 21 | 21 | 84.96 | 74.79 | 65.02 | 65.26 | 84.15 | 75.34 | 64.45 | 63.71 |
| nN(CO)2 | 23 | 24 | 27 | 25 | 86.77 | 73.72 | 66.05 | 64.37 | 85.78 | 73.22 | 63.76 | 62.96 |
Performance Comparison for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data Using Two-degree Polynomial Model and Gaussian Models.
| Entropy list order | Training CV Accuracy(%) | Test Accuracy(%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poly | Gausssian | Poly | Gausssian | Poly | Gausssian | |||||||
| Baseline | – | – | – | – | 76.04 | 67.15 | 57.28 | 57.67 | 74.89 | 65.12 | 54.84 | 53.33 |
| O-058 | 2 | 1 | 2 | 2 | 79.92 | 70.12 | 60.34 | 79.12 | 65.96 | 56.34 | 56.02 | 53.21 |
| nDB | 3 | 3 | 4 | 4 | 80.05 | 71.34 | 61.22 | 80.36 | 76.32 | 67.78 | 57.67 | 55.32 |
| F06[C-Cl] | 7 | 8 | 7 | 8 | 79.73 | 69.96 | 58.27 | 79.12 | 75.12 | 63.29 | 54.79 | 53.29 |
| Cconj | 6 | 7 | 8 | 7 | 78.75 | 67.54 | 57.65 | 77.64 | 76.07 | 65.96 | 55.36 | 54.34 |
| C-026 | 9 | 10 | 9 | 11 | 77.96 | 68.32 | 57.34 | 58.12 | 75.48 | 65.32 | 54.96 | 53.69 |
Descriptions for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data.
| Attribute Class | Description | |
|---|---|---|
| F01[N-O] | 2D frequency fingerprints | frequency of N-O at topological distance 1 |
| N-076 | Atom-centered fragments | Ar-NO2 / R–N(–R)–O / RO-NO |
| nArNO2 | Functional group counts | number of nitro groups (aromatic) |
| nCconj | Functional group counts | number of non-aromatic conjugated C( |
| C-034 | Atom-centered fragments | R–CR..X |
| B01[N-O] | 2D binary fingerprints | presence/absence of N-O at topological distance 1 |
| N-073 | Atom-centered fragments | Ar2NH / Ar3N / Ar2N-Al / R..N..R |
| nN(CO)2 | Functional group counts | number of imides (thio-)-C(=Y1)-N(Y)-C(=Y1)- Y=H or C, Y1= O or S |
R represents any group linked through carbon; X represents any electronegative atom (O, N, S, P, Se, halogens); Al and Ar represent aliphatic and aromatic groups, respectively; = represents a double bond; – represents an aromatic bond as in benzene or delocalized bonds such as the N-O bond in a nitro group; .. represents aromatic single bonds as in the C-N bond in pyrrole.
Descriptions for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data.
| Attribute Class | Description | |
|---|---|---|
| O-058 | Atom-centered fragments | =O |
| nDB | Constitutional descriptors | number of double bonds |
| F06[C-Cl] | 2D frequency fingerprints | frequency of C-Cl at topological distance 6 |
| nCconj | Functional group counts | number of non-aromatic conjugated C( |
| C-026 | Atom-centered fragments | R–CX..R |
R represents any group linked through carbon; X represents any electronegative atom (O, N, S, P, Se, halogens); Al and Ar represent aliphatic and aromatic groups, respectively; = represents a double bond; – represents an aromatic bond as in benzene or delocalized bonds such as the N-O bond in a nitro group; .. represents aromatic single bonds as in the C-N bond in pyrrole.
Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Linear Model.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 85.06 | 89.83 |
| Subtype | 0.3659 | 89.08 | 92.20 |
| Protocol | 0.5616 | 85.06 | 89.96 |
Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Two-degree Polynomial Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 85.06 | 89.83 |
| Subtype | 0.3638 | 89.08 | 92.20 |
| Protocol | 0.5630 | 86.78 | 87.46 |
Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Gaussian Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Baseline | – | – | – | 85.06 | 85.06 | 85.06 | 89.83 | 89.83 | 89.83 |
| Subtype | 0.5656 | 0.5662 | 0.5662 | 88.51 | 88.51 | 88.51 | 92.20 | 92.20 | 92.20 |
| Protocol | 0.3829 | 0.3835 | 0.3840 | 85.06 | 85.06 | 85.06 | 89.96 | 89.96 | 89.96 |
Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Linear Model.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 100.00 | 99.50 |
| T/B-cell | 7.1491e-16 | 100.00 | 100.00 |
| FAB | 1.1666e-15 | 100.00 | 99.70 |
Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Two-degree Polynomial Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |
|---|---|---|---|
| Baseline | – | 100.00 | 94.44 |
| T/B-cell | 7.1491e-16 | 100.00 | 100.00 |
| FAB | 1.1666e-15 | 100.00 | 100.00 |
Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Gaussian Kernel.
| Conditional Entropy | Training CV Accuracy(%) | Test Accuracy(%) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Baseline | – | – | – | 68.52 | 64.81 | 64.81 | 66.67 | 66.67 | 66.67 |
| Subtype | 7.1491e-16 | 7.1491e-16 | 7.1491e-16 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| Protocol | 1.1666e-15 | 1.1666e-15 | 1.1666e-15 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |