| Literature DB >> 16606446 |
Xuegong Zhang1, Xin Lu, Qian Shi, Xiu-Qin Xu, Hon-Chiu E Leung, Lyndsay N Harris, James D Iglehart, Alexander Miron, Jun S Liu, Wing H Wong.
Abstract
BACKGROUND: Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16606446 PMCID: PMC1456993 DOI: 10.1186/1471-2105-7-197
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of R-SVM and SVM-RFE on Data-G (with gene outliers)
| 800 | 4.01% | 1.81E-42 | -7.70% | 4.72E-03 | -3.90% | 1.71E-39 |
| 600 | 5.77% | 1.74E-49 | -2.50% | 4.64E-01 | -1.70% | 5.21E-15 |
| 500 | 6.83% | 2.75E-51 | -4.00% | 1.62E-01 | -0.30% | 0.079189 |
| 400 | 8.35% | 3. 26E-60 | 2.80% | 3.48E-01 | 1.10% | 4.48E-06 |
| 300 | 9.33% | 3.83E-58 | 7.40% | 3.65E-02 | 3.70% | 1.77E-31 |
| 200 | 8.22% | 1.28E-48 | 19.20% | 6.36E-09 | 6.30% | 5.79E-44 |
| 150 | 8.55% | 1.51E-53 | 19.50% | 1.16E-08 | 7.10% | 9.76E-46 |
| 100 | 4.97% | 6.20E-22 | 11.90% | 1.83E-04 | 6.00% | 6.43E-40 |
| 90 | 5.84% | 1.66E-27 | 13.70% | 4.20E-06 | 4.60% | 1.07E-30 |
| 80 | 5.17% | 8.20E-29 | 12.40% | 4.14E-06 | 4.50% | 7.12E-29 |
| 70 | 4.14% | 1.46E-27 | 8.50% | 4.77E-04 | 3.80% | 1.05E-24 |
| 60 | 3.10% | 1.23E-20 | 10.20% | 3.14E-05 | 3.40% | 4.99E-24 |
| 50 | 2.27% | 2.01E-15 | 10.20% | 4.11E-06 | 2.90% | 2.37E-21 |
a Level: The number of features selected in each recursive step. With all of the 1000 features, there is no difference between R-SVM and SVM-RFE because no feature selection happened.
b ReduceSV: Relative reduction in the mean number of support vectors used by R-SVM comparing to that by SVM-RFE, calculated as: (average #SVSVM-RFE - average #SVR-SVM)/(average #svSVM-RFE).
c P(sv-diff): The p-value of the observed difference in numbers of SVs, by paired t-test.
d ReduceTest: Relative reduction in the mean test error rates of SVM models with R-SVM-selected features comparing to that with SVM-RFE selected features, calculated as: (average TestErrorSVM-RFE - average TestErrorR-SVM)/(average TestErrorSVM-RFE).
e P(test-diff): The p-value of the observed difference in test error rates, by paired t-test.
f ImproveRec: Relative improvement in the proportion of recovered informative genes by R-SVM than that by SVM-RFE, calculated as: (average #RECR-SVM - average #RECSVM-RFE)/(average #RECSVM-RFE), where #REC is the number of recovered true informative genes with the method stated in the subscript.
g P(rec-diff): The p-value of the observed difference in proportion of recovered true informative genes, by paired t-test.
Comparison of R-SVM and SVM-RFE on Data-S (with sample outliers)
| 800 | 3.25% | 4.49E-41 | -65.19% | 5.65E-36 | -10.14% | 3.36E-75 | 50.37% | 5.97E-35 |
| 600 | 5.80% | 1.90E-57 | -70.27% | 3.04E-35 | -7.14% | 5.18E-56 | 72.28% | 1.10E-49 |
| 500 | 7.02% | 8.20E-63 | -59.63% | 1.81E-37 | -5.13% | 3.37E-39 | 80.54% | 1.17E-56 |
| 400 | 8.26% | 1.68E-67 | -41.43% | 8.31E-25 | -2.57% | 4.53E-12 | 89.04% | 2.51E-64 |
| 300 | 7.72% | 1.20E-58 | -19.14% | 2.18E-13 | 0.75% | 4.92E-02 | 93.44% | 7.46E-65 |
| 200 | 7.21% | 4.54E-51 | -6.53% | 2.56E-04 | 4.00% | 7.15E-16 | 93.91% | 1.47E-61 |
| 150 | 9.13% | 1.29E-71 | 2.63% | 1.20E-01 | 6.47% | 8.41E-23 | 93.59% | 6.27E-61 |
| 100 | 8.30% | 1.42E-64 | 5.56% | 8.04E-04 | 7.69% | 3.50E-22 | 92.44% | 1.33E-61 |
| 90 | 8.36% | 2.01E-72 | 4.31% | 1.15E-02 | 6.99% | 8.74E-19 | 91.37% | 2.60E-61 |
| 80 | 8.01% | 6.63E-71 | 4.45% | 1.99E-02 | 6.99% | 9.33E-18 | 90.26% | 2.65E-60 |
| 70 | 7.17% | 1.29E-67 | 6.59% | 3.78E-04 | 7.52% | 2.80E-16 | 88.56% | 7.55E-62 |
| 60 | 6.67% | 2.65E-65 | 6.16% | 2.32E-03 | 7.27% | 5.72E-13 | 86.38% | 2.60E-62 |
| 50 | 5.82% | 1.08E-58 | 7.70% | 1.34E-04 | 7.42% | 3.71E-12 | 83.82% | 1.23E-61 |
a,b,c,d,e,f,g same as in Table 1.
h ReduceOS V: Relative reduction in the number of outlier support vectors (the outlier samples being taken as support vectors) in R-SVM comparing to that in SVM-RFE, calculated as: (average #OSVSVM-RFE - average #OSVR-SVM)/(average #OSVSVM-RFE), where #OSV denotes the number of outlier samples being taken as support vectors by the method mentioned in subscript.
i P(osv-diff): The p-value of observed difference in OVS, by paired t-test.
Comparison of R-SVM and SVM-RFE on Data-R
| 800 | 15.35% | 1.24E-53 | -3.59% | 1.26E-05 | -3.60% | 1.50E-23 |
| 600 | 18.65% | 3.14E-56 | -7.06% | 4.09E-04 | 2.69% | 2.20E-09 |
| 500 | 19.58% | 7.71E-58 | -6.46% | 1.79E-03 | 9.18% | 1.24E-37 |
| 400 | 21.07% | 1.80E-63 | -2.74% | 3.22E-05 | 17.32% | 4.25E-59 |
| 300 | 22.51% | 5.12E-67 | -4.64% | 1.26E-05 | 24.14% | 5.43E-65 |
| 200 | 22.16% | 9.38E-68 | -0.93% | 1.83E-04 | 30.64% | 2.25E-71 |
| 150 | 21.78% | 4.57E-64 | -3.44% | 8.74E-04 | 29.14% | 5.86E-71 |
| 100 | 21.01% | 3.21E-57 | 0.31% | 3.22E-05 | 29.95% | 7.74E-69 |
| 90 | 22.57% | 1.88E-60 | -2.52% | 3.52E-03 | 27.51% | 9.74E-66 |
| 80 | 22.88% | 1.67E-65 | 1.84% | 7.85E-05 | 27.92% | 4.03E-62 |
| 70 | 21.42% | 2.96E-59 | 0.59% | 4.09E-04 | 27.16% | 1.15E-58 |
| 60 | 20.20% | 1.64E-55 | 6.16% | 1.83E-04 | 26.83% | 2.55E-60 |
| 50 | 18.67% | 4.40E-52 | 4.23% | 8.74E-04 | 25.89% | 9.63E-53 |
| 40 | 15.37% | 5.66E-46 | 8.99% | 4.69E-06 | 25.39% | 1.09E-55 |
| 30 | 11.85% | 6.90E-33 | 9.61% | 1.67E-06 | 24.19% | 2.07E-45 |
| 20 | 7.87% | 2.19E-18 | 11.43% | 3.22E-05 | 20.86% | 1.09E-34 |
a,b,c,d,e,f,g same as in Table 1.
The CV results on the rat cirrhosis data
| CV2b | AveSV c | CV2b | AveSV c | |
| 93 | 4.2% | 14.75 | 4.2% | 14.75 |
| 80 | 4.2% | 11.91 | 4.2% | 14.74 |
| 70 | 4.2% | 9.95 | 4.2% | 14.73 |
| 60 | 3.2% | 9.22 | 4.2% | 13.91 |
| 50 | 3.2% | 9.03 | 4.2% | 13.82 |
| 40 | 3.2% | 9.02 | 4.2% | 14.65 |
| 30 | 3.2% | 8.95 | 4.2% | 13.65 |
| 20 | 3.2% | 8.93 | 4.2% | 9.98 |
| 18 | 4.2% | 8.14 | 4.2% | 9.97 |
| 16 | 4.2% | 8.08 | 3.2% | 7.26 |
| 15 | 4.2% | 7.60 | 3.2% | 7.15 |
| 14 | 4.2% | 7.54 | 3.2% | 7.94 |
| 13 | 6.3% | 7.58 | 4.2% | 7.98 |
| 12 | 6.3% | 7.41 | 4.2% | 8.05 |
| 11 | 6.3% | 7.65 | 4.2% | 8.02 |
| 10 | 6.3% | 7.64 | 3.2% | 9.83 |
| 9 | 5.3% | 6.50 | 3.2% | 8.83 |
| 8 | 4.2% | 5.97 | 4.2% | 7.01 |
| 7 | 4.2% | 6.73 | 4.2% | 6.05 |
| 6 | 4.2% | 5.98 | 3.2% | 5.97 |
| 5 | 5.3% | 5.94 | 4.2% | 5.05 |
a Level: The number of features selected in each recursive step.
b CV2: Total cross-validation error rate (CV2 error rate).
c AveSV: Average number of support vectors used in the cross-validations at each level.
The CV results on the human breast cancer dataset
| 98 | 28.7% | 54.65 | 28.70% | 54.65 |
| 88 | 27.9% | 50.10 | 29.40% | 55.25 |
| 79 | 29.4% | 49.28 | 30.10% | 52.21 |
| 71 | 29.4% | 47.48 | 30.90% | 50.88 |
| 63 | 27.9% | 44.65 | 27.90% | 48.42 |
| 56 | 27.2% | 42.50 | 27.90% | 46.02 |
| 50 | 27.9% | 40.04 | 26.50% | 40.13 |
| 45 | 25.7% | 38.65 | 26.50% | 40.25 |
| 40 | 24.3% | 37.04 | 27.90% | 34.88 |
| 36 | 23.5% | 35.16 | 27.90% | 34.51 |
| 32 | 22.1% | 33.26 | 27.90% | 30.75 |
| 28 | 22.8% | 32.04 | 27.20% | 27.77 |
| 25 | 22.1% | 31.24 | 30.90% | 24.61 |
| 22 | 22.1% | 31.15 | 34.60% | 23.93 |
| 19 | 22.8% | 32.10 | 30.10% | 26.79 |
| 17 | 25.7% | 33.26 | 29.40% | 31.28 |
| 15 | 23.5% | 35.68 | 25.70% | 35.10 |
| 13 | 19.9% | 37.40 | 26.50% | 42.15 |
| 11 | 22.1% | 37.83 | 25.00% | 46.03 |
| 9 | 21.3% | 42.01 | 24.30% | 50.18 |
| 8 | 17.6% | 44.07 | 22.10% | 49.93 |
| 7 | 23.5% | 50.29 | 20.60% | 51.43 |
| 6 | 22.1% | 54.73 | 20.60% | 52.39 |
| 5 | 22.1% | 57.98 | 20.60% | 52.18 |
| 4 | 22.8% | 59.75 | 25.00% | 58.92 |
| 3 | 27.2% | 78.90 | 32.40% | 77.46 |
a,b,c Same as in Table 4.
The top 6 R-SVM-selected biomarkers with their t-test and ROC statistics
| 3526.68 | 1 | 11.916 | 2.05E-19 | 0.969 | 0.024 |
| 3548.26 | 3 | 11.234 | 4.02E-18 | 0.955 | 0.029 |
| 1754.12 | 7 | 9.784 | 2.55E-15 | 0.936 | 0.034 |
| 4195.07 | 15 | 5.341 | 8.46E-07 | 0.821 | 0.043 |
| 8211.04 | 30 | 3.660 | 4.51E-04 | 0.712 | 0.063 |
| 4912.63 | 34 | 3.339 | 1.28E-03 | 0.696 | 0.057 |
a Rank by t-statistics
b Standard error of the AUC (area under curve).
T-statistics and ROC statistics of the 8 R-SVM-selected markers on the breast cancer data
| +Marker-5 | 1 | -5.867 | 3.31E-08 | 0.775 | 0.041 |
| +Marker-28 | 2 | -5.229 | 7.68E-07 | 0.745 | 0.043 |
| Marker-29 | 3 | 5.169 | 9.29E-07 | 0.708 | 0.044 |
| +Marker-58 | 4 | -4.911 | 2.79E-06 | 0.754 | 0.043 |
| Marker-74 | 6 | 4.103 | 7.07E-05 | 0.700 | 0.044 |
| Marker-81 | 10 | -2.963 | 3.61E-03 | 0.626 | 0.048 |
| Marker-92 | 52 | 1.639 | 0.104 | 0.638 | 0.047 |
| Marker- 97 | 94 | 0.162 | 0.872 | 0.570 | 0.049 |
a The biological study based on these and other data will be published elsewhere. Here we use the relative sequential position of the markers on the m/z axis to represent them.
b Rank ordered by t-statistics among the 98 markers.
+ Biomarkers corresponds to "peptide A"
The comparison of SVM vs. WV on Data-G
| 800 | 36.36% | 1.16E-17 | 1.02% | 2.13E-06 |
| 600 | 38.95% | 6.74E-17 | 9.49% | 2.14E-62 |
| 500 | 39.51% | 8.72E-21 | 14.82% | 3.77E-71 |
| 400 | 44.84% | 3.86E-23 | 20.83% | 1.68E-79 |
| 300 | 49.75% | 6.86E-25 | 28.72% | 3.48E-86 |
| 200 | 54.22% | 2.02E-27 | 36.75% | 3.70E-91 |
| 150 | 54.83% | 9.37E-30 | 36.14% | 2.65E-86 |
| 100 | 43.56% | 6.63E-25 | 33.61% | 4.42E-75 |
| 90 | 42.35% | 1.85E-26 | 31.09% | 4.23E-73 |
| 80 | 37.37% | 7.35E-25 | 29.08% | 3.79E-67 |
| 70 | 32.23% | 1.20E-20 | 26.54% | 9.22E-63 |
| 60 | 27.79% | 1.16E-20 | 24.39% | 1.24E-61 |
| 50 | 23.47% | 8.64E-15 | 21.80% | 1.83E-53 |
a Level: The number of features selected in each recursive step.
dReduceTest: Relative reduction in the mean test error rates of SVM comparing to that of WV, calculated as: (average TestErrorWV - average TestErrorR-SVM)/(average TestErrorWV).
e P(test-diff): The p-value of the observed differences in test error rates, by paired t-test.
f ImproveRec: Relative improvement in the proportion of recovered informative genes by R-SVM comparing to that by WV, calculated as: (average #RECR-SVM - average #RECWV)/(average #RECWV), where #REC represents the number of recovered true informative genes with the method stated in the subscript.
g P(rec-diff): The p-value of the observed difference in proportion of recovered informative genes, by paired t-test.
The comparison of R-SVM vs. WV on Data-S
| 800 | -12.32% | 1.58E-04 | -4.01% | 8.77E-33 |
| 600 | -30.90% | 1.38E-19 | -0.16% | 0.482 |
| 500 | -40.09% | 2.98E-32 | -0.03% | 0.940 |
| 400 | -48.92% | 2.95E-37 | -2.21% | 1.29E-11 |
| 300 | -58.87% | 1.54E-44 | -6.81% | 2.56E-35 |
| 200 | -64.05% | 1.72E-48 | -13.73% | 2.35E-53 |
| 150 | -60.96% | 1.83E-47 | -15.52% | 2.15E-52 |
| 100 | -56.41% | 2.62E-49 | -19.58% | 1.29E-57 |
| 90 | -52.91% | 1.58E-42 | -19.14% | 1.18E-51 |
| 80 | -50.73% | 2.35E-41 | -19.08% | 1.31E-51 |
| 70 | -47.11% | 4.02E-40 | -18.27% | 6.03E-47 |
| 60 | -43.29% | 6.80E-38 | -17.58% | 1.18E-42 |
| 50 | -36.01% | 1.06E-34 | -16.35% | 1.34E-37 |
a,d,e,f,g Same as in Table 8.
Comparison of R-SVM vs. WV on Data-R
| 800 | 26.23% | 6.15E-11 | -14.40% | 1.76E-72 |
| 600 | 21.40% | 5.51E-08 | -20.69% | 4.32E-74 |
| 500 | 20.28% | 1.12E-09 | -22.89% | 1.23E-75 |
| 400 | 18.40% | 2.70E- 10 | -25.16% | 2.38E-75 |
| 300 | 14.86% | 5.51E-08 | -28.52% | 1.01E-76 |
| 200 | 18.18% | 5.64E-07 | -26.47% | 3.99E-83 |
| 150 | 13.35% | 1.26E-05 | -23.87% | 4.98E-77 |
| 100 | 13.07% | 5.64E-07 | -18.37% | 1.21E-63 |
| 90 | 13.53% | 1.26E-05 | -17.91% | 1.03E-60 |
| 80 | 15.34% | 4.69E-06 | -16.69% | 1.16E-57 |
| 70 | 12.04% | 4.09E-04 | -15.49% | 4.30E-52 |
| 60 | 12.76% | 5.64E-07 | -13.95% | 4.61E-47 |
| 50 | 9.09% | 4.09E-04 | -12.68% | 3.89E-42 |
| 40 | 8.75% | 1.79E-03 | -10.50% | 1.00E-35 |
| 30 | 8.90% | 1.83E-04 | -7.77% | 2.22E-32 |
a,d,e,f,g Same as in Table 8.
Figure 1Workflow of the R-SVM algorithm.