| Literature DB >> 21849016 |
Hui Wang1, Mark J van der Laan.
Abstract
BACKGROUND: When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.Entities:
Mesh:
Year: 2011 PMID: 21849016 PMCID: PMC3166941 DOI: 10.1186/1471-2105-12-312
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The simulation I results
| MVR | DSA | MVR | DSA | ||
|---|---|---|---|---|---|
| 0.1 | 1.0870 | - | 1.2136 | - | |
| 0.6522 | - | 0.8846 | - | ||
| na | 1.0130 | na | 1.0680 | ||
| 0.3 | 1.0776 | - | 1.0684 | - | |
| 0.1528 | - | 0.0958 | - | ||
| na | 1.0345 | na | 1.0299 | ||
| 0.5 | 1.0373 | - | 1.0331 | - | |
| 0.0355 | - | 0.0149 | - | ||
| na | 1.0251 | na | 1.0335 | ||
| 0.7 | 1.0081 | - | 1.0000 | - | |
| 0.0275 | - | 0.0162 | - | ||
| na | 1.0693 | na | 1.1055 | ||
| 0.9 | 0.8415 | - | 0.5502 | - | |
| 0.0364 | - | 0.0204 | - | ||
| na | 1.2630 | na | 1.6103 | ||
Bold fonts: testing set (a). Italic fonts: testing set (b).
na: not available. -: the same value as the previous entry.
Figure 1A typical example in simulation I. This graph presents the average L2 risk of the final prediction model on the candidate list from the UR-VIM and the TMLE-VIM, for simulation I data with setup (σ= 5, m= 10, ρ = 0.7). In the left panel, the MVR risk is plotted against a series of p-value thresholds used to truncate the candidate list; the right panel plots the D/S/A risk. The testing set (a) predictions are grouped in solid blue lines, and the testing set (b) predictions are grouped in broken orange lines. Dots represent the UR-VIM, and triangles represent the TMLE-VIM.
The simulation II results (p-value)
| UR-VIM | 0.2887 | 13.8 | 605.3 | 0.1851 | 13.4 | 555.9 |
| 0.4849 | 16.6 | 280.5 | 0.3245 | 14.7 | 255.5 | |
| 0.6289 | 19.7 | 29.1 | 0.4203 | 17.9 | 24 | |
| TMLE-VIM( | 0.6479 | 20 | 41.6 | 0.4498 | 19.2 | 105.9 |
The candidate variable list contains all variables with p-values less than 0.05.
The simulation II results (top 100)
| UR-VIM | 0.1444 | 9.0 | 0.2956 | 0.0862 | 8.2 | 0.3642 |
| 0.1907 | 8.8 | 0.2534 | 0.1605 | 7.2 | 0.2590 | |
| 0.6059 | 19.9 | 0.2289 | 0.4132 | 19.2 | 0.2234 | |
| TMLE-VIM( | 0.5916 | 20 | 0.1242 | 0.3859 | 17.7 | 0.0867 |
The candidate variable list contains the top 100 variables ranked by their p-values.
The analysis result of the breast cancer dataset
| Num. of genes in the candidate list | C.V. classification accuracy | Corr. level among the top 100 genes | |
|---|---|---|---|
| UR-VIM | 327 | 0.7669 | 0.43 |
| 660 | 0.7744 | 0.18 | |
| 818 | 0.7744 | 0.21 |
Figure 2The venn diagram of the breast cancer data. This venn diagram shows the overlaps of identified candidate genes from the breast cancer dataset using the UR-VIM, the , and the .