| Literature DB >> 26146513 |
Shuhei Kaneko1, Akihiro Hirakawa2, Chikuma Hamada1.
Abstract
In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient's survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.Entities:
Mesh:
Year: 2015 PMID: 26146513 PMCID: PMC4469838 DOI: 10.1155/2015/259474
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Typical results of gene selection by the lasso.
| True condition | The lasso | |
|---|---|---|
| Select | No select | |
| Genes that are not correlated with survival | False positive | True negative |
|
| ||
| Genes that are truly correlated with survival | True positive | False negative |
Figure 1Illustration for estimating the number of FP and TP. The areas denoted by the vertical and diagonal lines are the proportion of FP and TP, respectively.
Accuracy of the estimated number of true positives (TP) obtained using the proposed algorithm in the simulation study. Average of a tuning parameter (λ), number of selected genes () in the lasso, true number of true positives (True TP), estimated number of TP (), and false positives () are reported at λ (k = 5,10,50,100,150) of the solution path.
|
| ρ |
| λ |
| True TP |
|
|
|---|---|---|---|---|---|---|---|
| 30 | 0 | 5 | 47.0 | 5.0 | 4.4 | 2.9 | 2.2 |
| 10 | 40.8 | 10.1 | 8.0 | 5.8 | 4.3 | ||
| 50 | 22.9 | 48.6 | 25.6 | 28.5 | 20.1 | ||
| 100 | 12.6 | 86.7 | 29.9 | 32.1 | 54.7 | ||
| 150 | 8.6 | 124.5 | 30.0 | 30.7 | 93.9 | ||
| 0.5 | 5 | 48.6 | 5.0 | 4.1 | 2.8 | 2.2 | |
| 10 | 42.1 | 10.0 | 7.5 | 5.8 | 4.2 | ||
| 50 | 23.5 | 48.1 | 25.2 | 31.9 | 16.3 | ||
| 100 | 12.4 | 84.9 | 29.9 | 35.3 | 49.6 | ||
| 150 | 8.4 | 121.2 | 30.0 | 31.6 | 89.6 | ||
|
| |||||||
| 5 | 0 | 5 | 66.9 | 5.0 | 5.0 | 3.0 | 2.0 |
| 10 | 26.3 | 10.4 | 5.0 | 5.2 | 5.2 | ||
| 50 | 17.2 | 50.1 | 5.0 | 5.2 | 44.9 | ||
| 100 | 12.7 | 93.9 | 5.0 | 5.0 | 88.9 | ||
| 150 | 9.8 | 128.4 | 5.0 | 5.0 | 123.4 | ||
| 0.5 | 5 | 66.8 | 5.0 | 5.0 | 3.0 | 2.0 | |
| 10 | 26.5 | 10.3 | 5.0 | 5.2 | 5.1 | ||
| 50 | 16.9 | 49.5 | 5.0 | 5.1 | 44.4 | ||
| 100 | 12.4 | 92.1 | 5.0 | 5.0 | 87.1 | ||
| 150 | 9.6 | 125.2 | 5.0 | 5.0 | 120.2 | ||
Figure 2Trace plot of number of selected genes and estimated number of true positives (TP) produced by applying the proposed algorithm to the training data from the diffuse large B-cell lymphoma (DLBCL) dataset. We determined λ = 7.19 (log10 = 0.86) as the optimum λ based on the estimated number of TP. Using cross-validation (CV), we determined λ = 27 (log10 = 1.43) as the optimum λ.
GenBank accession numbers and descriptions for 4 genes selected by both CV and the model including the 42 genes identified by the algorithm that we developed.
| GenBank accession number | Description |
|---|---|
| X82240 (AA729003) | T-cell leukemia/lymphoma 1A |
| AA805575 | Thyroxine-binding globulin precursor |
| LC_29222 | — |
| X59812(H98765) | Cytochrome P450, subfamily XXVIIA polypeptide |
Values of the comparison criteria for the model including 12 genes determined by CV (CV-model) and the model including the 42 genes identified by our developed algorithm (42 TP-model).
| Criteria | CV-model | 42 TP-model |
|---|---|---|
|
| 0.007 | <0.001 |
|
| 0.002 | <0.001 |
| Deviance | −9.079 | −11.297 |
Figure 3Kaplan-Meier curves of overall survival for “better” and “worse” prognostic groups: (a) the model including 12 genes determined by CV (CV-model) and(b) the model including 42 genes identified by the developed method (42 TP-model).