| Literature DB >> 29281963 |
Dennis E Te Beest1, Steven W Mes2, Saskia M Wilting3, Ruud H Brakenhoff2, Mark A van de Wiel4,5.
Abstract
BACKGROUND: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting.Entities:
Keywords: Classification; DNA copy number; Gene expression; Methylation; Prior information; Random forest
Mesh:
Year: 2017 PMID: 29281963 PMCID: PMC5745983 DOI: 10.1186/s12859-017-1993-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Illustration of the sources of data used in CoRF for the LNM example. First, a base RF is fitted on the training data. Its output, v , together with the co-data, is used to train the co-data model. From the co-data model, we obtain a probability per gene used for refitting on the training data. In an extra step we validate the results on GSE84846
Fig. 2Fit of the co-data model for the LNM example. Each square represents 100 genes grouped by either (a) DNA copy number-expression correlation or (b) p-value. The red lines represent the marginal fit across the correlations or p-values. The top red lines represent the fit for genes present in the gene profile. The cloud of red dots represent the fitted values for 1000 randomly selected genes
Fig. 3The ROC curve based on oob predictions for the base RF and CoRF. The ROC curve based on oob predictions for the base RF and CoRF; (a) the TCGA training data, (b) validation data set (GSE84846), and (c) The cervical cancer example
Fig. 4The performance of RF/CoRF for given numbers of variables selected with vh-vimp for the LNM example. For the (TCGA) training data the performance was assessed by a 10-fold cross-validation. For the validation data set (GSE84846) the prediction models where directly applied
Fig. 5Fit of the co-data model for the cervical cancer example. Displayed are the estimated sampling probabilities for 10000 randomly selected methylation sites displayed by (a) location of the methylation site, (b) the number of CpGs, and (c) p-values. Figure c only displays the methylation sites that are up-regulated