| Literature DB >> 22786785 |
Wouter G Touw1, Jumamurat R Bayjanov, Lex Overmars, Lennart Backus, Jos Boekhorst, Michiel Wels, Sacha A F T van Hijum.
Abstract
In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.Entities:
Keywords: Random Forest; conditional relationships; local importance; proximity; variable importance; variable interaction
Mesh:
Year: 2012 PMID: 22786785 PMCID: PMC3659301 DOI: 10.1093/bib/bbs034
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Training of an individual tree of an RFM. The tree is built based on a data matrix (shown within the ellipses). This matrix consists of samples (S1–S10; e.g. individuals) belonging to two classes (encircled crosses or encircled plus signs; e.g. healthy and ill) and measurements for each sample for different variables (V1-V5; e.g. SNPs). Dice: random selection. Dashed lines: randomly selected samples and variables. For each tree, a bootstrap set is created by sampling samples from the data set at random and with replacement until it contains as many samples as there are in the data set. The random selection will contain about 63% of the samples in the original data set. In this example, the bootstrap set contains seven unique samples (samples S3–S9; non-selected samples S1, S2 and S10 are faded). For every node (indicated as ellipses) a few variables are randomly selected (here three; the other two non-selected variables are shown faded; by default RF selects the square root of the total number of variables) and evaluated for their ability to split the data. The variable resulting in the largest decrease in impurity is chosen to define the splitting rule. In case of the top node, this is V4 and for the second node on the left hand side this is V2 (indicated with the black arrows). This process is repeated until the nodes are pure (so called leaves; indicated with round-edged boxes): they contain samples of the same class (encircled cross or plus signs).
Random Forest use in Life Sciences publications ordered by data type or origin
| Area | Publications | RF features | Other possible uses of RF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of trees varied | Tree size | Variable importance | Local importance | Proximity | Conditional importance | Variable interactions | Alternative voting scheme | RF algorithm modified | RF in pipeline | |||
| Genomics | [ | 4 | 5 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| Metabolomics | [ | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Proteomics | [ | 1 | 1 | 0 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| Transcriptomics | [ | 2 | 4 | 1 | 8 | 1 | 3 | 0 | 0 | 0 | 1 | 4 |
| Other | [ | 5 | 6 | 0 | 13 | 0 | 3 | 0 | 0 | 0 | 2 | 2 |
| Total | 58 papers | 13 | 16 | 1 | 41 | 1 | 7 | 0 | 0 | 0 | 3 | 11 |
The number of studies in different application areas that use RF properties or that report adaptations to RF is indicated. aNumber of variables to select for the best split at each node. bBy changing the number of splits or node size. cInferred from tree structure.
Figure 2:Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest. In this hypothetical case, a supervised classification was performed on samples from two classes (encircled crosses or encircled plus signs; e.g. healthy individuals or patients). Dissection of the random forest trees might result in the further (unsupervised) distinction of subsets of samples. Top panel: variables (V1-Vn; e.g. SNPs in a GWAS study), their values (1 or 0) and interactions. Bottom panel: subsets (separated by the dashed lines) of samples from the pure classes that are predicted by a given interaction between variables. An interpretation example: provided that SNP4 (V4) is present, SNP2 (V2) allows the distinction between two subsets (consisting of healthy individuals 6, 7 8, 9 and patients 2, 5 and s). If SNP4 is absent, then the patient samples 1, 3, 4 and t can be classified. In case SNP1 (V1) is absent and SNP5 (V5) is present, a subset of healthy individuals consisting of samples a, b, c and d can be classified. Note that in this example, there can apparently no subset be distinguished if SNP1 (V1) is present or SNP5 (V5) is absent.