| Literature DB >> 24770439 |
Putri W Novianti1, Kit C B Roes1, Marinus J C Eijkemans1.
Abstract
Classification methods used in microarray studies for gene expression are diverse in the way they deal with the underlying complexity of the data, as well as in the technique used to build the classification model. The MAQC II study on cancer classification problems has found that performance was affected by factors such as the classification algorithm, cross validation method, number of genes, and gene selection method. In this paper, we study the hypothesis that the disease under study significantly determines which method is optimal, and that additionally sample size, class imbalance, type of medical question (diagnostic, prognostic or treatment response), and microarray platform are potentially influential. A systematic literature review was used to extract the information from 48 published articles on non-cancer microarray classification studies. The impact of the various factors on the reported classification accuracy was analyzed through random-intercept logistic regression. The type of medical question and method of cross validation dominated the explained variation in accuracy among studies, followed by disease category and microarray platform. In total, 42% of the between study variation was explained by all the study specific and problem specific factors that we studied together.Entities:
Mesh:
Year: 2014 PMID: 24770439 PMCID: PMC4000205 DOI: 10.1371/journal.pone.0096063
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Overview of the studied data.
| Study | Classification model(s) | Study factor 1 | … | Study factor 8 | Classfication model accuracy |
| 1 |
|
| |||
| 2 |
|
| |||
| 3 |
|
| |||
| 4 |
|
| |||
| 5 |
|
| |||
| 6 |
|
| |||
| 7 |
|
| |||
| 7 |
|
| |||
| 7 |
|
| |||
| 7 |
|
| |||
| … | … | … | … | … | … |
| 45 |
|
| |||
| 46 |
|
| |||
| 47 |
|
| |||
| 48 |
|
|
: Classification model j in study i.
: The number of correct classified sample(s) based on the classification model j in study i.
: The number of miss-classified sample(s) based on the classification model j in study i.
Figure 1The PRISMA workflow diagram of the literature review search.
The diagram represents the process of literature review search. The details for each step can be found in the Material S1.
Characteristics of 48 fully reviewed studies.
| Study characteristics | Number of studies |
|
| |
| 2005–2007 | 15 |
| 2008–2010 | 19 |
| 2011–2013 | 14 |
|
| |
| Inflammatory disorder | 17 |
| Immune disease | 5 |
| Degenerative disease | 6 |
| Infection | 12 |
| Mental disorder | 5 |
| Other | 3 |
|
| |
| One-color | 35 |
| Two-color | 13 |
|
| |
| Diagnostic | 29 |
| Prognostic | 6 |
| Response-to-treatment | 13 |
|
| |
| Single | 19 |
| Nested | 29 |
|
| |
| Filter | 13 |
| Wrapper | 16 |
| Embedded | 19 |
|
| |
| Interaction | 37 |
| No-interaction | 24 |
*Some studies used more than one classifier.
Individual random intercept logistic regression.
| Study factor | Df | AIC | P value |
| Class imbalance level | 1 | 142.3 | 0.18 |
| Sample size | 1 | 142.9 | 0.23 |
| Microarray platform (color system) | 1 | 142.8 | 0.22 |
| Medical question | 2 | 144.1 | 0.33 |
| Disease type | 5 | 148.3 | 0.66 |
| Cross validation technique | 1 | 144.0 | 0.66 |
| Gene selection method | 2 | 144.7 | 0.45 |
| Classification method | 1 | 144.3 | 0.90 |
| The number of genes in final model | 1 | 144.3 | 0.78 |
Backward elimination in multiple random intercept logistic regression.
| Step | Study factors on the model | Df | AIC | P value |
| 1 | Class imbalance level | 1 | 155.95 | 0.128 |
| Sample size | 1 | 153.65 | 0.928 | |
| Microarray platform (color system) | 1 | 154.85 | 0.271 | |
| Medical question | 2 | 160.58 | 0.011 | |
|
|
|
|
| |
| Cross validation technique | 1 | 157.56 | 0.048 | |
| Gene selection method | 2 | 152.72 | 0.582 | |
| Classification method | 1 | 153.64 | 0.950 | |
| The number of genes in final model | 1 | 153.91 | 0.602 | |
| 2 | Class imbalance level | 1 | 148.52 | 0.233 |
|
|
|
|
| |
| Microarray platform (color system) | 1 | 150.62 | 0.061 | |
| Medical question | 2 | 151.90 | 0.033 | |
| Cross validation technique | 1 | 150.80 | 0.054 | |
| Gene selection method | 2 | 148.90 | 0.149 | |
| Classification method | 1 | 147.18 | 0.773 | |
| The number of genes in final model | 1 | 147.61 | 0.476 | |
| 3 | Class imbalance level | 1 | 146.53 | 0.232 |
| Microarray platform (color system) | 1 | 149.07 | 0.046 | |
| Medical question | 2 | 149.90 | 0.033 | |
| Cross validation technique | 1 | 149.41 | 0.038 | |
| Gene selection method | 2 | 147.63 | 0.104 | |
|
| 1 |
|
| |
| The number of genes in final model | 1 | 145.62 | 0.473 | |
| 4 | Class imbalance level | 1 | 144.58 | 0.239 |
| Microarray platform (color system) | 1 | 147.32 | 0.042 | |
| Medical question | 2 | 147.99 | 0.033 | |
| Cross validation technique | 1 | 147.85 | 0.031 | |
| Gene selection method | 2 | 145.77 | 0.101 | |
|
| 1 |
|
| |
| 5 | Class imbalance level | 1 | 142.76 | 0.315 |
| Microarray platform (color system) | 1 | 145.84 | 0.043 | |
| Medical question | 2 | 145.99 | 0.044 | |
| Cross validation technique | 1 | 146.02 | 0.039 | |
| Gene selection method | 2 | 144.08 | 0.115 |
The AIC of multivariable random effect logistic regression model if the corresponding study factor is deleted. The AIC in the full model is 155.7.
*The study factors gave the lowest AIC and was excluded from the model for the next step.
Figure 2The relative explained-variation of study factors.
The x-axis represents the relative explained variation for each study factor, while the y-axis shows the study factors. Table S2 provides more details on the relative explained-variation of each study factor.
Figure 3Boxplot of the “Medical question” study factor with respect to sample size on training data, proportional sample size (class imbalance level), and the number of genes included in the final classification models.
The number of data points (average accuracy) clustered by “medical question” and “cross validation technique”.
| Medical question | ||||
| Diagnostic | Prognostic | Response to treatment | ||
|
| Single | 10 (0.93) | 5 (0.90) | 13 (0.83) |
| Nested | 27 (0.87) | 2 (0.90) | 4 (0.76) | |