| Literature DB >> 25734246 |
Erdogan Taskesen, Sepideh Babaei, Marcel M J Reinders, Jeroen de Ridder.
Abstract
BACKGROUND: Acute Myeloid Leukemia (AML) is characterized by various cytogenetic and molecular abnormalities. Detection of these abnormalities is important in the risk-classification of patients but requires laborious experimentation. Various studies showed that gene expression profiles (GEP), and the gene signatures derived from GEP, can be used for the prediction of subtypes in AML. Similarly, successful prediction was also achieved by exploiting DNA-methylation profiles (DMP). There are, however, no studies that compared classification accuracy and performance between GEP and DMP, neither are there studies that integrated both types of data to determine whether predictive power can be improved. APPROACH: Here, we used 344 well-characterized AML samples for which both gene expression and DNA-methylation profiles are available. We created three different classification strategies including early, late and no integration of these datasets and used them to predict AML subtypes using a logistic regression model with Lasso regularization.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25734246 PMCID: PMC4347619 DOI: 10.1186/1471-2105-16-S4-S5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Classification performance for different integration strategies. A) Classification accuracy based on the DLCV scheme (F-score), B) Classification performance based on the DLCV scheme (AUC). C) Resulting -log10(P-values) of global test based on the training and test sets. D) Number of features extracted by logistic regression model from training subsets. Note that the late integration is based on the extracted features from solely GEP and DMP.
Figure 2Classification results on the validation sets and their corresponding ROC curves. Top illustrates scatter plots of the logistic regression classifier (first layer) outcomes on a validation subset for subtype (A) 7q (subset5), (B) NPM1(subset2), and (C) FLT3/NPM1(subset3). Red and blue solid lines indicate the classification boundaries of the first layer (GEP and DMP respectively). The purple solid line indicates the classification boundary of the second layer. Black encapsulated circles indicates samples that are misclassified when we trained the first layer on GEP or DMP only. These are however correctly classified when we incorporate the the second layer. Bottom illustrates the ROC curves of the scatter plots.
Figure 3Schematic overview of classification approach along with integration strategies. The left part shows no and early integration strategies by training the logistic regression classifier on only GEP or DMP and GEP+DMP, respectively. In the DLCV scheme we split the input data into five subsets. The classifier was trained by means of a 5-fold cross-validation approach using four subsets for training and testing and one for validation. The right part shows the late integration procedure where the nearest mean classifier (NMC) (i.e. second layer) was trained on the new two-dimensional data which represents the first layer outcomes for GEP and DMP sets. The second layer was evaluated by the first layer outcomes of the validation subset. The reported performance is the average of classification performance on the 5 validation subsets.