| Literature DB >> 28399794 |
Putri W Novianti1,2,3, Victor L Jong4,5, Kit C B Roes4, Marinus J C Eijkemans4.
Abstract
BACKGROUND: Aggregating gene expression data across experiments via meta-analysis is expected to increase the precision of the effect estimates and to increase the statistical power to detect a certain fold change. This study evaluates the potential benefit of using a meta-analysis approach as a gene selection method prior to predictive modeling in gene expression data.Entities:
Keywords: Acute myeloid leukemia; Gene expression; Meta-analysis; Predictive modeling
Mesh:
Year: 2017 PMID: 28399794 PMCID: PMC5387259 DOI: 10.1186/s12859-017-1619-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Data division to perform cross-platform classification models building and their characteristics. (#: the number)
An approach in building and validating classification models by using meta-analysis as gene selection technique
| 1. Data collection |
| Collect raw gene expression datasets, which possibly come from previous experiments and/or systematic search from online repositories. |
| 2. Data preparation |
| (i) Individually preprocess raw gene expression datasets (i.e. normalization, background correction, log2 transformation). |
| (ii) Divide |
| 3. Meta-analysis for gene selection |
| (i) For each probesets, aggregate expression values from SET1 to get a signature list via random effect meta-analysis. |
| (ii) Record significant probesets (also refer to as informative probesets) |
| 4. Predictive modeling |
| (i) In SET2, include informative probesets resulted from Step 3. |
| (ii) Divide samples in SET2 to a learning set and a testing set. |
| (iii) Perform cross validation in classification model modeling. |
| (iv) Evaluate optimum predictive models in the testing set. |
| 5. External validation |
| (i) In SET3, include probesets that are informative from Step 3. |
| (ii) Scale gene expression values in SET3 with SET2 as a reference. |
| (iii) Validate classification models from Step 4 to the scaled gene expressions data in SET3. |
Parameters to generate simulated gene expression datasets
| Simulation ID |
| Δ |
| DEGMA a | DEGIND b |
|---|---|---|---|---|---|
| 1 | 50 | 0.1 | 0.75 | 12 | 72 |
| 2 | 50 | 0.5 | 0.5 | 57 | 34 |
| 3 | 50 | 0.75 | 0.25 | 70 | 62 |
| 4 | 100 | 0.1 | 0.75 | 12 | 14 |
| 5 | 100 | 0.5 | 0.5 | 53 | 56 |
| 6 | 100 | 0.75 | 0.25 | 67 | 50 |
| 7 | 150 | 0.1 | 0.75 | 15 | 23 |
| 8 | 150 | 0.5 | 0.5 | 52 | 26 |
| 9 | 150 | 0.75 | 0.25 | 58 | 57 |
Symbols. n: the number of samples in each generated dataset; Δ: the log2 fold changes of differentially expressed (DE) genes. ρ: pairwise correlation of DE genes
aThe number of genes that were stated as differentially expressed (DE) genes by MA approach from 50 cumulative studies. All the selected genes are true positives
bThe number of true DE genes among the top-100 DE genes selected by limma procedure
Fig. 2The distribution of expression values after pre-processing step from the first three samples in six experiments. The expression values are in log2 scale
Fig. 3Plot of the difference of classification model accuracies between MA- and individual-classification approach, when Data1 was used as a training data
Fig. 4Plot of the difference of classification model accuracies between MA- and individual-classification approach in the simulated datasets, when Δ = 0.1, γ = 0.75 and (a) n = 50 (Simulation 1) (b) n = 100 (Simulation 4) (c) n = 150 (Simulation 7). The aforementioned simulation parameters resulted in the less informative datasets
Results of the random effects models
| Factors | Coefficient | Confidence interval |
| Confidence interval |
| Confidence interval |
| Confidence interval | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LL | UL | LL | UL | LL | UL | LL | UL | |||||
|
| 0.0005 | -0.0005 | 0.0009 | 0.0244 | 0.0165 | 0.0404 | 0.0489 | 0.0289 | 0.0759 | 0.000 | 0.000 | 0.0039 |
| Δ | -0.1169 | -0.2041 | -0.0285 | 0.0245 | 0.0163 | 0.0402 | 0.0359 | 0.0159 | 0.0405 | 0.000 | 0.000 | 0.0039 |
|
| 0.1489 | 0.0295 | 0.2636 | 0.0245 | 0.0165 | 0.0405 | 0.0369 | 0.0022 | 0.0579 | 0.000 | 0.000 | 0.0039 |
Each factor was evaluated individually in the random effects linear regression model. The coefficients were inverse transformed to the original scale of the difference of classification model accuracy between MA- and individual classification approach
Abbreviations: LL lower limit, UL upper limit
Symbols: n: the number of samples in each generated dataset; Δ: the log2 fold change of differentially expressed (DE) genes. ρ: pairwise correlation of DE genes. σ 0, σ 0 and σ 0 are the standard deviation of the random intercepts with respect to classification model, scenario in the simulation study and the number of studies used for selecting relevant features via meta-analysis approach. See Method section for more details regarding the random effect models