| Literature DB >> 30236139 |
Aneta Polewko-Klim1, Wojciech Lesiński2, Krzysztof Mnich3, Radosław Piliszek3, Witold R Rudnicki2,3,4.
Abstract
BACKGROUND: Modern experimental techniques deliver data sets containing profiles of tens of thousands of potential molecular and genetic markers that can be used to improve medical diagnostics. Previous studies performed with three different experimental methods for the same set of neuroblastoma patients create opportunity to examine whether augmenting gene expression profiles with information on copy number variation can lead to improved predictions of patients survival. We propose methodology based on comprehensive cross-validation protocol, that includes feature selection within cross-validation loop and classification using machine learning. We also test dependence of results on the feature selection process using four different feature selection methods.Entities:
Keywords: Copy number variation; Feature selection; Gene expression; Machine learning; Neuroblastoma; Random forest; Synergy
Mesh:
Substances:
Year: 2018 PMID: 30236139 PMCID: PMC6148774 DOI: 10.1186/s13062-018-0222-9
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Informative variables discovered by three filtering methods in all data sets
| Data set | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| CNV | MA | G | J | T | MA | G | J | T | |
| Variables | 145 subjects | 498 subjects | |||||||
| All | 39115 | 43349 | 60778 | 340414 | 263538 | 43291 | 60778 | 340414 | 263538 |
| Used ∗ | 39114 | 43349 | 40660 | 340414 | 208856 | 43291 | 41104 | 340414 | 208058 |
| T-test | 5 | 1152 | 1096 | 2738 | 3726 | 6420 | 8180 | 37011 | 38324 |
| IG-1D | 25 | 900 | 1008 | 1825 | 2844 | 6364 | 9690 | 36915 | 46169 |
| IG-2D | 37 | 807 | 878 | 1457 | 2445 | 11307 | 11243 | 44927 | 54987 |
*Multiple markers in genes and transcript series of RNA-seq data are incomplete, with data missing for most patients. Only markers for which at least 50% of records is non-zero for both decision classes were included in the study
Fig. 2Venn plot for sets of informative features identified in CNV-145 (left panel) and MA-145 (right panel) data sets. There is little overlap between informative features identified by three methods for CNV data. In particular, there is only one variable recognised as relevant by all three filtering methods. The agreement for the gene expression is much higher - for each method the number of variables that is common with at least one other method is larger than 68% of all variables identified as relevant by this method
Informative genes that were identified as most relevant in MA-145 and G-145 data sets
| MA-145 | G-145 | |||||
|---|---|---|---|---|---|---|
| Ranking by | Ranking by | |||||
| Gene | T-test | IG-1D | IG-2D | T-test | IG-1D | IG-2D |
| PGM2L1 | 1 | 4 | 1 | 1 | 3 | 1 |
| SLC22A4 | 2 | 20 | 23 | 2 | 13 | 12 |
| PRKACB | 29 | 1 | 19 | 34 | 4 | 2 |
| DOC2B | 28 | 7 | 13 | - | 37 | 20 |
| PIK3R1 | - | 21 | 9 | - | 10 | 6 |
| NTRK1 | - | 17 | 8 | - | 32 | 32 |
| NRCAM | 4 | 12 | - | - | 19 | - |
| ALDH3A2 | 6 | - | 47 | 5 | - | - |
| DST | - | - | 3 | - | 43 | - |
| A_32_P30874 | 32 | 2 | 2 | - | - | - |
| Hs23691.1 | 11 | 29 | 4 | - | - | - |
| PLXNA4A | 45 | 3 | 15 | - | - | - |
| HSD17B3 | 3 | 34 | - | - | - | - |
| ACN9 | - | - | - | 7 | 5 | 4 |
| Slartoybo | - | - | - | 19 | 1 | 5 |
| LOC100289222 | - | - | - | 18 | 20 | 3 |
| Sneyga | - | - | - | - | 2 | 10 |
| SPRED3 | - | - | - | 3 | 42 | - |
| Jardarby | - | - | - | 4 | - | - |
All genes that were ranked in top 10 most relevant by any filtering method in either data set are shown. The numbers in each column correspond to ranks achieved by genes in a data set, processed by one of three filtering methods. Genes present in top 50 variables in both data sets are shown first, followed by those present in top 50 only in MA-145 data set, and then by those exclusive in top 50 in G-145 data set
Fig. 1Venn plot for top 50 informative features identified in MA-145 (left panel) and G-145 (right panel) data sets
Informative genes that were identified as most relevant in the CNV data set
| Ranking by | |||
|---|---|---|---|
| Gene | T-test | IG-1D | IG-2D |
| ZNF644 | 2 | 4 | 19 |
| ZZZ3 | - | 1 | 2 |
| TMED5 | - | 10 | 1 |
| PLEK2 | 1 | 15 | - |
| QKI | 4 | 16 | - |
| A_14_P117576 | 5 | 21 | - |
| KIAA0090 | - | - | 3 |
| ANKRD13C | - | 3 | - |
| FNDC1 | 3 | - | - |
| GUCA2B | - | - | 4 |
| C1orf160 | - | - | 5 |
| LPHN2 | - | 5 | - |
The numbers in each column correspond to ranks achieved by genes processed by one of three filtering methods – t-test, IG-1D or IG2D. All genes that were ranked in top 5 most relevant by either method are displayed
Aggregate results for all models based on gene expression
| Data series | ||||
|---|---|---|---|---|
| Cohort size | MA | G | J | T |
| Max | ||||
| 145 | 0.674 | 0.672 | 0.606 | 0.625 |
| 498 | 0.545 | 0.556 | 0.543 | 0.543 |
| Average | ||||
| 145 | 0.634 | 0.629 | 0.556 | 0.569 |
| 498 | 0.535 | 0.538 | 0.525 | 0.524 |
Maximum and average MCC obtained for all fully cross-validated models built for each data series are displayed for both cohort sizes
Model quality measured with MCC coefficient for the MA-145 data set
| OOB | Cross-validation | |||||||
|---|---|---|---|---|---|---|---|---|
| FS metod | Top 10 | Top 20 | Top 50 | Top 100 | Top 10 | Top 20 | Top 50 | Top 100 |
| Stage 1 | ||||||||
| T-test | 0.642 | 0.707 | 0.700 | 0.720 | ||||
| IG-1D | 0.753 | 0.728 | 0.738 | 0.736 | ||||
| IG-2D | 0.713 | 0.747 | 0.730 | 0.726 | ||||
| T-test + lasso | 0.744 | 0.869 | 0.820 | 0.822 | ||||
| Stage 2 | ||||||||
| T-test | 0.622 | 0.683 | 0.693 | 0.713 | 0.636 | 0.698 | 0.709 | 0.730 |
| IG-1D | 0.732 | 0.721 | 0.729 | 0.730 | 0.741 | 0.727 | 0.737 | 0.737 |
| IG-2D | 0.695 | 0.734 | 0.721 | 0.721 | 0.699 | 0.743 | 0.731 | 0.729 |
| T-test + lasso | 0.732 | 0.854 | 0.808 | 0.809 | 0.750 | 0.868 | 0.831 | 0.832 |
| Stage 3 | ||||||||
| T-test | 0.655 | 0.691 | 0.714 | 0.724 | 0.576 | 0.606 | 0.647 | 0.665 |
| IG-1D | 0.735 | 0.742 | 0.747 | 0.748 | 0.605 | 0.638 | 0.661 | 0.674 |
| IG-2D | 0.705 | 0.721 | 0.730 | 0.734 | 0.609 | 0.636 | 0.655 | 0.670 |
| T-test + lasso | 0.780 | 0.831 | 0.808 | 0.820 | 0.651 | 0.663 | 0.648 | 0.643 |
Model quality measured with MCC coefficient for the G-145 data set
| OOB | Cross-validation | |||||||
|---|---|---|---|---|---|---|---|---|
| FS metod | Top 10 | Top 20 | Top 50 | Top 100 | Top 10 | Top 20 | Top 50 | Top 100 |
| Stage 1 | ||||||||
| T-test | 0.703 | 0.720 | 0.719 | 0.713 | ||||
| IG-1D | 0.788 | 0.784 | 0.801 | 0.796 | ||||
| IG-2D | 0.793 | 0.768 | 0.738 | 0.763 | ||||
| T-test + lasso | 0.790 | 0.834 | 0.862 | 0.861 | ||||
| Stage 2 | ||||||||
| T-test | 0.683 | 0.709 | 0.714 | 0.704 | 0.692 | 0.719 | 0.716 | 0.712 |
| IG-1D | 0.732 | 0.767 | 0.766 | 0.762 | 0.741 | 0.782 | 0.778 | 0.774 |
| IG-2D | 0.729 | 0.733 | 0.743 | 0.759 | 0.743 | 0.753 | 0.756 | 0.771 |
| T-test + lasso | 0.792 | 0.827 | 0.848 | 0.848 | 0.802 | 0.840 | 0.865 | 0.867 |
| Stage 3 | ||||||||
| T-test | 0.689 | 0.713 | 0.723 | 0.724 | 0.590 | 0.621 | 0.650 | 0.653 |
| IG-1D | 0.750 | 0.771 | 0.774 | 0.770 | 0.589 | 0.626 | 0.661 | 0.672 |
| IG-2D | 0.738 | 0.755 | 0.760 | 0.755 | 0.585 | 0.621 | 0.650 | 0.661 |
| T-test + lasso | 0.829 | 0.832 | 0.853 | 0.854 | 0.599 | 0.661 | 0.655 | 0.638 |
Fig. 3Distribution of fraction of correctly classified objects. For each object the position in y axis corresponds to the fraction of times this object was correctly predicted in cross-validation
Fig. 4Distribution of MCC obtained in 400 cross-validation runs at the Stage 3 of the modelling pipeline. Each point, representing MCC value obtained for a RF classifier prediction for the validation set in the cross validation loop. Each RF classifier was built on the different training set constructed in the cross-validation loop, using the variables selected as most relevant for a given training set. Values for G-145, CNV, MA-145, and MA+CNV data sets are presented from left to right. Each box-plot represents distribution of points to its left
Model quality measured with MCC coefficient for the CNV-145 data set
| OOB | Cross-validation | |||||||
|---|---|---|---|---|---|---|---|---|
| FS metod | Top 10 | Top 20 | Top 50 | Top 100 | Top 10 | Top 20 | Top 50 | Top 100 |
| Stage 1 | ||||||||
| T-test | 0.566 | - | - | - | ||||
| IG-1D | 0.569 | 0.646 | 0.642 | 0.646 | ||||
| IG-2D | 0.460 | 0.484 | 0.491 | 0.492 | ||||
| T-test + lasso | 0.568 | - | - | - | ||||
| Stage 2 | ||||||||
| T-test | 0.534 | - | - | - | 0.545 | - | - | - |
| IG-1D | 0.544 | 0.624 | 0.617 | 0.615 | 0.551 | 0.635 | 0.632 | 0.627 |
| IG-2D | 0.443 | 0.470 | 0.484 | 0.484 | 0.452 | 0.481 | 0.493 | 0.491 |
| T-test + lasso | 0.537 | - | - | - | 0.554 | - | - | - |
| Stage 3 | ||||||||
| T-test | 0.476 | - | - | - | 0.189 | - | - | - |
| IG-1D | 0.575 | 0.582 | 0.583 | 0.582 | 0.248 | 0.257 | 0.258 | 0.260 |
| IG-2D | 0.502 | 0.511 | 0.514 | 0.514 | 0.279 | 0.301 | 0.308 | 0.306 |
| T-test + lasso | 0.510 | - | - | - | 0.193 | - | - | - |
Synergies between data sets
| Data set | MA-145 | |||||||
| OOB | Cross-validation | |||||||
| Feature set | MA50 | CNV | MA+CNV | Syn. | MA50 | CNV | MA+CNV | Syn. |
| Stage 2 | ||||||||
| T-test | 0.693 | 0.537 | 0.693 | -0.001 | 0.709 | 0.546 | 0.698 | -0.011 |
| IG-1D | 0.729 | 0.617 | 0.755 |
| 0.737 | 0.632 | 0.765 |
|
| IG-2D | 0.721 | 0.484 | 0.740 |
| 0.731 | 0.493 | 0.750 |
|
| T-test + lasso | 0.808 | 0.536 | 0.804 | -0.004 | 0.831 | 0.553 | 0.827 | -0.004 |
| Stage 3 | ||||||||
| T-test | 0.714 | 0.479 | 0.717 |
| 0.647 | 0.192 | 0.632 | -0.015 |
| IG-1D | 0.747 | 0.583 | 0.764 |
| 0.661 | 0.258 | 0.662 |
|
| IG-2D | 0.730 | 0.514 | 0.740 |
| 0.655 | 0.308 | 0.656 | 0.000 |
| T-test+lasso | 0.808 | 0.506 | 0.825 |
| 0.648 | 0.194 | 0.652 |
|
| G-145 | ||||||||
| OOB | cross-validation | |||||||
| Feature set | G50 | CNV | G+CNV | Syn. | G50 | CNV | G+CNV | Syn. |
| Stage 2 | ||||||||
| T-test | 0.714 | 0.537 | 0.720 |
| 0.716 | 0.546 | 0.725 |
|
| IG-1D | 0.766 | 0.617 | 0.780 |
| 0.778 | 0.632 | 0.786 |
|
| IG-2D | 0.743 | 0.484 | 0.747 |
| 0.756 | 0.493 | 0.757 |
|
| T-test+lasso | 0.848 | 0.536 | 0.853 |
| 0.865 | 0.553 | 0.868 |
|
| Stage 3 | ||||||||
| T-test | 0.714 | 0.478 | 0.730 |
| 0.650 | 0.192 | 0.640 | -0.011 |
| IG-1D | 0.747 | 0.582 | 0.786 |
| 0.661 | 0.258 | 0.662 |
|
| IG-2D | 0.730 | 0.511 | 0.767 |
| 0.650 | 0.308 | 0.650 | 0.000 |
| T-test+lasso | 0.808 | 0.506 | 0.858 |
| 0.655 | 0.194 | 0.655 | 0.000 |
Synergies between data sets displayed for two stages of the analysis for MA+CNV and G+CNV data sets. MA50 and G50 are sets MA-145 and G-145 data sets limited to top 50 variables, respectively. Cases, for which MCC for mixed model is higher then either of the components, suggesting possible synergy are highlighted in boldface
Three estimates of MCC
| Feature selection method | ||||||||
|---|---|---|---|---|---|---|---|---|
| Estimate type | T-test | IG-1D | IG-2D | T-test+lasso | T-test | IG-1D | IG-2D | T-test+lasso |
|
|
| |||||||
| OOB | 0.720 | 0.736 | 0.726 | 0.822 | 0.713 | 0.796 | 0.763 | 0.861 |
| Cross-validation | 0.665 | 0.674 | 0.670 | 0.643 | 0.653 | 0.672 | 0.661 | 0.638 |
| Ensemble | 0.702 | 0.727 | 0.711 | 0.699 | 0.689 | 0.705 | 0.712 | 0.698 |