| Literature DB >> 31208429 |
Joshua D Mannheimer1,2, Dawn L Duval2,3, Ashok Prasad1,4, Daniel L Gustafson5,6,7,8.
Abstract
BACKGROUND: The availability and generation of large amounts of genomic data has led to the development of a new paradigm in cancer treatment emphasizing a precision approach at the molecular and genomic level. Statistical modeling techniques aimed at leveraging broad scale in vitro, in vivo, and clinical data for precision drug treatment has become an active area of research. As a rapidly developing discipline at the crossroads of medicine, computer science, and mathematics, techniques ranging from accepted to those on the cutting edge of artificial intelligence have been utilized. Given the diversity and complexity of these techniques a systematic understanding of fundamental modeling principles is essential to contextualize influential factors to better understand results and develop new approaches.Entities:
Keywords: Cancer; Cytotoxic chemotherapies; Drug response; Genomic modeling; Machine learning
Mesh:
Substances:
Year: 2019 PMID: 31208429 PMCID: PMC6580596 DOI: 10.1186/s12920-019-0519-2
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Cytotoxic Drugs and number of cell lines
| Drug | Abbreviation | Number of Cell Lines |
|---|---|---|
| Bleomycin | BLM | 632 |
| Bortezomib | BTZ | 331 |
| Cisplatin | CIS | 146 |
| Cytarabine | CYT | 515 |
| Docetaxel | DTX | 555 |
| Doxorubicin | DOX | 738 |
| Etoposide | ETP | 643 |
| Gemcitabine | GEM | 583 |
| Methotrexate | MTX | 216 |
| Mitomycin C | MMC | 759 |
| Paclitaxel | PTX | 227 |
| Vinblastine | VBL | 719 |
| Vorinostat | VOR | 728 |
| SN-38 | SN-38 | 698 |
| 5-Fluorouracil | 5-FU | 409 |
15 Cytotoxic agents and the number of cell lines with experimentally determined IC50’s for each drug. Training set comprises 75% of the total data while the testing data accounts for the remaining 25%
Fig. 1General workflow: The general workflow used to build models
Feature selection methods
| Selection Method | Description |
|---|---|
| No feature selection (NO FS) | All probes used with a total of 49,386 probes. |
| Differentially Expressed genes (DEGs) | Array probes that have a statistically significant Spearman correlation |
| LIMMA | Linear Empirical Bayes with a modified t-statistic as implemented in the LIMMA Bioconductor package in R. Genes were selected by running LIMMA on the top and bottom 25% sensitive and resistance cell lines. A false discovery rate of 5% was chosen as a cutoff. |
| Bonferroni Correction (BC) | Bonferroni Correction |
| DEG Bootstrap (BS) | Array probes which have a statistically significant Spearman correlation P < 0.05 in fifty random subsets containing 75% of the training data |
| Histotype specific Bootstrap (BS-Hist) | 50 subsets of the training data were generated such that each subset contained only one cell from a specific histotype. Probes that have a significant Spearman correlation P < 0.05 in 50% of the splits were selected. ** Data not shown, reported Additional file |
| Maximum Relevance Minimum Redundancy (MRMR) | Maximum Relevance Minimum Redundancy. 1000 Probes are chosen such that they have a maximum correlation with drug response with minimal cross-correlation with other chosen probes. |
| Control 1 (CTR1) | Probes are randomly selected from all 49,836 probes equal to the number of DEGs for each model/trial. For example, bleomycin dataset 1 yielded 5377 DEGs in DEG feature selection thus 5377 probes are selected randomly in control 1 experiments. |
| Control 2 (CTR2) | The compliment of DEGs. For example, for bleomycin dataset 1 control 2 genes would include 38,009 probes excluded form the 5377 probes selected as DEGs. |
| Random Control (RCTR) | A number, N, of probes equal to the number of DEGs are randomly selected. This gives N vectors with each entry corresponding to a cell line in the training set. This vector is then shuffled randomly such that the original value is no longer associated with the same cell yielding a feature matrix that is arbitrary. |
| Histotype Only (HIST) | Each cell line is associated with a 55 dimensional vector where the nth entry is 1 if the cell comes from the corresponding histotype and 0 otherwise. (One hot encoded) |
A summary and definition of the different feature selection methods discussed in the results section. The abbreviations that will be used in the text to refer to these methods are in prentices
Fig. 2Cluster Entropy: Illustration of how cluster entropy, Sc, is calculated. It is a measure of cluster homogeneity, in this case, how many cells of the same histotype are placed in the same cluster
Fig. 3Model performance by method and Drug: a Average spearman correlation coefficients for four different regression methods over three different methods of feature selection. b-e Predicted versus Measured IC50 values for each of the fifteen drugs using DEG genes. b NLSVR, c PCR, d SVRLN, e ANN
Model Performance
| NLSVR | PCR | LNSVR | ANN | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NFS | DEG | LIM | NFS | DEG | LIM | NFS | DEG | LIM | NFS | DEG | LIM | |
| BLM | .207 | .202 | .202 | .239 | .208 | .208 | .151 | .1 | .209 | .147 | .17 | .21 |
| BTZ | .38 | .404 | .365 | .422 | .399 | .354 | .332 | .326 | .232 | −.009 | .299 | .24 |
| CIS | .05 | .08 | N/A | −.009 | .047 | N/A | .03 | .079 | N/A | −.066 | .034 | N/A |
| CYT | .313 | .32 | .279 | .32 | .281 | .256 | .337 | .291 | .269 | .226 | .266 | .291 |
| DTX | .422 | .44 | .408 | .367 | .409 | .382 | .357 | .319 | .359 | .185 | .318 | .207 |
| DOX | .273 | .27 | .117 | .243 | .285 | .106 | .27 | .226 | .103 | .115 | .173 | .096 |
| ETP | .289 | .302 | .294 | .248 | .291 | .263 | .238 | .219 | .273 | .209 | .195 | .246 |
| GEM | .143 | .139 | .166 | .153 | .117 | .143 | .07 | .063 | .165 | .131 | .119 | .134 |
| MTX | .461 | .455 | .462 | .431 | .435 | .433 | .417 | .388 | .338 | .411 | .391 | .322 |
| MMC | .237 | .302 | .244 | .264 | .269 | .25 | .27 | .224 | .239 | .203 | .153 | .248 |
| PTX | .32 | .27 | .198 | .287 | .282 | .159 | .233 | .170 | .191 | −.106 | .211 | .177 |
| VBL | .44 | .403 | .399 | .408 | .398 | .37 | .398 | .339 | .371 | .112 | .302 | .363 |
| VOR | .509 | .495 | .486 | .5 | .487 | .439 | .484 | .471 | .404 | .445 | .42 | .42 |
| SN-38 | .383 | .417 | .409 | .379 | .391 | .443 | .397 | .404 | .429 | .01 | .327 | .402 |
| 5-FU | .463 | .464 | .40 | .455 | .484 | .354 | .451 | .438 | .337 | .309 | .409 | .365 |
| AVG | .326 | .331 | .316 | .314 | .319 | .297 | .285 | .27 | .28 | .144 | .252 | 0.266 |
Average spearman correlations across six different testing sets for all regression and feature selection methods. This data is graphically displayed in Fig. 3
Feature Selection and number of features
| Drug | DEG | Limma | BC | BS | BS Hist |
|---|---|---|---|---|---|
| BLM | 5216.5; 4414; 6314 | 18.2; 11; 32 | 2.5; 0; 6 | 178.3; 117; 298 | 8.7; 6; 15 |
| BTZ | 11522.2; 10087; 12191 | 233.3; 124; 414 | 93.3; 52; 132 | 1626.5; 1208; 1888 | 442.5; 820; 144 |
| CIS | 2354.8; 1835; 2890 | NA | NA | 20.5; 11; 32 | 188.8; 103; 329 |
| CYT | 12954; 8675; 16,844 | 371.8; 123; 826 | 174.5; 12; 414 | 2063.1; 714; 3704 | 571.8; 104; 1374 |
| DTX | 15292; 13054; 16938 | 330.5; 127; 549 | 398.8; 147; 625 | 2979.8; 1958; 3718 | 794.7; 585; 955 |
| DOX | 4983.3; 4469; 5368 | 12.8; 8; 18 | 2.5; 0; 6 | 167.8; 145; 201 | 27.5; 50; 13 |
| ETP | 11271.5; 10707; 12744 | 178.7; 91; 284 | 45.8; 32; 63 | 1254; 1024; 1720 | 146.8; 22; 269 |
| GEM | 5728.7; 4099; 7423 | 7; 2; 17 | 3.2; 1; 8 | 185.2; 82; 342 | 34.2; 5; 79 |
| MTX | 15727.3; 13692; 18906 | 130; 15; 203 | 398.8; 11; 687 | 3153.5; 2010; 4378 | 2329.3; 2748; 1372 |
| MMC | 6515.8; 4936; 8485 | 30.8; 8; 95 | 6.7;3; 14 | 332.2; 127; 629 | 26.5; 5; 116 |
| PTX | 6175.2; 4997; 7344 | 12; 4; 20 | 4.7; 1; 8 | 295; 465; 162 | 293; 136; 392 |
| VBL | 15207.7; 12488; 17196 | 201.8; 114; 284 | 447.33; 124; 716 | 3094.8; 1830; 4142 | 510; 231; 809 |
| VOR | 29935.2; 29461; 30337 | 5796.7; 4971; 6284 | 8165.2; 7274; 8690 | 15931.3; 14934; 16,448 | 4107; 3213; 5604 |
| SN-38 | 13501.5; 11919; 16510 | 190.8; 108; 424 | 315.3; 153; 738 | 2577.2; 1761; 4286 | 193.5; 28; 149 |
| 5-FU | 16218.8; 14888; 17706 | 236.5; 142; 320 | 333; 208; 480 | 3240.8; 2595; 3928 | 773.7; 550; 1084 |
The number of by feature selection method. Each cell contains the mean, maximum, and minimum number of features (Mean; Minimum; Maximum)
Fig. 4Feature selection methods and controls: a-b Spearman Correlation Coefficients for different feature selection methods NLSVR (a), PCR (b). c-d Spearman Correlation Coefficients for control models NLSVR (c), PCR (d) the placement of the symbol indicates the mean with the ends representing the range. e Cluster Entropy (Sc), indicative of how well cell lines of the same histotype cluster using k-means. Comparable Sc as well as little difference in r indicate that histotype recognition drives model performance. Sc is relative to the random control (RCTR) where Sc = 1, perfect clustering would have a Sc=0. The asterisks indicate significance (p < 0.05) between the method and alternative models, indicated by color (black indicates is was significant compared to all other methods), using a non parametric Wilcoxon match-paired rank test. The calculated p values can be found in Additional file 3
Fig. 5Histotype influence on drug response. a-b P values for pairwise F-tests between histotype IC50 for Bleomycin (a) and Vorinostat (b). c-d Measured vs Predicted IC50 using DEGs for Bleomycin (c) and Vorinostat (d). e-f Measured vs. Predicted IC50 for Hist models in Bleomycin (e) and Vorinostat (f). Each symbol color combination indicates a different histotype
Fig. 6Model performance and number of features. Average Spearman correlations for all 15 drugs as a function of features used for NLSVR DEG (a), NLSVR CTR1 (b), NLSVR CTR2 (c), PCR DEG (d), PCR CTR1 (e), PCR CTR2 (f). average Sc vs average Spearman correlation with each symbol representing the number of features used for NLSVR (g) and PCR (h)
WPC Index
| FS Method | NLSVR | PCR | NLSVR NSCLC-AD | p NSCLC-AD |
|---|---|---|---|---|
| NO FS | 0.581 | 0.578 | 0.528 | 0.009 |
| DEG | 0.582 | 0.579 | 0.535 | 0.001 |
| CTR1 | 0.581 | 0.576 | 0.519 | 0.056 |
| CTR2 | 0.576 | 0.559 | 0.508 | 0.248 |
| MRMR | 0.579 | 0.575 | NA | NA |
| BS | 0.58 | 0.576 | NA | NA |
| BC | 0.575 | 0.571 | NA | NA |
| Hist | 0.552 | 0.552 | NA | NA |
| Random Control | NA | 0.498 | NA | NA |
WPC Index for NLSVR and PCR models as well as values for non small cell lung cancer adenocarcinoma (NSCLC-AD) in select drugs. The-value for NSCLC-AD was calculated by 3000 random permutations of the test data to construct a null distribution. For example, there is a .01% chance of obtaining a higher WPC score randomly for NLSVR DEG models on NSCLC-AD. Note that a wpc score cannot be calculated for the random control for NLSVR due to a variance of 0 which results in division by 0 incalculated wpc scores
Fig. 7NCI60 models vs the GDSC models: a-d Spearman correlations for a given feature selection method (FS correlation) verses Spearman correlations for one hot encoded histotype models for NCI60 NLSVR (a), NCI60 PCR (b), GDSC NLSVR (c), GDSC PCR (d). e. NCI60 models for the 14 drugs with data both in the GDSC and NCI60