| Literature DB >> 34059739 |
Yitan Zhu1, Thomas Brettin2, Fangfang Xia2, Alexander Partin2, Maulik Shukla2, Hyunseung Yoo2, Yvonne A Evrard3, James H Doroshow4, Rick L Stevens2,5.
Abstract
Convolutional neural networks (CNNs) have been successfully used in many applications where important information about data is embedded in the order of features, such as speech and imaging. However, most tabular data do not assume a spatial relationship between features, and thus are unsuitable for modeling using CNNs. To meet this challenge, we develop a novel algorithm, image generator for tabular data (IGTD), to transform tabular data into images by assigning features to pixel positions so that similar features are close to each other in the image. The algorithm searches for an optimized assignment by minimizing the difference between the ranking of distances between features and the ranking of distances between their assigned pixels in the image. We apply IGTD to transform gene expression profiles of cancer cell lines (CCLs) and molecular descriptors of drugs into their respective image representations. Compared with existing transformation methods, IGTD generates compact image representations with better preservation of feature neighborhood structure. Evaluated on benchmark drug screening datasets, CNNs trained on IGTD image representations of CCLs and drugs exhibit a better performance of predicting anti-cancer drug response than both CNNs trained on alternative image representations and prediction models trained on the original tabular data.Entities:
Mesh:
Year: 2021 PMID: 34059739 PMCID: PMC8166880 DOI: 10.1038/s41598-021-90923-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1An illustration of IGTD strategy based on CCL gene expression data. (a) Rank matrix of Euclidean distances between all pairs of genes. The grey level indicates the rank value. 2500 genes with the largest variations across CCLs are included for calculating the matrix. (b) Rank matrix of Euclidean distances between all pairs of pixels calculated based on their coordinates in a by image. The pixels are concatenated row by row from the image to form the order of pixels in the matrix. (c) Feature distance rank matrix after optimization and rearranging the features accordingly. (d) The error change in the optimization process. The horizontal axis shows the number of iterations and the vertical axis shows the error value.
Figure 2Example image representations of CCL gene expression profiles and drug molecular descriptors. (a–c) are image representations of the gene expression profile of the SNU-61 cell line generated by IGTD, REFINED, and DeepInsight, respectively. (d–f) are image representations of molecular descriptors of Nintedanib, generated by IGTD, REFINED, and DeepInsight, respectively.
IGTD reduces the local heterogeneity (LH) of image representations compared with REFINED.
| Neighborhood size ( | LH (IGTD) | LH (REFINED) | Reduction percentage by IGTD | P-value | |
|---|---|---|---|---|---|
| CCL | 3 | 0.174 (0.023) | 0.187 (0.026) | 6.38% (7.08%) | 2.05E−107 |
| 5 | 0.177 (0.024) | 0.187 (0.027) | 5.32% (6.90%) | 3.37E−87 | |
| 7 | 0.179 (0.024) | 0.188 (0.027) | 4.68% (6.79%) | 1.96E−74 | |
| 9 | 0.180 (0.024) | 0.189 (0.027) | 4.33% (6.57%) | 4.65E−69 | |
| Drug | 3 | 0.051 (0.013) | 0.064 (0.017) | 19.99% (4.98%) | 1.50E−252 |
| 5 | 0.056 (0.014) | 0.066 (0.017) | 14.64% (4.25%) | 2.82E−229 | |
| 7 | 0.061 (0.014) | 0.069 (0.017) | 11.37% (3.92%) | 9.87E−199 | |
| 9 | 0.067 (0.015) | 0.074 (0.018) | 8.63% (3.78%) | 4.56E−156 |
In the LH and reduction percentage columns, the number before the parenthesis is the average value obtained across CCLs or drugs, and the number in the parenthesis is the standard deviation. P-value is obtained via two-tail pairwise t-test comparing the LH between IGTD images and REFINED images across CCLs or drugs.
Figure 3Architecture of the convolutional neural network (CNN) used for predicting drug response based on image representations.
Comparison on the drug response prediction performance of different data representations and prediction models.
| Dataset | Prediction model | Data representation | R2 | P-value |
|---|---|---|---|---|
| CTRP | LightGBM | Tabular data | 0.825 (0.003) | 8.19E−20 |
| Random forest | 0.786 (0.003) | 5.97E−26 | ||
| tDNN | 0.834 (0.004) | 7.90E−18 | ||
| sDNN | 0.832 (0.005) | 1.09E−16 | ||
| CNN | IGTD images | |||
| REFINED images | 0.855 (0.003) | 8.77E−01 | ||
| DeepInsight images | 0.846 (0.004) | 7.02E−10 | ||
| GDSC | LightGBM | Tabular data | 0.718 (0.006) | 2.06E−13 |
| Random forest | 0.682 (0.006) | 4.53E−19 | ||
| tDNN | 0.734 (0.009) | 1.79E−03 | ||
| sDNN | 0.723 (0.008) | 6.04E−10 | ||
| CNN | IGTD images | |||
| REFINED images | 0.739 (0.007) | 5.93E−01 | ||
| DeepInsight images | 0.731 (0.008) | 2.96E−06 |
In the R2 column, the number before parenthesis is the average R2 across 20 cross-validation trials, and the number in the parenthesis is the standard deviation. Bold indicates the highest average R2 obtained on each dataset. P-value is obtained via the two-tail pairwise t-test to compare the performance of CNNs trained on IGTD images with those of other combinations of prediction models and data representations.
Comparison on model training time of CNNs trained with different image representations.
| Data | Comparison | Log2 ratio of model training time | P-value |
|---|---|---|---|
| CTRP | REFINED vs. IGTD | − 0.034 (0.232) | 5.35E−01 |
| DeepInsight vs. IGTD | 4.136 (0.240) | 5.55E−25 | |
| GDSC | REFINED vs. IGTD | 0.172 (0.199) | 1.30E−03 |
| DeepInsight vs. IGTD | 4.622 (0.417) | 2.34E−21 |
The number before parenthesis is the average across 20 cross-validation trials, and the number in the parenthesis is the standard deviation. P-value is obtained via the two-tail one-sample t-test across the cross-validation trials.