| Literature DB >> 35455716 |
Francisco Carrillo-Perez1,2, Juan Carlos Morales1, Daniel Castillo-Secilla3, Olivier Gevaert2, Ignacio Rojas1, Luis Javier Herrera1.
Abstract
Differentiation between the various non-small-cell lung cancer subtypes is crucial for providing an effective treatment to the patient. For this purpose, machine learning techniques have been used in recent years over the available biological data from patients. However, in most cases this problem has been treated using a single-modality approach, not exploring the potential of the multi-scale and multi-omic nature of cancer data for the classification. In this work, we study the fusion of five multi-scale and multi-omic modalities (RNA-Seq, miRNA-Seq, whole-slide imaging, copy number variation, and DNA methylation) by using a late fusion strategy and machine learning techniques. We train an independent machine learning model for each modality and we explore the interactions and gains that can be obtained by fusing their outputs in an increasing manner, by using a novel optimization approach to compute the parameters of the late fusion. The final classification model, using all modalities, obtains an F1 score of 96.81±1.07, an AUC of 0.993±0.004, and an AUPRC of 0.980±0.016, improving those results that each independent model obtains and those presented in the literature for this problem. These obtained results show that leveraging the multi-scale and multi-omic nature of cancer data can enhance the performance of single-modality clinical decision support systems in personalized medicine, consequently improving the diagnosis of the patient.Entities:
Keywords: NSCLC; artificial neural networks; deep learning; information fusion; machine learning; personalized medicine
Year: 2022 PMID: 35455716 PMCID: PMC9025878 DOI: 10.3390/jpm12040601
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Summary of works in the literature for different NSCLC classification problems. SVM: support vector machine; DNN: deep neural network; RF: random forest; CNN: convolutional neural network; k-NN: k-nearest neighbor; Acc.: accuracy; AUC: area under the curve.
| Modalities | Problem | Model | Metrics | Results | |
|---|---|---|---|---|---|
| Smolander et al. [ | RNA-Seq | LUAD vs. control | DNN | Acc. | 95.97% |
| Fan et al. [ | RNA-Seq | LUAD vs. control | SVM | Acc. | 91% |
| Gonzales et al. [ | Microarray | SCLC vs. LUAD vs. LUSC vs. LCLC | k-NN | Acc. | 91% |
| Castillo-Secilla et al. [ | RNA-Seq | LUAD vs. control vs. LUSC | RF | Acc. | 95.7% |
| Ye et al. [ | miRNA-Seq | LUSC vs. control | SVM | F1 score | 99.4% |
| Qiu et al. [ | CNV | LUAD vs. control vs. LUSC | EN-PLS-NB | Acc. | 84% |
| Shen et al. [ | metDNA | LUAD vs. control | RF | Acc. | 95.57% |
| Cai et al. [ | metDNA | LUAD vs. LUSC vs. SCLC | Ensemble | Acc. | 86.54% |
| Coudray et al. [ | WSI | LUAD vs. control vs. LUSC | CNN | AUC | 0.978 |
| Kanavati et al. [ | WSI | Lung carcinoma vs. control | CNN | AUC | 0.988 |
| Graham et al. [ | WSI | LUAD vs. control vs. LUSC | CNN | Acc. | 81% |
Number of samples per class for each data modality.
| WSI | RNA-Seq | miRNA | CNV | metDNA | |
|---|---|---|---|---|---|
| LUAD | 495 | 457 | 413 | 465 | 431 |
| Control | 419 | 44 | 71 | 919 | 71 |
| LUSC | 506 | 479 | 420 | 472 | 381 |
| Total | 1420 | 980 | 904 | 1856 | 883 |
Number of tiles obtained from the WSI per class.
| # Tiles | |
|---|---|
| LUAD | 100,841 |
| Control | 62,715 |
| LUSC | 92,584 |
| Total | 256,140 |
Figure 1Prediction pipeline for a given sample with multiple modalities. If missing information is present, the probabilities for that modality are zero. (i) Multi-scale and multi-omic data available for each sample are obtained. (ii) For the imaging modality, non-overlapping tissue tiles of 512 × 512 are obtained. For the molecular modalities, the features are obtained with the aforementioned preprocessing methodology (see Section 3). (iii) Probabilities are computed for each modality and class. In the molecular modalities the probabilities are returned by the machine learning model. For the imaging modality, the probabilities are obtained based on the number of tiles predicted per class divided by the total number of tiles. (iv) The late fusion model is applied using the previously obtained weights via the gradient optimization, and the final prediction is obtained. (v) Fuse probabilities with weights obtained via gradient descent optimization and obtain final prediction.
Results obtained in the 10-fold CV by each single modality and multimodal fusion of the modalities in their common samples (see Supplementary Material Tables S1–S3). For the case of four- and five-modality fusion, AUC is omitted given the low number of control samples. The X marks the modalities that are used in each case.
| WSI | RNA-Seq | miRNA | CNV | metDNA | Acc. (Std) | F1 score (Std) | AUC (Std) | AUPRC (Std) |
|---|---|---|---|---|---|---|---|---|
| X | 88.56 (2.34) | 88.57 (2.36) | 0.965 (0.003) | 0.940 (0.014) | ||||
| X | 93.16 (1.87) | 93.17 (1.82) | 0.987 (0.007) | 0.973 (0.028) | ||||
| X | 92.31 (2.69) | 92.34 (2.65) | 0.976 (0.013) | 0.961 (0.023) | ||||
| X | 88.36 (1.34) | 88.36 (1.34) | 0.954 (0.009) | 0.879 (0.025) | ||||
| X | 93.21 (1.84) | 93.19 (1.87) | 0.972 (0.016) | 0.957 (0.030) | ||||
| X | X | 94.65 (1.80) | 94.69 (1.80) | 0.991 (0.004) | 0.979 (0.032) | |||
| X | X | 92.59 (2.57) | 92.60 (2.56) | 0.987 (0.006) | 0.982 (0.009) | |||
| X | X | 90.26 (1.98) | 90.20 (1.92) | 0.974 (0.010) | 0.962 (0.016) | |||
| X | X | 92.79 (1.77) | 92.80 (1.78) | 0.983 (0.009) | 0.979 (0.012) | |||
| X | X | 94.55 (1.83) | 94.74 (1.70) | 0.988 (0.007) | 0.980 (0.017) | |||
| X | X | 91.81 (2.34) | 92.12 (2.36) | 0.978 (0.006) | 0.953 (0.050) | |||
| X | X | 94.33 (1.81) | 94.33 (1.79) | 0.991 (0.007) | 0.989 (0.009) | |||
| X | X | 91.00 (1.97) | 91.36 (1.82) | 0.973 (0.009) | 0.944 (0.048) | |||
| X | X | 93.84 (2.88) | 93.85 (2.88) | 0.979 (0.015) | 0.980 (0.015) | |||
| X | X | 90.15 (3.09) | 90.28 (3.04) | 0.968 (0.010) | 0.947 (0.033) | |||
| X | X | X | 95.55 (1.78) | 95.69 (1.76) | 0.985 (0.008) | 0.990 (0.005) | ||
| X | X | X | 93.99 (1.47) | 94.00 (1.41) | 0.982 (0.022) | 0.974 (0.041) | ||
| X | X | X | 94.70 (2.11) | 94.73 (2.10) | 0.987 (0.010) | 0.990 (0.007) | ||
| X | X | X | 93.84 (2.05) | 93.97 (2.03) | 0.974 (0.030) | 0.977 (0.016) | ||
| X | X | X | 94.23 (2.55) | 94.23 (2.54) | 0.975 (0.022) | 0.986 (0.008) | ||
| X | X | X | 93.50 (2.98) | 93.52 (2.97) | 0.981 (0.009) | 0.978 (0.012) | ||
| X | X | X | 94.79 (1.76) | 95.10 (1.72) | 0.938 (0.059) | 0.963 (0.050) | ||
| X | X | X | 95.05 (2.05) | 95.10 (2.01) | 0.967 (0.027) | 0.989 (0.009) | ||
| X | X | X | 94.11 (1.76) | 94.20 (1.74) | 0.977 (0.012) | 0.981 (0.010) | ||
| X | X | X | 94.11 (2.92) | 94.36 (2.70) | 0.975 (0.005) | 0.966 (0.023) | ||
| X | X | X | X | 95.22 (2.13) | 95.47 (2.01) | - | 0.987 (0.007) | |
| X | X | X | X | 95.53 (2.09) | 95.62 (2.04) | - | 0.989 (0.007) | |
| X | X | X | X | 95.22 (2.10) | 95.30 (2.05) | - | 0.986 (0.009) | |
| X | X | X | X | 94.71 (2.29) | 94.9 (2.20) | - | 0.978 (0.013) | |
| X | X | X | X | 94.86 (2.19) | 95.14 (2.06) | - | 0.981 (0.010) | |
| X | X | X | X | X | 95.53 (2.20) | 95.82 (2.05) | - | 0.983 (0.012) |
Figure 2ROC curves for the fusion and individual models over all available samples for each modality. (a) ROC curve for LUAD class. (b) ROC curve for control class. (c) ROC curve for LUSC class.
Figure 3F1 score obtained by each fusion model on the available samples for each modality, without restricting to those in common between the different modalities (see Table 2 to check the number of samples per class). On the left Y-axis the sources used in the integration are shown, while on the right Y-axis the F1 score obtained by each integration can be observed. They are ordered from the highest F1 score to the lowest. metDNA stands for DNA methylation and CNV for copy number variation.
Correct and misclassified samples over the whole dataset for each data type and the fusion model using all modalities. RNA, CNV, and metDNA stand for RNA-Seq, copy number variation, and DNA methylation, respectively.
| WSI | RNA | miRNA | CNV | metDNA | |
|---|---|---|---|---|---|
| Correct | 1232 | 913 | 834 | 1636 | 821 |
| Misclassified | 159 | 67 | 70 | 220 | 62 |
| Fusion | |||||
| Correct | 1328 | 929 | 857 | 1796 | 838 |
| Misclassified | 63 | 51 | 47 | 60 | 45 |
| Absolute difference in | 96 (6.5%) | 16 (1.6%) | 23 (2.6%) | 160 (8.6%) | 17 (2%) |
Comparison of our fusion results with the available literature for LUAD vs. control vs. LUSC. The results of the fusion model are on those available samples for the studied modality. Unfortunately, a direct comparison of fusion methods cannot be performed given the lack of literature for this specific problem. The best results for each case are highlighted in bold.
| Modality | Metric | Score | |
|---|---|---|---|
| Qui et al. [ | CNV | Acc. | 84% |
| Ours | CNV | Acc. |
|
| Cai et al. [ | metDNA | Acc. | 86.54% |
| Ours | metDNA | Acc. |
|
| Cai et al. [ | metDNA | F1 score | 74.55% |
| Ours | metDNA | F1 score |
|
| Castillo-Secilla et al. [ | RNA-Seq | Acc. |
|
| Ours | RNA-Seq | Acc. | 95% |
| Castillo-Secilla et al. [ | RNA-Seq | F1 score |
|
| Ours | RNA-Seq | F1 score | 95.02% |
| Coudray et al. [ | WSI | AUC | 0.978 |
| Ours | WSI | AUC |
|
| Graham et al. [ | WSI | Acc. | 81% |
| Ours | WSI | Acc. |
|