| Literature DB >> 35685355 |
Hua-Ping Liu1, Dongwen Wang1, Hung-Ming Lai2.
Abstract
There is a growing need to build a model that uses single cell RNA-seq (scRNA-seq) to separate malignant cells from nonmalignant cells and to identify tumor of origin of single cells and/or circulating tumor cells (CTCs). Currently, it is infeasible to build a tumor of origin model learnt from scRNA-seq by machine learning (ML). We then wondered if an ML model learnt from bulk transcriptomes is applicable to scRNA-seq to infer single cells' tumor presence and further indicate their tumor of origin. We used k-nearest neighbors, one-versus-all support vector machine, one-versus-one support vector machine, random forest and introduced scTumorTrace to conduct a pioneering experiment containing leukocytes and seven major cancer types where bulk RNA-seq and scRNA-seq data were available. 13 ML models learnt from bulk RNA-seq were all reliable to use (F-score > 96%) shown by a validation set of bulk transcriptomes, but none of them was applicable to scRNA-seq except scTumorTrace. Making inferences from bulk RNA-seq to scRNA-seq was impaired by feature selection and improved by log2-transformed TPM units. scTumorTrace with transcriptome-wide 2-tuples showed F-score beyond 98.74 and 94.29% in inferring tumor presence and tumor of origin at single-cell resolution and correctly identified 45 single candidate prostate CTCs but lineage-confirmed non-CTCs as leukocytes. We concluded that modern ML techniques are quantitative and could hardly address the raised questions. scTumorTrace with transcriptome-wide 2-tuples is qualitative, standardization-free and not subject to log2-transformed quantities, enabling us to infer tumor presence of single cell transcriptomes and their tumor of origin from bulk transcriptomes.Entities:
Keywords: Circulating tumor cells; Digit medicine; RNA-seq; Single cell transcriptomes; Translational bioinformatics
Year: 2022 PMID: 35685355 PMCID: PMC9162953 DOI: 10.1016/j.csbj.2022.05.035
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Graphic outlines. (A) The purpose of the present study. (B) A new learning technique of scTumorTrace.
Summary statistics of predictive performance: bulk tissue of origin (validation).
| ML Methods | Feature Type | FS | Z | Log2 | #Feat | R | P | F1 | ACC | Effective |
|---|---|---|---|---|---|---|---|---|---|---|
| kNN | Quantitative | N | Y | N | 13,126 | 96.02 | 96.98 | 96.39 | 96.94 | ✓ |
| Y | 99.39 | 99.23 | 99.30 | 99.47 | ✓ | |||||
| OvA SVM | Quantitative | N | Y | N | 13,126 | 99.47 | 99.31 | 99.38 | 99.58 | ✓ |
| Y | 99.76 | 99.81 | 99.78 | 99.79 | ✓ | |||||
| OvO SVM | Quantitative | N | Y | N | 13,126 | 99.12 | 99.54 | 99.32 | 99.37 | ✓ |
| Y | 99.67 | 99.77 | 99.72 | 99.68 | ✓ | |||||
| RF | Quantitative | N | Y | N | 13,126 | 99.37 | 99.15 | 99.25 | 99.26 | ✓ |
| Y | 99.68 | 99.55 | 99.62 | 99.68 | ✓ | |||||
| RF | Quantitative | Y | Y | N | 4608 | 99.52 | 99.40 | 99.46 | 99.47 | ✓ |
| Y | 1025 | 99.61 | 99.31 | 99.45 | 99.58 | ✓ | ||||
| scTumorTrace | Qualitative | Y | N | N/Y | 500+ | 97.51 | 97.65 | 97.55 | 97.36 | ✓ |
| scTumorTrace | Qualitative | Y | N | N/Y | 7 K+ | 98.48 | 98.33 | 98.39 | 98.31 | ✓ |
| scTumorTrace | Qualitative | N | N | N/Y | 200 K+ | 99.04 | 98.62 | 98.80 | 99.05 | ✓ |
kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, R = recall%, P = precision%, F1 = f-score%, ACC = accuracy%, ✓= highly effective.
Summary statistics of predictive performance: tumor presence of single cell transcriptomes.
| ML Methods | Feature Type | FS | Z | Log2 | #Feat | TPR | TNR | F1 | ACC | Effective |
|---|---|---|---|---|---|---|---|---|---|---|
| kNN | Quantitative | N | Y | N | 13,126 | 99.95 | 0 | 61.20 | 44.09 | × |
| Y | 100 | 0 | 61.22 | 44.12 | × | |||||
| OvA SVM | Quantitative | N | Y | N | 13,126 | 100 | 0 | 61.22 | 44.12 | × |
| Y | 99.52 | 0.13 | 61.05 | 43.98 | × | |||||
| OvO SVM | Quantitative | N | Y | N | 13,126 | 100 | 0 | 61.22 | 44.12 | × |
| Y | 100 | 0 | 61.22 | 44.12 | × | |||||
| RF | Quantitative | N | Y | N | 13,126 | 100 | 0 | 61.22 | 44.12 | × |
| Y | 100 | 0 | 61.22 | 44.12 | × | |||||
| RF | Quantitative | Y | Y | N | 4608 | 100 | 0 | 61.22 | 44.12 | × |
| Y | 1025 | 100 | 0 | 61.22 | 44.12 | × | ||||
| scTumorTrace | Qualitative | Y | N | N/Y | 500+ | 97.24 | 65.10 | 80.55 | 79.28 | Δ |
| scTumorTrace | Qualitative | Y | N | N/Y | 7 K+ | 98.35 | 84.82 | 90.40 | 90.79 | ▲ |
| scTumorTrace | Qualitative | N | N | N/Y | 200 K+ | 97.77 | 99.79 | 98.74 | 98.90 | ✓ |
kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, TPR = sensitivity%, TNR = specificity%, F1 = f-score%, ACC = accuracy%, ×= not effective, Δ = less effective, ▲= fairly effective, ✓= highly effective.
Summary statistics of predictive performance: tumor of origin of single and/or circulating tumor cells.
| ML Methods | Feature Type | FS | Z | Log2 | #Feat | R | P | F1 | ACC | Effective |
|---|---|---|---|---|---|---|---|---|---|---|
| kNN | Quantitative | N | Y | N | 13,126 | 38.31 | NaN | NaN | 9.21 | × |
| Y | 61.60 | NaN | NaN | 29.58 | × | |||||
| OvA SVM | Quantitative | N | Y | N | 13,126 | 63.77 | NaN | NaN | 14.63 | × |
| Y | 80.96 | 57.50 | 57.99 | 42.80 | Δ | |||||
| OvO SVM | Quantitative | N | Y | N | 13,126 | 51.96 | NaN | NaN | 12.52 | × |
| Y | 71.83 | NaN | NaN | 39.55 | × | |||||
| RF | Quantitative | N | Y | N | 13,126 | 47.96 | NaN | NaN | 12.40 | × |
| Y | 67.20 | NaN | NaN | 41.21 | × | |||||
| RF | Quantitative | Y | Y | N | 4608 | 38.39 | NaN | NaN | 8.98 | × |
| Y | 1025 | 49.14 | NaN | NaN | 16.67 | × | ||||
| scTumorTrace | Qualitative | Y | N | N/Y | 500+ | 63.60 | 54.46 | 55.57 | 74.89 | Δ |
| scTumorTrace | Qualitative | Y | N | N/Y | 7 K+ | 75.60 | NaN | NaN | 87.79 | Δ |
| scTumorTrace | Qualitative | N | N | N/Y | 200 K+ | 91.96 | 97.38 | 94.29 | 98.57 | ✓ |
kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, R = recall%, P = precision%, F1 = f-score%, ACC = accuracy%, ×= not effective, Δ = less effective, ✓= highly effective.
Fig. 2Confusion matrix of scTumorTrace with transcriptome-wide 2-tuples. (A) Bulk tissue of origin: training set. (B) Bulk tissue of origin: validation set. (C) Tumor of origin of single cells and circulating tumor cells. (D) Inference of non-positive CTCs derived from prostate cancer patients.
Fig. 3Mirrored histograms of discriminant scores for non-positive CTCs captured from prostate cancer patients. A mirrored histogram showed the likelihood of a single cell being prostate cancer-derived (the top histogram in red) or being leukocytes-like (the bottom histogram in blue) supported by a third-party cancer type. Pale-red indicated a cell to be inferred was against a single prostate CTC. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)