Literature DB >> 35685355

Can we infer tumor presence of single cell transcriptomes and their tumor of origin from bulk transcriptomes by machine learning?

Hua-Ping Liu¹, Dongwen Wang¹, Hung-Ming Lai².

Abstract

There is a growing need to build a model that uses single cell RNA-seq (scRNA-seq) to separate malignant cells from nonmalignant cells and to identify tumor of origin of single cells and/or circulating tumor cells (CTCs). Currently, it is infeasible to build a tumor of origin model learnt from scRNA-seq by machine learning (ML). We then wondered if an ML model learnt from bulk transcriptomes is applicable to scRNA-seq to infer single cells' tumor presence and further indicate their tumor of origin. We used k-nearest neighbors, one-versus-all support vector machine, one-versus-one support vector machine, random forest and introduced scTumorTrace to conduct a pioneering experiment containing leukocytes and seven major cancer types where bulk RNA-seq and scRNA-seq data were available. 13 ML models learnt from bulk RNA-seq were all reliable to use (F-score > 96%) shown by a validation set of bulk transcriptomes, but none of them was applicable to scRNA-seq except scTumorTrace. Making inferences from bulk RNA-seq to scRNA-seq was impaired by feature selection and improved by log2-transformed TPM units. scTumorTrace with transcriptome-wide 2-tuples showed F-score beyond 98.74 and 94.29% in inferring tumor presence and tumor of origin at single-cell resolution and correctly identified 45 single candidate prostate CTCs but lineage-confirmed non-CTCs as leukocytes. We concluded that modern ML techniques are quantitative and could hardly address the raised questions. scTumorTrace with transcriptome-wide 2-tuples is qualitative, standardization-free and not subject to log2-transformed quantities, enabling us to infer tumor presence of single cell transcriptomes and their tumor of origin from bulk transcriptomes.

Entities: Chemical

Keywords: Circulating tumor cells; Digit medicine; RNA-seq; Single cell transcriptomes; Translational bioinformatics

Year: 2022 PMID： 35685355 PMCID： PMC9162953 DOI： 10.1016/j.csbj.2022.05.035

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Single cell RNA-seq (scRNA-seq) is already a powerful biomedicine technique to study the tumor microenvironment (TME) [1], tumor diagnosis [2], and therapeutic resistance [3] and can be used to characterize circulating tumor cells (CTCs) [4]. For exploring the complexity of the TME, it is essential to effectively separate malignant cells from nonmalignant cells. The separation is equivalent to determining whether there is tumor present in single cells and a few tools have been developed for the separation by scRNA-seq [1], [5], [6]. Carcinoma of unknown primary origin (CUP) is a rare disease where malignant cells spread elsewhere in the body but routine testing cannot locate their origin [7]. CTCs have recently been attempted to find a starting point of occult primary cancer [8], [9]. However, CTCs are cancer cells that shed from a primary or metastatic tumor lesion and circulate in the blood stream. The primary site the cancer began is usually known in current CTC clinical settings [10]; in fact, CTCs tumor of origin itself remains problematic if their tumor of origin is not given or not sure [11]. Using scRNA-seq to infer single cells’ tumor of origin (scTOO) would be helpful in answering CUP and CTCs tumor of origin. Unfortunately, scRNA-seq has its disadvantages of low capture efficiency and high dropouts [12], [13]. The disadvantages can interfere with an accurate portrayal of single cell expression programs. scRNA-seq is currently costly and it is not economically feasible to capture single cell transcriptomes of thousands of patients over a wide range of cancer types for scTOO model training by machine learning (ML). Alternatively, bulk transcriptomes may be a practical approach to learning a multiclass model for the scTOO inference. Bulk RNA-seq detects an average of thousands of cells’ gene expression and captures more transcripts with a lower level of technical noise than scRNA-seq [12], [14]. It has been shown that 60% tumor purity is sufficient for a bulk tumor sample to represent a mass of malignant cells in the TME [15]. More importantly, public repositories have an abundant supply of bulk RNA-seq data ranging over plenty of tumor types with many hundreds of patients per type. Bulk RNA-seq and scRNA-seq are quite different on biological and technical levels. It is, therefore, worth knowing whether an ML multiclass model learnt from bulk transcriptomes can be applicable to scRNA-seq data to infer single cells’ tumor presence and their tumor of origin.

Materials and methods

Experimental design

We carried out a pioneering experiment with seven major cancer types (ovary, lung, liver, colorectum, breast, prostate, and melanoma) and white blood cells (WBCs) sourced from two public repositories (TCGA and GEO). The tumors we used were all in situ, and the metastatic ones were discarded. WBCs were from health people and non-tumor patients who suffered from non-cancer diseases. We created a cohort of bulk RNA-seq samples and randomly split it into training and test (validation) sets in the ratio of 7:3 (Table S1). We used single cell RNA-seq data (ovary, lung, liver, colorectum, melanoma and WBCs), CTC RNA-seq data (breast and prostate) and lineage-confirmed non-positive CTC RNA-seq data (prostate) to form an application set (Table S2). scRNA-seq imputation was performed by the SCRABBLE algorithm [16]. We only considered single cell transcriptomic profiles whose genes were not preselected or not truncated. All of the single cells (CTCs included) were patient-derived, and single cells from cell lines were excluded. We used four modern machine learning (ML) methods (kNN, one-versus-all SVM, one-versus-one SVM and random forest) and introduced a new learning technique (scTumorTrace) to build an ML multiclass model learnt from bulk training data. The built ML model was validated by bulk test data to show if it was reliable to use; if yes, we applied it to scRNA-seq data (Table S2). Both bulk RNA-seq and scRNA-seq data were normalized to the TPM (Transcripts Per Million) units as expression quantities. We examined standardization on two quantities: TPM and log2(TPM + 1). TPM (respectively, log2(TPM + 1)) values were mean-centered with a unit standard deviation and were termed z-score (respectively, standardized log2 transformation). Standardization was applied to each of a training, a test and an application set. We used the TPM units for scTumorTrace and the two sorts of standardized quantities for the four modern ML methods. We also examined whether feature selection impacts on the applicability to scRNA-seq. The examination was conducted in experiments of random forest and scTumorTrace.

Modern machine learning

We used k = 3 with Euclidean distance for k-nearest neighbors (kNN) and grew 100 trees to build random forest (RF). A linear kernel function with the penalty of C = 1 was utilized for support vector machine in one-versus-all (OvA SVM) and one-versus-one (OvO SVM) schemes.

A new learning technique: scTumorTrace

We suppose that there are K classes to be studied in a training set where and is the number of samples in . , is a class label and is an expression vector profiled by a set of genes . We define a transcriptomic 2-tuple between two classes : given two genes , (respectively, ) holds in at least 90 percent of samples with (respectively, ). , we construct an entire set of 2-tuples between and and refer to it as . Upon the training set , we connect all the s to build an all-in-one panel of transcriptomic 2-tuples for , . Given an unlabeled sample to be inferred, we define its discriminant score to indicate that a class predicts the likelihood of being . In Eq. (1) is the gene set that profiles , is the number of invalid 2-tuples whose gene ( or ) is undefined in , and is the number of meaningless 2-tuples having . Following Eq. (1), we define as an overall score of being supported by the other classes. Consequently, is determined by . Note that is an indicator function (alias the Iverson bracket) for a statement in Eq. (1) and Eq. (2).

Tumor presence inference

It is straightforward to infer single cells’ tumor presence in terms of multiclass classification. A single cell is malignant, providing that a multiclass model identifies it as a cell from one of seven cancer types.

Feature selection

We applied random forest recursive feature elimination (RFRFE) [17] to a training set with an evaluation procedure of 10-fold cross-validation and an error rate as an evaluation measure to select a subset of 4608 genes for z-score (Fig. S1) and 1025 genes for standardized log2 transformation (Fig. S2). For scTumorTrace, we used a simple filter to reduce amounts of transcriptomic 2-tuples from a transcriptome-wide scale to a modest or a small scale (Table S3). Upon a training set, we set a threshold for any two classes and retained genes whose expression intensities above the threshold in>50% samples in both and . The retained genes between and were used to discover . All the thresholds and the amounts of used transcriptomic 2-tuples were in Table S4-S5.

Performance metrics

Let be the number of true positives, true negatives, false positives, false negatives, positive (malignancy), and negative cases, respectively. We used sensitivity (), specificity (), F-score (), accuracy () to evaluate the inference of tumor presence. For an evaluation of scTOO inferences, we let be the number of correct predictions, predicted instances and actual instances for a class , respectively and calculated its recall (), precision () and F-score (). We then used macro recall (), macro precision (), macro F-score (), micro accuracy () as summary statistics.

Results

The purpose of conducting the present study is to question whether a multiclass model learnt from bulk transcriptomes by machine learning can be applicable to single cell transcriptomes to infer single cells’ tumor presence and their tumor of origin, graphically illustrated in Fig. 1A. Details of how we designed experiments to answer were given in section 2.1. We also developed a new learning technique (scTumorTrace) as a companion to our study. scTumorTrace was outlined in Fig. 1B. and was described in mathematical detail in section 2.3.

Fig. 1

Graphic outlines. (A) The purpose of the present study. (B) A new learning technique of scTumorTrace.

Graphic outlines. (A) The purpose of the present study. (B) A new learning technique of scTumorTrace. A validation set of bulk transcriptomic data showed that thirteen ML models learnt from bulk RNA-seq were all robust (F-score > 96%, Table 1). It indicated that the thirteen classifiers were all eligible to be examined on scRNA-seq data for their applicability to single cells’ inferences. All the modern ML techniques were unable to discriminate leukocytes from neoplastic cells (single tumor cells and circulating tumor cells) while three scTumorTrace classifiers were able to (Table 2). Their discriminating power increased with a growing number of employed transcriptomic 2-tuples (sensitivity, 97.24, 98.35 and 97.77%, and specificity, 65.10, 84.82 and 99.79%; Table 2 and Table S3-S5).

Table 1

Summary statistics of predictive performance: bulk tissue of origin (validation).

ML Methods	Feature Type	FS	Z	Log2	#Feat	R	P	F1	ACC	Effective
kNN	Quantitative	N	Y	N	13,126	96.02	96.98	96.39	96.94	✓
kNN	Quantitative	N	Y	Y	13,126	99.39	99.23	99.30	99.47	✓
OvA SVM	Quantitative	N	Y	N	13,126	99.47	99.31	99.38	99.58	✓
OvA SVM	Quantitative	N	Y	Y	13,126	99.76	99.81	99.78	99.79	✓
OvO SVM	Quantitative	N	Y	N	13,126	99.12	99.54	99.32	99.37	✓
OvO SVM	Quantitative	N	Y	Y	13,126	99.67	99.77	99.72	99.68	✓
RF	Quantitative	N	Y	N	13,126	99.37	99.15	99.25	99.26	✓
RF	Quantitative	N	Y	Y	13,126	99.68	99.55	99.62	99.68	✓
RF	Quantitative	Y	Y	N	4608	99.52	99.40	99.46	99.47	✓
RF	Quantitative	Y	Y	Y	1025	99.61	99.31	99.45	99.58	✓
scTumorTrace	Qualitative	Y	N	N/Y	500+	97.51	97.65	97.55	97.36	✓
scTumorTrace	Qualitative	Y	N	N/Y	7 K+	98.48	98.33	98.39	98.31	✓
scTumorTrace	Qualitative	N	N	N/Y	200 K+	99.04	98.62	98.80	99.05	✓

kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, R = recall%, P = precision%, F1 = f-score%, ACC = accuracy%, ✓= highly effective.

Table 2

Summary statistics of predictive performance: tumor presence of single cell transcriptomes.

ML Methods	Feature Type	FS	Z	Log2	#Feat	TPR	TNR	F1	ACC	Effective
kNN	Quantitative	N	Y	N	13,126	99.95	0	61.20	44.09	×
kNN	Quantitative	N	Y	Y	13,126	100	0	61.22	44.12	×
OvA SVM	Quantitative	N	Y	N	13,126	100	0	61.22	44.12	×
OvA SVM	Quantitative	N	Y	Y	13,126	99.52	0.13	61.05	43.98	×
OvO SVM	Quantitative	N	Y	N	13,126	100	0	61.22	44.12	×
OvO SVM	Quantitative	N	Y	Y	13,126	100	0	61.22	44.12	×
RF	Quantitative	N	Y	N	13,126	100	0	61.22	44.12	×
RF	Quantitative	N	Y	Y	13,126	100	0	61.22	44.12	×
RF	Quantitative	Y	Y	N	4608	100	0	61.22	44.12	×
RF	Quantitative	Y	Y	Y	1025	100	0	61.22	44.12	×
scTumorTrace	Qualitative	Y	N	N/Y	500+	97.24	65.10	80.55	79.28	Δ
scTumorTrace	Qualitative	Y	N	N/Y	7 K+	98.35	84.82	90.40	90.79	▲
scTumorTrace	Qualitative	N	N	N/Y	200 K+	97.77	99.79	98.74	98.90	✓

kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, TPR = sensitivity%, TNR = specificity%, F1 = f-score%, ACC = accuracy%, ×= not effective, Δ = less effective, ▲= fairly effective, ✓= highly effective.

Summary statistics of predictive performance: bulk tissue of origin (validation). kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, R = recall%, P = precision%, F1 = f-score%, ACC = accuracy%, ✓= highly effective. Summary statistics of predictive performance: tumor presence of single cell transcriptomes. kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, TPR = sensitivity%, TNR = specificity%, F1 = f-score%, ACC = accuracy%, ×= not effective, Δ = less effective, ▲= fairly effective, ✓= highly effective. OvA SVM built on standardized log2-trandformed quantities demonstrated limited effectiveness (F-score = 57.99%) in inferring single cells’ tumor of origin (scTOO) and the other nine modern ML models were not applicable to the inference (Table 3). Compared to z-score, standardized log2 transformation helped a quantitative ML classifier learnt from bulk transcriptomes address scTOO (see accuracy, Table 3). The five experiments of random forest and scTumorTrqace showed that feature selection caused severe damage to the scTOO inference from bulk transcriptomes no matter what expression quantities were used (Table 3). scTumorTrace with transcriptome-wide 2-tuples (i.e. without performing feature selection) was the only classifier that was well able to infer single cells tumor presence (F-score = 98.74%, Table 2) and their tumor of origin (F-score = 94.29%, Table 3 and Fig. 2C).

Table 3

Summary statistics of predictive performance: tumor of origin of single and/or circulating tumor cells.

ML Methods	Feature Type	FS	Z	Log2	#Feat	R	P	F1	ACC	Effective
kNN	Quantitative	N	Y	N	13,126	38.31	NaN	NaN	9.21	×
kNN	Quantitative	N	Y	Y	13,126	61.60	NaN	NaN	29.58	×
OvA SVM	Quantitative	N	Y	N	13,126	63.77	NaN	NaN	14.63	×
OvA SVM	Quantitative	N	Y	Y	13,126	80.96	57.50	57.99	42.80	Δ
OvO SVM	Quantitative	N	Y	N	13,126	51.96	NaN	NaN	12.52	×
OvO SVM	Quantitative	N	Y	Y	13,126	71.83	NaN	NaN	39.55	×
RF	Quantitative	N	Y	N	13,126	47.96	NaN	NaN	12.40	×
RF	Quantitative	N	Y	Y	13,126	67.20	NaN	NaN	41.21	×
RF	Quantitative	Y	Y	N	4608	38.39	NaN	NaN	8.98	×
RF	Quantitative	Y	Y	Y	1025	49.14	NaN	NaN	16.67	×
scTumorTrace	Qualitative	Y	N	N/Y	500+	63.60	54.46	55.57	74.89	Δ
scTumorTrace	Qualitative	Y	N	N/Y	7 K+	75.60	NaN	NaN	87.79	Δ
scTumorTrace	Qualitative	N	N	N/Y	200 K+	91.96	97.38	94.29	98.57	✓

Fig. 2

Confusion matrix of scTumorTrace with transcriptome-wide 2-tuples. (A) Bulk tissue of origin: training set. (B) Bulk tissue of origin: validation set. (C) Tumor of origin of single cells and circulating tumor cells. (D) Inference of non-positive CTCs derived from prostate cancer patients.

Summary statistics of predictive performance: tumor of origin of single and/or circulating tumor cells. kNN = k-nearest neighbors, OvA SVM = one-versus-all support vector machine, OvO SVM = one-versus-one support vector machine, RF = random forest, FS = feature selection is used (Y) or not (N), Z = standardization (z-score) is used (Y) or not (N), Log2 = a log2 scale is used (Y) or not (N), #Feat = amount of features, 500+=500 features (2-tuples) on average, 7 K+=7000 features (2-tuples) on average, 200 K+=0.2 million features (2-tuples) on average, R = recall%, P = precision%, F1 = f-score%, ACC = accuracy%, ×= not effective, Δ = less effective, ✓= highly effective. Confusion matrix of scTumorTrace with transcriptome-wide 2-tuples. (A) Bulk tissue of origin: training set. (B) Bulk tissue of origin: validation set. (C) Tumor of origin of single cells and circulating tumor cells. (D) Inference of non-positive CTCs derived from prostate cancer patients. Fig. 2D showed that 45 single candidate prostate CTCs but lineage-confirmed non-CTCs (false CTCs) were all accurately identified as leukocytes by scTumorTrace. Mirrored histograms further showed that almost all the false CTCs were identified unequivocally (Fig. 3). Although false-CTC 45 was not wrongly identified, its inference was not very strongly supported by the other six cancer types, i.e. no pale-red bars against blue bars among them (Fig. 3). Its scRNA-seq profiling might be badly distorted and would not be very dissimilar to that of prostate cancer; however, the slight difference could still be detected by scTumorTrace. 3 out of 77 lineage-confirmed single prostate CTCs were also identified as leukocytes. The three circulating cells (cell 3, 21 and 24) had similar mirrored histograms to those of false-CTC 19, 20 and 37 (Fig. 3) so their tumor presence might be in doubt. Overall, scTumorTrace had both AUROC and AUPRC far beyond 90% in scTOO inferences except the breast cancer experiment (Fig. S3). 74 single breast CTCs were identified as either breast-derived malignant cells (n = 42) or leukocytes (n = 32) that resulted in a 98.04% AUROC and a 69.65% AUPRC (Fig. 2C and Fig. S3).

Fig. 3

Mirrored histograms of discriminant scores for non-positive CTCs captured from prostate cancer patients. A mirrored histogram showed the likelihood of a single cell being prostate cancer-derived (the top histogram in red) or being leukocytes-like (the bottom histogram in blue) supported by a third-party cancer type. Pale-red indicated a cell to be inferred was against a single prostate CTC. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Discussion

CTC detection platforms normally require enrichment strategies and might not avoid false‐positive or false‐negative events [18], [19]. 74 out of 77 lineage-confirmed single prostate CTCs were correctly inferred (Fig. 2C), and their discriminant scores showed overwhelming odds in favor of malignant cells derived from prostate cancer (red bars vs. blue & pale-blue bars; Fig. S4). Although the other three single prostate CTCs (cell 3, 21 and 24) seemed misclassified (as leukocytes), their discriminant scores were more similar to the 45 single candidate prostate CTCs but lineage-confirmed non-CTCs (blue bars vs. red & pale-red bars; Fig. 3). We, thus, doubted whether the three were false-positive events (i.e. tumor absence). In 74 single breast CTCs inferences, Fig. S5 (inferred leukocytes) and Fig. S6 (inferred breast-derived CTCs) showed that there were no overwhelming discriminant scores for breast CTCs (red bars) against leukocytes (blue bars). Here, red or pale-red bars indicated whether scTumorTrace tended towards or tended against breast cancer cells shaped by the other cancer types. Similarly, blue or pale-blue bars indicated white blood cells delineation. 32 single breast CTCs were falsely identified as leukocytes, mainly because their discriminant scores between breast cancer and WBCs tended towards WBCs (pale-red against blue bars, Fig. S5). 40 out of 42 correctly inferred breast-derived CTCs also showed pale-red vs. blue bars in breast cancer cells vs. white blood cells (Fig. S6). Fig. S5 and Fig. S6 imply that discovered 2-tuples between breast cancer and WBCs would not be applicable to single breast cancer cell transcriptomes. Even so, Eq. (2) show that scTumorTrace can still draw their inferences to a certain extent by transcriptomic 2-tuples between breast cancer and the other six cancer types. Cell 44 had overwhelming odds against single breast CTC with the majority of pale-red bars (including overall score) so it would be a false positive rather than a misclassified instance (Fig. S5). Although accuracy did not reach a high level, scTumorTrace identified 74 single breast CTCs as either breast-derived malignant cells (n = 42) or leukocytes (n = 32) - nothing else but the two classes (Fig. 2C). Provided that tumor presence can be (or has been) experimentally confirmed, scTumorTrace can simply be used to answer CTCs tumor of origin with scRNA-seq profiling. Otherwise scTumorTrace can be an alternative remedy for the true identity of single CTCs synchronized with their tumor of origin inference. Our empirical study showed that scTumorTrace was capable of inferring single cells’ tumor presence (sensitivity, 97.77%, positive predictive value, 99.73%, and F-score, 98.74%). scTumorTrace can, therefore, help separate neoplastic cancer cells from the TME where tumor-infiltrating lymphocytes are present and interact closely with surrounding tumor cells. A recent method, CopyKAT, uses scRNA-seq data to infer aneuploid copy number events for the separation. However, it is not suitable for those cancers with few copy number alternations (CNA) and has biased detection of CNA events provided that the data to be inferred has a complete absence of tumor cells [6]. scTumorTrace employs transcriptome-wide 2-tuples that can be discovered among all sorts of cancers so it is also applicable to pediatric cancers and hematopoietic cancers for which CopyKAT is not suitable. More importantly, CopyKAT needs a bunch of scRNA-seq data for CNA inference while scTumorTrace is a classifier of single-instance inference and infers cells one by one no matter whether a tumor cell is absent or present in a scRNA-seq dataset. Briefly, scTumorTrace can be applicable to even one single cell but CopyKAT cannot. Bulk RNA-seq estimates global expression of thousands of cells and can capture more transcripts while scRNA-seq detects an individual cell’s expression with low capture efficiency and exhibits technical & biological cell-to-cell variation [20]. The present study observed varying levels of expression quantification between bulk RNA-seq and scRNA-seq (IQR: bulk training, 10.3495, bulk test, 10.3943, and single cells, 32.5449). This might explain why log2-transformed quantities (IQR: bulk training, 2.8230, bulk test, 2.8157, and single cells, 5.0493) were of help for quantitative ML techniques to infer scRNA-seq from bulk RNA-seq (Table 3). Since gene programs learnt from bulk transcriptomic identities might be distorted in single cell transcriptomes profiled by current scRNA-seq technologies, inferring single cells’ tumor presence and their tumor of origin from bulk RNA-seq is a big challenge. Performing feature selection would make the challenge more challenging no matter the quantitative or the qualitative approaches. The fewer the features were selected; the more chance the tumor identities were damaged (Table 2, Table 3). scTumorTrace can automatically adjust transcriptomic 2-tuples for each individual cell in accordance with its completeness of scRNA-seq profiling (see Eq. (1)). Therefore, scTumorTrace maximizes its effectiveness only when transcriptome-wide gene programs are available (Table 2, Table 3 and Fig. 2C). scTumorTrace is a qualitative and a standardization-free learning technique and is not subject to log2-transformed quantities such that it can address the raised questions by “qualitative identities” inherent in both bulk transcriptomes and single cell transcriptomes; conversely, modern quantitative ML techniques cannot. scTumorTrace has a fundamental weakness in computation time. This is due to a large number of transcriptome-wide 2-tuples ranging from one thousand to hundreds of thousands. When we extend the present study to more cancer types, an all-in-one panel of transcriptomic 2-tuples can grow rapidly. A faster version should be developed, especially, for a high-throughput single cell platform like 10X Genomics. Meanwhile, we will need to improve the distinction between single breast cancer cells and white blood cells. We will also have to apply scTumorTrace to a broad range of cancer types such that we may suggest the primary site of the CUP disease by either bulk tissue or single cell transcriptomes.

Conclusions

Bulk RNA-seq and scRNA-seq are quite different on biological and technical levels. We questioned whether a multiclass model learnt from bulk RNA-seq is applicable to addressing single cells’ tumor presence and their tumor of origin (scTOO). Our pioneering experiment produced three pieces of empirical evidence. Firstly, standardized log2 transformation is helpful to a quantitative ML method in improving its applicability. Secondly, performing feature selection causes damage to the applicability no matter the quantitative or the qualitative ML approaches. Thirdly, it is unlikely that we infer tumor presence of single cell transcriptomes and scTOO from bulk transcriptomes by modern quantitative machine learning. We might need to seek a qualitative learning technique with transcriptome-wide gene programs for such inferences. scTumorTrace could then be tailored to the particular needs.

Conflict of interest statement

Hung-Ming Lai has ownership interests (including stock, patents, etc) as an inventor of pending unpublished provisional patent application(s) for scTumorTrace and its clinical applications. No potential conflicts of interest were disclosed by the other authors.

19 in total

Review 1. Current approaches for avoiding the limitations of circulating tumor cells detection methods-implications for diagnosis and treatment of patients with solid tumors.

Authors: Artur Kowalik; Magdalena Kowalewska; Stanisław Góźdź
Journal: Transl Res Date: 2017-04-26 Impact factor: 7.012

2. A platform for primary tumor origin identification of circulating tumor cells via antibody cocktail-based in vivo capture and specific aptamer-based multicolor fluorescence imaging strategy.

Authors: Min Jia; Yifei Mao; Chuanchen Wu; Shuo Wang; Hongyan Zhang
Journal: Anal Chim Acta Date: 2019-07-26 Impact factor: 6.558

3. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes.

Authors: Ruli Gao; Shanshan Bai; Ying C Henderson; Yiyun Lin; Aislyn Schalck; Yun Yan; Tapsi Kumar; Min Hu; Emi Sei; Alexander Davis; Fang Wang; Simona F Shaitelman; Jennifer Rui Wang; Ken Chen; Stacy Moulder; Stephen Y Lai; Nicholas E Navin
Journal: Nat Biotechnol Date: 2021-01-18 Impact factor: 54.908

Review 4. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.

Authors: Ashraful Haque; Jessica Engel; Sarah A Teichmann; Tapio Lönnberg
Journal: Genome Med Date: 2017-08-18 Impact factor: 11.117

5. SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data.

Authors: Tao Peng; Qin Zhu; Penghang Yin; Kai Tan
Journal: Genome Biol Date: 2019-05-06 Impact factor: 13.583

Review 6. Eleven grand challenges in single-cell data science.

Authors: David Lähnemann; Johannes Köster; Ewa Szczurek; Davis J McCarthy; Stephanie C Hicks; Mark D Robinson; Catalina A Vallejos; Kieran R Campbell; Niko Beerenwinkel; Ahmed Mahfouz; Luca Pinello; Pavel Skums; Alexandros Stamatakis; Camille Stephan-Otto Attolini; Samuel Aparicio; Jasmijn Baaijens; Marleen Balvert; Buys de Barbanson; Antonio Cappuccio; Giacomo Corleone; Bas E Dutilh; Maria Florescu; Victor Guryev; Rens Holmer; Katharina Jahn; Thamar Jessurun Lobo; Emma M Keizer; Indu Khatri; Szymon M Kielbasa; Jan O Korbel; Alexey M Kozlov; Tzu-Hao Kuo; Boudewijn P F Lelieveldt; Ion I Mandoiu; John C Marioni; Tobias Marschall; Felix Mölder; Amir Niknejad; Lukasz Raczkowski; Marcel Reinders; Jeroen de Ridder; Antoine-Emmanuel Saliba; Antonios Somarakis; Oliver Stegle; Fabian J Theis; Huan Yang; Alex Zelikovsky; Alice C McHardy; Benjamin J Raphael; Sohrab P Shah; Alexander Schönhuth
Journal: Genome Biol Date: 2020-02-07 Impact factor: 13.583

7. RNA-Seq of single prostate CTCs implicates noncanonical Wnt signaling in antiandrogen resistance.

Authors: David T Miyamoto; Yu Zheng; Ben S Wittner; Richard J Lee; Huili Zhu; Katherine T Broderick; Rushil Desai; Douglas B Fox; Brian W Brannigan; Julie Trautwein; Kshitij S Arora; Niyati Desai; Douglas M Dahl; Lecia V Sequist; Matthew R Smith; Ravi Kapur; Chin-Lee Wu; Toshi Shioda; Sridhar Ramaswamy; David T Ting; Mehmet Toner; Shyamala Maheswaran; Daniel A Haber
Journal: Science Date: 2015-09-18 Impact factor: 47.728