Literature DB >> 34914736

MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes.

Abstract

Translating in vitro results from experiments with cancer cell lines to clinical applications requires the selection of appropriate cell line models. Here we present MFmap (model fidelity map), a machine learning model to simultaneously predict the cancer subtype of a cell line and its similarity to an individual tumour sample. The MFmap is a semi-supervised generative model, which compresses high dimensional gene expression, copy number variation and mutation data into cancer subtype informed low dimensional latent representations. The accuracy (test set F1 score >90%) of the MFmap subtype prediction is validated in ten different cancer datasets. We use breast cancer and glioblastoma cohorts as examples to show how subtype specific drug sensitivity can be translated to individual tumour samples. The low dimensional latent representations extracted by MFmap explain known and novel subtype specific features and enable the analysis of cell-state transformations between different subtypes. From a methodological perspective, we report that MFmap is a semi-supervised method which simultaneously achieves good generative and predictive performance and thus opens opportunities in other areas of computational biology.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 34914736 PMCID： PMC8675718 DOI： 10.1371/journal.pone.0261183

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Tumour-derived cell lines are important model systems for developing new anti-cancer treatments and for understanding cancer biology [1-3]. They are comparably cost efficient, easy to handle under laboratory conditions and do not inflict ethical issues arising in research involving human or animal subjects. Yet, promising cell line experiments are rarely translated to clinical applications. In some cases, there are remarkable differences between cell lines and the primary tumours they were derived from [2-4]. This is also the reason why the assignment of clinically informative tumour subtypes to cell line models [3-5] is not a straightforward task. To narrow the gap between preclinical findings and tumour treatment, it is necessary to select appropriate cell line models for a given tumour sample or a given cancer subtype. Several attempts to evaluate similarities and differences between cell lines and bulk tumours have focused on associations between corresponding data modalities including mutation, copy number, gene expression and methylation [6-12]. An important data resource comes from collaborative projects like NCI-60 [13] and the Cancer Cell Line Encyclopaedia (CCLE) [5, 14], who have generated large-scale pharmacogenomics data from patient-derived cell lines across organs. Other efforts like Sanger Genomics of Drug Sensitivity in Cancer (GDSC) [15], Connectivity Map (CMAP) [16], the Cancer Therapeutics Response Portal (CTRP v1 and CTRP v2) [17, 18] further expanded the datasets. On the other hand, The Cancer Genome Atlas (TCGA) [19] and the International Cancer Genome Consortium (ICGC) [20] systematically characterised molecular profiles of thousands of tumours. These complementary data resources are valuable for understanding the complexity of cancer biology and connecting in vitro pharmacogenomic profiles to patient molecular characteristics, potentially informing anti-cancer treatment strategies. Integrative analyses considering multiple data types of both cell lines and bulk tumours are still challenging and new analysis concepts tailored towards specific questions are an ongoing research topic. For instance, Cellector [21] preselects the most frequent genomic alterations and defines cancer subtypes based on a sequence of these alterations. Although such a preselection of genomic alterations integrates prior knowledge about cancer mutational patterns, it neglects complementary information contained in other data types. Furthermore, Cellector relies on a binary matrix of genomic alterations. This matrix is very sparse, since samples harbouring the same alterations are very rare. Therefore, the statistical power to detect appropriate cell lines for tumours might be limited. A recent study [22] highlighted that independent classifiers based on different data types to predict cell line identity often yield inconsistent results. For example, predictions based on the mutation spectrum and oncogenic mutations can be contradictory, although both features are derived from mutation data. Complementary information from different data sources is integrated by the MAGNETIC-framework [23] into gene modules. Gene set enrichment analysis (GSEA) is then used to interpret these modules as pathways. MAGNETIC is indeed a powerful technique for integrating multiple molecular datasets and prior knowledge, but it does not conclude to what extent a cell line is suitable as a tumour model. The maui framework assigns cancer subtype labels to cell lines by extracting relevant features from multiple data types using a variational autoencoder (VAE) [24]. However, most of the maui embedded features are weakly associated with subtype labels and are therefore difficult to interpret. Here, we propose MFmap, a new semi-supervised VAE architecture and objective function which combines good classification accuracy with good generative performance. We exploit these properties to derive subtype informed low dimensional representations for both cell lines and bulk tumours from high dimensional multi-omics data including gene expression, mutation and copy number variation. The latent representations can then be used to assess the similarity between a cell line and a tumour. We provide cell line by tumour dissimilarity matrices for CCLE and TCGA for the ten different cancer types listed in Table 1. In addition, MFmap predicts cancer subtype labels for cell lines. We demonstrate, how these predicted cancer subtypes can be used to transfer information from cell-line-based drug sensitivity screens to patient cohorts. We also show, that the latent representations learnt by MFmap are biologically interpretable. Finally, we illustrate how the generative nature of the MFmap model can be exploited for studying subtype transformations during cancer progression. At http://h2926513.stratoserver.net:3838/MFmap_shiny/ we provide a resource enabling researchers to select the most relevant cell line for a cancer patient.

Table 1

The sample size of TCGA and CCLE data used for training and testing MFmap.

TCGA code	study name	number of subtypes	TCGA sample size	CCLE sample size
BRCA	Breast invasive carcinoma	4	484	51
COADREAD	Colon adenocarcinoma	4	414	54
ESCA	Esophageal carcinoma	2	169	27
HNSC	Head and neck squamous cell carcinoma	4	278	29
LUAD	Lung adenocarcinoma	3	227	70
LUSC	Lung squamous cell carcinoma	4	178	22
PAAD	Pancreatic adenocarcinoma	2	149	40
SKCM	Skin cutaneous melanoma	3	260	49
UCEC	Uterine corpus endometrial carcinoma	3	234	28
GBMLGG	Glioblastoma multiforme and lower grade glioma	7	621	55

Materials and methods

Matching cell lines and tumours as a semi-supervised learning problem

MFmap is a semi-supervised deep neural network which integrates gene expression, copy number variation (CNV) and somatic mutation data with subtype classification. Each tumour sample t consists of a pair of (t, y), where denotes the high dimensional molecular features and y ∈ {1, …, h} is the cancer subtype label. For a cell line c, the cancer subtype is unknown and only the molecular features are available. The index c or t will be suppressed, whenever we refer to a single observation. The MFmap neural network is trained in a semi-supervised manner using both cell line data and tumour data . Here, we used cell line data from CCLE and tumour data from TCGA. One aim of MFmap is to use semi-supervised classification to infer the cancer subtype y of a cell line c. A second aim is to assess the similarity between a cell line and a tumour. Instead of comparing the high dimensional molecular features and directly, we first encode them into low dimensional latent representations (see next section for details). Then, the similarity of a tumour sample t and a cell line c is measured as the cosine coefficient between the corresponding latent representation vectors and . We will also show that these latent representations carry interpretable biological information. The molecular data = (, ) consist of gene expression profiles and network smoothed mutation and CNV profiles . We will refer to these two parts as RNA and DNA view, respectively. The DNA view is obtained from the original binary mutation and CNV matrices (Fig 1(A)), which indicate the occurrence of a mutation or CNV event targeting a gene in a given tumour sample or cell line. These very sparse matrices are first projected onto an annotated cancer network [25]. By using a network diffusion algorithm [26], a mutation or CNV signal hitting a single gene is propagated to neighbouring nodes in the network, thereby enriching the mutation or CNV data by cancer network information. All molecular features were translated and scaled to the interval between zero and one.

Fig 1

Overview of MFmap.

(A) In a preprocessing step, mutation and CNV profiles are transformed to network smoothed DNA profiles. The original mutation and CNV data are represented as a binary matrix indicating the presence/absence of a DNA alteration in a given tumour sample or cell line. This sparse matrix is projected onto a cancer reference network (CRN) [25] and a network diffusion algorithm propagates this information to network neighbours, resulting in a dense DNA mutation or CNV matrix (DNA features). (B) The smoothed DNA features (DNA view) combined with gene expression data (RNA view) form the input of MFmap. The neural network architecture of MFmap has three components: encoder, decoder and classifier, encoded by different colours. The encoder maps sample features to a distribution q(z|) for the latent representation with mean value () and covariance 2(). The classifier outputs a molecular subtype probability p(y|) and the decoder models a density p(|) for the reconstruction of the DNA and RNA views. During semi-supervised training, the molecular subtypes of tumour samples are used. (C) For visualisation, the latent representations of bulk tumour samples are used to generate a reference map. Cell lines are then projected to the reference map. The colour coding of individual samples or cell lines (dots) indicates the tumour subtype or the predicted subtype, respectively. The density of the tumour samples is indicated by background contour lines coloured according to the subtypes.

Overview of MFmap.

Specification of MFmap as a semi-supervised generative model

The MFmap neural network (Fig 1(B)) is a new variant of a semi-supervised VAE [27]. The observable data are considered to be drawn from the probability distributions p(, y) for tumour samples and p() for cell lines. These distributions are modelled as marginals over the latent variable , such that To facilitate biological interpretation of the latent representations, we set the dimension d of the latent space equal to the number of cancer subtypes h. In other applications of the MFmap model, one could also consider d as a tuneable hyper-parameter. For the generative model, we assume and y to be conditionally independent given the latent variable . Accordingly, the joint distribution can be factorised as These distributions are specified as Here, p() is the prior distribution for the latent representation vector. We denote the Gaussian distribution with mean vector and covariance matrix Σ by . The parameter () of the categorial distribution p(y|) depends on the latent representation . For the decoder p(|) one can chose a suitable distribution with parameters depending on the latent representations [27]. The functions ↦ π() and ↦ (·|) are represented as neural networks. The parameters of these decoder networks are jointly denoted as . For the mfMAP model we initially used a Gaussian distribution (|) to model the outputs. However, we found that rescaling the molecular features to the interval [0, 1] and using a Bernoulli distribution for improved the semi-supervised classification accuracy (see Results section). Then, each single output of the decoder neural network ↦ (·|) can be interpreted as the probability, that the corresponding molecular feature is active or not. For instance, for the i−th component () of the RNA-view, the corresponding output can be regarded as the probability that the i-th gene is expressed. Posterior inference, i.e. the evaluation of p(y, |) using Bayes theorem, is often intractable, because the marginal likelihood p() in Eq (1) requires integrating over . Therefore, a variational distribution q(y, |) is introduced to approximate the true posterior [24, 27]. We assume that the variational distribution reflects the conditional independence ⊥ y| of the generative model in Eq (2). This implies For consistency we assume that q(y|) in Eq (4) is identical to p(y|) in Eq (3b) and is represented by the same neural network mapping to the categorial parameter (). For the variational distribution q(|) we choose a Gaussian with parameters () and (). The parameters are represented by the encoder neural network , which is itself parametrised by . The overall architecture of MFmap (Fig 1(B)) is thus formed by three neural networks, the encoder Eq (5), the classifier Eq (3b) and the decoder Eq (3c).

Training of MFmap using a semi-supervised loss function

Variational inference involves maximising an evidence lower bound (ELBO) to the log-likelihood of the observational data [24, 27]. For a single cell line sample one can derive a lower bound to the log-likelihood which is identical to the ELBO of the basic VAE [24] for unsupervised learning consisting of a reconstruction loss term and a Kullback-Leibler (KL) divergence term. For a single labelled tumour sample we have for the log-likelihood where the ELBO for labelled examples reads To derive this ELBO (see S1 File), we exploited the conditional independence assumption ⊥ y| for both the generative model (Eq (2)) and the inference model (Eq (4)). The additional term in Eq (9) in comparison to Eq (7) can be interpreted as a classification loss. Given a tumour sample (t, y), the probability for the cancer subtype label p(y|) is a function of , which is inferred from q(|). This distribution is in turn determined by the molecular feature vector . We found empirically that the semi-supervised classification accuracy during training was relatively poor when using these exact negative ELBOs as loss functions. This is in line with previous findings that achieving both good semi-supervised classification accuracy and good generative performance is often difficult in VAEs [28] or other generative models [29]. Motivated by the work from [30], we added the negative entropy of the distribution p(y|) to the unsupervised ELBO in Eq (7) and to the supervised ELBO in Eq (9). In summary, the MFmap loss functions for the unlabelled cell line and the labelled tumour data are respectively given by This entropy regularisation encourages the classification boundaries to be located in low sample density regions [30] in the latent space, which improves the generalisation performance of the model. As shown below (see Results section), the semi-supervised classification accuracy was very convincing, when using this entropy regularisation. During training, mini-batches b = 1, …, B from the cell line and tumour data are used to minimise over different epochs. To check whether all terms in the MFmap loss function in Eq (10) can be jointly optimised, we recorded the values of each term in each training epoch and calculated their pair-wise correlations. The reconstruction loss -E[p(|)], the KL-divergence D(q(|)||p()), the entropy and the classification loss -E[log p(|)] are highly correlated (Fig 2), what suggests that they are optimised simultaneously.

Fig 2

Joint optimisation of the reconstruction loss, the KL divergence, entropy and the classification loss with the MFmap loss function.

The plot shows the pairwise correlation of different terms in the MFmap loss function Eq (10) during different training epochs.

Joint optimisation of the reconstruction loss, the KL divergence, entropy and the classification loss with the MFmap loss function.

The plot shows the pairwise correlation of different terms in the MFmap loss function Eq (10) during different training epochs.

Visualisation of individual samples

The MFmap latent representation can be used to visualise and organise the associations of individual tumour samples and cell lines (Fig 1(C)). Inspired by the visualisation concept of Onco-GPS (OncoGenic Positioning System) [31], we used the tumour samples with known subtypes to generate a reference map for the cancer subtypes. In this reference map, the components z1, …, z of the latent representation are presented as a graph with h corner points in a plane. The location of these corner points is determined by multidimensional scaling and is chosen so as to reflect the distances in the h-dimensional latent space as good as possible (see S1 File for details). An individual tumour sample can now be visualised as a point located in the area between the corner points. The location of such a point is given by a superposition of the corner positions weighted by the latent representation magnitudes of individual samples. In addition, the subtypes of the tumour samples are colour coded. The contour lines and the background colour shading represent the sample density in the region. Once the reference map is established, individual cell lines can be projected to this map, where the colour of each dot encodes the subtype predicted by the MFmap classifier. This projection is based on the latent representation values of the cell line samples. Since our aim is to analyse the fidelity of a cell line as an oncological model for a given tumour or a cancer subtype, we name our framework the model fidelity map (MFmap).

Results

Evaluating the MFmap classification and generative performance

A direct evaluation of the MFmap subtype prediction for cell lines is impossible because there are no ground truth labels available. However, the classification accuracy on an unseen test dataset of bulk tumours provides an indirect evaluation of the subtype prediction performance. In Table 2 we used 20% of the tumour samples as independent test set and evaluated the classification performance using four multi-class classification metrics: overall accuracy, weighted precision, weighted recall, and weighted F1 score. Similar results can be obtained, when 10% of the tumour samples are used for testing (see Table 1 in the S2 File). We also tested the effect of increasing the latent space dimension d and found that the classification accuracy was typically not higher, indicating that our choice of setting d equal to the number of cancer subtypes did not impair the classification accuracy (see Table 2 in the S2 File).

Table 2

MFmap subtype classification performance estimated for unseen tumour samples.

Here, 20% of the bulk tumour data were randomly selected as an independent test set.

accuracy	precision	recall	F₁ score	organ
0.97	0.97	0.97	0.97	BRCA
0.96	0.96	0.96	0.96	COADREAD
1.00	1.00	1.00	1.00	ESCA
0.99	0.99	0.99	0.99	GBMLGG
0.91	0.92	0.91	0.91	HNSC
0.96	0.96	0.96	0.96	LUAD
0.94	0.95	0.94	0.94	LUSC
0.97	0.97	0.97	0.97	PAAD
1.00	1.00	1.00	1.00	SKCM
0.96	0.96	0.96	0.96	UCEC

MFmap subtype classification performance estimated for unseen tumour samples.

Here, 20% of the bulk tumour data were randomly selected as an independent test set. The good classification results for GBMLGG are intriguing, because the G-CIMP-High, G-CIMP-Low and LGm6-GBM subtypes were derived from methylation data [32], which were not used to train MFmap. This indicates that MFmap is able to extract DNA and RNA patterns reflecting features originally derived from different methylation status. In addition, we tested how well the MFmap autoencoder part reconstructs the molecular features . To this end, we first sampled a latent representations from the encoder q(|) for a given input from the real data. Then, we correlated these original molecular features with the output sampled from the decoder distribution p(|). The histogram of Pearson correlation coefficients in Fig 3 shows a high input-output correlation for most molecular features for three exemplary cancer types: breast invasive carcinoma (BRCA), colorectal adenocarcinoma (COADREAD) and glioblastoma multiforme and lower grade glioma (GBMLGG). Taken together, MFmap can combine very good classification accuracy with good generative performance.

Fig 3

The generative performance of MFmap.

The histogram shows sample-wise correlation coefficients between input features (DNA and RNA views) and reconstructed features output by the MFmap decoder.

The generative performance of MFmap.

The histogram shows sample-wise correlation coefficients between input features (DNA and RNA views) and reconstructed features output by the MFmap decoder. Future applications of MFmap will include the analysis of query samples input to a reference model trained on a large data set. To check how well MFmap can perform in such a setting, we checked various measures for the quality of integrating these data from different sources [33-35]. Since this is not the focus of this paper, we have relegated the very promising results to the Supporting Information (see S2 File).

Selecting the optimal cell line for a given tumour

The heatmaps in Fig 4 represent pairwise cell line by tumour dissimilarity matrices for three exemplary cancer types BRCA, COADREAD and GBMLGG. In addition, the subtypes of bulk tumours annotated from [32, 36, 37] and the subtypes of cell lines predicted by the MFmap classifier are displayed. For a better visualisation, cell lines and tumours are clustered based on their pairwise cosine dissimilarity scores. The similarity of a cell line c to a tumour t is defined as the cosine of the angle between their latent representations and . Accordingly, the dissimilarity between c and t is defined as . A dissimilarity of d(c, t) = 0 indicates perfect alignment between the latent representations of the cell line and the tumour, whereas a dissimilarity d(c, t) = 1 indicates orthogonal latent representations. The highest dissimilarity of d(c, t) = 2 would be achieved for antipodal latent vectors. Based on this dissimilarity matrix, researchers can select the best cell lines for a given tumour or a given tumour subtype. And, vice versa, the relevance of promising experimental results observed in vitro can be checked by selecting a subset of tumours most likely resembling the cell line characteristics. The pairwise dissimilarity matrices between TCGA bulk tumours and CCLE cell lines and cell line subtype predictions for all tumour types listed in Table 1 are provided on our website (http://h2926513.stratoserver.net:3838/MFmap_shiny/).

Fig 4

Pairwise dissimilarity between CCLE cell lines and TCGA bulk tumours.

Pairwise dissimilarity between CCLE cell lines and TCGA bulk tumours.

The colour coding in the heatmaps indicates the pairwise dissimilarity which was obtained from the latent representations of cell lines and tumours for the three exemplary cancer types (A) breast invasive carcinoma (BRCA), (B) colorectal adenocarcinoma (COADREAD) and (C) glioblastoma multiforme and lower grade glioma (GBMLGG). Tumours (columns) and cell lines (rows) were clustered according to the dissimilarity score, which ranges from 0 (very similar) to 2 (very dissimilar). The subtype classification of each cell line was predicted from the classification layer of the MFmap neural network. The tables display the sample size for the different subtypes or predicted subtypes. These results also indicate, for which subtypes suitable cell line models exist and for which subtypes cell lines should be prioritised for future in vitro model development [21]. Each BRCA subtype is represented by at least three cell lines (Fig 4(A)) and the heatmap shows that these cell lines are very similar to the corresponding tumours of the same subtype. However, only three cell lines represents the HER2-enriched subtype. The four subtypes of COADREAD tumours are also well represented by at least six highly similar cell lines in CCLE (Fig 4(B)). For GBMLGG, the Mesenchymal-like tumour subtype is represented by 31 cell lines with high similarity scores. Many TCGA tumour samples have the molecular subtype Codel and G-CIMP-high, but they are only represented by seven and nine cell lines, respectively. Only two cell lines were classified as Classic-like and a single cell line has the predicted subtype LGm6-GBM. The PA-like tumour subtype is not represented by any cell line.

Predicting drug sensitivity in cancer patient sub-cohorts using MFmap and in vitro drug screens

Predicting patient therapeutic response is one important goal of subtype stratification. To explore the translational potential of the subtypes predicted by MFmap we estimated the association between predicted subtypes and drug sensitivity of all compounds available in the CTRP dataset [18]. For each cancer type listed in Table 1 and each compound, we compared the drug sensitivity among different cell line subtypes predicted by the MFmap classifier. Drug sensitivity is quantified in CTRP by the area under the dose response curve (AUC). We used an ANOVA to test for differences in the mean AUC among the predicted subtypes. At a false discovery rate (FDR) cutoff of 25%, we found 18, six and 16 compounds in BRCA, GBMLGG and UCEC to show significant subtype specificity, respectively. For the other seven cancer types in Table 1, there are no significant AUC differences across the different subtypes. Note that the sample size per subtype is very small, which might explain why statistically significant results can only be obtained for three cancer types. For BRCA, the compound with the strongest association between subtype and drug sensitivity is Lapatinib (ANOVA p-value = 2.95e-05). Lapatinib is a tyrosine kinase inhibitor used in combination therapy for HER2-positive breast cancer [38]. Our results suggest that cell lines of molecular subtype HER2-enriched are more sensitive to Lapatinib treatment (Fig 5(A)) in comparison to other three subtypes. Although there are only three cell lines representing the HER2-enriched subtype, this finding is in line with the known inhibitive mechanism of Lapatinib on the HER2/neu and epidermal growth factor receptor (EGFR) pathways. This result highlights the potential of MFmap as a tool for translating in vitro drug screening results to patient sub-cohorts. Our analysis also suggests that larger sample sizes and a better coverage of underrepresented subtypes are essential to increase the statistical power for detecting subtype specificity from cell line drug screens.

Fig 5

Cancer subtype specific drug sensitivity of CCLE cell lines.

Cancer subtype specific drug sensitivity of CCLE cell lines.

The subtypes of breast invasive carcinoma (BRCA) cell lines respond differentially to the compounds Lapatinib and Olygomycin A. Treatment response to the compounds KHS101 and Bortezomib in of glioblastoma multiforme and lower grade glioma (GBMLGG) cell lines is subtype specific. The drug sensitivity is summarised by the area under the dose response curve (AUC) and p-values refer to an ANOVA of the AUC differences among different subtypes. Another drug with significant variations of the AUC values across the different BRCA subtypes is Oligomycin A (ANOVA p-value = 1.39e-4), a compound targeting oxidative phosphorylation via an inhibition of the ATP synthase. The potential of Oligomycin A as a therapeutic compound to prevent metastatic spread in breast cancer has recently been highlighted [39]. The results in Fig 5(B) suggest that treatment with Oligomycin A might be most efficient for the HER2-enriched and Luminal A or Luminal B subtypes. The drug sensitivities of KHS101 and Bortezomib are significantly associated with GBMLGG subtypes (KHS101: ANOVA p-value = 2.3e-04; Bortezomib: ANOVA p-value = 2.3e-04). The synthetic small molecule KHS101 was shown to promote tumour cell death in diverse glioblastoma multiforme cell line models [40]. Our analysis suggests that the G-CIMP-low subtype is more sensitive to KSH101 treatment (Fig 5(C)) compared to the other six GBMLGG subtypes. G-CIMP-low is an IDH mutant glioma subtype with poor clinical outcome in recurrent glioma [32]. Bortezomib targets the ubiquitin-proteasome pathway and is used for the treatment of multiple myeloma, but has also been discussed as treatment for glioma [41]. Our results in Fig 5(D) show that the Codel and G-CIMP-high subtypes have larger AUCs. The results for LGm6-GBM and Classic-like are not conclusive because there are not enough cell lines representing these subtypes.

Biological characterisation of latent representations learnt by MFmap

The pattern of MFmap learnt latent representations can be used as a signature for cancer subtypes. For example, in BRCA, the basal-like subtype is characterised by a pattern of low values of components z1 and z4 and high values of z2 and z3 (Fig 6(A)). HER2-enriched tumours are characterised by high values of z1 and z3 and z4. Luminal A and B subtypes can be distinguished by z4. Similarly, cancer subtypes in COADREAD and GBMLGG are highly associated with their latent representations learnt by MFmap (Fig 6(B) and 6(C)).

Fig 6

Association of MFmap latent representations and cancer subtypes.

Association of MFmap latent representations and cancer subtypes.

The dimension of the latent representation h is set to the number of cancer subtypes. The boxplots display latent representations of different subtypes of TCGA samples in the three exemplary cancer types (A) breast invasive carcinoma (BRCA), (B) colorectal adenocarcinoma (COADREAD) and (C) glioblastoma multiforme and lower grade glioma (GBMLGG). Cancer subtypes are colour encoded and sorted by their median latent representations. To further investigate the biological meaning of the latent representations we analysed the association between and pathway activities in TCGA reference datasets. We used single sample gene set enrichment analysis (ssGSEA) [42] to assess sample-wise pathway activities. The pathway signatures were compiled from several sources including 10 curated oncogenic signalling pathways [43], 19 curated specific DNA damage repair (DDR) pathways [44], 14 expert-curated specific DDR processes and DDR associated processes [45]. This collection was combined with MsigDB (v7.0) [46] chemical and genetic perturbations (CGP) and canonical pathways (CP) collections (MsigDB C2 collection) and MsigDB (v7.0) hallmark gene sets (MsigDB H collection). The degree of associations was quantified by the information coefficient and the Pearson correlation coefficient and the statistical significance was assessed by permutation tests. To tackle class imbalance in the different subtypes, we applied SMOTE upsampling [47]. We used COADREAD as a proof of concept, because it has four well characterised molecular subtypes CMS1-CMS4 [37]. The CMS1 subtype is characterised by micro-satellite instability (MSI), whereas CMS4 tumours are micro-satellite stable. The CMS4 subtype is also distinguished from CMS1 by epithelial mesenchymal transformation (EMT) characteristics, accompanied by prominent stromal invasion and angiogenesis. These mutually exclusive characteristics are clearly reflected in the magnitudes of the latent representation components. The top gene sets associated with component z2 are “WATANABE COLON CANCER MSI VS MSS UP” and “KOINUMA COLON CANCER MSI UP”, whereas z4 is associated with the activity of gene sets annotated as “HALLMARK ANGIOGENESIS” and “HALLMARK EPITHELIAL MESENCHYMAL TRANSITION”. Clearly, high values of z2 are a characteristics of the CMS1 subtype, whereas high values of z4 are a distinctive feature of CMS4 tumours. This example illustrates that a meaningful way to guide biological interpretation of the latent representations is to associate them to single sample pathway activity. The same method was applied to annotate latent representations of GBMLGG (Fig 7(A)), which has seven subtypes [32]. The Mesenchymal-like and PA-like are stratified by gene expression profiles and the G-CIMP-high, G-CIMP-low and LGm6-GBM are methylation based. The Codel subtype describes IDH-mutant samples harbouring a co-deletion of chromosome arm 1p and 19q. Many pathways associated with latent representation z1 are related to the neurotransmitter release cycle, which is also a characteristics of the Verhaak proneuronal subtype [48]. Pathways correlated to latent representation z2 are related to the mesenchymal cell type, hypoxia and angiogenesis, which characterises the Verhaak mesenchymal subtype. The activity of the Fanconi Anemia (FA) DNA repair pathway is highly correlated with latent representation z3. DNA damage response deficiency and amplified oncogenic MYC signalling characterises tumours with large values of latent representation z4. Latent representation z5 is related to the neurotransmitter release cycle and dysfunctional metabolism; latent representation z6 to mitotic checkpoint deficiency. Many pathways associated with latent representation z7 are involved in mismatch repair deficiency, replication stress and cell cycle disregulation and also related to the classical subtype in the earlier classification of Verhaak [48].

Fig 7

Characterising the MFmap learnt latent representations in glioblastoma multiforme and lower grade glioma (GBMLGG).

Characterising the MFmap learnt latent representations in glioblastoma multiforme and lower grade glioma (GBMLGG).

(A) The top heatmap shows the latent representation of TCGA tumour samples (columns). The tumour samples are ordered based on a hierarchical clustering of and their subtypes are colour encoded. The heatmap at the bottom displays sample-wise pathway activities that are significantly associated with the latent representations 1, …, 7. Pathway activities were computed using the ssGSEA algorithm [42]. For better visualisation, we upsampled the input data of MFmap and ssGSEA to get a balanced sample size in each subtype. (B) The MFmap reference map is formed by projecting the latent representations of bulk tumours into two dimensions using multidimensional scaling. It consists of seven dominant components represented by black nodes. The length of their connections is given by the Euclidean distance of the dominant components in the latent space. The annotation of the seven dominant nodes is based on the correlation between and pathway activity scores (see A). The background colour encodes sample subtypes, and the background contour encodes sample density. Individual bulk tumours are displayed as dots on the MFmap reference map. (C) Cell line samples are projected to the MFmap reference map. In both (B) and (C), the subtype of bulk tumours and predicted subtype of cell lines are colour coded. Subtype specific sample size for bulk tumours and cell lines is reported in the legend table. Individual samples and their relationships can be displayed in the MFmap reference map (Fig 7(B)), a visualisation tool adapted from OncoGPS [31]. Here, the seven corners of the map correspond to the respective latent representations z1, …, z7 in GBMLGG. The corner locations are determined by multidimensional scaling on the latent representations of bulk tumours. Individual bulk tumour samples are displayed as dots in the regions between the corner points with locations determined by a weighted vector sum of the seven corner locations (see S1 File for details). The subtype of each tumour sample is indicated by colours. The density of the tumour samples of a given subtype is depicted by the contour lines and the corresponding colour shading. Fig 7(B)) shows that samples of the same subtype clustered together and the inter-cluster distance is large. Projecting cell lines to the MFmap reference map (Fig 7(C)) helps to visualise the relationship between their predicted subtypes and their latent representations.

Modelling cellular state transformations using latent space arithmetics

Cancerous neoplasms undergo various biochemical changes during cancer evolution and in response to selective pressure. One example is the transition from a proneural to a mesenchymal phenotype in glioblastoma, which is characterised by acquired therapeutic resistance and more aggressive potential [49]. In the DNA methylation based subtype classification of [32], the G-CIMP-high methylation phenotype tends to have the proneural molecular subtype [48] (see Fig 7(B)). Given that the latent representations learnt by MFmap clearly distinguish these different subtypes, we asked, whether the generative nature of the semi-supervised VAE can also be exploited to study such cancer subtype transformations. To this end, we used the latent representations of the G-CIMP-high tumours and the Mesenchymal-like tumours (see Fig 7(B)) and computed the centroid vectors and for the corresponding tumour samples. The difference was used as a latent perturbation vector. By adding to the latent representation of each G-CIMP-high tumour (Fig 8(A)) we obtained the latent representation of in silico samples (Fig 8(B)), which are located in the “Mesenchymal-like region” of the reference map. We used these latent representation vectors of the in silico samples as input to the decoder of the MFmap network. We then checked, whether key molecular features of real Mesenchymal-like samples are reflected by these generated samples. Based on the available biological knowledge, we focussed on the most prominent onco-markers of the G-CIMP-high subtype: mutation status of the alpha thalassemia/mental retardation syndrome X-linked (ATRX), isocitrate dehydrogenase (IDH) and TP53 genes. The original G-CIMP-high tumours show a high propensity towards mutations in these genes, indicated by relatively higher network smoothed mutation scores (Fig 8(C)), although not all samples are necessarily harbouring these mutations. In contrast, the predicted mutation scores for the perturbed in silico samples in Fig 8(B) are much lower, indicating a lower propensity to IDH1, ATRX or TP53 mutations. This is in agreement with the observed tendency of Mesenchymal-like tumours for these mutations [49]. This example not only highlights the good generative performance of MFmap but also hints at potential applications on integrative analysis of cancer evolution dynamics.

Fig 8

In-silico perturbation analysis of cellular state changes during disease transformation from the G-CIMP-high to the Mesenchymal-like subtype in glioblastoma multiforme and lower grade glioma (GBMLGG).

(A) The G-CIMP-high tumours from TCGA are projected to the MFmap reference map. (B) By perturbing the latent representation vectors of these G-CIMP-high tumours we generate artificial tumour samples located in the Mesenchymal-like region of the MFmap reference map (compare Fig 7(B)). (C) Boxplots of the sample mutation status (network smoothed mutation scores) of marker genes IDH1, ATRX1 and TP53 before and after perturbation.

In-silico perturbation analysis of cellular state changes during disease transformation from the G-CIMP-high to the Mesenchymal-like subtype in glioblastoma multiforme and lower grade glioma (GBMLGG).

Discussion

Limited success in translating in vitro therapeutic markers to clinical applications highlights that not all cell lines are good models for a given cancer subtype. Selecting the most appropriate cell line for a given tumour or a set of tumours is crucial for understanding cancer biology and developing new anti-cancer treatments. Here, we provide a computational framework and a resource for cancer researchers to select the best cell lines for a TCGA tumour or a cancer subtype from ten different cancer types (http://h2926513.stratoserver.net:3838/MFmap_shiny/). The quantitative similarity score enables researchers to judge, whether a given tumour or a subtype of tumours is well represented by a cell line. The assignment of cancer subtype labels to cell lines enables cell biologists to optimise experimental planning and to focus their research on clinically relevant model systems. We found that our semi-supervised MFmap model can classify tumours with a very high accuracy. Further analysis of drug sensitivity profiles supports that the subtype prediction for cell lines is biologically meaningful. Our analysis shows that HER2-enriched cell lines are most sensitive to Lapatinib, in agreement with prior knowledge about drug efficiency of this compound. As an example for the translation of in vitro pharmacogenomic data, we predict that the G-CIMP-low subtype is more sensitive to the new synthetic compound KHS101 compared to other GBMLGG subtypes. Our finding that only BRCA, GBMLGG and UCEC show significant subtype specific drug sensitivity variation merits further investigation. One important reason is the small number of cell lines representing some cancer subtypes, which prevents us from finding statistically significant variations of drug sensitivity across the different subtypes. This highlights the need to prioritise cell line development for underrepresented disease variants [21]. However, it can not be ruled out that for some cancers the known subtype classifications are not predictive of drug sensitivity. This suggests that clinically relevant subtype stratification should take into account drug sensitivity. By embedding the original gene expression space, somatic mutation space and copy number space of bulk tumours and cell lines into a lower dimensional latent space, MFmap extracts latent features that are strongly associated with cancer subtypes. For COADREAD and GLMBGG, we have illustrated that the abstract latent representations can be annotated biologically using their associations with pathway activities. This makes the latent representations interpretable and allows to study the molecular and clinical heterogeneity of this disease. In principle, MFmap can be complemented by other modalities such as methylation or proteomics data. However, for our purpose we found that gene expression and DNA features in combination with the prior knowledge about tumour subtypes contains sufficient information. Our proof of principle analysis of the transformation between two different tumour subtypes presents a new approach for studying tumour evolutionary processes in a more integrative way [50]. The small sample size of some multi-region sequencing or single-cell sequencing studies limits the ability to infer robust evolutionary patterns. By projecting these data to the MFmap reference map obtained from training on large sets of bulk tumour data one could deduce useful phenotypic information for individual patients. We believe that this can leverage information gathered in large cancer genomic studies like TCGA to guide personalised clinical decision making. The MFmap is based on a new semi-supervised neural network architecture combining a basic VAE with an additional classifier. Such semi-supervised learning tasks are very common in the biomedical research field, because it is often easier to acquire a large number of measurements than to obtain the corresponding labels. Based on the good predictive and generative performance of MFmap together with the evidence provided here, that MFmap can learn biologically and clinically meaningful information, we are convinced that the MFmap model can be adapted to other semi-supervised tasks in oncology and beyond.

Extended method details.

(PDF) Click here for additional data file.

Further evaluation of the MFmap performance.

(PDF) Click here for additional data file. (TXT) Click here for additional data file.

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present. 13 Sep 2021 PONE-D-21-23744 MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes PLOS ONE Dear Dr. Kschischo, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 27 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Tao Huang Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following financial disclosure: “This work was supported by the FOR2800 research unit funded by the Deutsche Forschungsgemeinschaft (DFG project number 395736209).” Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 4. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data. 5. Please upload a new copy of Figure 4, 6 and 7 as the detail is not clear. Please follow the link for more information: https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/ 6. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper presents a variational autoencoder based model for simultaneous classification of cancer (subtypes) and representation learning. The model is applied to two publicly available cohorts (TCGA and CCLE). The methodology is presented convincingly and results appear sound. The authors have included links to 1) a Shiny server so that other can interactively query the results of the model and 2) to a Github repo with their source code so that others can apply the model to their data. This is very valuable. One could perhaps get even more impact if a server running the code was set up and if datasets were provide so that other could benchmark their method against the proposed method. But this is nice to have not need to have. So all in all this is good work. The paper can be recommended for publication more or less as is. Reviewer #2: Zhang and Kschischo propose MFmap (model fidelity map) to simultaneously predict the cancer subtype of a cell line and its similarity to an individual tumour sample. A semi-supervised generative model, MFmap is a neural network architecture combining a basic VAE with an additional classifier. The use of the classifier to serves as a regulariser controlling the capability to learn a latent representation that is cancer subtype relevant is cleaver and does add to interpretability. However, the restriction of the dimension of the latent representation to the number of subtypes of cancer limits the algorithms application for biological discovery and potential denosing from unknown sources of technical variation. It would be interesting to know the effect of expanding to larger dimensionalizations; however, it reasonable to argue that is beyond the scope of the current work. Additionally, performance evaluations via “bake off” are not particularly informative, this work could benefit from the inclusion of a discussion of potential performance advantages over other methods. In particular, https://doi.org/10.1038/s41587-021-01001-7, https://doi.org/10.1016/j.cels.2019.04.004, and https://doi.org/10.1016/j.cels.2019.05.031 would be useful to place in the context of more widely used tools for integrated data analysis. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Oct 2021 Response to the Editor 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming... We have used the PLOS Latex template and changed the naming of the files. 2. Thank you for stating the following financial disclosure... We have added "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." to the financial disclosure section. 3. In your Data Availability statement... We have prepared and uploaded a minimal data set. The relevant URLS of the TCGA data and the CCLE data are provided in the Supporting Information at the end of the main text. 4. We note that you have included the phrase “data not shown” in your manuscript... We provide the additional table as table S1 in the Supplemental file S1_file and refer to it in the main text. 5. Please upload a new copy of Figure 4, 6 and 7 as the detail is not clear... We have uploaded new copies of these figures and hope they are clear now. 6. Please review your reference list to ensure that it is complete and correct... We have checked the references. Response to the Reviewer #1: Thank you for the very positive review. Following your suggestion, we we have added data to train MFmap to the Github repo to enable other researcher to compare their work to ours. Response to the Reviewer #2: Thank you for the very positive review. We have added a file S1_file.pdf as Supplemental information and refer to it in the main text. There we provide the following additional evaluations of the MFmap: • In Table S1 we show the classification accuracy for a latent dimension of 100. You can see that the classification results are not really better, when we use a higher dimensional latent space. • We provide additional performance evaluations for integrative analysis. We have focused on the scenario, that a reference model is trained with a large data set and a query data set (with a potential distribution shift) is fed into the reference model. We use the entropy of subtype mixing (ESM), the adjusted random index (ARI), the normalised mutual information (NMI) and the average silhouette width (ASW) mentioned in Lotfollahi et al. 2021 (https://www.nature.com/articles/s41587-021-01001-7) as performance measures for the consistency between reference and query. • See also the sub-section “Evaluating the MFmap classification and generative performance” in the main text, were we refer to these additional performance evaluations in S1_file.pdf. • We have cited the papers • https://doi.org/10.1038/s41587-021-01001-7 • https://doi.org/10.1016/j.cels.2019.04.004 • https://doi.org/10.1016/j.cels.2019.05.031 suggested by you in the main text. suggested by you in the main text. Submitted filename: Response_to_Reviewers.pdf Click here for additional data file. 25 Nov 2021 MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes PONE-D-21-23744R1 Dear Dr. Kschischo, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Tao Huang Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: The authors have addressed all questions raised by the reviewers. I recommend publication of the manuscript. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: No 6 Dec 2021 PONE-D-21-23744R1 MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes Dear Dr. Kschischo: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Tao Huang Academic Editor PLOS ONE

44 in total

1. Comprehensive Integration of Single-Cell Data.

Authors: Tim Stuart; Andrew Butler; Paul Hoffman; Christoph Hafemeister; Efthymia Papalexi; William M Mauck; Yuhan Hao; Marlon Stoeckius; Peter Smibert; Rahul Satija
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

2. Integration of Tumor Genomic Data with Cell Lines Using Multi-dimensional Network Modules Improves Cancer Pharmacogenomics.

Authors: James T Webber; Swati Kaushik; Sourav Bandyopadhyay
Journal: Cell Syst Date: 2018-11-07 Impact factor: 10.304

Review 3. The NCI60 human tumour cell line anticancer drug screen.

Authors: Robert H Shoemaker
Journal: Nat Rev Cancer Date: 2006-10 Impact factor: 60.716

4. Next-generation characterization of the Cancer Cell Line Encyclopedia.

Authors: Mahmoud Ghandi; Franklin W Huang; Judit Jané-Valbuena; Gregory V Kryukov; Christopher C Lo; E Robert McDonald; Jordi Barretina; Ellen T Gelfand; Craig M Bielski; Haoxin Li; Kevin Hu; Alexander Y Andreev-Drakhlin; Jaegil Kim; Julian M Hess; Brian J Haas; François Aguet; Barbara A Weir; Michael V Rothberg; Brenton R Paolella; Michael S Lawrence; Rehan Akbani; Yiling Lu; Hong L Tiv; Prafulla C Gokhale; Antoine de Weck; Ali Amin Mansour; Coyin Oh; Juliann Shih; Kevin Hadi; Yanay Rosen; Jonathan Bistline; Kavitha Venkatesan; Anupama Reddy; Dmitriy Sonkin; Manway Liu; Joseph Lehar; Joshua M Korn; Dale A Porter; Michael D Jones; Javad Golji; Giordano Caponigro; Jordan E Taylor; Caitlin M Dunning; Amanda L Creech; Allison C Warren; James M McFarland; Mahdi Zamanighomi; Audrey Kauffmann; Nicolas Stransky; Marcin Imielinski; Yosef E Maruvka; Andrew D Cherniack; Aviad Tsherniak; Francisca Vazquez; Jacob D Jaffe; Andrew A Lane; David M Weinstock; Cory M Johannessen; Michael P Morrissey; Frank Stegmeier; Robert Schlegel; William C Hahn; Gad Getz; Gordon B Mills; Jesse S Boehm; Todd R Golub; Levi A Garraway; William R Sellers
Journal: Nature Date: 2019-05-08 Impact factor: 49.962

Review 5. Lapatinib in the treatment of breast cancer.

Authors: Gerald M Higa; Jame Abraham
Journal: Expert Rev Anticancer Ther Date: 2007-09 Impact factor: 4.512

6. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset.

Authors: Brinton Seashore-Ludlow; Matthew G Rees; Jaime H Cheah; Murat Cokol; Edmund V Price; Matthew E Coletti; Victor Jones; Nicole E Bodycombe; Christian K Soule; Joshua Gould; Benjamin Alexander; Ava Li; Philip Montgomery; Mathias J Wawer; Nurdan Kuru; Joanne D Kotz; C Suk-Yee Hon; Benito Munoz; Ted Liefeld; Vlado Dančík; Joshua A Bittker; Michelle Palmer; James E Bradner; Alykhan F Shamji; Paul A Clemons; Stuart L Schreiber
Journal: Cancer Discov Date: 2015-10-19 Impact factor: 39.397

7. Cell lines: Valuable tools or useless artifacts.

Authors: Gurvinder Kaur; Jannette M Dufour
Journal: Spermatogenesis Date: 2012-01-01

8. Characterization of twenty-five ovarian tumour cell lines that phenocopy primary tumours.

Authors: Tan A Ince; Aurea D Sousa; Michelle A Jones; J Chuck Harrell; Elin S Agoston; Marit Krohn; Laura M Selfors; Wenbin Liu; Ken Chen; Mao Yong; Peter Buchwald; Bin Wang; Katherine S Hale; Evan Cohick; Petra Sergent; Abigail Witt; Zhanna Kozhekbaeva; Sizhen Gao; Agoston T Agoston; Melissa A Merritt; Rosemary Foster; Bo R Rueda; Christopher P Crum; Joan S Brugge; Gordon B Mills
Journal: Nat Commun Date: 2015-06-17 Impact factor: 14.919

9. Analysis of renal cancer cell lines from two major resources enables genomics-guided cell line selection.

Authors: Rileen Sinha; Andrew G Winer; Michael Chevinsky; Christopher Jakubowski; Ying-Bei Chen; Yiyu Dong; Satish K Tickoo; Victor E Reuter; Paul Russo; Jonathan A Coleman; Chris Sander; James J Hsieh; A Ari Hakimi
Journal: Nat Commun Date: 2017-05-10 Impact factor: 14.919

10. Network-based stratification of tumor mutations.

Authors: Matan Hofree; John P Shen; Hannah Carter; Andrew Gross; Trey Ideker
Journal: Nat Methods Date: 2013-09-15 Impact factor: 28.547

1 in total

Review 1. Computational estimation of quality and clinical relevance of cancer cell lines.

Authors: Lucia Trastulla; Javad Noorbakhsh; Francisca Vazquez; James McFarland; Francesco Iorio
Journal: Mol Syst Biol Date: 2022-07 Impact factor: 13.068

1 in total