| Literature DB >> 35475239 |
Rileen Sinha1, Augustin Luna2,3,4, Nikolaus Schultz5, Chris Sander2,3,4.
Abstract
Patient-derived cell lines are often used in pre-clinical cancer research, but some cell lines are too different from tumors to be good models. Comparison of genomic and expression profiles can guide the choice of pre-clinical models, but typically not all features are equally relevant. We present TumorComparer, a computational method for comparing cellular profiles with higher weights on functional features of interest. In this pan-cancer application, we compare ∼600 cell lines and ∼8,000 tumor samples of 24 cancer types, using weights to emphasize known oncogenic alterations. We characterize the similarity of cell lines and tumors within and across cancers by using multiple datum types and rank cell lines by their inferred quality as representative models. Beyond the assessment of cell lines, the weighted similarity approach is adaptable to patient stratification in clinical trials and personalized medicine.Entities:
Keywords: CCLP; TCGA; cancer genomics; cancer therapy; cell lines; decision support; oncogenic alterations; patient stratification; web application; weighted similarity
Year: 2021 PMID: 35475239 PMCID: PMC9017219 DOI: 10.1016/j.crmeth.2021.100039
Source DB: PubMed Journal: Cell Rep Methods ISSN: 2667-2375
Figure 1TumorComparer workflow and available tumor samples and cell lines
(A) Weighted similarity between pairs of cancer material is computed using data type-specific datum matrices and weights for each molecular data point (e.g., a mutation in a specific gene). Weights are either derived from data or provided by the user, reflecting an emphasis on particular genomic alterations. The weighted similarities for each data type are then normalized and combined into a final weighted similarity score. To compare cell lines and tumors, we used mutations, CNAs, and gene expression (mRNA) values, and chose weights for these features based on the recurrence of cancer type-specific (or pan-cancer) events in sets of tumors samples, as a proxy for the likelihood of the feature to be functional (e.g., to be “drivers”); for expression, we chose the log-fold change in expression in relation to pooled normals. Top: mutated, green; wild type, white. Middle: gains, light and dark red; losses, light and dark blue; diploid, white. Bottom: overexpressed, red; underexpressed, blue.
(B) The number of TCGA tumors and CCLP cell lines for each cancer type included in this study. CCLP, Cancer Cell Line Project; BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; COAD, colon adenocarcinoma; DLBC, lymphoid neoplasm diffuse large B cell lymphoma; ESCA, esophageal adenocarcinoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KIRC, kidney renal clear cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic adenocarcinoma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; SKCM, skin cutaneous melanoma; STAD, stomach adenocarcinoma; TCGA, The Cancer Genome Atlas; THCA, thyroid carcinoma; UCEC, uterine corpus endometrial carcinoma.
Figure 2Average similarities between cell lines and tumor samples by cancer subtype and data type
(A) Top left: similarity of CCLP cell lines (rows) and TCGA tumors (columns) using the mean gene expression levels of the 5,000 most variable genes across the tumors (correlation coefficient). The similarity is highest for matching cancer (sub)type in the vast majority (18 out of 24 = 75%; dark red or orange on the diagonal) of cases.
(B) Top right: similarity of CCLP cell lines (rows) and TCGA tumors (columns) using the mean copy-number changes of the 5,000 most variable genes across the tumors (correlation coefficient). Unlike for expression, only a minority (7 out of 24 = 29%) of the closest matches are for the same cancer (sub)type.
(C) Bottom left: similarity of CCLP cell lines (rows) and TCGA tumors (columns) using the mutation frequencies of the 299 most frequently mutated genes across the tumors (correlation coefficient of average mutation frequencies). Similar to CNA, only a minority of closest matches are for the same cancer type (9 out of 24 = 37.5%). The low percentage in (B and C) indicate that CNAs and somatic mutations contain less tissue-specific information than gene expression.
(D) Bottom right: ranked similarity between CCLP cell lines (rows) and TCGA tumors of the same cancer type, based on the similarities in the rows of (A–C) above. A top rank indicates the cell lines of a certain cancer (sub)type are on average well matched to the tumors of the same (sub)type. In several cases, when the same tissue type does not provide the closest match, it might be among the top few matches, as might be expected in cases where other related tissues are present. Although several cancer types have high similarity between cell lines and tumors when using various datum types, some only have high similarity when using one or two of the three datum types.
Figure 3The distribution of feature-weighted similarities between ovarian cancer cell lines and ovarian (HGSOC) tumors
(A–D) (A) The overall similarity represents the average of the feature-weighted similarity over the three datum types: (B) mRNA expression (with higher weights on genes with the largest expression changes); (C) mutations; and (D) copy-number alterations (with higher weights on the most recurrently altered genes in (C and D)). The distributions of feature-weighted similarities using mutations reveal striking differences between cell lines, with low, high, and intermediate similarities to tumors for three groups of cell lines. Cell lines, such as OAW-28 and Kuramochi have a high overall similarity to tumors ranking as they rank highly across all three datum types, whereas cell lines, such as PA-1 and TYK-nu, are poor matches to tumors across multiple datum types in terms of genomic similarity over the genes used in the feature-weighted similarity measure.
Figure 4The similarity of cell lines and tumors varies by gene set—the best matches might be quite different for different gene sets/pathways
The top two panels show the similarity scores of SKCM tumors and melanoma cell lines when using uniform weights on all features, and genes from (A) RTK-RAS pathway and (B) WNT pathway. Similarly, the bottom two panels (C and D) show corresponding scores for liver cancer cell lines, compared with TCGA LIHC tumors. SKCM cell lines show similar/better similarity scores when using the RTK-RAS pathway than the WNT pathway, whereas LIHC cell lines show lower scores with the RTK-RAS pathway than with the WNT pathway—this is consistent with the frequency of alterations in the member genes of the RTK-RAS and WNT pathways in these cancer types.
Figure 5Mean weighted similarity of CCLP cell lines to parental tumor types across 25 TCGA cancer (sub)types
Each dot is a cell line (depicting its mean similarity to matching tumors), each boxplot summarizes the mean similarity ranks of cell lines and tumors of a given cancer type, and cancer types are ordered by increasing median average weighted similarity ranks. The overall similarity (A) is the average of the weighted similarity by each data type (B–D) mRNA: expression, copy-number alterations, mutations. Feature weights were chosen to emphasize the most significant recurrent mutations, copy-number alterations, and overexpression in relation to normal samples. Most tumor (sub)types have a mix of good, moderate, and poor matches to tumors among cell lines, as reflected by high, moderate, and low similarity scores, except for DLBC, THCA, PRAD, and UCEC, which have a high proportion of poor matches (with PRAD, UCEC, and THCA also having relatively few cancer cell lines in CCLP).
Figure 6Cell lines that consistently score high/low across all datum types
(A) Cell lines that rank in the top 10% for mutations, CNAs and gene expression. These might be considered good representatives of their respective tumor types.
(B) Cell lines that rank in the bottom 50% for mutations, CNAs, and gene expression. These are poor representatives of their respective tumor types. Ranks were based on weighted similarity computations when using the most variable (in tumors) genes for each data type, and feature weights emphasizing recurrent alterations (mutations, CNAs) or expression change in relation to a pool of normal samples. Circle sizes reflect the rank values.
Thirty-one highly cited outlier cell lines from 11 cancer types
| Cancer type or subtype | Cell line | Citation count | Alterations in the cell line atypical of the parent cancer type | TumorComparer score notes |
|---|---|---|---|---|
| LAML | HL-60 | ∼109,000 | lacks cancer type-specific mutations and CNAs, has mutations in CDKN2A and NRAS | poor scores for mutations and CNAs |
| BRCA | MDA-MB-231 | ∼102,000 | mutations in BRAF and NF2; deep deletions in CDKN2B, CDKN2B-AS1, PTPRD | bottom 50% overall, as well as for all 3 datum types |
| CESC | HELA | ∼86,000 | lacks cancer type-specific mutations and amplifications | bottom 10% by mutations |
| PRAD | DU145 | ∼42,000 | many mutations in pan-cancer genes, lacks cancer-specific amplifications, has a KRAS deep deletion | bottom 10% overall, |
| LAML | KG1 | ∼16,000 | lacks cancer type-specific mutations and CNAs | bottom 10% by mutations |
| LUAD | NCI-H23 | ∼11,000 | high number of mutations in cancer-specific genes, pan-cancer mutations in DNMT3A and EEF1A1 | bottom 50% overall, as well as for all 3 datum types |
| PRAD | 22Rv1 | ∼7,700 | many mutations in pan-cancer genes, lacks cancer-specific CNAs | bottom 50% overall, as well as for mutations and CNAs |
| OV | SK-OV-3 | ∼7,600 | many mutations in pan-cancer genes | bottom 10% by CNAs, |
| LUAD | NCI-H522 | ∼7,000 | lacks cancer-specific CNAs | bottom 50% overall as well as for expression and CNAs |
| STAD | MKN28 | ∼6,600 | lacks cancer-specific amplifications | poor scores for mutations and CNAs |
| KIRC | CAKI-1 | ∼5,800 | lacks cancer-specific mutations and amplifications, has a CDKN2A deep deletion | outlier by mutation |
| LIHC | SK-HEP-1 | ∼5,600 | lack of cancer-specific amplifications, has many pan-cancer deep deletions | bottom 50% overall, |
| OV | IGROV-1 | ∼5,000 | excessive mutation count, many mutations in pan-cancer genes | poor scores for mutations and CNAs |
| OV | ES-2 | ∼4,100 | mutations in BRAF, KMT2D, and MAP2K1 | poor score overall, and for all datum types |
| CESC | C-33A | ∼4,100 | excessive mutation count, many mutations in pan-cancer genes | poor score overall, and for all datum types |
| DLBC | CRO-AP2 | ∼3,500 | lacks cancer-specific mutations and amplifications | poor scores overall, as well as for mutations and CNAs |
| LUAD | NCI-H82 | ∼3,500 | poor score overall, and for all datum types | |
| OV | PA-1 | ∼3,200 | lacks a TP53 mutation and cancer-specific amplifications | poor score overall, as well as for mutations and CNAs |
| OV | OVCAR-8 | ∼3,000 | lacks a TP53 mutation, has several pan-cancer mutations and amplifications | poor score by mutation |
| OV | OVCAR-5 | ∼2,900 | lacks a TP53 mutation, has a KRAS mutation as well as deep deletion | poor score overall, as well as for expression and CNAs |
| LAML | CESS | ∼2,900 | lacks cancer-specific mutations and amplifications | poor score overall, as well as for mutations |
| BLCA | UM-UC-3 | ∼2,800 | has several pan-cancer deep deletions | poor score overall, as well as for expression and mutations |
| KIRC | TK-10 | ∼2,000 | lacks cancer-specific mutations and CNAs | poor score overall, as well as for mutations and CNAs |
| KIRC | U-031 | ∼1,900 | lacks cancer-specific mutations | poor score for mutations |
| GBM | LN-229 | ∼1,900 | lacks cancer-specific mutations | poor score for mutations |
| STAD | HGC-27 | ∼1,720 | mutations in CDK12 and SMARCA4 | poor score overall, as well as for expression and CNAs |
| LUAD | SW1573 | ∼1,500 | poor score overall, as well as for expression and mutations | |
| LUAD | NCI-H1793 | ∼1,400 | poor score overall, as well as for mutations and CNAs | |
| MESO | NCI-H28 | ∼1,400 | lacks cancer-specific mutations and amplifications | poor score overall, as well as for expression and mutations |
| LUAD | A-427 | ∼1,200 | poor score for each data type | |
| LUAD | DU4475 | ∼1,100 | lacks cancer-specific mutations and amplifications | poor score overall, as well as for expression and mutations |
The genomic profiles of these cell lines are not well matched to tumors from the annotated parent cancer type. These cell lines are probably not good models for tumors. Details of alterations for each cell line are in Table S1. Poor score overall refers to all three data types. Citation numbers (>1,000 required) were estimated using Google Scholar as of June 2020, and have been binned next to the nearest multiple of 100 for those less than 1,000 and to the nearest multiple of 1,000 for those greater than 10,000.
Abbreviations are as follows: BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; DLBC, lymphoid neoplasm diffuse large B cell lymphoma; GBM, glioblastoma multiforme; KIRC, kidney renal clear cell carcinoma; LAML, acute myeloid leukemia; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; OV, ovarian serous cystadenocarcinoma; PRAD, prostate adenocarcinoma; STAD, stomach adenocarcinoma.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| The Cancer Genome Atlas (TCGA) | (synapse.org/#!Synapse:syn3241074, Data Freeze 1.3.1) | |
| COSMIC Cell Line Project (CCLP) | cancer.sanger.ac.uk/cell_lines | |
| Cell Model Passports | ||
| Cellosaurus | ||
| Data Matrices used in this paper | This Paper | Zenodo ( |
| ANNOVAR | ||
| GISTIC2 | ||
| TumorComparer | This Paper | |