| Literature DB >> 34819069 |
Michael Skaro1, Jonathan Arnold2, Marcus Hill3, Yi Zhou4, Shannon Quinn4,3,5, Melissa B Davis6, Andrea Sboner6,7,8,9, Mandi Murph10.
Abstract
BACKGROUND & AIMS: Cancer metastasis into distant organs is an evolutionarily selective process. A better understanding of the driving forces endowing proliferative plasticity of tumor seeds in distant soils is required to develop and adapt better treatment systems for this lethal stage of the disease. To this end, we aimed to utilize transcript expression profiling features to predict the site-specific metastases of primary tumors and second, to identify the determinants of tissue specific progression.Entities:
Keywords: Cancer; Machine learning; Metastatic organotropism; Transcriptomic profiling
Mesh:
Year: 2021 PMID: 34819069 PMCID: PMC8611885 DOI: 10.1186/s12920-021-01122-7
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Classification of tumor type. Classification of Cancer type. The confusion matrix detailing sample type specific performance for the GBT classification of tumor transcriptomes. 33 cancer types were considered by the model as annotated by their four letter TCGA code. The scale bar on the right-hand vertical axis denotes the density for each tile where dark tiles indicate low number of predicted values and red/white values indicate high numbers of predicted values. The major diagonal denotes the cancer type match between predicted and true labels where true labels are annotated along the left-side vertical axis and predicted labels are annotated across the horizontal axis
Fig. 2Observed sites of metastatic progression in the TCGA database. Thirty-three cancers in the TCGA database have recorded RNA sequencing data. Within twenty-three projects 125 anatomic locations have clinically annotated metastatic progression. Unique metastatic sites of progression found within the population are annotated on the vertical axis. The cancer type four letter codes are annotated on the horizontal axis. The heatmaps are stratified by log frequency of occurrence in the data set. The right heatmap are were locations with the greatest frequency amongst all sites. COAD and READ have been combined in this section of the analysis
Fig. 3Prediction of Site-specific Metastases. Displayed are the model performance metrics predicting site specific metastasis. The data was classified following a train test split where 30% of the annotated transcriptome population were held out. The performances reported are on out of bag instances that were not used as synthetic templates for training. Model performances are reported on a scale of 0 to 1. Cancer type label are in the four-letter code from the TCGA database. Total support are instances in the test set where a positive class was observed are reported in Additional file 1: data tables
Average model metrics by cancer
| TCGA-Project | Avg. Precision | Avg. Recall | Avg. F-Measure | Avg. Model accuracy |
|---|---|---|---|---|
| BLCA | 0.93 | 0.87 | 0.89 | 0.90 |
| BRCA | 0.82 | 0.80 | 0.81 | 0.81 |
| COADREAD | 0.76 | 0.76 | 0.76 | 0.75 |
| ESCA | 0.77 | 0.81 | 0.79 | 0.81 |
| HNSC | 0.86 | 0.85 | 0.85 | 0.86 |
| KIRC | 0.93 | 0.95 | 0.94 | 0.95 |
| KIRP | 0.87 | 0.89 | 0.88 | 0.89 |
| LIHC | 0.95 | 0.91 | 0.93 | 0.93 |
| LUAC | 0.76 | 0.75 | 0.75 | 0.75 |
| LUSC | 0.65 | 0.67 | 0.66 | 0.67 |
| PAAD | 0.75 | 0.77 | 0.76 | 0.77 |
| PRAD | 0.88 | 0.87 | 0.86 | 0.87 |
| SARC | 0.70 | 0.75 | 0.72 | 0.75 |
| SKCM | 0.73 | 0.79 | 0.76 | 0.79 |
| STAD | 0.73 | 0.74 | 0.74 | 0.74 |
| THCA | 0.61 | 0.61 | 0.61 | 0.61 |
Displayed are the cumulative model performance metrics aggregating all locations for each cancer type. The cancers are labeled with their four letter TCGA code. Model metrics reported right to left were classification precision, classification recall, classification F-Measure and classification accuracy. Model performance variance and standard deviation are reported in the Additional file 1. Positive and Negative class specific performance reported in Additional file 1: data tables
Fig. 4Simulated and observed overrepresented GO biological processes. Gene set enrichment analysis was conducted using the clusterProfiler package in R. The Go ontology database was used to investigate feature enrichment in Biological Processes for each metastatic location in each cancer type that was classified by the model. The upsest plots were generated using the UPsetR package. The bars represent the GO IDs with an adjusted p value < 0.05 after Bonferroni correction. A Simulated enrichment of randomly selected transcript features overrepresented in GO. B Enriched processes in Bone metastases. C Enriched processes in Liver metastases. D Enriched processes in Lung metastases. E Enriched processes in Lymph Node metastases. Statistical significance and GO:ID enrichment results included in Additional file 1: data tables
Fig. 5Shared significantly overrepresented biological processes. Gene set enrichment analysis was conducted using the clusterProfiler package in R. The Go ontology database was used to investigate feature enrichment in Biological Processes for each metastatic location in each cancer type that was classified by the model. SimplifyEnrichment package was used to cluster the semantic similarity between shared overrepresented biological processes in tumors metastasizing to concordant locations. A Enriched processes in Bone metastases. B Enriched processes in Liver metastases. C Enriched processes in Lung metastases. D Enriched processes in Lymph Node metastases. Statistical significance and GO:ID enrichment results included in Additional file 1: data tables. Similarity scores are on a scale of 0 to 1