| Literature DB >> 32885043 |
Yanping Xie1,2, Brittny C Davis Lynn3, Nicholas Moir1, David A Cameron4, Jonine D Figueroa1,2, Andrew H Sims1.
Abstract
Publicly available tumor gene expression datasets are widely reanalyzed, but it is unclear how representative they are of clinical populations. Estimations of molecular subtype classification and prognostic gene signatures were calculated for 16,130 patients from 70 breast cancer datasets. Collated patient demographics and clinical characteristics were sparse for many studies. Considerable variations were observed in dataset size, patient/tumor characteristics, and molecular composition. Results were compared with Surveillance, Epidemiology, and End Results Program (SEER) figures. The proportion of basal subtype tumors ranged from 4 to 59%. Date of diagnosis ranged from 1977 to 2013, originating from 20 countries across five continents although European ancestry dominated. Publicly available breast cancer gene expression datasets are a great resource, but caution is required as they tend to be enriched for high grade, ER-negative tumors from European-ancestry patients. These results emphasize the need to derive more representative and annotated molecular datasets from diverse populations.Entities:
Keywords: Cancer epidemiology; Cancer genomics
Year: 2020 PMID: 32885043 PMCID: PMC7447772 DOI: 10.1038/s41523-020-00180-x
Source DB: PubMed Journal: NPJ Breast Cancer ISSN: 2374-4677
Fig. 1Molecular subtypes of breast cancer are highly variable across publicly available datasets.
Distributions of PAM50 intrinsic subtypes assigned by the geneFu package across 70 datasets. The median proportion of luminal A tumors was 25% and luminal B was 31%. Some datasets were dominated by specific subtypes and some were completely lacking in normal-like breast tumors.
Fig. 2Publicly available breast cancer gene expression datasets recapitulate some epidemiological trends, but have reduced proportions of ER+ and grade 1 tumors compared with Western populations.
The distribution of molecular subtypes by age (a) and the association between BMI and molecular predictions of poor outcomes (b) are as would be expected. However, Asians older than 50 appear to have worse predicted prognosis than other races (c), but this is likely confounded by other factors. The boxes represent upper and lower quartile ranges, horizontal line the median and whiskers indicate 1.5× the interquartile range. Incidence rates for ER-positive tumors progressively increase overtime, but the proportion remains significantly lower than that reported by SEER (d), vertical bars represent 95% confidence intervals. Grade 3 tumors were most abundant in publicly available datasets for the 1990s, which does not reflect SEER figures, which show increasing proportions of grade 1 tumors over the last three decades (e).