| Literature DB >> 26608184 |
Miriam Lohr1, Birte Hellwig1, Karolina Edlund2, Johanna S M Mattsson3, Johan Botling3, Marcus Schmidt4, Jan G Hengstler2, Patrick Micke3, Jörg Rahnenführer5.
Abstract
The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data.Entities:
Keywords: Gene expression; Male–female classifier; Microarray; Misannotation; Quality control
Mesh:
Year: 2015 PMID: 26608184 PMCID: PMC4673097 DOI: 10.1007/s00204-015-1632-4
Source DB: PubMed Journal: Arch Toxicol ISSN: 0340-5761 Impact factor: 5.153
Overview of analyzed datasets
| Type | Cohorts | Sample size (female/male) |
|---|---|---|
| Non-small cell lung cancer | GSE37745, Shedden, GSE31547, GSE29013, GSE14814, GSE4573, GSE31210, GSE19188, GSE31546, GSE10445 | 1338 |
| Colon cancer | GSE33113, GSE12945, GSE31595, GSE4271, GSE1433, GSE17536, GSE17537 | 769 |
| Other cancer | GSE5720, GSE4107, GSE42952, GSE34111, GSE31684 | 200 |
| Non-cancer | GSE19027, GSE17913, GSE23343, GSE25462, GSE7821, GSE20950, GSE24427 | 408 |
| Breast cancer | GSE11121, GSE2034, TRANSBIG (GSE7390/GSE6532), GSE16446, GSE20194, GSE20271, GSE22093, GSE23988 | 1373 |
| Ovarian cancer | Bild, GSE14764, GSE19829, GSE26712 | 426 |
| Prostate cancer | GSE17951, GSE25136, GSE3325, GSE8218 | 399 |
Tissue collections and gene array datasets analyzed by the male–female classifier, if available identified by their Gene Expression Omnibus (GEO) Series (GSE) number
Detailed description of analyzed datasets
| Cohort | # Female | # Male | # Total | Type (disease or subject of study) |
|---|---|---|---|---|
| GSE37745 | 89 | 107 | 196 | NSCLC |
| Shedden | 220 | 223 | 443 | NSCLC |
| GSE31547 | 36 | 14 | 50 | NSCLC + controls |
| GSE29013 | 17 | 38 | 55 | NSCLC |
| GSE14814 | 23 | 67 | 90 | NSCLC |
| GSE4573 | 47 | 82 | 129 | NSCLC |
| GSE31210 | 109 | 95 | 204 | NSCLC |
| GSE19188 | 23 | 59 | 82 | NSCLC |
| GSE31546 | 14 | 3 | 17 | NSCLC |
| GSE10445 | 16 | 56 | 72 | NSCLC |
| GSE4107 | 12 | 10 | 22 | Colorectal cancer |
| GSE33113 | 48 | 42 | 90 | Colorectal cancer |
| GSE31595 | 22 | 15 | 37 | Colorectal cancer |
| GSE12945 | 28 | 34 | 62 | Colorectal cancer |
| GSE14333 | 106 | 120 | 226 | Colorectal cancer |
| GSE17536 | 81 | 96 | 177 | Colorectal cancer |
| GSE17537 | 29 | 26 | 55 | Colorectal cancer |
| GSE4271 | 32 | 68 | 100 | Other cancer: glioma |
| GSE31684 | 25 | 68 | 93 | Other cancer: bladder |
| GSE34111 | 6 | 24 | 30 | Other cancer: gastrointestinal |
| GSE5720 | 24 | 30 | 54 | Other cancer: 9 different tissues |
| GSE42952 | 9 | 14 | 23 | Other cancer: pancreatic |
| GSE19027 | 11 | 48 | 59 | Bronchial epithelium of (non-) smokers with and without lung cancer |
| GSE17913 | 38 | 40 | 78 | Smoking |
| GSE23343 | 7 | 10 | 17 | Insulin resistance/type 2 diabetes |
| GSE25462 | 28 | 22 | 50 | Insulin resistance/type 2 diabetes |
| GSE7821 | 28 | 12 | 40 | Healthy twins |
| GSE20950 | 27 | 12 | 39 | Insulin resistance/obesity |
| GSE24427 | 80 | 45 | 125 | Multiple sclerosis |
| GSE11121 | 200 | 0 | 200 | Breast cancer |
| GSE2034 | 286 | 0 | 286 | Breast cancer |
| TRANSBIG (GSE7390/GSE6532) | 280 | 0 | 280 | Breast cancer |
| GSE16446 | 114 | 0 | 114 | Breast cancer; chemo response |
| GSE20194 | 247 | 0 | 247 | Breast cancer; chemo response |
| GSE20271 | 139 | 0 | 139 | Breast cancer; chemo response |
| GSE22093 | 47 | 0 | 47 | Breast cancer; chemo response |
| GSE23988 | 60 | 0 | 60 | Breast cancer; chemo response |
| Bild | 133 | 0 | 133 | Ovarian cancer |
| GSE14764 | 80 | 0 | 80 | Ovarian cancer |
| GSE19829 | 28 | 0 | 28 | Ovarian cancer |
| GSE26712 | 185 | 0 | 185 | Ovarian cancer |
| GSE17951 | 0 | 153 | 153 | Prostate cancer |
| GSE25136 | 0 | 79 | 79 | Prostate cancer |
| GSE3325 | 0 | 19 | 19 | Prostate cancer |
| GSE8218 | 0 | 148 | 148 | Prostate cancer |
Overview over the studied tissue collections and gene array data
Probe sets included in the male–female classifier
| Affymetrix ID | Gene | Chromosome | Cut point (99 % quantile) | Evidence (male/female) |
|---|---|---|---|---|
| 221728_x_at |
| X | >0.389 | Female |
| 214218_s_at |
| X | >0.385 | Female |
| 201909_at |
| Y | >0.431 | Male |
| 205000_at |
| Y | >0.276 | Male |
Probe sets included into the male–female classifier, with corresponding cut points for evidence whether a sample originates from a male or a female
Fig. 1Differentiation between male and female samples by XIST expression. Bean plots of the expression values of probe set 221728_x_at (XIST) in the NSCLC cohort GSE31210. A clear separation between low expression values in males (blue) and high expression values in females (red) can be observed. One sample is mislabelled
Fig. 2Improvement in comparability of cohorts by normalization. a Raw expression values of female (red) and male (blue) labelled samples set 201909_at (RPS4Y1) across all datasets. b The same cohorts after normalization. Specifically, two outliers in datasets TRANSBIG and GSE22093 indicate two breast cancer patients with high RPS4Y1 expression, feature clearly inconsistent with female sex
Fig. 3Application of the male–female classifier. Application of the male–female classifier to all cohorts, cohorts grouped by caner type. Green “correctly classified,” red “misclassified,” and orange “unconfident” samples
Fig. 4Visualization of the male–female classifier with mean expression values of the two prove sets for XIST on the x-axis and DDX3Y and RPS4Y1 on the y-axis. The points represent individual patients. The point clouds on the left and are characteristic for males and females, respectively. Colors indicate classification accuracy samples. Green “correctly classified,” red “misclassified,” and orange “unconfident.” a Results for the Uppsala cohort (GSE37745): One female patients clearly mislabeled as male, and two samples are labeled “unconfident.” b Results for GSE33113 with clear discrimination between males and females and no sex misannotations. c Results for GSE5720 with two misclassified samples and large number of samples classified as “unconfident.” d Results for a breast cancer dataset (TRANSBIG) with one male patient assigned to the category “misclassified”
Results of univariate Cox models
| Dataset | No. of patients | No. of misannotations and duplications | No. of significant genes (original scenario) | Percentage of genes no longer significant after removal of the misannotated samples | Percentage of genes newly significant after removal of the misannotated samples |
|---|---|---|---|---|---|
| GSE37745 | 196 | 3 | 450 | 12.22 | 14.00 |
| Shedden | 443 | 14 | 1354 | 15.66 | 8.79 |
| GSE29013 | 55 | 1 | 419 | 15.51 | 14.32 |
| GSE4573 | 129 | 5 | 189 | 26.63 | 38.62 |
| GSE31547 | 50 | 1 | 318 | 50.51 | 23.27 |
| GSE19188 | 82 | 8 | 190 | 53.16 | 34,374 |
Results of univariate Cox models for six NSCLC datasets. Comparison between significance genes (p < 0.01) identified in the original cohort and significance genes identified in the reduced cohort after removal of misannotated and duplicated samples