| Literature DB >> 27746907 |
Lilah Toker1, Min Feng2, Paul Pavlidis1.
Abstract
Concern about the reproducibility and reliability of biomedical research has been rising. An understudied issue is the prevalence of sample mislabeling, one impact of which would be invalid comparisons. We studied this issue in a corpus of human transcriptomics studies by comparing the provided annotations of sex to the expression levels of sex-specific genes. We identified apparent mislabeled samples in 46% of the datasets studied, yielding a 99% confidence lower-bound estimate for all studies of 33%. In a separate analysis of a set of datasets concerning a single cohort of subjects, 2/4 had mislabeled samples, indicating laboratory mix-ups rather than data recording errors. While the number of mixed-up samples per study was generally small, because our method can only identify a subset of potential mix-ups, our estimate is conservative for the breadth of the problem. Our findings emphasize the need for more stringent sample tracking, and that re-users of published data must be alert to the possibility of annotation and labelling errors.Entities:
Keywords: Transcriptomics; data quality; gene expression; misannotation; mislabeling; reproducibility
Year: 2016 PMID: 27746907 PMCID: PMC5034794 DOI: 10.12688/f1000research.9471.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure S1. Disagreement between gene-based and annotated sex in three datasets participating in the metaanalysis of Parkinson’s disease [1, 2].
Santiago and Potashkin included four datasets in their metaanalysis. When available (three out of the four datasets) we used sample characteristics provided in the associated manuscripts to identify existence of mislabelled samples. Gene-based males are defined by high RPS4Y1 and low XIST expression. Cont – control subjects (green), PD – Parkinson’s disease (orange). In brackets, the corresponding number of females (F) and males (M) reported in the original manuscript. XIST and RPS4Y1 genes were present in datasets GSE22491, but only RPS4Y1 was present in GSE18838. ( a) Based on the sex-genes expression, dataset GSE22491 contains at two 2 mislabelled samples. Of notice, in the pooled sample (indicated by an arrow) containing equal amount of males and females, the two genes are expressed at similar levels. ( b) This is the only dataset for which sex of individual samples was available on GEO. Red – GEO annotated females, blue – GEO annotated males. Based on the manuscript’s sample characteristics there should be 8F, 3M controls, and 2F, 15M PD. However, metadata provided on GEO, describes 5F, 6M controls, and 4F, 13M PD. Both of these annotations disagree with the gene-based sex of the samples (Cont – 8F, 3M, PD – 5F, 12M).
Summary of discrepancies between the gene expression-based and annotated sex in human microarray datasets.
Unclassified samples are samples with disagreement between their classification using k-means clustering and the median expression of the sex specific probesets. Datasets were considered as “correctly annotated” only if they did not contain mismatched or unclassified samples. Eight of the datasets contained both mismatched samples and unclassified samples.
| All Datasets | Non-cancer
| Cancer
| All Samples | Non-cancer
| Cancer
| |
|---|---|---|---|---|---|---|
| Correctly annotated | 31 (44%) | 29 (53%) | 2 (13%) | 4043 (97%) | 2868 (98%) | 1175 (96%) |
| Mismatched | 32 (46%) | 24 (44%) | 8 (53%) | 83 (2%) | 58 (1.97%) | 25 (2.04%) |
| Unclassified | 15 (21%) | 7 (13%) | 8 (53%) | 34 (0.8%) | 11 (0.4%) | 23 (1.9%) |
| Total | 70 | 55 | 15 | 4160 | 2937 | 1223 |
Figure 1. Representative plots showing expression levels of sex-specific probesets.
Expression level of probesets representing the XIST (red), KDM5D (black) and RPS4Y1 (blue) genes. “MetaFemale” and “MetaMale” indicate the meta-data annotated sex of the samples and their total number in brackets. The “M” and “F” along the X axis indicates the gene-based sex of the samples, as determined by k-means clustering. Log 2-transformed expression levels are plotted. ( a) Representative dataset with no mismatched samples. ( b) Representative dataset with two mismatched samples (highlighted with grey bars). Gene-based sex that contradicts the annotated sex of the sample is highlighted in bold at bottom.
Figure 2. Gene-based and metadata-based sex in four datasets of similar subjects from Stanley Array collection.
The heatmap represents z-transformed expression values of KDM5D, RPS4Y1 and XIST probeset in four datasets of microarray data from Stanley Array Collection cohort of subjects. The datasets are designated - Study1 AltarA, Study3 Bahn, Study5 Dobrin, Study7 Kato, in correspondence to their names on the Stanley collection site. Each column represents a subject and each raw represents a probeset. The four studies are represented on the left color bar on the side of the heatmap. The gene names corresponding to each probeset are shown by the right color bar on the side of the heatmap. Three of the studies – AltarA, Bahn and Kato were performed on the GPL96 platform on which XIST is represented by two probesets. The Dobrin dataset is on the GPL570 platform containing additional 5 XIST probesets, one of which was removed from the analysis. The annotated sex of each subject (metadata gender) is represented by the top color bar (females – pink, males – purple). Missing samples (samples that were excluded from the original studies) are shown in grey. Arrows point to the mismatched samples.