Literature DB >> 26862323

Highlight report: Erroneous sample annotation in a high fraction of publicly available genome-wide expression datasets.

Abstract

Entities: CellLine Chemical Disease Gene Species

Year: 2015 PMID： 26862323 PMCID： PMC4743481 DOI： 10.17179/excli2015-760

Source DB: PubMed Journal: EXCLI J ISSN： 1611-2156 Impact factor: 4.068

× No keyword cloud information.

⁯⁯

Recently, Lohr et al. have published a method that identifies sample annotation errors in gene expression data (Lohr et al., 2015[18]). Surprisingly, 40 % of 45 analyzed publicly available datasets including 4913 patients were affected by erroneous sample annotation. The authors conclude that sample annotation errors may be a more widespread phenomenon as previously expected (Lohr et al., 2015[18]). The authors used two strategies for identifying sample mix-up. First, a classifier was established that differentiates between samples from female and male patients. This classifier is based on the x-chromosomal gene XIST and the y-chromosomal genes RPS4Y1 and DDX3Y (Lohr et al., 2015[18]). In datasets with similar numbers of male and females, approximately half of sample mix-ups will result in sex mislabeling. A further possible error is sample duplication, where the same sample is analyzed twice and the duplicate is erroneously labeled with another patient (Lohr et al., 2015[18]). To identify such duplications, a correlation-based strategy was used. A strength of the techniques presented by Lohr et al. is that they include normalization steps which make it possible to apply the same algorithm on samples of all datasets. The algorithm then differentiates between 'correctly classified' and 'misclassified' samples. In the analyzed 45 publicly available cohorts 18 contained at least one misclassification. The authors also show that deleting the erroneous samples can strongly influence the number of statistically significant prognostic genes. Currently, genome-wide data are frequently used in cancer research (Stock et al., 2015[33]; Sicking et al., 2014[28]; Cadenas et al., 2014[4]; Mattson et al., 2015[21]). Intensively studied fields are breast- and ovarian cancer (Siggelkow et al., 2012[29]; Godoy et al., 2014[11]; Stewart et al., 2012[31]; Schmidt et al., 2012[24]). It can be expected that cohorts with only samples from either females or males have a lower risk of sex mislabeling. Therefore, it was surprising that an example of mislabeled patients was also identified in breast cancer patients. For example the well-known TRANSBIG cohort contains one female node-negative breast cancer patient who in reality is a man (Lohr et al., 2015[18]). Besides its intensive use in cancer research (Micke et al., 2014[22]; Schmidt et al., 2008[24]; Botling et al., 2013[3]) genome-wide expression data are also frequently used in toxicology (Campos et al., 2014[5]; Stöber et al., 2014[32]; Marchan, 2014[20][19]; Bolt, 2013[1]; Song et al., 2013[30]; Godoy et al., 2013[12]; Bolt et al., 2010[2]). The goal of these studies is to obtain first evidence of the mechanism of action of chemicals (Glahn et al., 2008[10]; Shimada et al., 2010[26]; Dika Nguea et al., 2008[6]; Hendrickx et al., 2014[15]; Shinde et al., 2015[27]; Yao and Costa, 2014[34]; Gagné et al., 2013[9]; Fang et al., 2014[8]; Kim et al., 2012[16]) or to establish classifiers of co-regulated gene clusters (Grinberg et al., 2014[14]; Shao et al., 2014[25]; Krug et al., 2013[17]; Godoy et al., 2015[13]; Rempel et al., 2015[23]; Doktorova et al., 2012[7]). In these datasets the correlation-based classifier for sample duplication may be helpful. In conclusion, the easy to use classifiers published by Lohr and colleagues (2015[18]) should be routinely included into the analysis of gene array but also RNA seq data to reduce the number of erroneous sample annotations.

33 in total

1. Gene array screening for identification of drugs with low levels of adverse side effects.

Authors: Hermann M Bolt; Rosemarie Marchan; Jan G Hengstler
Journal: Arch Toxicol Date: 2010-04 Impact factor: 5.153

2. Gene expression profiles in the brain of the neonate mouse perinatally exposed to methylmercury and/or polychlorinated biphenyls.

Authors: Miyuki Shimada; Satomi Kameo; Norio Sugawara; Kozue Yaginuma-Sakurai; Naoyuki Kurokawa; Satomi Mizukami-Murata; Kunihiko Nakai; Hitoshi Iwahashi; Hiroshi Satoh
Journal: Arch Toxicol Date: 2009-12-18 Impact factor: 5.153

3. Gelsolin Is Associated with Longer Metastasis-free Survival and Reduced Cell Migration in Estrogen Receptor-positive Breast Cancer.

Authors: Anna-Maria Stock; Franziska Klee; Karolina Edlund; Marianna Grinberg; Seddik Hammad; Rosemarie Marchan; Cristina Cadenas; Bernd Niggemann; Kurt S Zänker; Jörg Rahnenführer; Marcus Schmidt; Jan G Hengstler; Frank Entschladen
Journal: Anticancer Res Date: 2015-10 Impact factor: 2.480

4. Aberrantly activated claudin 6 and 18.2 as potential therapy targets in non-small-cell lung cancer.

Authors: Patrick Micke; Johanna Sofia Margareta Mattsson; Karolina Edlund; Miriam Lohr; Karin Jirström; Anders Berglund; Johan Botling; Jörg Rahnenfuehrer; Millaray Marincevic; Fredrik Pontén; Simon Ekman; Jan Hengstler; Stefan Wöll; Ugur Sahin; Ozlem Türeci
Journal: Int J Cancer Date: 2014-04-08 Impact factor: 7.396

5. Differential gene expression in human hepatocyte cell lines exposed to the antiretroviral agent zidovudine.

Authors: Jia-Long Fang; Tao Han; Qiangen Wu; Frederick A Beland; Ching-Wei Chang; Lei Guo; James C Fuscoe
Journal: Arch Toxicol Date: 2013-11-30 Impact factor: 5.153

6. Cadmium, cobalt and lead cause stress response, cell cycle deregulation and increased steroid as well as xenobiotic metabolism in primary normal human bronchial epithelial cells which is coordinated by at least nine transcription factors.

Authors: Felix Glahn; Wolfgang Schmidt-Heck; Sebastian Zellmer; Reinhard Guthke; Jan Wiese; Klaus Golka; Roland Hergenröder; Gisela H Degen; Thomas Lehmann; Matthias Hermes; Wiebke Schormann; Marc Brulport; Alexander Bauer; Essam Bedawy; Rolf Gebhardt; Jan G Hengstler; Heidi Foth
Journal: Arch Toxicol Date: 2008-07-25 Impact factor: 5.153

7. Human embryonic stem cell-derived test systems for developmental neurotoxicity: a transcriptomics approach.

Authors: Anne K Krug; Raivo Kolde; John A Gaspar; Eugen Rempel; Nina V Balmer; Kesavan Meganathan; Kinga Vojnits; Mathurin Baquié; Tanja Waldmann; Roberto Ensenat-Waser; Smita Jagtap; Richard M Evans; Stephanie Julien; Hedi Peterson; Dimitra Zagoura; Suzanne Kadereit; Daniel Gerhard; Isaia Sotiriadou; Michael Heke; Karthick Natarajan; Margit Henry; Johannes Winkler; Rosemarie Marchan; Luc Stoppini; Sieto Bosgra; Joost Westerhout; Miriam Verwei; Jaak Vilo; Andreas Kortenkamp; Jürgen Hescheler; Ludwig Hothorn; Susanne Bremer; Christoph van Thriel; Karl-Heinz Krause; Jan G Hengstler; Jörg Rahnenführer; Marcel Leist; Agapios Sachinidis
Journal: Arch Toxicol Date: 2012-11-21 Impact factor: 5.153

8. Gene networks and transcription factor motifs defining the differentiation of stem cells into hepatocyte-like cells.

Authors: Patricio Godoy; Wolfgang Schmidt-Heck; Karthick Natarajan; Baltasar Lucendo-Villarin; Dagmara Szkolnicka; Annika Asplund; Petter Björquist; Agata Widera; Regina Stöber; Gisela Campos; Seddik Hammad; Agapios Sachinidis; Umesh Chaudhari; Georg Damm; Thomas S Weiss; Andreas Nüssler; Jane Synnergren; Karolina Edlund; Barbara Küppers-Munther; David C Hay; Jan G Hengstler
Journal: J Hepatol Date: 2015-05-25 Impact factor: 25.083

Highlight report: Erroneous sample annotation in a high fraction of publicly available genome-wide expression datasets.

⁯⁯

1. Gene array screening for identification of drugs with low levels of adverse side effects.

2. Gene expression profiles in the brain of the neonate mouse perinatally exposed to methylmercury and/or polychlorinated biphenyls.

3. Gelsolin Is Associated with Longer Metastasis-free Survival and Reduced Cell Migration in Estrogen Receptor-positive Breast Cancer.

4. Aberrantly activated claudin 6 and 18.2 as potential therapy targets in non-small-cell lung cancer.

5. Differential gene expression in human hepatocyte cell lines exposed to the antiretroviral agent zidovudine.

6. Cadmium, cobalt and lead cause stress response, cell cycle deregulation and increased steroid as well as xenobiotic metabolism in primary normal human bronchial epithelial cells which is coordinated by at least nine transcription factors.

7. Human embryonic stem cell-derived test systems for developmental neurotoxicity: a transcriptomics approach.

8. Gene networks and transcription factor motifs defining the differentiation of stem cells into hepatocyte-like cells.

9. Transcriptome based differentiation of harmless, teratogenetic and cytotoxic concentration ranges of valproic acid.

10. Loss of circadian clock gene expression is associated with tumor progression in breast cancer.

1. Highlight report: Predicting late metastasis in breast cancer.

2. The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome.

3. Highlight report: Intratumoral metabolomic heterogeneity of breast cancer.