| Literature DB >> 23487185 |
Marta Rosikiewicz1, Aurélie Comte, Anne Niknejad, Marc Robinson-Rechavi, Frederic B Bastian.
Abstract
As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.Entities:
Mesh:
Year: 2013 PMID: 23487185 PMCID: PMC3595988 DOI: 10.1093/database/bat010
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Distribution of groups of identical Affymetrix chips
| Number of chip groups | Number of chips per group | Number of experiments per group |
|---|---|---|
| 4 | 2 | 1 |
| 13 | 3 | 3 |
| 15 | 4 | 4 |
| 1033 | 2 | 2 |
Distribution of pairs of experiments sharing identical chips
| Number of shared identical chips | 1 | 2 | 3 | 4 | 5 | 6–10 | 11–20 | 21–50 | 51–340 |
| Number of experiment pairs | 2 | 5 | 14 | 8 | 3 | 14 | 13 | 3 | 4 |
Note that an experiment can be part of several pairs, depending on the number of experiments it shares chips with, and that the four experiments using duplicated chips within themselves (GSE591, GSE9750, GSE6196 and GSE6490) are not considered.