| Literature DB >> 29664470 |
Nathan P Golightly1, Avery Bell1, Anna I Bischoff1, Parker D Hollingsworth1,2, Stephen R Piccolo1,3.
Abstract
One important use of genome-wide transcriptional profiles is to identify relationships between transcription levels and patient outcomes. These translational insights can guide the development of biomarkers for clinical application. Data from thousands of translational-biomarker studies have been deposited in public repositories, enabling reuse. However, data-reuse efforts require considerable time and expertise because transcriptional data are generated using heterogeneous profiling technologies, preprocessed using diverse normalization procedures, and annotated in non-standard ways. To address this problem, we curated 45 publicly available, translational-biomarker datasets from a variety of human diseases. To increase the data's utility, we reprocessed the raw expression data using a uniform computational pipeline, addressed quality-control problems, mapped the clinical annotations to a controlled vocabulary, and prepared consistently structured, analysis-ready data files. These data, along with scripts we used to prepare the data, are available in a public repository. We believe these data will be particularly useful to researchers seeking to perform benchmarking studies-for example, to compare and optimize machine-learning algorithms' ability to predict biomedical outcomes.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29664470 PMCID: PMC5903354 DOI: 10.1038/sdata.2018.66
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Flow diagram that illustrates the process we used to collect and curate the data.
We wrote computer scripts that downloaded the data, checked for quality, normalized and standardized data values, and stored the data in analysis-ready file formats. The specific steps differed for clinical and expression data (see Methods).
Overview of data sources used in this study.
| GSE1456[ | Breast cancer | 157 | Elston grade; overall survival status; overall survival time; relapse status; relapse time | 11832 | Genome U133A |
| GSE2109[ | Breast | 263 | Age; alcohol consumption; days from diagnosis to excision; ER status; ethnic background; ethnic background; family history of cancer; fibrocystic disease; Her2 status; histology; hormonal therapy duration; mammogram status and findings; metastasis; metastatic sites; multiple tumors; node involvement; oophorectomy status; oral contraceptive use; PR status; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Colon | 255 | Age; alcohol consumption; days from diagnosis to excision; diagnosis method; Dukes’ stage; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior screening status; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; symptoms; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Endometrium | 51 | Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; quality metric; stage; symptoms; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Kidney | 209 | Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; primary site; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Lung | 103 | Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; symptoms; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Ovary | 158 | Age; alcohol consumption; days from diagnosis to excision; esophagitis reflux history; ethnic background; family history of cancer; fibrocystic disease; histology; mammogram history; metastasis; metastatic sites; multiple tumors; node involvement; node involvement; oophorectomy status; primary site; prior therapy status; quality metric; relapse time; retreatment states; screening history; stage; symptoms; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Prostate | 79 | Age; alcohol consumption; days from diagnosis to excision; diagnosis method; ethnic background; family history of cancer; Gleason score; histology; metastasis; multiple tumors; node involvement; prior therapy status; prostate-specific antigen (PSA) testing history; PSA finding; quality metric; stage; symptoms; tobacco use; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE2109[ | Uterine | 112 | Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; human papilloma virus diagnosis history; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior therapy status; quality metric; relapse; stage; symptoms; tobacco use; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE4271[ | Glial | 100 | Age; recurrence status; sex; survival status; survival time; WHO grade | 11832 | Genome U133A |
| GSE5460[ | Breast cancer | 127 | ER status; Her2 status; histological type; lymphovascular invasion; node status; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE5462[ | Breast cancer | 52 | Treatment history; treatment response | 11832 | Genome U133A |
| GSE6532[ | Breast carcinoma | 317 | Age; distant metastasis-free survival time/status; ER status; genomic grade index; node involvement; PR status; recurrence-free survival time/status; tumor grade; tumor size | 11832 | Genome U133A |
| GSE6532[ | Breast carcinoma | 87 | Age; distant metastasis-free survival time/status; ER status; genomic grade index; node involvement; PR status; recurrence-free survival time/status; tumor grade; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE10320[ | Wilms Tumor | 144 | Relapse | 11832 | Genome U133A |
| GSE15296[ | Peripheral Blood | 75 | Kidney transplant rejection; subtype | 20024 | Genome U133 Plus 2.0 |
| GSE19804[ | Paired tumor and normal tissues | 60 | Age; tissue type; tumor stage | 20024 | Genome U133 Plus 2.0 |
| GSE20181[ | Breast cancer | 50 | Treatment history; treatment response | 11832 | Genome U133A |
| GSE20189[ | Lung adenocarcinoma | 162 | Case/control status; morphology; smoking status; stage | 11832 | Genome U133A 2.0 |
| GSE21510[ | Laser capture microdissection and homogenized tissues (surgically resected material) | 104 | Metastasis; stage; tissue type | 20024 | Genome U133 Plus 2.0 |
| GSE25507[ | Peripheral blood lymphocyte | 146 | Case/control status (autism); paternal age, maternal age, subject age | 20024 | Genome U133 Plus 2.0 |
| GSE26682[ | Colorectal tumor | 140 | Age; microsatellite instability status; sex | 11832 | Genome U133A |
| GSE26682[ | Colorectal tumor | 160 | Age; microsatellite instability status; sex | 20024 | Genome U133 Plus 2.0 |
| GSE27279[ | Posterior Fossa Ependymoma | 100 | Age; sex; tumor location | 16632 | Exon 1.0 ST |
| GSE27342[ | Paired gastric tumor and normal tissue | 72 | Age; sex; stage; tissue type; tumor grade | 16632 | Exon 1.0 ST |
| GSE27854[ | Colorectal tumor | 115 | Metastasis; stage | 20024 | Genome U133 Plus 2.0 |
| GSE30219[ | Lung | 293 | Age; follow-up time; histology; metastasis; node involvement; relapse status; sex; survival; survival time; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE30784[ | Oral squamous cell carcinoma | 229 | Age; case/control status; sex | 20024 | Genome U133 Plus 2.0 |
| GSE32646[ | Breast | 115 | Age; ER status (IHC); Her2 status (FISH); histological grade; lymph node involvement; pathologic complete response; PR status (IHC); stage; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE37147[ | Bronchial sample | 238 | Age; case/control status (COPD); FEV1/FVC score/percentage; inhaled medication status; sex; smoking status; tobacco use | 21614 | Gene 1.0 ST |
| GSE37199[ | Blood sample | 94 | Disease stage (advanced castration resistant prostate cancer) | 20024 | Genome U133 Plus 2.0 |
| GSE37745[ | Non-small cell lung cancer | 196 | Adjuvant treatment status; age; histology; recurrence time/status; sex; stage; survival time/status; WHO performance status | 20024 | Genome U133 Plus 2.0 |
| GSE37892[ | Stage-II colon carcinoma | 130 | Age; diagnosis history; localisation; stage; time until metastasis | 20024 | Genome U133 Plus 2.0 |
| GSE38958[ | Peripheral blood mononuclear cell | 115 | Age; diagnosis (Idiopathic pulmonary fibrosis); ethnicity; predicted FVC percent; sex | 16632 | Exon 1.0 ST |
| GSE39491[ | Esophageal and gastric samples | 120 | Tumor cell type | 11832 | Genome U133 Plus 2.0 |
| GSE39582[ | Colon cancer | 566 | Adjuvant chemotherapy; age; BRAF mutation status; chromosome instability status; CIMP status; KRAS mutation status; mismatch repair status; overall survival time/status; recurrence-free survival time/status; sex; stage; TP53 mutation status; tumor location | 20024 | Genome U133 Plus 2.0 |
| GSE40292[ | Afferent limb tissue and whole-blood sample | 195 | Diagnosis; sex | 21614 | Gene 1.0 ST |
| GSE43176[ | Leukemic blast sample | 104 | Cytogenetics; disease state; FAB stage; KRAS mutation status; NRAS mutation status; subtype | 11832 | Genome U133A |
| GSE46449[ | Peripheral blood leukocyte | 53 | Age; diagnosis (bipolar disorder) | 20024 | Genome U133 Plus 2.0 |
| GSE46691[ | Prostate | 545 | Gleason score; metastasis | 16632 | Exon 1.0 ST |
| GSE46995[ | Leukocyte | 85 | Age; disease status (biliary atresia) | 21614 | Gene 1.0 ST |
| GSE48391[ | Breast | 81 | ER status; gene-expression subtype; Her2 status; recurrence status; survival time/status | 20024 | Genome U133 Plus 2.0 |
| GSE58697[ | Desmoid tumor | 72 | Age; follow-up time; recurrence time; sex; tumor location; tumor size | 20024 | Genome U133 Plus 2.0 |
| GSE63885[ | Ovarian cancer surgical sample | 101 | Adjuvant chemotherapy; BRCA mutation status; clinical status at last follow-up, clinical status after 1st line chemotherapy; disease-free survival; FIGO stage; histopathological type; overall survival; residual tumor size; TP53 accumulation in cancer cells (IHC); TP53 mutation status; TP53 mutation status; tumor grade | 20024 | Genome U133 Plus 2.0 |
| GSE67784[ | Peripheral blood sample | 309 | Sex; V30M mutation status; whether exhibiting symptoms | 21614 | Gene 1.1 ST |
aThese identifiers represent data series in Gene Expression Omnibus. Some identifiers are listed multiple times; in these cases, we used a subset of the series data (for a specific tissue type or microarray platform).
Figure 2Histogram showing the proportion of missing clinical-annotation values per dataset.
Some datasets contained no missing values, while others were missing as many as as 72.3% of data values.
Figure 3Distribution of IQRay quality scores for each dataset.
Sample qualities are plotted for each dataset. Low-quality samples were identified using Grubb’s test. Samples that fall on or below the red threshold were excluded from the data repository.
Summary of excluded samples.
| We excluded samples that did not pass our quality-control criteria or that appeared to be duplicated. The Gene Expression Omnibus series and sample identifiers are listed, along with the reason we excluded each sample. | ||
|---|---|---|
| GSE15296 | GSM382283 | Poor Quality |
| GSE19804 | GSM494596 | Poor Quality |
| GSE19804 | GSM494654 | Poor Quality |
| GSE19804 | GSM494657 | Poor Quality |
| GSE20181 | GSM506289 | Likely Duplicate |
| GSE20181 | GSM506294 | Likely Duplicate |
| GSE20181 | GSM506304 | Likely Duplicate |
| GSE20181 | GSM125198 | Likely Duplicate |
| GSE20181 | GSM125210 | Likely Duplicate |
| GSE20181 | GSM125230 | Likely Duplicate |
| GSE2109_Breast | GSM53059 | Likely Duplicate |
| GSE2109_Breast | GSM53027 | Likely Duplicate |
| GSE2109_Colon | GSM89040 | Likely Duplicate |
| GSE2109_Colon | GSM152664 | Likely Duplicate |
| GSE2109_Colon | GSM152632 | Likely Duplicate |
| GSE2109_Colon | GSM179922 | Likely Duplicate |
| GSE2109_Colon | GSM89044 | Likely Duplicate |
| GSE2109_Colon | GSM152666 | Likely Duplicate |
| GSE2109_Colon | GSM179820 | Likely Duplicate |
| GSE2109_Colon | GSM179924 | Likely Duplicate |
| GSE2109_Lung | GSM203652 | Poor Quality |
| GSE2109_Ovary | GSM76554 | Likely Duplicate |
| GSE2109_Ovary | GSM203725 | Likely Duplicate |
| GSE2109_Ovary | GSM76567 | Likely Duplicate |
| GSE2109_Ovary | GSM231913 | Likely Duplicate |
| GSE2109_Ovary | GSM46839 | Poor Quality |
| GSE2109_Prostate | GSM179790 | Likely Duplicate |
| GSE2109_Prostate | GSM179843 | Likely Duplicate |
| GSE2109_Prostate | GSM179903 | Likely Duplicate |
| GSE25507 | GSM627091 | Likely Duplicate |
| GSE25507 | GSM627087 | Likely Duplicate |
| GSE25507 | GSM627096 | Likely Duplicate |
| GSE25507 | GSM627078 | Likely Duplicate |
| GSE25507 | GSM627085 | Likely Duplicate |
| GSE25507 | GSM627196 | Likely Duplicate |
| GSE25507 | GSM627153 | Likely Duplicate |
| GSE25507 | GSM627180 | Likely Duplicate |
| GSE25507 | GSM627099 | Likely Duplicate |
| GSE25507 | GSM627115 | Likely Duplicate |
| GSE25507 | GSM627118 | Likely Duplicate |
| GSE25507 | GSM627124 | Likely Duplicate |
| GSE25507 | GSM627154 | Likely Duplicate |
| GSE25507 | GSM627204 | Likely Duplicate |
| GSE25507 | GSM627209 | Likely Duplicate |
| GSE25507 | GSM627215 | Likely Duplicate |
| GSE26682 | GSM656833 | Likely Duplicate |
| GSE26682 | GSM656770 | Likely Duplicate |
| GSE26682_U133PLUS2 | GSM656860 | Poor Quality |
| GSE26682_U133PLUS2 | GSM656613 | Poor Quality |
| GSE26682_U133PLUS2 | GSM656839 | Poor Quality |
| GSE26682_U133PLUS2 | GSM656721 | Poor Quality |
| GSE27342 | GSM675945 | Likely Duplicate |
| GSE27342 | GSM675947 | Likely Duplicate |
| GSE27342 | GSM675933 | Likely Duplicate |
| GSE27342 | GSM675935 | Likely Duplicate |
| GSE27342 | GSM676040 | Poor Quality |
| GSE27342 | GSM687519 | Poor Quality |
| GSE27854 | GSM687525 | Poor Quality |
| GSE30219 | GSM748210 | Likely Duplicate |
| GSE30219 | GSM748212 | Likely Duplicate |
| GSE30219 | GSM748218 | Likely Duplicate |
| GSE30219 | GSM748219 | Likely Duplicate |
| GSE30219 | GSM748255 | Poor Quality |
| GSE30219 | GSM748247 | Poor Quality |
| GSE30219 | GSM748057 | Poor Quality |
| GSE30219 | GSM748266 | Poor Quality |
| GSE30784 | GSM764928 | Likely Duplicate |
| GSE30784 | GSM764930 | Likely Duplicate |
| GSE30784 | GSM764904 | Poor Quality |
| GSE30784 | GSM764970 | Poor Quality |
| GSE32646 | GSM809214 | Likely Duplicate |
| GSE32646 | GSM809248 | Likely Duplicate |
| GSE32646 | GSM809251 | Likely Duplicate |
| GSE32646 | GSM809254 | Likely Duplicate |
| GSE37147 | GSM912230 | Likely Duplicate |
| GSE37147 | GSM912296 | Likely Duplicate |
| GSE37147 | GSM912296 | Likely Duplicate |
| GSE37147 | GSM912305 | Likely Duplicate |
| GSE37147 | GSM912291 | Likely Duplicate |
| GSE37147 | GSM912296 | Likely Duplicate |
| GSE37147 | GSM912305 | Likely Duplicate |
| GSE37147 | GSM912342 | Likely Duplicate |
| GSE37147 | GSM912273 | Likely Duplicate |
| GSE37147 | GSM912305 | Likely Duplicate |
| GSE37147 | GSM912342 | Likely Duplicate |
| GSE37147 | GSM912342 | Likely Duplicate |
| GSE37147 | GSM912348 | Likely Duplicate |
| GSE37147 | GSM912376 | Likely Duplicate |
| GSE37147 | GSM912376 | Likely Duplicate |
| GSE37147 | GSM912376 | Likely Duplicate |
| GSE37147 | GSM912463 | Poor Quality |
| GSE37147 | GSM912197 | Poor Quality |
| GSE37147 | GSM912300 | Poor Quality |
| GSE37199 | GSM913439 | Poor Quality |
| GSE37745 | GSM1019319 | Likely Duplicate |
| GSE37745 | GSM1019246 | Likely Duplicate |
| GSE37745 | GSM1019325 | Likely Duplicate |
| GSE37745 | GSM1019247 | Likely Duplicate |
| GSE37745 | GSM1019194 | Poor Quality |
| GSE37745 | GSM1019195 | Poor Quality |
| GSE37745 | GSM1019176 | Poor Quality |
| GSE37745 | GSM1019192 | Poor Quality |
| GSE37745 | GSM1019232 | Poor Quality |
| GSE37892 | GSM929512 | Poor Quality |
| GSE39491 | GSM970152 | Poor Quality |
| GSE39582 | GSM972249 | Likely Duplicate |
| GSE39582 | GSM972472 | Likely Duplicate |
| GSE39582 | GSM972243 | Likely Duplicate |
| GSE39582 | GSM972044 | Likely Duplicate |
| GSE39582 | GSM972091 | Likely Duplicate |
| GSE39582 | GSM972090 | Likely Duplicate |
| GSE39582 | GSM972245 | Likely Duplicate |
| GSE39582 | GSM972473 | Likely Duplicate |
| GSE39582 | GSM972515 | Likely Duplicate |
| GSE39582 | GSM972248 | Likely Duplicate |
| GSE43176 | GSM1057835 | Poor Quality |
| GSE46449 | GSM1130404 | Likely Duplicate |
| GSE46449 | GSM1130406 | Likely Duplicate |
| GSE46449 | GSM1130413 | Likely Duplicate |
| GSE46449 | GSM1130417 | Likely Duplicate |
| GSE46449 | GSM1130426 | Likely Duplicate |
| GSE46449 | GSM1130428 | Likely Duplicate |
| GSE46449 | GSM1130430 | Likely Duplicate |
| GSE46449 | GSM1130434 | Likely Duplicate |
| GSE46449 | GSM1130436 | Likely Duplicate |
| GSE46449 | GSM1130468 | Likely Duplicate |
| GSE46449 | GSM1130471 | Likely Duplicate |
| GSE46449 | GSM1130483 | Likely Duplicate |
| GSE48390 | GSM1176924 | Poor Quality |
| GSE48390 | GSM125120 | Poor Quality |
| GSE5462 | GSM125123 | Likely Duplicate |
| GSE5462 | GSM125125 | Likely Duplicate |
| GSE58697 | GSM1417097 | Poor Quality |
| GSE63885 | GSM1559328 | Likely Duplicate |
| GSE63885 | GSM1559360 | Likely Duplicate |
| GSE63885 | GSM1559385 | Likely Duplicate |
| GSE63885 | GSM1559370 | Likely Duplicate |
| GSE63885 | GSM1559375 | Likely Duplicate |
| GSE63885 | GSM1559386 | Likely Duplicate |
| GSE63885 | GSM1559361 | Poor Quality |
| GSE6532_U133PLUS2 | GSM151294 | Poor Quality |
| GSE6532_U133PLUS2 | GSM151280 | Poor Quality |