| Literature DB >> 23550061 |
Benjamin Frederick Ganzfried1, Markus Riester, Benjamin Haibe-Kains, Thomas Risch, Svitlana Tyekucheva, Ina Jazic, Xin Victoria Wang, Mahnaz Ahmadifar, Michael J Birrer, Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron.
Abstract
This article introduces a manually curated data collection for gene expression meta-analysis of patients with ovarian cancer and software for reproducible preparation of similar databases. This resource provides uniformly prepared microarray data for 2970 patients from 23 studies with curated and documented clinical metadata. It allows users to efficiently identify studies and patient subgroups of interest for analysis and to perform meta-analysis immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods and clinical data formats. We confirm that the recently proposed biomarker CXCL12 is associated with patient survival, independently of stage and optimal surgical debulking, which was possible only through meta-analysis owing to insufficient sample sizes of the individual studies. The database is implemented as the curatedOvarianData Bioconductor package for the R statistical computing language, providing a comprehensive and flexible resource for clinically oriented investigation of the ovarian cancer transcriptome. The package and pipeline for producing it are available from http://bcb.dfci.harvard.edu/ovariancancer.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23550061 PMCID: PMC3625954 DOI: 10.1093/database/bat013
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Data sets in the curatedOvarianData database
| Data set | Reference | Platform | Samples | Late Stage | Serous Subtype (%) | Median Survival (Months) | Median Follow-up (Months) | Censoring (%) |
|---|---|---|---|---|---|---|---|---|
| E.MTAB.386 | ( | Ill. HumanRef-8 v2 | 129 | 99 | 100 | 42 | 55 | 43 |
| GSE12418 | ( | SWEGENE v2.1.1_27k | 54 | 100 | 100 | N/A | N/A | N/A |
| GSE12470 | ( | Agilent G4110b | 53 | 66 | 81 | N/A | N/A | N/A |
| GSE13876 | ( | Operon Human v3 | 157 | 100 | 100 | 25 | 72 | 28 |
| GSE14764 | ( | Affy U133a | 80 | 89 | 85 | 54 | 37 | 74 |
| GSE17260 | ( | Agilent G4112a | 110 | 100 | 100 | 53 | 47 | 58 |
| GSE18520 | ( | Affy U133 Plus 2.0 | 63 | 84 | 84 | 25 | 140 | 23 |
| GSE19829.GPL570 | ( | Affy U133 Plus 2.0 | 28 | N/A | N/A | 47 | 62 | 39 |
| GSE19829.GPL8300 | ( | Affy U95 v2 | 42 | N/A | N/A | 45 | 50 | 45 |
| GSE20565 | ( | Affy U133 Plus 2.0 | 140 | 48 | 51 | N/A | N/A | N/A |
| GSE2109 | N/A | Affy U133 Plus 2.0 | 204 | 42 | 42 | N/A | N/A | N/A |
| GSE26712 | ( | Affy U133a | 195 | 96 | 95 | 46 | 90 | 30 |
| GSE30009 | ( | TaqMan qRT-PCR 380 | 103 | 100 | 99 | 41 | 53 | 45 |
| GSE30161 | ( | Affy U133 Plus 2.0 | 58 | 100 | 81 | 50 | 83 | 38 |
| GSE32062.GPL6480 | ( | Agilent G4112a | 260 | 100 | 100 | 59 | 56 | 53 |
| GSE32063 | ( | Agilent G4112a | 40 | 100 | 100 | 53 | 81 | 45 |
| GSE6008 | ( | Affy U133a | 99 | 54 | 41 | N/A | N/A | N/A |
| GSE6822 | ( | Affy Hu6800 | 66 | N/A | 62 | N/A | N/A | N/A |
| GSE9891 | ( | Affy U133 Plus 2.0 | 285 | 85 | 93 | 47 | 36 | 59 |
| PMID15897565 | ( | Affy U133a | 63 | 83 | 100 | N/A | N/A | N/A |
| PMID17290060 | ( | Affy U133a | 117 | 98 | 100 | 63 | 82 | 43 |
| PMID19318476 | ( | Affy U133a | 42 | 93 | 100 | 34 | 89 | 48 |
| TCGA | ( | Affy HT U133a | 578 | 90 | 98 | 45 | 52 | 48 |
These data sets provide curated gene expression and clinical data for a total of 2970 samples, including all publicly ovarian cancer gene expression experiments with individual patient survival information at the time of press.
aOnly FIGO Stages III and IV.
bData set is a subset of the samples from the retracted paper PMID17290060, Dressman et al. (27).
cPaper was retracted because of a misalignment of genomic and survival data (30); the corrected data are provided here.
N/A, not available.
Figure 1Flowchart of the data collection and curation pipeline. The software implementing this pipeline reproduces all steps from downloading of data to final packaging, requiring manual intervention only for identifying studies, curation of clinical metadata and documentation of the package.
Curated clinical annotations
| Characteristic | Allowed values | Description |
|---|---|---|
| sample_type | tumour, metastatic, cellline, healthy, adjacentnormal | Healthy, only from individuals without cancer; adjacentnormal, from individuals with cancer; |
| histological_type | ser, endo, clearcell, mucinous, other, mix, undifferentiated | ser, serous; endo, endometrioid; clearcell, mixture of ser + endo. Other includes sarcomatoid, endometroid, papillary serous, adenocarcinoma, dysgerminoma |
| primarysite | ov, ft, other | Ov, ovary; ft, fallopian tube |
| arrayedsite | ov, ft, other | ov, ovary; ft, fallopian tube |
| summarygrade | low, high | low, 1, 2, LMP (low malignant potential); high, 3, 2/3 |
| summarystage | early, late | early, FIGO I, II, I/II; late, FIGO III, IV, II/III, III/IV |
| tumourstage | 1, 2, 3, 4 | FIGO Stage (I–IV, translated to 1–4 for R usage) |
| substage | a, b, c, d | Substage (abcd) |
| grade | 1, 2, 3 | Grade ( |
| age_at_initial_pathologic_diagnosis | 1-99 | Age at initial pathologic diagnosis in years |
| pltx | y/n | Patient treated with Platin |
| tax | y/n | Patient treated with Taxol |
| neo | y/n | Neoadjuvant treatment |
| days_to_tumour_recurrence | decimal | Time to recurrence or last follow-up in days |
| recurrence_status | recurrence, no recurrence | Recurrence censoring variable |
| days_to_death | decimal | Time to death or last follow-up in days |
| vital_status | living, deceased | Overall survival censoring variable |
| os_binary | short, long | Dichotomized overall survival time; as defined by study |
| relapse_binary | short, long | Dichotomized relapse variable; as defined by the study |
| site_of_tumour_first_recurrence | metastasis, locoregional, etc. | Site of the first recurrence |
| primary_therapy_outcome_success | completeresponse, etc. | Response to any kind of therapy |
| bebulking | optimal, suboptimal | Amount of residual disease (optimal ≤ 1 cm) |
| percent_normal_cells | 0–100+/− | Estimated percentage of normal cells; 20− ≤ 20% |
| percent_stromal_cells | 0–100+/− | Estimated percentage of stromal cells |
| percent_tumour_cells | 0–100+/− | Estimated percentage of tumour cells; 80+ ≥ 80% |
| batch | character | Hybridization date or other available batch variable |
| uncurated_author_metadata | character | All original, uncurated metadata |
Additional study-specific details are provided in the package manual.
aMost ovarian cancer pathologists follow the FIGO grading system, although some exceptions (15, 22, 25) are noted in the package manual.
Figure 2Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each data set (columns). Red indicates that the corresponding characteristic is available for at least one sample in the data set. See Table 2 for descriptions of these characteristics.
Figure 3The database confirms CXCL12 as prognostic of overall survival in patients with ovarian cancer. Forest plot of the expression of the chemokine CXCL12 as a univariate predictor of overall survival, using all 14 data sets with applicable expression and survival information. HR indicates the factor by which overall risk of death increases with a one standard deviation increase in CXCL12 expression. A summary HR significantly larger than 1 indicates that patients with high CXCL12 levels had poor outcome and confirms in several lines of code the previously reported association between CXCL12 abundance and patient survival (9). Consideration of important clinicopathological features such as stage, grade, histology and residual disease (optimal surgical debulking) is also straightforward; examples are provided in the package vignette.