Literature DB >> 29664470

Curated compendium of human transcriptional biomarker data.

Nathan P Golightly1, Avery Bell1, Anna I Bischoff1, Parker D Hollingsworth1,2, Stephen R Piccolo1,3.   

Abstract

One important use of genome-wide transcriptional profiles is to identify relationships between transcription levels and patient outcomes. These translational insights can guide the development of biomarkers for clinical application. Data from thousands of translational-biomarker studies have been deposited in public repositories, enabling reuse. However, data-reuse efforts require considerable time and expertise because transcriptional data are generated using heterogeneous profiling technologies, preprocessed using diverse normalization procedures, and annotated in non-standard ways. To address this problem, we curated 45 publicly available, translational-biomarker datasets from a variety of human diseases. To increase the data's utility, we reprocessed the raw expression data using a uniform computational pipeline, addressed quality-control problems, mapped the clinical annotations to a controlled vocabulary, and prepared consistently structured, analysis-ready data files. These data, along with scripts we used to prepare the data, are available in a public repository. We believe these data will be particularly useful to researchers seeking to perform benchmarking studies-for example, to compare and optimize machine-learning algorithms' ability to predict biomedical outcomes.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29664470      PMCID: PMC5903354          DOI: 10.1038/sdata.2018.66

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

DNA encodes a cell’s instruction manual in the form of genes and regulatory sequences[1]. Cells behave differently, in part, because genes are transcribed into RNA in different quantities within those cells[2]. Researchers examine gene-expression levels to understand cellular dynamics and the mechanisms behind cellular aberrations, including those that lead to disease development. Modern technologies now make it possible to profile expression levels for thousands of genes at a time for a modest expense[3]. Using these high-throughput technologies, scientists have performed thousands of studies to characterize biological processes and to evaluate the potential for precision-medicine applications. One such application is to derive transcriptional biomarkers—patterns of expression that indicate disease states or that predict medical outcomes, such as relapse, survival, or treatment response[4-10]. Indeed, already to date, more than 100 transcriptional biomarkers have been proposed for predicting breast-cancer survival alone[11]. Many funding agencies and academic journals have imposed policies that require scientists to deposit transcriptional data in publicly accessible databases. These policies seek to ensure that other scientists can verify the original study's findings and can reuse the data in secondary analyses. For example, Gene Expression Omnibus (GEO) currently contains data for more than 2 million biological samples[12]. Upon considering infrastructure and personnel costs, we estimate that these data represent hundreds of millions—if not billions—of dollars (USD) of collective research investment. Reusing these vast resources offers an opportunity to reap a greater return on investment—perhaps most importantly via informing and validating new studies. Unfortunately, although anyone can access GEO data, researchers vastly underutilize this treasure trove because preparing data for new analyses requires considerable background knowledge and informatics expertise. In GEO, data are typically available in two forms: 1) raw data, as produced originally by the data-generating technology, and 2) processed data, which were used in the data generators' analyses. In most cases, researchers process raw data in a series of steps that might include quality-control filtering, noise reduction, standardization, and summarization (e.g., summarizing to gene-level values and excluding outliers). Data from different profiling technologies must be handled in ways that are specific to each technology. However, even for datasets generated using the same profiling technology, the methods employed for data preprocessing vary widely across studies. This heterogeneity makes it difficult for researchers to perform secondary analyses and to trust that analytical findings are driven primarily by biological mechanisms rather than differences in data preprocessing. In addition, when data have not been mapped to biologically meaningful identifiers, it may be difficult for researchers to draw biological conclusions from the data. Sample-level annotations accompany each GEO dataset. For biomarker studies, such metadata might include medical diagnoses or treatment outcomes, as well as covariates such as age, sex, or ethnicity. Although GEO publishes metadata in a semi-standardized format and bioinformatics tools exist for downloading and parsing GEO data[13,14], it is difficult for many researchers to extract these data into a form that is suitable for secondary analyses. Within annotation files, values are often stored in key/value pairs with nondescript column names. Many columns are not useful for analytical purposes (e.g., when all samples have the same value). When values are missing, the columns often become shifted; accordingly, data for a given variable may be spread across multiple columns. Moreover, a variety of descriptors (e.g., “?”, “N/A”, or “Unknown”) are used to indicate missing values, thus requiring the analyst to account for these differences. In addition, seemingly minor errors, such as spelling mistakes or inconsistent capitalization, can hamper secondary-analysis efforts. In response to these challenges, we compiled the Biomarker Benchmark, a curated compendium of 45 transcriptional-biomarker datasets from GEO. These datasets represent a variety of human-disease states and outcomes, many related to cancer. We obtained raw gene-expression files, renormalized them using a common algorithm, and summarized the data using gene-level annotations (Figure 1). We used two techniques to check for quality-control issues in the gene-expression data. For datasets where gene-expression data were processed in multiple batches—and where batch information was available—we corrected for batch effects. Finally, we prepared a version of the data that is suitable for direct application in machine-learning analyses. For this version of the data, we one-hot encoded any discrete values and imputed any missing values.
Figure 1

Flow diagram that illustrates the process we used to collect and curate the data.

We wrote computer scripts that downloaded the data, checked for quality, normalized and standardized data values, and stored the data in analysis-ready file formats. The specific steps differed for clinical and expression data (see Methods).

Methods

Selecting data

To select datasets to be included in our compendium, we performed a custom search in Gene Expression Omnibus (GEO). First, we limited our search to data series that were associated with the Medical Subject Heading (MeSH) term "biomarker" and that came from Homo sapiens subjects. Next we limited the search to data generated using Affymetrix gene-expression microarrays and for which raw expression data were available (so we could renormalize the data). For each dataset, we examined the metadata to ensure that each series had at least one biomarker-relevant clinical variable. These included variables such as prognosis, disease stage, histology, and treatment success or relapse. Lastly, we selected series that included data for at least 70 samples (before additional filtering, see below). Based on these criteria, we identified 36 GEO series. Two series (GSE6532 and GSE26682, Data Citation 1) contained data for two types of Affymetrix microarray. To avoid platform-related biases, we separated each of these series into two datasets; we used a suffix for each that indicates the microarray platform (e.g., GSE6532_U133A and GSE6532_U133Plus2). For both of these series, the biological samples profiled using either microarray platform were distinct. The GSE2109 series—known as the Expression Project for Oncology (expO)—had been produced by the International Genomics Consortium and contains data for 129 different cancer types[15]. To avoid confounding effects due to tissue-specific expression and because the metadata differed considerably across the cancer types, we split this dataset into multiple datasets based on cancer type (Table 1 (available online only)). We excluded tissue types for which fewer than 70 samples were available; we also excluded the "omentum" cancer type because it was relatively heterogeneous and had relatively few samples.
Table 1

Overview of data sources used in this study.

Series ID aTissue type(s)# SamplesClinical variable(s)# GenesAffymetrix Platform(s)
GSE1456[28]Breast cancer157Elston grade; overall survival status; overall survival time; relapse status; relapse time11832Genome U133A
GSE2109[15]Breast263Age; alcohol consumption; days from diagnosis to excision; ER status; ethnic background; ethnic background; family history of cancer; fibrocystic disease; Her2 status; histology; hormonal therapy duration; mammogram status and findings; metastasis; metastatic sites; multiple tumors; node involvement; oophorectomy status; oral contraceptive use; PR status; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Colon255Age; alcohol consumption; days from diagnosis to excision; diagnosis method; Dukes’ stage; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior screening status; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; symptoms; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Endometrium51Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; quality metric; stage; symptoms; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Kidney209Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; primary site; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Lung103Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior therapy status; quality metric; relapse time; retreatment states; sex; stage; symptoms; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Ovary158Age; alcohol consumption; days from diagnosis to excision; esophagitis reflux history; ethnic background; family history of cancer; fibrocystic disease; histology; mammogram history; metastasis; metastatic sites; multiple tumors; node involvement; node involvement; oophorectomy status; primary site; prior therapy status; quality metric; relapse time; retreatment states; screening history; stage; symptoms; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Prostate79Age; alcohol consumption; days from diagnosis to excision; diagnosis method; ethnic background; family history of cancer; Gleason score; histology; metastasis; multiple tumors; node involvement; prior therapy status; prostate-specific antigen (PSA) testing history; PSA finding; quality metric; stage; symptoms; tobacco use; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE2109[15]Uterine112Age; alcohol consumption; days from diagnosis to excision; ethnic background; family history of cancer; histology; human papilloma virus diagnosis history; metastasis; metastatic sites; multiple tumors; node involvement; primary site; prior therapy status; quality metric; relapse; stage; symptoms; tobacco use; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE4271[29,30]Glial100Age; recurrence status; sex; survival status; survival time; WHO grade11832Genome U133A
GSE5460[31]Breast cancer127ER status; Her2 status; histological type; lymphovascular invasion; node status; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE5462[32,33]Breast cancer52Treatment history; treatment response11832Genome U133A
GSE6532[34]Breast carcinoma317Age; distant metastasis-free survival time/status; ER status; genomic grade index; node involvement; PR status; recurrence-free survival time/status; tumor grade; tumor size11832Genome U133A
GSE6532[34]Breast carcinoma87Age; distant metastasis-free survival time/status; ER status; genomic grade index; node involvement; PR status; recurrence-free survival time/status; tumor grade; tumor size20024Genome U133 Plus 2.0
GSE10320[35]Wilms Tumor144Relapse11832Genome U133A
GSE15296[36]Peripheral Blood75Kidney transplant rejection; subtype20024Genome U133 Plus 2.0
GSE19804[37]Paired tumor and normal tissues60Age; tissue type; tumor stage20024Genome U133 Plus 2.0
GSE20181[33,38]Breast cancer50Treatment history; treatment response11832Genome U133A
GSE20189[39]Lung adenocarcinoma162Case/control status; morphology; smoking status; stage11832Genome U133A 2.0
GSE21510[40]Laser capture microdissection and homogenized tissues (surgically resected material)104Metastasis; stage; tissue type20024Genome U133 Plus 2.0
GSE25507[41]Peripheral blood lymphocyte146Case/control status (autism); paternal age, maternal age, subject age20024Genome U133 Plus 2.0
GSE26682[42–44]Colorectal tumor140Age; microsatellite instability status; sex11832Genome U133A
GSE26682[42–44]Colorectal tumor160Age; microsatellite instability status; sex20024Genome U133 Plus 2.0
GSE27279[45]Posterior Fossa Ependymoma100Age; sex; tumor location16632Exon 1.0 ST
GSE27342[46,47]Paired gastric tumor and normal tissue72Age; sex; stage; tissue type; tumor grade16632Exon 1.0 ST
GSE27854[48]Colorectal tumor115Metastasis; stage20024Genome U133 Plus 2.0
GSE30219[49]Lung293Age; follow-up time; histology; metastasis; node involvement; relapse status; sex; survival; survival time; tumor size20024Genome U133 Plus 2.0
GSE30784[50]Oral squamous cell carcinoma229Age; case/control status; sex20024Genome U133 Plus 2.0
GSE32646[51]Breast115Age; ER status (IHC); Her2 status (FISH); histological grade; lymph node involvement; pathologic complete response; PR status (IHC); stage; tumor size20024Genome U133 Plus 2.0
GSE37147[52]Bronchial sample238Age; case/control status (COPD); FEV1/FVC score/percentage; inhaled medication status; sex; smoking status; tobacco use21614Gene 1.0 ST
GSE37199[53]Blood sample94Disease stage (advanced castration resistant prostate cancer)20024Genome U133 Plus 2.0
GSE37745[54]Non-small cell lung cancer196Adjuvant treatment status; age; histology; recurrence time/status; sex; stage; survival time/status; WHO performance status20024Genome U133 Plus 2.0
GSE37892[55]Stage-II colon carcinoma130Age; diagnosis history; localisation; stage; time until metastasis20024Genome U133 Plus 2.0
GSE38958[56]Peripheral blood mononuclear cell115Age; diagnosis (Idiopathic pulmonary fibrosis); ethnicity; predicted FVC percent; sex16632Exon 1.0 ST
GSE39491[57]Esophageal and gastric samples120Tumor cell type11832Genome U133 Plus 2.0
GSE39582[58]Colon cancer566Adjuvant chemotherapy; age; BRAF mutation status; chromosome instability status; CIMP status; KRAS mutation status; mismatch repair status; overall survival time/status; recurrence-free survival time/status; sex; stage; TP53 mutation status; tumor location20024Genome U133 Plus 2.0
GSE40292[59]Afferent limb tissue and whole-blood sample195Diagnosis; sex21614Gene 1.0 ST
GSE43176[60]Leukemic blast sample104Cytogenetics; disease state; FAB stage; KRAS mutation status; NRAS mutation status; subtype11832Genome U133A
GSE46449[61]Peripheral blood leukocyte53Age; diagnosis (bipolar disorder)20024Genome U133 Plus 2.0
GSE46691[62]Prostate545Gleason score; metastasis16632Exon 1.0 ST
GSE46995[63]Leukocyte85Age; disease status (biliary atresia)21614Gene 1.0 ST
GSE48391[64]Breast81ER status; gene-expression subtype; Her2 status; recurrence status; survival time/status20024Genome U133 Plus 2.0
GSE58697[65]Desmoid tumor72Age; follow-up time; recurrence time; sex; tumor location; tumor size20024Genome U133 Plus 2.0
GSE63885[66]Ovarian cancer surgical sample101Adjuvant chemotherapy; BRCA mutation status; clinical status at last follow-up, clinical status after 1st line chemotherapy; disease-free survival; FIGO stage; histopathological type; overall survival; residual tumor size; TP53 accumulation in cancer cells (IHC); TP53 mutation status; TP53 mutation status; tumor grade20024Genome U133 Plus 2.0
GSE67784[67]Peripheral blood sample309Sex; V30M mutation status; whether exhibiting symptoms21614Gene 1.1 ST

aThese identifiers represent data series in Gene Expression Omnibus. Some identifiers are listed multiple times; in these cases, we used a subset of the series data (for a specific tissue type or microarray platform).

We used publicly available data for this study and played no role in contacting the research subjects. We received approval to work with these data from Brigham Young University's Institutional Review Board (E 14522).

Preparing clinical annotations

For each dataset, we wrote custom R scripts[16] that download, parse, and reformat the clinical annotations. Initially, these scripts retrieve data from GEO using the GEOquery package[13]. Next they generate a tab-delimited text file for each dataset that contains all available clinical annotations, except those with identical values for all samples (for example, platform name, species name, submission date) or that were unique to each biological sample (for example, sample title). In addition, these scripts generate Markdown files that summarize each dataset and indicate sources. In some cases, multiple data values are included in the same cell in GEO annotation files. For example, in GSE5462 (Data Citation 1), one patient's clinical demographics and treatment responses are listed as "female; breast tumor; Letrozole, 2.5 mg/day,oral, 10–14 days; responder." We parsed these values and split them into separate columns for each sample. After these cleaning steps, the datasets contained an average of 7.8 variables of metadata (Table 1 (available online only)). Next we searched each dataset for missing values. Across the datasets, 11 distinct expressions had been used by the original data generators to represent missingness; these included "N/A", "NA", "MISSING", "NOT AVAILABLE", "?", and others. To support consistency, we standardized these values across the datasets, using a value of "NA". On average, 17.0% of the metadata values were missing per dataset; this proportion differed considerably across the datasets (Figure 2).
Figure 2

Histogram showing the proportion of missing clinical-annotation values per dataset.

Some datasets contained no missing values, while others were missing as many as as 72.3% of data values.

We anticipate that many researchers will use these data to develop and benchmark machine-learning algorithms (although they can be used in many other types of analysis). Accordingly, we prepared a secondary version of the clinical annotations that are ready to use in machine-learning analyses. First, we identified class variables that have potential relevance for biomarker applications. In many cases, these variables were identical to those used in the original studies; but we also included class variables that had not been used in the original studies. On average, the datasets contain 2.8 class variables. Second, we identified clinical variables that could be used as predictor variables (covariates). Using these data, we generated one "Analysis" file per class variable that contains the class values for each sample as well as covariates that we suggest are relevant to the class variable. (A given variable may be used as a class variable in one context and a predictor variable in a different context.) We named these analysis files using descriptive prefixes (e.g., "Prognosis", "Diagnosis", or "Stage"). In addition, we identified concepts in the National Cancer Institute Thesaurus[17] that map to each class and covariate variable. The name of each analysis file indicates the thesaurus term (preferred name) that corresponds to the class variable for that file. Within these files, the column names indicate the thesaurus terms that correspond to each covariate. We hope the use of this controlled vocabulary will make it easier for others to better understand the semantic meaning of these variables and identify commonalities across datasets. A tab-separated file that indicates mappings between the original annotation terms and the thesaurus terms can be found in our data repository (see https://osf.io/szwx6/). When a given sample was missing data for a given class variable, we excluded that sample from the respective analysis file for that class variable. After this filtering step, we identified class variables with fewer than 40 samples and excluded these class variables. When covariates were missing more than 20% data (Figure 2), we excluded these variables from the analysis files. When covariates were missing less than 20% data, we imputed missing values using median-based imputation for continuous variables and mode-based imputation for discrete variables[18]. We transformed discrete predictor variables using one-hot encoding; each unique value, except the first, was treated as a binary variable. In cases where discrete values were rare, we merged values. For example, in GSE2109_Breast (Data Citation 1), we merged Pathological_Stage values 3A, 3B, 3C, and 4 into a category called "3-4" because relatively few patients fell into the individual categories (38, 8, 22, and 5 samples, respectively). In addition, some class variables were ordinal in nature (e.g., cancer stage or tumor grade); we transformed these into binary variables. Finally, some clinical outcomes were survival or relapse times; we transformed these data to (discrete) class variables, using dataset-specific thresholds to distinguish between "long-term" and "short-term" survivors and excluding patients who were censored after the survival threshold had been reached. Our computer scripts (see Code availability) encode these decisions for each dataset.

Preprocessing gene-expression data

We created a computational pipeline (using R and shell scripts) that downloads, normalizes, and standardizes the raw-expression data. We used the GEOquery package[13] to download the CEL files and then normalized them using the SCAN.UPC package[19]. Some heterogeneity exists, even among platforms from the same manufacturer (Affymetrix). The number of probes and the probe sequences used in designing the microarray architectures vary. To help mitigate this heterogeneity and to aid in biological interpretation, we summarized the data using Ensembl-based gene-level annotations from Brainarray[20,21]. The SCAN algorithm log2-transforms the data and scales the data to center around zero. Relatively high values indicate relatively high gene-expression levels, and vice versa.

Code availability

Our computer scripts are stored in the open-access Biomarker Benchmark repository (Data Citation 1). Using these scripts, other researchers can reproduce our curation process and/or produce alternative versions of the data.

Data Records

After we filtered the original data (see Methods), our compendium includes data for 7,037 biological samples across 45 datasets (Table 1 (available online only)). On average, the datasets contain values for 18,043 genes (Table 1 (available online only)). In total, our repository contains 128 class variables (2.8 per dataset) and 2.1 unique values per class variable. All output data are stored in tab-delimited text files and are structured using the "tidy data" methodology[22]. Accordingly, data users can import the files directly into analytical tools such as Microsoft Excel, R, or Python. All data files are publicly and freely available in the open-access Biomarker Benchmark repository (Data Citation 1). The original data files are available via Gene Expression Omnibus using the accession numbers listed in Table 1 (available online only).

Technical Validation

We evaluated each sample using the IQRray[23] software, which produces a quality score for individual samples. Using these metrics, we applied Grubb’s statistical test (outliers package[24]) to each dataset, identified poor-quality outliers (Figure 3), and excluded these samples (Table 2 (available online only)). Next we used the DoppelgangR package[25] to identify samples that may have been duplicated inadvertently. We manually reviewed sample pairs that DoppelgangR flagged as potential duplicates. We excluded most sample pairs that were flagged (Table 2 (available online only)), even if the clinical annotations for both samples were distinct, under the assumption that these samples had somehow been mislabeled. In GSE46449 (Data Citation 1), many samples were biological replicates; we retained one of each replicate set. GSE5462, GSE19804, and GSE20181 (Data Citation 1) contained samples that had been profiled in a paired manner (e.g., pre- and post-treatment); we retained these pairs of samples.
Figure 3

Distribution of IQRay quality scores for each dataset.

Sample qualities are plotted for each dataset. Low-quality samples were identified using Grubb’s test. Samples that fall on or below the red threshold were excluded from the data repository.

Table 2

Summary of excluded samples.

SeriesSampleReason
We excluded samples that did not pass our quality-control criteria or that appeared to be duplicated. The Gene Expression Omnibus series and sample identifiers are listed, along with the reason we excluded each sample.  
GSE15296GSM382283Poor Quality
GSE19804GSM494596Poor Quality
GSE19804GSM494654Poor Quality
GSE19804GSM494657Poor Quality
GSE20181GSM506289Likely Duplicate
GSE20181GSM506294Likely Duplicate
GSE20181GSM506304Likely Duplicate
GSE20181GSM125198Likely Duplicate
GSE20181GSM125210Likely Duplicate
GSE20181GSM125230Likely Duplicate
GSE2109_BreastGSM53059Likely Duplicate
GSE2109_BreastGSM53027Likely Duplicate
GSE2109_ColonGSM89040Likely Duplicate
GSE2109_ColonGSM152664Likely Duplicate
GSE2109_ColonGSM152632Likely Duplicate
GSE2109_ColonGSM179922Likely Duplicate
GSE2109_ColonGSM89044Likely Duplicate
GSE2109_ColonGSM152666Likely Duplicate
GSE2109_ColonGSM179820Likely Duplicate
GSE2109_ColonGSM179924Likely Duplicate
GSE2109_LungGSM203652Poor Quality
GSE2109_OvaryGSM76554Likely Duplicate
GSE2109_OvaryGSM203725Likely Duplicate
GSE2109_OvaryGSM76567Likely Duplicate
GSE2109_OvaryGSM231913Likely Duplicate
GSE2109_OvaryGSM46839Poor Quality
GSE2109_ProstateGSM179790Likely Duplicate
GSE2109_ProstateGSM179843Likely Duplicate
GSE2109_ProstateGSM179903Likely Duplicate
GSE25507GSM627091Likely Duplicate
GSE25507GSM627087Likely Duplicate
GSE25507GSM627096Likely Duplicate
GSE25507GSM627078Likely Duplicate
GSE25507GSM627085Likely Duplicate
GSE25507GSM627196Likely Duplicate
GSE25507GSM627153Likely Duplicate
GSE25507GSM627180Likely Duplicate
GSE25507GSM627099Likely Duplicate
GSE25507GSM627115Likely Duplicate
GSE25507GSM627118Likely Duplicate
GSE25507GSM627124Likely Duplicate
GSE25507GSM627154Likely Duplicate
GSE25507GSM627204Likely Duplicate
GSE25507GSM627209Likely Duplicate
GSE25507GSM627215Likely Duplicate
GSE26682GSM656833Likely Duplicate
GSE26682GSM656770Likely Duplicate
GSE26682_U133PLUS2GSM656860Poor Quality
GSE26682_U133PLUS2GSM656613Poor Quality
GSE26682_U133PLUS2GSM656839Poor Quality
GSE26682_U133PLUS2GSM656721Poor Quality
GSE27342GSM675945Likely Duplicate
GSE27342GSM675947Likely Duplicate
GSE27342GSM675933Likely Duplicate
GSE27342GSM675935Likely Duplicate
GSE27342GSM676040Poor Quality
GSE27342GSM687519Poor Quality
GSE27854GSM687525Poor Quality
GSE30219GSM748210Likely Duplicate
GSE30219GSM748212Likely Duplicate
GSE30219GSM748218Likely Duplicate
GSE30219GSM748219Likely Duplicate
GSE30219GSM748255Poor Quality
GSE30219GSM748247Poor Quality
GSE30219GSM748057Poor Quality
GSE30219GSM748266Poor Quality
GSE30784GSM764928Likely Duplicate
GSE30784GSM764930Likely Duplicate
GSE30784GSM764904Poor Quality
GSE30784GSM764970Poor Quality
GSE32646GSM809214Likely Duplicate
GSE32646GSM809248Likely Duplicate
GSE32646GSM809251Likely Duplicate
GSE32646GSM809254Likely Duplicate
GSE37147GSM912230Likely Duplicate
GSE37147GSM912296Likely Duplicate
GSE37147GSM912296Likely Duplicate
GSE37147GSM912305Likely Duplicate
GSE37147GSM912291Likely Duplicate
GSE37147GSM912296Likely Duplicate
GSE37147GSM912305Likely Duplicate
GSE37147GSM912342Likely Duplicate
GSE37147GSM912273Likely Duplicate
GSE37147GSM912305Likely Duplicate
GSE37147GSM912342Likely Duplicate
GSE37147GSM912342Likely Duplicate
GSE37147GSM912348Likely Duplicate
GSE37147GSM912376Likely Duplicate
GSE37147GSM912376Likely Duplicate
GSE37147GSM912376Likely Duplicate
GSE37147GSM912463Poor Quality
GSE37147GSM912197Poor Quality
GSE37147GSM912300Poor Quality
GSE37199GSM913439Poor Quality
GSE37745GSM1019319Likely Duplicate
GSE37745GSM1019246Likely Duplicate
GSE37745GSM1019325Likely Duplicate
GSE37745GSM1019247Likely Duplicate
GSE37745GSM1019194Poor Quality
GSE37745GSM1019195Poor Quality
GSE37745GSM1019176Poor Quality
GSE37745GSM1019192Poor Quality
GSE37745GSM1019232Poor Quality
GSE37892GSM929512Poor Quality
GSE39491GSM970152Poor Quality
GSE39582GSM972249Likely Duplicate
GSE39582GSM972472Likely Duplicate
GSE39582GSM972243Likely Duplicate
GSE39582GSM972044Likely Duplicate
GSE39582GSM972091Likely Duplicate
GSE39582GSM972090Likely Duplicate
GSE39582GSM972245Likely Duplicate
GSE39582GSM972473Likely Duplicate
GSE39582GSM972515Likely Duplicate
GSE39582GSM972248Likely Duplicate
GSE43176GSM1057835Poor Quality
GSE46449GSM1130404Likely Duplicate
GSE46449GSM1130406Likely Duplicate
GSE46449GSM1130413Likely Duplicate
GSE46449GSM1130417Likely Duplicate
GSE46449GSM1130426Likely Duplicate
GSE46449GSM1130428Likely Duplicate
GSE46449GSM1130430Likely Duplicate
GSE46449GSM1130434Likely Duplicate
GSE46449GSM1130436Likely Duplicate
GSE46449GSM1130468Likely Duplicate
GSE46449GSM1130471Likely Duplicate
GSE46449GSM1130483Likely Duplicate
GSE48390GSM1176924Poor Quality
GSE48390GSM125120Poor Quality
GSE5462GSM125123Likely Duplicate
GSE5462GSM125125Likely Duplicate
GSE58697GSM1417097Poor Quality
GSE63885GSM1559328Likely Duplicate
GSE63885GSM1559360Likely Duplicate
GSE63885GSM1559385Likely Duplicate
GSE63885GSM1559370Likely Duplicate
GSE63885GSM1559375Likely Duplicate
GSE63885GSM1559386Likely Duplicate
GSE63885GSM1559361Poor Quality
GSE6532_U133PLUS2GSM151294Poor Quality
GSE6532_U133PLUS2GSM151280Poor Quality
When transcriptomic data are processed in multiple batches, batch assignments can lead to confounding effects[26]. In the clinical annotations, we identified batch-processing information for datasets GSE25507, GSE37199, GSE39582, and GSE40292 (Data Citation 1). We corrected for batch effects using the ComBat software[27]. The Biomarker Benchmark repository contains pre- and post-batch-corrected data. For dataset GSE37199, we identified two variables that could have been used for batch correction ("Centre" and "Plate"). Our repository contains batch-corrected data for both of these batch variables (the default is "Plate").

Machine-learning analysis

We created a document that illustrates how to programmatically download the data files and perform a simple classification analysis using our data (see https://osf.io/4n62k/). This document is coded for the R statistical package, but similar analyses could be performed using other programming languages.

Additional information

How to cite this article: Golightly N. P. et al. Curated compendium of human transcriptional biomarker data. Sci. Data 5:180066 doi: 10.1038/sdata.2018.66 (2018). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  61 in total

1.  Clinical significance of osteoprotegerin expression in human colorectal cancer.

Authors:  Shunsuke Tsukamoto; Toshiaki Ishikawa; Satoru Iida; Megumi Ishiguro; Kaoru Mogushi; Hiroshi Mizushima; Hiroyuki Uetake; Hiroshi Tanaka; Kenichi Sugihara
Journal:  Clin Cancer Res       Date:  2011-01-26       Impact factor: 12.531

2.  MicroRNA polymorphisms and risk of colorectal cancer.

Authors:  Stephanie L Schmit; Jeremy Gollub; Michael H Shapero; Shu-Chen Huang; Hedy S Rennert; Andrea Finn; Gad Rennert; Stephen B Gruber
Journal:  Cancer Epidemiol Biomarkers Prev       Date:  2014-10-23       Impact factor: 4.254

3.  Delineation of two clinically and molecularly distinct subgroups of posterior fossa ependymoma.

Authors:  Hendrik Witt; Stephen C Mack; Marina Ryzhova; Sebastian Bender; Martin Sill; Ruth Isserlin; Axel Benner; Thomas Hielscher; Till Milde; Marc Remke; David T W Jones; Paul A Northcott; Livia Garzia; Kelsey C Bertrand; Andrea Wittmann; Yuan Yao; Stephen S Roberts; Luca Massimi; Tim Van Meter; William A Weiss; Nalin Gupta; Wiesia Grajkowska; Boleslaw Lach; Yoon-Jae Cho; Andreas von Deimling; Andreas E Kulozik; Olaf Witt; Gary D Bader; Cynthia E Hawkins; Uri Tabori; Abhijit Guha; James T Rutka; Peter Lichter; Andrey Korshunov; Michael D Taylor; Stefan M Pfister
Journal:  Cancer Cell       Date:  2011-08-16       Impact factor: 31.743

4.  Sequential changes in gene expression profiles in breast cancers during treatment with the aromatase inhibitor, letrozole.

Authors:  W R Miller; A Larionov; T J Anderson; D B Evans; J M Dixon
Journal:  Pharmacogenomics J       Date:  2010-08-10       Impact factor: 3.550

5.  GSTP1 expression predicts poor pathological complete response to neoadjuvant chemotherapy in ER-negative breast cancer.

Authors:  Tomohiro Miyake; Takahiro Nakayama; Yasuto Naoi; Noriaki Yamamoto; Yoko Otani; Seung J Kim; Kenzo Shimazu; Atsushi Shimomura; Naomi Maruyama; Yasuhiro Tamaki; Shinzaburo Noguchi
Journal:  Cancer Sci       Date:  2012-03-01       Impact factor: 6.716

6.  Autism and increased paternal age related changes in global levels of gene expression regulation.

Authors:  Mark D Alter; Rutwik Kharkar; Keri E Ramsey; David W Craig; Raun D Melmed; Theresa A Grebe; R Curtis Bay; Sharman Ober-Reynolds; Janet Kirwan; Josh J Jones; J Blake Turner; Rene Hen; Dietrich A Stephan
Journal:  PLoS One       Date:  2011-02-17       Impact factor: 3.240

7.  The Landscape of Prognostic Outlier Genes in High-Risk Prostate Cancer.

Authors:  Shuang G Zhao; Joseph R Evans; Vishal Kothari; Grace Sun; Ashley Larm; Victor Mondine; Edward M Schaeffer; Ashley E Ross; Eric A Klein; Robert B Den; Adam P Dicker; R Jeffrey Karnes; Nicholas Erho; Paul L Nguyen; Elai Davicioni; Felix Y Feng
Journal:  Clin Cancer Res       Date:  2015-12-02       Impact factor: 12.531

8.  Utilization of never-medicated bipolar disorder patients towards development and validation of a peripheral biomarker profile.

Authors:  Catherine L Clelland; Laura L Read; Laura J Panek; Robert H Nadrich; Carter Bancroft; James D Clelland
Journal:  PLoS One       Date:  2013-06-24       Impact factor: 3.240

9.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts.

Authors:  Yudi Pawitan; Judith Bjöhle; Lukas Amler; Anna-Lena Borg; Suzanne Egyhazi; Per Hall; Xia Han; Lars Holmberg; Fei Huang; Sigrid Klaar; Edison T Liu; Lance Miller; Hans Nordgren; Alexander Ploner; Kerstin Sandelin; Peter M Shaw; Johanna Smeds; Lambert Skoog; Sara Wedrén; Jonas Bergh
Journal:  Breast Cancer Res       Date:  2005-10-03       Impact factor: 6.466

10.  Global changes in gene expression of Barrett's esophagus compared to normal squamous esophagus and gastric cardia tissues.

Authors:  Paula L Hyland; Nan Hu; Melissa Rotunno; Hua Su; Chaoyu Wang; Lemin Wang; Ruth M Pfeiffer; Barbara Gherman; Carol Giffen; Cathy Dykes; Sanford M Dawsey; Christian C Abnet; Kathryn M Johnson; Ruben D Acosta; Patrick E Young; Brooks D Cash; Philip R Taylor
Journal:  PLoS One       Date:  2014-04-08       Impact factor: 3.240

View more
  1 in total

1.  The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

Authors:  Stephen R Piccolo; Avery Mecham; Nathan P Golightly; Jérémie L Johnson; Dustin B Miller
Journal:  PLoS Comput Biol       Date:  2022-03-11       Impact factor: 4.475

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.