| Literature DB >> 25819073 |
Quan Wan1, Hayley Dingerdissen1, Yu Fan1, Naila Gulzar1, Yang Pan1, Tsung-Jung Wu1, Cheng Yan1, Haichen Zhang1, Raja Mazumder2.
Abstract
BioXpress is a gene expression and cancer association database in which the expression levels are mapped to genes using RNA-seq data obtained from The Cancer Genome Atlas, International Cancer Genome Consortium, Expression Atlas and publications. The BioXpress database includes expression data from 64 cancer types, 6361 patients and 17 469 genes with 9513 of the genes displaying differential expression between tumor and normal samples. In addition to data directly retrieved from RNA-seq data repositories, manual biocuration of publications supplements the available cancer association annotations in the database. All cancer types are mapped to Disease Ontology terms to facilitate a uniform pan-cancer analysis. The BioXpress database is easily searched using HUGO Gene Nomenclature Committee gene symbol, UniProtKB/RefSeq accession or, alternatively, can be queried by cancer type with specified significance filters. This interface along with availability of pre-computed downloadable files containing differentially expressed genes in multiple cancers enables straightforward retrieval and display of a broad set of cancer-related genes.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25819073 PMCID: PMC4377087 DOI: 10.1093/database/bav019
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Flow chart of the workflow used to create BioXpress. BioXpress processes short reads and read count data through distinct pipelines. Data are further divided into two groups: paired data that have both normal and tumor samples from the same patient, and non-paired, tumor-only data. Output in BioXpress is split into three different types: differential expression (stacked bar chart), tumor-only expression (box plot) and baseline expression data (heatmap). In addition to the data integration approaches shown in the figure, gene expression information is also extracted from publications.
Statistics of data collected in BioXpress
| Source | Data type | No. of samples/individuals | Tumor/normal |
|---|---|---|---|
| TCGA | Raw read count | 1320/660 | Tumor and normal |
| ICGC and TCGA | Raw read count | 6397/6324 | Tumor |
| Expression Atlas baseline | Normalized count | 1/1 | Normal |
| Literature | Published literature | Not applicable (135 publications) | Tumor and normal comparison |
aTypically, each patient contains more than one sequencing sample. Therefore, we provide the number of both samples and individuals.
bThe number of patients is collected from TCGA, ICGC and Expression Atlas baseline projects. Some TCGA patient IDs overlap with the ICGC patient IDs.
Figure 2.Snapshot of BioXpress interface. The stacked bar chart displays the percent of individuals with over- or under- expression of the ASPM gene.
Figure 3.Clustering and heatmap view of the top 50 differentially expressed genes as reported by BioXpress. Although these graphics were generated using external tools, the emphasis here is the ability of BioXpress to sort through large amounts of data and return candidate subsets for subsequent analysis. (A) Clustering of these genes in different cancer types based on the frequency of patients who have significant differential expression. Darker colors indicate a higher percentage of patients with such differential expression. (B) For genes which do not have normal samples, the heatmap shows clustering based on normalized count. Darker colors indicate a higher expression level. (C) Clustering based on baseline expression for the 50 genes in different tissues. Darker colors indicate higher expression level.
Genes significantly differentially expressed in tumor and normal samples in all cancer types in one or more patients
| Gene | UniProtKB AC | Protein name | Over-expressed cancer types | Under-expressed cancer types |
|---|---|---|---|---|
| CCL21 | O00585 | C-C motif chemokine 21 | KIRC, LIHC, BRCA, THCA, KICH | KICH, BRCA, THCA, PAAD, ESCA, KIRC, COAD, KIRP, STAD, CESC, LIHC, HNSC, READ, PRAD, BLCA, LUAD, LUSC, UCEC |
| GGT6 | Q6P531 | γ-glutamyltransferase 6 | BRCA,THCA, PAAD, BLCA, STAD, CESC, LIHC, KIRC, LUAD, UCEC | BLCA, BRCA, STAD, ESCA, KIRC, COAD, KIRP, HNSC, READ, PRAD, KICH, LUAD, LUSC |
| UBD | O15205 | Ubiquitin D | KICH, BRCA, THCA, ESCA, KIRC, COAD, STAD, CESC, LIHC, HNSC, READ, PRAD, BLCA, LUAD, LUSC, UCEC | BRCA, THCA, PAAD, KICH, KIRP, LIHC, HNSC, PRAD, BLCA |
| MMP7 | P09237 | Matrilysin | BRCA, STAD, THCA, ESCA, BLCA, COAD, PAAD, LIHC, HNSC, READ, PRAD, KIRC, LUAD, LUSC, UCEC | KICH, BRCA, BLCA, KIRP, CESC, LIHC, HNSC, PRAD, KIRC, LUAD |
| NCAM1 | P13591 | Neural cell adhesion molecule 1 | BRCA, THCA, KIRC, KIRP, HNSC, KICH, LUAD, LUSC | KICH, BRCA, STAD, KIRP, THCA, ESCA, KIRC, COAD, PAAD, CESC, LIHC, HNSC, READ, PRAD, BLCA, UCEC |
| CHRDL1 | Q9BU40 | Chordin-like protein 1 | PRAD, KICH, LIHC, THCA, KIRC | PAAD, BRCA, STAD, THCA, ESCA, BLCA, COAD, KIRP, KIRC, CESC, LIHC, HNSC, READ, PRAD, KICH, LUAD, LUSC, UCEC |
| WFDC2 | Q14508 | WAP four-disulfide core domain protein 2 | BRCA, STAD, PAAD, ESCA, KIRC, CESC, LIHC, HNSC, BLCA, LUAD, UCEC | KICH, BRCA, THCA, BLCA, COAD, KIRP, STAD, LIHC, HNSC, READ, PRAD, KIRC, LUAD, LUSC |
| LCN2 | P80188 | Neutrophil gelatinase-associated lipocalin | BLCA, BRCA, THCA, PAAD, ESCA, KIRC, COAD, KIRP, STAD, CESC, LIHC, READ, PRAD, KICH, LUAD, LUSC, UCEC | BRCA, THCA, KIRC, KIRP, LIHC, HNSC, PRAD, BLCA, LUAD, LUSC |
| KRT80 | Q6KB66 | Keratin, type II cytoskeletal 80 | BRCA, THCA, PAAD, ESCA, BLCA, COAD, KIRP, STAD, CESC, LIHC, READ, PRAD, LUAD, LUSC, UCEC | BLCA, BRCA, THCA, KIRC, LIHC, HNSC, PRAD, KICH |
LIHC = liver hepatocellular carcinoma; BLCA = bladder urothelial carcinoma; KICH = kidney chromophobe; UCEC = uterine corpus endometrial carcinoma; ESCA = esophageal carcinoma; CESC = cervical squamous cell carcinoma and endocervical adenocarcinoma.
Top five genes significantly differentially expressed in tumor and normal samples in >50% of the patients
| Gene | UniProtKB AC | Protein name | Over-expressed cancer types | Under-expressed cancer types |
|---|---|---|---|---|
| COL10A1 | Q03692 | Collagen alpha-1(X) chain | BRCA, STAD, BLCA, COAD, HNSC, LUAD | |
| COL11A1 | P12107 | Collagen alpha-1(XI) chain | BRCA, COAD, HNSC, LUAD, LUSC, | |
| MMP11 | P24347 | Stromelysin-3 | BRCA, BLCA, COAD, HNSC, LUAD | |
| TMPRSS4 | Q9NRS4 | Transmembrane protease serine 4 | KIRC, LUAD, LUSC, THCA, UCEC | |
| MMP1 | P03956 | Interstitial collagenase | COAD, LUAD, LUSC, HNSC | |
| ADH1B | P00325 | Alcohol dehydrogenase 1B | BLCA, THCA, KIRC, COAD, KIRP, HNSC, KICH, LUSC, UCEC | |
| MT1H | P80294 | Metallothionein-1H | KICH, KIRC, KIRP, LIHC, THCA | |
| MT1G | P13640 | Metallothionein-1G | KICH, KIRC, KIRP, LIHC, THCA | |
| CHRDL1 | Q9BU40 | Chordin-like protein 1 | BLCA, KICH, KIRC, THCA, UCEC | |
| CA4 | P22748 | Carbonic anhydrase 4 | BRCA, COAD, KIRP, LUAD, LUSC |
The genes were sorted based on the number of cancer types they were differentially expressed in.
LIHC = liver hepatocellular carcinoma; BLCA = bladder urothelial carcinoma; KICH = kidney chromophobe; CESC = cervical squamous cell carcinoma and endocervical adenocarcinoma.