Literature DB >> 21695066

GENT: gene expression database of normal and tumor tissues.

Gwangsik Shin1, Tae-Wook Kang, Sungjin Yang, Su-Jin Baek, Yong-Su Jeong, Seon-Young Kim.   

Abstract

BACKGROUND: Some oncogenes such as ERBB2 and EGFR are over-expressed in only a subset of patients. Cancer outlier profile analysis is one of computational approaches to identify outliers in gene expression data. A database with a large sample size would be a great advantage when searching for genes over-expressed in only a subset of patients. DESCRIPTION: GENT (Gene Expression database of Normal and Tumor tissues) is a web-accessible database that provides gene expression patterns across diverse human cancer and normal tissues. More than 40000 samples, profiled by Affymetrix U133A or U133plus2 platforms in many different laboratories across the world, were collected from public resources and combined into two large data sets, helping the identification of cancer outliers that are over-expressed in only a subset of patients. Gene expression patterns in nearly 1000 human cancer cell lines are also provided. In each tissue, users can retrieve gene expression patterns classified by more detailed clinical information.
CONCLUSIONS: The large samples size (>24300 for U133plus2 and >16400 for U133A) of GENT provides an advantage in identifying cancer outliers. A cancer cell line gene expression database is useful for target validation by in vitro experiment. We hope GENT will be a useful resource for cancer researchers in many stages from target discovery to target validation. GENT is available at http://medicalgenome.kribb.re.kr/GENT/ or http://genome.kobic.re.kr/GENT/.

Entities:  

Keywords:  Affymetrix; cancer; gene expression; human tissues

Year:  2011        PMID: 21695066      PMCID: PMC3118449          DOI: 10.4137/CIN.S7226

Source DB:  PubMed          Journal:  Cancer Inform        ISSN: 1176-9351


Background

Recent examples of successful cancer therapeutics such as Gleevec, Herceptin, and Iressa suggest that the concept of ‘molecular targeted therapy’ is applicable to human cancers of diverse tissue and genetic origin.1 ‘Oncogene addiction’ is a term to describe a phenomenon in which the growth and survival of tumors are impaired by the inactivation of a single oncogene.2 There are several established relationships between genetic alterations and corresponding targeted therapies, and efforts to identify further genetic alterations are underway. Mechanisms of genetic alterations include mutations (EGFR in lung cancer), translocations (BCR-ABL in chronic myeloid leukemia), and gene amplifications (ERBB2 in breast cancer).1 Interestingly, some oncogenes are altered in only a subset of cancer patients. For examples, ERBB2 is amplified and over-expressed in about 25%–30% of breast cancer patients, whereas EGFR is mutated in about 20% of lung cancer patients. Cancer Outlier Profile Analysis (COPA) is a computational method that identifies gene expression profiles that are pathogenically over-expressed in only a subset of patients.3,4 AGTR1 is an example of potential target genes identified by applying the COPA method to the Oncomine database.5 A database with a large sample size is a great advantage when searching for genes over-expressed in only a subset of patients. For example, identifying genes over-expressed in 50 out of 1000 patients is easier and more reliable than identifying genes over-expressed in 2 out of 40 patients. Although the sample size of most individual gene expression studies rarely exceeds one thousand, a data set of nearly ten thousand samples (ie, GeneSapiens database) can be created by a combined analysis of multiple data sets.6 Recent work has shown that analysis of a large microarray data set compiled from many data sets can reveal novel findings that are difficult to observe in the individual studies.7 For a combined analysis, data sets created by the Affymetrix platforms (ie, U133A and U133plus2) offer several advantages. First, most gene expression data sets have been created using the Affymetrix platforms. Second, many data sets are accompanied by raw CEL files so that users can preprocess them as they wish. We have collected human tissue gene expression data sets produced using the Affymetrix U133A and U133Plus2 platforms from public resources, and built a large-scale gene expression database of more than 40,000 samples.

Construction and Content

More than 24300 (U133plus2; 306 data sets) and 16400 (U133A, 241 data sets) samples were collected from public resources, including Gene Expression Omnibus,8 Array Express,9 and Expression Project for Oncology.10 Whenever CEL files were available (288/306 for U133plus2 and 192/241 for U133A), we pre-processed them using the MAS5 algorithm using the affy package.11 We chose the MAS5 algorithm because it is a single-array algorithm in which expression values are independent of other data. We then normalized each sample to a target density of 500. For data sets without CEL files but pre-processed by the MAS5 algorithm (18/306 for U133plus2 and 49/241 for U133A), we used expression measures downloaded from the web source and normalized them to a target density of 500. We identified samples described in more than one dataset, and cleaned them up from duplications. We then classified each sample according to tissue and disease types (Table 1) based on information given in each dataset. Most samples (∼75%) were classified into either cancer or normal, but about 20% of samples were classified into other diseases including neurodegenerative diseases, immune-related diseases, and organ-specific diseases. In each tissue type, we also classified each sample into more detailed clinical subtypes such as estrogen receptor positive breast cancers or high grade serous carcinoma of the ovary etc, as described in the original data source. We also collected expression data for more than 3000 samples comprising nearly 1000 different cancer cell lines across tissues, and processed them using the same method (Table 2). Broad/Sanger Cell Line Project12 and GSK’s cell line project13 provided the most abundant expression data sets. For genes with multiple Affymetrix ids, we calculated the average of the multiple probes. The system implementation is based on an Apache web server, JavaScript and PHP scripts for data processing, Open flash charts and R scripts for image production, and MySQL as a backend database.
Table 1.

The number of tissue samples according to tissue types (U133plus2 and U133A).

TissueU133plus2
U133A
Total
CancerNormalCancerNormal
Abdomen1300013
Adipose15901272
Adrenal gland1450019
Bladder39148715155
Blood4693639313010998974
Brain78556859216273572
Breast19542512635914931
Cervix74126434184
Colon1294206256271783
Endometrium726109142
Esophagus4892428109
GIST6400064
Head and neck20214212239
Heart0004141
Kidney573105366661110
Liver1822515652415
Lung4412255823641612
Muscle01770331508
Myometrium0002424
Ovary8592134191230
Pancreas13255138208
Prostate3084524483680
Sarcoma493000493
Skin2902849959876
Small intestine13602241
Stomach268574618389
Testis4618413207
Thyroid62254425156
Tongue0110415
Uterus15512024191
Vagina35008
Vulva21140035
Total1305726559284408729083
Table 2.

The number of cancer cell lines according to tissue types (U133plus2 and U133A).

TissueU133plus2U133ATotal
Adrenal gland022
Biliary tract066
Bladder304070
Blood229142371
Bone123244
Brain149117266
Breast239199438
Cervix232346
Colon14356199
Connective tissue909
Endometrium01111
Esophagus122537
EWT707
Eye527
Kidney294069
Leukemia044
Liver331649
Lung358325683
Lymphoma73174
Muscle10010
Myeloma72431
Ovary244367
Pancreas501868
Pharynx606
Placenta9211
Prostate121830
Rectum707
Sarcoma808
Skin12973202
Soft tissue01919
Stomach562480
Testis044
Thyroid121325
Upper aerodigestive02424
Urinary tract02020
Vulva9312
Total171413363050

Utility and Discussion

GENT can be searched using either a gene symbol or an Affymetrix id. DB search results are presented in one of three ways: 1) cancer-normal samples across tissues 2) cancer cell lines across tissues, and 3) detailed phenotypes in a tissue of choice. Raw data for the searched gene are available for download, so users can analyze them as they wish. Users can search multiple gene symbols or Affymetrix ids for cancer-normal samples across tissues. As an example, we show a pattern of ERBB2 expression in diverse cancer and normal tissues (Fig. 1) and in diverse cancer cell lines (Fig. 2). As expected, the over-expression of ERBB2 in a subset of breast cancer patients is obvious. Besides, over-expression of ERBB2 in a subset of lung, ovarian, and stomach cancer patients is observed as well. Indeed, Herceptin is being tested in the treatment of lung, ovarian, and stomach cancer; Last year, an international multi-center study showed the effectiveness of Herceptin in patients with ERBB2-positive stomach cancer patients.14
Figure 1.

Pattern of ERBB2 expression across diverse normal and tumor tissues, A) U133plus2 data set, B) U133A data set.

Figure 2.

Pattern of ERBB2 expression across diverse cancer cell lines in U133plus2 data. A) Data is shown as multiple boxplots. B) Data is shown as a flash chart so users can identify a cell line name by pointing a mouse on each dot.

The pattern of ERBB2 expression in diverse cell lines provides interesting information as well. First, similar patterns are observed between cancer cell lines and tissues (Figs. 1 and 2). The ERBB2 is not only over-expressed in a subset of breast, lung, ovary, and gastric tumor tissues, but it is also over-expressed in a subset of cell lines from those four tissues (Fig. 2). Second, information on its expression in specific cell lines is provided as raw data for convenience during in vitro experiments. Here, HCC1954 or B-BT-474 (breast), SKOV3 (ovary), and NCI-N87 or MKN7 (stomach) are good cell lines for observing the effects of siRNA knock-down of the ERBB2, a first step to show if it is a critical oncogene (Fig. 2). The information provided by the cell line database allows users to skip laborious RT-PCR steps necessary for selecting cell lines for in vitro experiments. As a second example, we show a pattern of MET expression in diverse normal and tumor tissues. MET is a tyrosine kinase receptor for hepatocyte growth factor and its mutations and amplifications are associated with papillary renal carcinoma.1 Again, the over-expression of MET in a subset of renal cancers is clearly shown (Supplementary Fig. 1). Besides, MET is over-expressed in a subset of liver, melanoma, and gastric cancer patients, suggesting that MET can be a target for a subset of liver, melanoma and gastric cancer patients. Finally, users can search detailed phenotypes in a specific tissue of interest. For example, if a user selects a brain tissue for detailed information, patterns of expression in diverse brain diseases including Alzheimer’s disease, Parkinson’s disease, and subtypes of brain tumors are presented. We provide an example of ovarian data set (GSE12172) in Figure 3. Here, ovarian cancer is further classified into subtypes based on tumor stages. We parsed detailed clinical information given in each data set and provide them in a user-friendly manner.
Figure 3.

Pattern of LAMB3 expression among ovarian cancer subtypes. A) A screenshot of sub-type specific search option. B) Pattern of LAMB3 expression in different stages of ovarian cancer patients from GSE12172 data set.

One may be concerned that the GENT database may present false findings due to noise in the public data because laboratory effects are known to be present in publicly available data sets.11 We assessed the impact of these effects following Lukk et al’s analyses. 7 We selected biological groups (with ten replicates or more) which contain at least two different laboratories. For U133A data sets, we selected 5,089 samples of 92 biological groups produced in 93 laboratories. For each of the biological groups, we computed the average correlation coefficient between the assays from different laboratories within the same group. We also calculated the average correlation coefficient between assays from the same laboratory but belonging to different biological groups. The comparison of the two similarity distributions showed that the biological effects were stronger than the laboratory effects7 (Fig. 4). We got similar results with the U133Plus2 data sets, too.
Figure 4.

Analysis of laboratory effects by comparing distribution of correlation coefficients among three different groups: Distribution of all pairwise correlations between the samples in the dataset (black), distribution of average similarities between the sample subgroups from different laboratories within the same biological group (green), distribution of average similarities between the sample subgroups from different biological groups within the same laboratory (red).

Also, we provide three means to help to identify and minimize false positive findings. First, raw data are provided for the searched gene with data sources (gene expression series id) so that users can check laboratory effects themselves. Second, a data set filtering option is provided so that users can include or exclude particular data sets. Finally, in our opinion, providing results separately in two platforms (U133A and U133plus2) is one way to discern between true and noisy data as congruent results between the two platforms are a sign of good data quality (Figs. 1 and 2). The COPA method was originally developed to identify genomic aberrations (ie, chromosomal translocations such as TMPRSS2-ETV1) by searching for pairs of samples with mutually exclusive outliers.4 Currently, the Oncomine database implements the COPA method and provides information on possible outliers. Basically, samples in a dataset are grouped by sample properties, centered by a median value and rescaled by median absolute deviation, and COPA score is calculated at multiple percentiles (ie, 1%, 5%, and 10%). Although we adopted COPA concept in the GENT database, we didn’t implement it as suggested in the original paper.3,4 Instead, we tried to increase the sample size by collating similar samples across different datasets, so outliers could be detected more reliably. We plan to add the COPA score for each gene in the future. Also, we plan to add additional genomic data such as copy number alterations and genome-wide DNA methylation data to the GENT database so that additional information can be obtained by the integrated analysis of multiple genomic data. We also plan to add gene expression data generated using other platforms (ie, Illumina or Agilent) as more data accumulate. Finally, we plan to add more functions to provide more extensive clinical information which is currently provided in a limited way.

Conclusions

Oncomine5 and Gene Expression Atlas15 are two examples of excellent web databases for gene expression information with many useful functions. The two databases provide extensive clinical information given in each collected dataset and many useful search options. To mention a major difference between the two databases, Oncomine focuses on human cancer datasets while Gene Expression Atlas comprises more than 20 organisms including human, mouse, rat, and so on. As those two databases provide rich information, we focused on providing new features for users instead of implementing features already available in those two databases. In our opinion, two aspects of GENT are unique. The first one is the large sample size made possible by the collation of hundreds of datasets into two large datasets, and the second one is the cancer cell line gene expression database which is a convenient tool to select cell lines of interest. The large sample size (>24300 for U133plus2 and >16400 for U133A) of GENT provides an enormous advantage in identifying cancer outliers. A cancer cell line gene expression database is a useful resource for target validation by in vitro experiment. We hope GENT will be a useful resource for cancer researchers in many stages from target discovery to target validation. Pattern of MET expression across diverse normal and tumor tissues.
  12 in total

1.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Authors:  Scott A Tomlins; Daniel R Rhodes; Sven Perner; Saravana M Dhanasekaran; Rohit Mehra; Xiao-Wei Sun; Sooryanarayana Varambally; Xuhong Cao; Joelle Tchinda; Rainer Kuefer; Charles Lee; James E Montie; Rajal B Shah; Kenneth J Pienta; Mark A Rubin; Arul M Chinnaiyan
Journal:  Science       Date:  2005-10-28       Impact factor: 47.728

2.  A gene expression bar code for microarray data.

Authors:  Michael J Zilliox; Rafael A Irizarry
Journal:  Nat Methods       Date:  2007-09-30       Impact factor: 28.547

Review 3.  Oncogene addiction.

Authors:  I Bernard Weinstein; Andrew Joe
Journal:  Cancer Res       Date:  2008-05-01       Impact factor: 12.701

Review 4.  Clinical Cancer Advances 2009: major research advances in cancer treatment, prevention, and screening--a report from the American Society of Clinical Oncology.

Authors:  Nicholas J Petrelli; Eric P Winer; Julie Brahmer; Sarita Dubey; Sonali Smith; Charles Thomas; Linda T Vahdat; Jennifer Obel; Nicholas Vogelzang; Maurie Markman; John W Sweetenham; David Pfister; Mark G Kris; Lynn M Schuchter; Raymond Sawaya; Derek Raghavan; Patricia A Ganz; Barnett Kramer
Journal:  J Clin Oncol       Date:  2009-11-09       Impact factor: 44.544

5.  COPA--cancer outlier profile analysis.

Authors:  James W MacDonald; Debashis Ghosh
Journal:  Bioinformatics       Date:  2006-08-07       Impact factor: 6.937

6.  A global map of human gene expression.

Authors:  Margus Lukk; Misha Kapushesky; Janne Nikkilä; Helen Parkinson; Angela Goncalves; Wolfgang Huber; Esko Ukkonen; Alvis Brazma
Journal:  Nat Biotechnol       Date:  2010-04       Impact factor: 54.908

Review 7.  Linking somatic genetic alterations in cancer to therapeutics.

Authors:  Darrin Stuart; William R Sellers
Journal:  Curr Opin Cell Biol       Date:  2009-03-26       Impact factor: 8.382

8.  AGTR1 overexpression defines a subset of breast cancer and confers sensitivity to losartan, an AGTR1 antagonist.

Authors:  Daniel R Rhodes; Bushra Ateeq; Qi Cao; Scott A Tomlins; Rohit Mehra; Bharathi Laxman; Shanker Kalyana-Sundaram; Robert J Lonigro; Beth E Helgeson; Mahaveer S Bhojani; Alnawaz Rehemtulla; Celina G Kleer; Daniel F Hayes; Peter C Lucas; Sooryanarayana Varambally; Arul M Chinnaiyan
Journal:  Proc Natl Acad Sci U S A       Date:  2009-06-01       Impact factor: 11.205

9.  Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues.

Authors:  Sami Kilpinen; Reija Autio; Kalle Ojala; Kristiina Iljin; Elmar Bucher; Henri Sara; Tommi Pisto; Matti Saarela; Rolf I Skotheim; Mari Björkman; John-Patrick Mpindi; Saija Haapa-Paananen; Paula Vainio; Henrik Edgren; Maija Wolf; Jaakko Astola; Matthias Nees; Sampsa Hautaniemi; Olli Kallioniemi
Journal:  Genome Biol       Date:  2008-09-19       Impact factor: 13.583

10.  ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression.

Authors:  Helen Parkinson; Misha Kapushesky; Nikolay Kolesnikov; Gabriella Rustici; Mohammad Shojatalab; Niran Abeygunawardena; Hugo Berube; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Ele Holloway; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Tim F Rayner; Faisal Rezwan; Anjan Sharma; Eleanor Williams; Xiangqun Zheng Bradley; Tomasz Adamusiak; Marco Brandizi; Tony Burdett; Richard Coulson; Maria Krestyaninova; Pavel Kurnosov; Eamonn Maguire; Sudeshna Guha Neogi; Philippe Rocca-Serra; Susanna-Assunta Sansone; Nataliya Sklyar; Mengyao Zhao; Ugis Sarkans; Alvis Brazma
Journal:  Nucleic Acids Res       Date:  2008-11-10       Impact factor: 16.971

View more
  92 in total

1.  CellMinerHCC: a microarray-based expression database for hepatocellular carcinoma cell lines.

Authors:  Frank Staib; Markus Krupp; Thorsten Maass; Timo Itzel; Arndt Weinmann; Ju-Seog Lee; Bertil Schmidt; Martina Müller; Snorri S Thorgeirsson; Peter R Galle; Andreas Teufel
Journal:  Liver Int       Date:  2013-09-09       Impact factor: 5.828

2.  Genomic Instability Promoted by Overexpression of Mismatch Repair Factors in Yeast: A Model for Understanding Cancer Progression.

Authors:  Ujani Chakraborty; Timothy A Dinh; Eric Alani
Journal:  Genetics       Date:  2018-04-13       Impact factor: 4.562

3.  Endoplasmic reticulum stress activates SRC, relocating chaperones to the cell surface where GRP78/CD109 blocks TGF-β signaling.

Authors:  Yuan-Li Tsai; Dat P Ha; He Zhao; Anthony J Carlos; Shan Wei; Tsam Kiu Pun; Kaijin Wu; Ebrahim Zandi; Kevin Kelly; Amy S Lee
Journal:  Proc Natl Acad Sci U S A       Date:  2018-04-13       Impact factor: 11.205

4.  Molecular biology: Salvaging the genome.

Authors:  Sharanya Sivanand; Kathryn E Wellen
Journal:  Nature       Date:  2015-07-22       Impact factor: 49.962

5.  Pyrin Inflammasome Regulates Tight Junction Integrity to Restrict Colitis and Tumorigenesis.

Authors:  Deepika Sharma; Ankit Malik; Clifford S Guy; Rajendra Karki; Peter Vogel; Thirumala-Devi Kanneganti
Journal:  Gastroenterology       Date:  2017-12-02       Impact factor: 22.682

6.  Vav1 in hematologic neoplasms, a mini review.

Authors:  Matthew J Oberley; Deng-Shun Wang; David T Yang
Journal:  Am J Blood Res       Date:  2012-01-01

7.  A COL11A1-correlated pan-cancer gene signature of activated fibroblasts for the prioritization of therapeutic targets.

Authors:  Dongyu Jia; Zhenqiu Liu; Nan Deng; Tuan Zea Tan; Ruby Yun-Ju Huang; Barbie Taylor-Harding; Dong-Joo Cheon; Kate Lawrenson; Wolf R Wiedemeyer; Ann E Walts; Beth Y Karlan; Sandra Orsulic
Journal:  Cancer Lett       Date:  2016-09-05       Impact factor: 8.679

8.  Thermoresponsive Collagen/Cell Penetrating Hybrid Peptide as Nanocarrier in Targeting-Free Cell Selection and Uptake.

Authors:  Myungeun Oh; Chloe Hu; Selina F Urfano; Merlyn Arostegui; Katarzyna Slowinska
Journal:  Anal Chem       Date:  2016-09-22       Impact factor: 6.986

9.  The cyclin D1-CDK4 oncogenic interactome enables identification of potential novel oncogenes and clinical prognosis.

Authors:  Siwanon Jirawatnotai; Samanta Sharma; Wojciech Michowski; Bhoom Suktitipat; Yan Geng; John Quackenbush; Joshua E Elias; Steven P Gygi; Yaoyu E Wang; Piotr Sicinski
Journal:  Cell Cycle       Date:  2014       Impact factor: 4.534

10.  Differential N-Glycosylation Patterns in Lung Adenocarcinoma Tissue.

Authors:  L Renee Ruhaak; Sandra L Taylor; Carol Stroble; Uyen Thao Nguyen; Evan A Parker; Ting Song; Carlito B Lebrilla; William N Rom; Harvey Pass; Kyoungmi Kim; Karen Kelly; Suzanne Miyamoto
Journal:  J Proteome Res       Date:  2015-09-30       Impact factor: 4.466

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.