| Literature DB >> 27322383 |
Aurora Torrente1,2, Margus Lukk3, Vincent Xue4, Helen Parkinson2, Johan Rung5, Alvis Brazma2.
Abstract
Rapid accumulation and availability of gene expression datasets in public repositories have enabled large-scale meta-analyses of combined data. The richness of cross-experiment data has provided new biological insights, including identification of new cancer genes. In this study, we compiled a human gene expression dataset from ∼40,000 publicly available Affymetrix HG-U133Plus2 arrays. After strict quality control and data normalisation the data was quantified in an expression matrix of ∼20,000 genes and ∼28,000 samples. To enable different ways of sample grouping, existing annotations where subjected to systematic ontology assisted categorisation and manual curation. Groups like normal tissues, neoplasmic tissues, cell lines, homoeotic cells and incompletely differentiated cells were created. Unsupervised analysis of the data confirmed global structure of expression consistent with earlier analysis but with more details revealed due to increased resolution. A suitable mixed-effects linear model was used to further investigate gene expression in solid tissue tumours, and to compare these with the respective healthy solid tissues. The analysis identified 1,285 genes with systematic expression change in cancer. The list is significantly enriched with known cancer genes from large, public, peer-reviewed databases, whereas the remaining ones are proposed as new cancer gene candidates. The compiled dataset is publicly available in the ArrayExpress Archive. It contains the most diverse collection of biological samples, making it the largest systematically annotated gene expression dataset of its kind in the public domain.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27322383 PMCID: PMC4913919 DOI: 10.1371/journal.pone.0157484
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Principal components.
On top, the first two principal components. The panel on the left illustrates the clear separation along the X-axis of the hematopoietic and non-hematopoietic material containing samples from both tissues and cell lines. The non-solid region includes blood, bone marrow, lymph nodes, tonsil, osteoclasts, spleen, sputum, thymus gland, bronchoalveolar lavage cells and derived cell lines. An elongated cluster including incompletely differentiated cells is also found. In the right panel, the Y-axis separates cell lines (top), neoplasias (middle) and non-neoplastic diseases (bottom), whereas normal tissues overlap with all three. The panel in the bottom shows the data for the first (X-axis) and third components (Y-axis). The hematopoietic axis (X-axis) allows detaching leukaemias from other blood neoplasias. Cell lines derived from brain tumours can be distinguished from their tissues of origin along this axis as well. The Y-axis detaches non-neoplastic central nervous system samples from tumoral ones, and separates myelomas from lymphomas.
Fig 2Clustering of biological groups with at least 20 observations.
a) Clustering of biological groups using average correlations. Only the 20,000 most variable probesets are accounted for. The groups are recoded by colours to display the largest clusters in the dendrogram. The colour labels on the right hand side specify groups of healthy, cancerous or diseased tissues or cell lines. Illustrative groups of tissues are also highlighted. b) Clustering of biological groups of solid tissues against the 1,000 most variable probesets. Many clusters, identified by visual inspection, include genes overexpressed in one or more tissues of origin. From top to bottom, clusters C.8, C14, C.18 and C.29 display probesets with high activity in adipose tissue, liver, brain and skeletal muscle and heart, respectively.
Group sizes in paired tissues.
| Tissue type | Cancer | Normal |
|---|---|---|
| airways and lung | 309 | 447 |
| bone | 37 | 23 |
| brain | 330 | 474 |
| breast | 926 | 222 |
| colon | 575 | 168 |
| fat | 72 | 448 |
| gastro-intestinal | 293 | 79 |
| head and neck | 96 | 84 |
| kidney | 44 | 269 |
| mesenchymal | 129 | 54 |
| pancreas | 73 | 48 |
| prostate | 88 | 21 |
| skin | 61 | 148 |
| smooth muscle | 73 | 123 |
| uterus | 79 | 145 |
Sizes of the biological groups with at least 20 replicates for which both normal and cancer samples are available.
Fig 3Extreme expression level profiles.
Expression levels of the most variable probeset (202286_s_at), on top, and a low-variability probeset corresponding to housekeeping gene GAPDH (217398_x_at), at the bottom, across all tissues for which there are at least 20 replicates of untreated, normal and cancerous samples. Samples from the same tissue of origin are displayed together, grouped by disease status. The green dashed line represents the overall mean; blue and red solid lines show the mean of cancerous and normal groups, respectively, whereas cyan and pink solid lines describe their respective dispersion, given by the within group standard deviation.
Fig 4Expression level changes across tissue types and disease status.
Distribution of the number of groups for which there is a clear change in the expression level of the probeset. The values are quantified by having either the group mean minus the group standard deviation above the overall mean or the group mean plus the group standard deviation below the overall mean.
Genes mapping to the top 100 probesets.
| AATF (*) | AK056098 | ANAPC7 (**) | |
| ARNTL (**) | C1orf21 (**) | ||
| CBFA2T2 (**) | CBX2 (**) | CD80 (**) | |
| CDCA4 (**) | CHP1 (**) | CHTOP (**) | CPSF1[ |
| CSE1L (*) | CYB5D2 | DCAF13 (**) | |
| DLEU2[ | DNMT1 (*) | DUSP5P1 (**) | |
| FAM122B (**) | FANCA (*) | FCGR1A[ | |
| FCGR1B (**) | FDX1 (**) | FLJ41455 | GABBR1[ |
| GINS4 (**) | GMPS (*) | HAUS3 (**) | HAUS8 (**) |
| HCG18 | HELLS (*) | ||
| LPCAT1 (**) | MATR3[ | MIR1204[ | |
| MMP24-AS1 | MTBP (**) | NCOA6 (**) | |
| NCOR1 (**) | NR2C1 (**) | NR2C2AP (**) | NR3C2 (*) |
| NSMCE2 (**) | NSUN2 (**) | PARP9 (**) | |
| PDS5A (**) | POLQ (**) | PPP2CB (**) | PRDM13 (**) |
| RBL1 (**) | RFX5 | ||
| RGS16 (**) | RNPS1 (**) | RP11-353N14.2 | RP11-932O9.10 |
| RPS15A (**) | S100PBP (*) | SALL4 (**) | |
| SMOC2 (**) | SNORA72 (**) | STX12 (**) | |
| THOC2 (**) | TIAL1 (**) | TMEM194A (**) | TMEM246 (**) |
| U2SURP (**) | USP32 (*) | VPS13D | |
| VWA1 (**) | ZDHHC2 (**) | ZHX1-C8orf76 | ZNF174 (**) |
| ZNF680 | ZNF692 (**) | ZNF7 (**) |
[1] CPSF1 (**) /// MIR1234 (**) /// MIR6849 /// MIR939 (**)
[2] DLEU2 (**) /// MIR15A (**)
[3] FCGR1A (**) /// FCGR1B (**) /// FCGR1C
[4] GABBR1 (**) /// UBD (*)
[5] LOC100506639 /// ZNF131 (**)
[6] MATR3 (**) /// SNHG4 (**)
[7] MIR1204 /// PVT1 (*)
List of genes mapped to by the top 100 probesets, proposed as candidates to be connected to cancer processes, irrespectively of the tissue type. Probesets 216677_at, 229948_at, 235229_at, 235363_at, 241569_at and 243379_at, are not mapped to any gene. Entries mapped by more than one probeset are displayed in italic; multiple matchings are shown with a superindex. One, two and no asterisks correspond to genes which have been found in the Atlas of Genetics and Cytogenetics in Oncology and Haematology database [33] to be related, possibly related or not related to cancer processes, respectively. Genes in bold-face have been identified in [34] as overexpressed in cancer. Gene CASK and multiple matching LOC100506639 /// ZNF131, in italic, are mapped to by two and three probesets, respectively. Additionally, genes (in multiple matchings) PVT1 and UBD and CPSF1, MIR1234, MIR939, DLEU2, MIR15A, FCGR1A, FCGR1B, GABBR1, ZNF131, MATR3 and SNHG4 are identified to be related and possibly related to cancer, respectively.