| Literature DB >> 29664468 |
Qingguo Wang1,2,3, Joshua Armenia1,2, Chao Zhang4, Alexander V Penson1,2, Ed Reznik1,2, Liguo Zhang5, Thais Minet3, Angelica Ochoa1,2, Benjamin E Gross1,2, Christine A Iacobuzio-Donahue5, Doron Betel4, Barry S Taylor1,2,6, Jianjiong Gao1,2, Nikolaus Schultz1,2,6.
Abstract
Driven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data, such as the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA). While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources remains challenging, due to differences in sample and data processing. Here, we developed a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment, gene expression quantification, and batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from GTEx and TCGA and successfully corrected for study-specific biases, enabling comparative analysis between TCGA and GTEx. The normalized datasets are available for download on figshare.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29664468 PMCID: PMC5903355 DOI: 10.1038/sdata.2018.61
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Uniform processing of RNA-seq data from GTEx and TCGA.
GTEx and TCGA RNA-seq samples processed by our pipeline.
| Only paired-end RNA-seq samples were included. | ||||
|---|---|---|---|---|
| bladder / blca | 11 | 19 | 411 | 441 |
| breast / brca | 218 | 114 | 1112 | 1444 |
| cervix / cesc | 11 | 3 | 304 | 318 |
| uterus / ucec | 90 | 24 | 180 | 294 |
| uterus / ucs | 0 | 57 | 57 | |
| colon-sigmoid / read | 173 | 10 | 94 | 277 |
| colon-transverse / coad | 203 | 41 | 295 | 539 |
| liver / lihc | 136 | 50 | 371 | 557 |
| salivary gland / hnsc | 70 | 44 | 520 | 634 |
| esophageal / esca | 790 | 11 | 185 | 986 |
| prostate / prad | 119 | 52 | 497 | 668 |
| stomach / stad | 204 | 35 | 415 | 654 |
| thyroid / thca | 355 | 59 | 505 | 919 |
| lung / luad | 374 | 59 | 528 | 961 |
| lung / lusc | 51 | 504 | 555 | |
| kidney cortex / kirc | 36 | 72 | 541 | 649 |
| kidney cortex / kirp | 32 | 290 | 322 | |
| kidney cortex / kich | 25 | 66 | 91 | |
| 2790 | 701 | 6875 | 10366 |
Figure 2Effect of uniform processing and batch effect removal on gene expression levels in GTEx and TCGA.
Two-dimensional plots are shown of principal components calculated by performing PCA of the gene expression values of bladder, prostate, and thyroid samples from GTEx and TCGA. (a) PCA of the level 3 data, i.e., the expression data from GTEx and TCGA. GTEx expression data was quantile normalized (see Supplementary Fig. S1B). (b) PCA of the expression data after uniform processing through our pipeline, before batch bias correction. (c) PCA of the expression data after uniform processing through our pipeline, after batch bias correction.
Figure 3Hierarchical clustering of GTEx and TCGA bladder, prostate, and thyroid data shows the effect of uniform processing and batch effect correction.
(a) level 3 expression data from GTEx and TCGA; (b) gene expression calculated using our pipeline prior to batch bias correction; (c) our expression data after batch bias correction.
Figure 4Normalized expression across tissue and cancer types for three known cancer genes: ERBB2, IGF2 and TP53.