| Literature DB >> 28974199 |
Joseph N Paulson1,2,3, Cho-Yi Chen1,2, Camila M Lopes-Ramos1,2, Marieke L Kuijjer1,2, John Platig1,2, Abhijeet R Sonawane4, Maud Fagny1,2, Kimberly Glass1,2,4, John Quackenbush5,6,7,8.
Abstract
BACKGROUND: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data - critical first steps for any subsequent analysis.Entities:
Keywords: Filtering; GTEx; Normalization; Preprocessing; Quality control; RNA-Seq
Mesh:
Year: 2017 PMID: 28974199 PMCID: PMC5627434 DOI: 10.1186/s12859-017-1847-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Preprocessing workflow for large, heterogeneous RNA-Seq data sets, as applied to the GTEx data. The boxes on the right show the number of samples, genes, and tissue types at each step. First, samples were filtered using PCoA with Y-chromosome genes to test for correct annotation of the sex of each sample. PCoA was used to group or separate samples derived from related tissue regions. Genes were filtered to select a normalization gene set to preserve robust, tissue-dependent expression. Finally, the data were normalized using a global count distribution method to support cross-tissue comparison while minimizing within-group variability
Fig. 2PCoA analysis allows for grouping of subregions for greater power. Scatterplots of the first and second principal coordinates from principal coordinate analysis on major tissue regions. a Aorta, coronary artery, and tibial artery form distinct clusters. b Skin samples from two regions group together but are distinct from fibroblast cell lines, a result that holds up (c) when removing the fibroblasts
Breakdown of tissues, assigned groups, abbreviations used, and sample sizes
| Tissue | Abbreviation | Subtissue | Sample size |
|---|---|---|---|
| Adipose subcutaneous | ADS | Adipose - Subcutaneous | 380 |
| Adipose visceral | ADV | Adipose - Visceral (Omentum) | 234 |
| Adrenal gland | ARG | Adrenal Gland | 159 |
| Artery aorta | ATA | Artery - Aorta | 247 |
| Artery coronary | ATC | Artery - Coronary | 140 |
| Artery tibial | ATT | Artery - Tibial | 357 |
| Brain other | BRO | Brain - Amygdala | 779 |
| Brain - Anterior cingulate cortex (BA24) | |||
| Brain - Cortex | |||
| Brain - Frontal Cortex (BA9) | |||
| Brain - Hippocampus | |||
| Brain - Hypothalamus | |||
| Brain - Spinal cord (cervical c-1) | |||
| Brain - Substantia nigra | |||
| Brain cerebellum | BRC | Brain - Cerebellar Hemisphere | 254 |
| Brain - Cerebellum | |||
| Brain basal ganglia | BRB | Brain - Caudate (basal ganglia) | 360 |
| Brain - Nucleus accumbens (basal ganglia) | |||
| Brain - Putamen (basal ganglia) | |||
| Breast | BST | Breast - Mammary Tissue | 217 |
| Lymphoblastoid cell line | LCL | Cells - EBV-transformed lymphocytes | 132 |
| Fibroblast cell line | FIB | Cells - Transformed fibroblasts | 305 |
| Colon sigmoid | CLS | Colon - Sigmoid | 173 |
| Colon transverse | CLT | Colon - Transverse | 203 |
| Gastroesophageal junction | GEJ | Esophagus - Gastroesophageal Junction | 176 |
| Esophagus mucosa | EMC | Esophagus - Mucosa | 330 |
| Esophagus muscularis | EMS | Esophagus - Muscularis | 283 |
| Heart atrial appendage | HRA | Heart - Atrial Appendage | 217 |
| Heart left ventricle | HRV | Heart - Left Ventricle | 267 |
| Kidney cortex | KDN | Kidney Cortex | 36 |
| Liver | LVR | Liver | 137 |
| Lung | LNG | Lung | 360 |
| Minor salivary gland | MSG | Minor Salivary Gland | 70 |
| Skeletal muscle | SMU | Muscle - Skeletal | 469 |
| Tibial nerve | TNV | Nerve - Tibial | 334 |
| Ovary | OVR | Ovary | 108 |
| Pancreas | PNC | Pancreas | 193 |
| Pituitary | PIT | Pituitary | 124 |
| Prostate | PRS | Prostate | 119 |
| Skin | SKN | Skin - Not Sun Exposed (Suprapubic) | 661 |
| Skin - Sun Exposed (Lower leg) | |||
| Intestine terminal ileum | ITI | Small Intestine - Terminal Ileum | 104 |
| Spleen | SPL | Spleen | 118 |
| Stomach | STM | Stomach | 204 |
| Testis | TST | Testis | 199 |
| Thyroid | THY | Thyroid | 355 |
| Uterus | UTR | Uterus | 90 |
| Vagina | VGN | Vagina | 97 |
| Whole blood | WBL | Whole Blood | 444 |
Fig. 3Six highly expressed tissue-specific genes that are removed upon tissue-agnostic filtering. Boxplots of continuity-corrected log2 counts for six tissue-specific genes (a-f). These genes are retained when considering tissue-specificity and not when filtering in an unsupervised manner. Colors represent different tissues. Examples include (a) MUC7, (b) REG3A, (c) AHSG, (d) GKN1, (e) SMCP, and (f) NPPB
Fig. 4Using a tissue-defined reference lowers root mean squared error. Boxplots of the RMSE comparing the log-transformed quantiles of each sample to the reference defined using (left) all tissues and samples and the (right) reference defined using samples of the same tissue