| Literature DB >> 33870146 |
Eliah G Overbey1, Amanda M Saravia-Butler2,3, Zhe Zhang4, Komal S Rathi4, Homer Fogle5,3, Willian A da Silveira6, Richard J Barker7, Joseph J Bass8, Afshin Beheshti9,10, Daniel C Berrios3, Elizabeth A Blaber11, Egle Cekanaviciute3, Helio A Costa12, Laurence B Davin13, Kathleen M Fisch14, Samrawit G Gebre3,9, Matthew Geniza15, Rachel Gilbert16, Simon Gilroy7, Gary Hardiman6,17, Raúl Herranz18, Yared H Kidane19, Colin P S Kruse20, Michael D Lee21,22, Ted Liefeld23, Norman G Lewis13, J Tyson McDonald24, Robert Meller25, Tejaswini Mishra26, Imara Y Perera27, Shayoni Ray28, Sigrid S Reinsch3, Sara Brin Rosenthal14, Michael Strong29, Nathaniel J Szewczyk30, Candice G T Tahimic31, Deanne M Taylor32, Joshua P Vandenbrink33, Alicia Villacampa18, Silvio Weging34, Chris Wolverton35, Sarah E Wyatt36,37, Luis Zea38, Sylvain V Costes3, Jonathan M Galazka3.
Abstract
With the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility, and reusability of pipeline data; to provide a template for data processing of future spaceflight-relevant datasets; and to encourage cross-analysis of data from other databases with the data available in GeneLab.Entities:
Keywords: Omics; Space Sciences
Year: 2021 PMID: 33870146 PMCID: PMC8044432 DOI: 10.1016/j.isci.2021.102361
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1GeneLab RNA-seq Consensus Pipeline (RCP)
(A) The three broad steps of the RCP. The RCP handles (1) data preprocessing to trim sequencing adapters and to provide quality control metrics; (2) data processing to map reads to the reference genome and quantify the number of read counts per gene; and (3) differential gene expression calculation, which will provide a list of differentially expressed genes that can be sorted by adjusted p value and log fold-change.
(B) The full RCP annotated with tools, input files, and output files.
Figure 2Data preprocessing (pipeline step 1): quality control and trimming
(A) Data preprocessing pipeline. FastQ files from Illumina base-calling software are quality checked using FastQC and MultiQC. Data are then trimmed using TrimGalore and are re-checked for quality; (B) flags used for FastQC program; (C) flags used for MultiQC program; (D) flags used for TrimGalore program; trimmed reads (∗fastq.gz) are then used as input data for FastQC (B) followed by MultiQC (C) to generate trimmed read quality metrics. Tool versions used to process each dataset are included in the RNA-seq processing protocol in the GLDS Repository.
Figure 3Data processing (pipeline step 2A): read mapping
(A) Data processing pipeline. Trimmed reads are mapped to their reference genome and transcriptome with STAR. Gene counts are then quantified with RSEM; (B) flags used for generating the indexed STAR reference files; (C) flags used for mapping reads with STAR. Tool versions used to process each dataset are included in the RNA-seq processing protocol in the GLDS Repository.
Figure 4Data processing (pipeline step 2B): gene quantification
(A) Data processing pipeline. Mapping results from STAR are quantified by RSEM; (B) parameters for RSEM indexed reference files generation; (C) parameters for quantifying gene and isoform counts with RSEM. Tool versions used to process each dataset are included in the RNA-seq processing protocol in the GLDS repository.
Figure 5Differential gene expression calculation (pipeline step 3)
(A) Data processing pipeline. The R program DESeq2 is run in order to determine which genes are differentially expressed between experimental conditions using gene count files from RSEM.
(B) Output files generated. The table columns distinguish which script produces each output. The columns distinguish how those output files are used.
Differential gene expression output table—annotations
| TAIR | SYMBOL | GENENAME | REFSEQ | ENTREZID | STRING_id | GOSLIM_IDS |
|---|---|---|---|---|---|---|
| ANAC001 | NA | 839580 | 3702.AT1G01010.1 | NA | ||
| ARV1 | NA | 839569 | 3702.AT1G01020.1 | GO:0005622, GO:0005737, … | ||
| NGA3 | NA | 839321 | 3702.AT1G01030.1 | NA | ||
| ASU1 | Encodes a Dicer homolog … | 839574 | 3702.AT1G01040.2 | NA |
Truncated version of the differential_expression.csv file provided as GeneLab processed data for GLDS-251. The first 7 columns of the differential gene expression output table contain gene IDs and annotations (for remainder of columns, refer to Table 2).
Differential gene expression output table—statistics
| Norm. expr. (sample A) | Log2fc (comparison A) | P value (comparison A) | Adj p value (comparison A) | Mean (all samples) | Stdev (all samples) | LRT p value | Mean (group A) | Stdev (group A) |
|---|---|---|---|---|---|---|---|---|
| 263.864 | −0.078 | 0.648 | 0.848 | 198.735 | 31.756 | 0.484 | 225.550 | 36.759 |
| 200.493 | 0.341 | 0.033 | 0.198 | 147.061 | 19.197 | 0.740 | 174.839 | 24.073 |
| 19.040 | 0.691 | 0.137 | NA | 11.035 | 3.121 | NA | 15.706 | 2.889 |
| 644.811 | 0.126 | 0.366 | 0.655 | 669.586 | 68.327 | 1.000 | 688.123 | 76.969 |
Truncated version of the differential_expression.csv file provided as GeneLab processed data for GLDS-251. Following the seven columns of gene IDs and annotations (Table 1) are normalized gene expression data for each sample (Norm. expr. (sample A)) then results from all possible pairwise comparisons, including log2 fold change (Log2fc (comparison A)), p values (P.value (comparison A)), and adjusted p values (Adj.p.value (comparison A)) calculated from the Wald Tests. Next are the average gene expression (Mean (all samples)) and standard deviation (Stdev (all samples)) of all samples followed by the F-statistic p value generated from the likelihood ratio test (LRT.p.value), and the last set of columns are the average gene expressions (Group.Mean) and standard deviations (Group.Stdev) of samples within each group.
Figure 6Global and differential gene expression in spaceflight versus ground control liver samples from GeneLab datasets
(A and B) Principal component analysis of global gene expression in spaceflight (FLT) and respective ground control (GC) liver samples from the (A) Rodent Research 1 (RR-1) NASA Validation mission (GLDS-168) and (B) RR-6 ISS-terminal mission (GLDS-245). Plots were generated using data in the normalized counts tables for each respective dataset on the NASA GeneLab Data Repository.
(C and D) Heatmaps showing the top 30 differentially expressed genes in spaceflight (FLT) versus ground control (GC) liver samples from the (C) Rodent Research 1 (RR-1) NASA Validation mission (GLDS-168) and (D) RR-6 ISS-terminal mission (GLDS-245). Heatmaps were generated using data in the differential expression tables for each respective dataset on the NASA GeneLab Data Repository and are colored by relative expression. Adj. p value < 0.05 and |log2FC| > 1. All samples included were derived from frozen carcasses post-mission and utilized the ribo-depletion library preparation method.
Comparison of gene ontology in spaceflight versus ground control liver samples from GeneLab datasets
| GeneLab dataset | # Enriched GO terms (NOM p < 0.01) | # Enriched GO terms (NOM p < 0.01 & FDR<0.5) | # Enriched GO terms (NOM p < 0.01 & FDR<0.25) |
|---|---|---|---|
| GLDS-168 | 71, 135 | 0, 132 | 0, 0 |
| GLDS-245 | 21, 24 | 2, 6 | 1, 0 |
The number of enriched gene ontology (GO) terms identified by Gene Set Enrichment Analysis (GSEA, phenotype permutation) was evaluated in spaceflight (FLT) versus ground control (GC) liver samples from the Rodent Research 1 (RR-1) NASA Validation mission (GLDS-168), and RR-6 ISS-terminal mission (GLDS-245). For GO terms, the number on the left corresponds to GO terms enriched in FLT samples and the number on the right corresponds to GO terms enriched in GC samples. These data were generated using the normalized counts for each respective dataset on the NASA GeneLab Data Repository. All samples included were derived from frozen carcasses post-mission and utilized the ribo-depletion library preparation method. GLDS-168, FLT n = 5 and GC n = 5; GLDS-245, FLT n = 10 and GC n = 10. p values and FDR values are indicated.