Differential gene expression analysis is widely used to study changes in gene expression profiles between two or more groups of samples (e.g., physiological versus pathological conditions, pre-treatment versus post-treatment, and infected versus non-infected tissues). This protocol aims to identify gene expression changes in a pre-selected set of genes associated with severe acute respiratory syndrome coronavirus 2 viral infection and host cell antiviral response, as well as subsequent gene expression association with phenotypic features using samples deposited in public repositories. For complete details on the use and outcome of this informatics analysis, please refer to Bizzotto et al. (2020).
Differential gene expression analysis is widely used to study changes in gene expression profiles between two or more groups of samples (e.g., physiological versus pathological conditions, pre-treatment versus post-treatment, and infected versus non-infected tissues). This protocol aims to identify gene expression changes in a pre-selected set of genes associated with severe acute respiratory syndrome coronavirus 2 viral infection and host cell antiviral response, as well as subsequent gene expression association with phenotypic features using samples deposited in public repositories. For complete details on the use and outcome of this informatics analysis, please refer to Bizzotto et al. (2020).
Timing: 1 hR is a free software environment for statistical computing and graphics. It runs on UNIX, Windows and MacOS.To download and install R go to https://www.r-project.org/ (R Core Team, 2013). The current pipeline was performed using R version 3.6.2.RStudio is an integrated development environment (IDE) for R. It allows to easily execute the R codes, plot graphics, and manage the workspace in a multipanel interphase.To download and install RStudio go to https://rstudio.com/products/rstudio/ (RStudio Team, 2020).Pause point: Since this protocol follows a bioinformatic pipeline exclusively, by saving the executed steps the protocol can be paused at any time.
Download required packages in RStudio
Timing: 1 hUsers must first download the required packages (listed in the key resources table). They can be downloaded through Bioconductor, which provides tools for the analysis and comprehension of high-throughput genomic data. BiocManager::install() is the recommended command to install packages (for detailed information on why BiocManager::install() is preferred to the standard R packages installation please read https://www.bioconductor.org/install/#why-biocmanagerinstall):Install the packages needed for the analysis. See troubleshooting 1:Once all packages are installed, they need to be loaded:
Dataset selection
Timing: 2 daysWhen using datasets from public repositories, the key step is to identify a dataset (or datasets) that comply with the eligibility criteria and that contains the sample information required for the analysis.We suggest browsing Gene Expression Omnibus (GEO: https://www.ncbi.nlm.nih.gov/gds, (Barrett et al., 2012)) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress/, (Athar et al., 2019)) repositories because they gather multiple high-throughput genomics datasets. However, there are several public repositories that might be more suitable for other types of studies.These repositories allow to download raw sequencing data (.fastq sequencing files) and/or pre-processed files (tab-delimited.txt files containing matrices with sequence read counts after trimming and alignment to the reference genome). The pre-processed files may contain a raw-counts matrix (non-normalized) or a normalized counts matrix (see below for more details). Sample information is also available to download. Finally, the platform used, and pre-processing algorithm (when data are pre-processed) are specified.We strongly recommend researchers to thoroughly evaluate the type of data submitted, study design, number of samples and any other relevant information that might help to analyze the samples and draw statistically valid conclusions. Troubleshooting 2.Our eligibility criteria for (Bizzotto et al., 2020) was: (i) publicly available transcriptome data; ii) detailed sample/patient information; (iii) detailed protocol information; (iv) ≥ 60 samples. We selected the GSE152075 dataset from GEO which contained RNA-seq data from 430 SARS-CoV-2 positive and 54 negative patients (Lieberman et al., 2020). We downloaded the gene expression aligned data matrix (tab-delimited .txt file with reads pseudo-aligned to the human reference transcriptome). Clinico-pathological information included age, gender, and viral load (expressed as cycle threshold (Ct) by RT-qPCR for the N1 viral gene at time of diagnosis). The interpretation for viral load was as follows: the lower the Ct, the higher the viral load. This phenotypic data can be downloaded directly in RStudio as explained in the “RNA-seq data organization and counts normalization” section.
Key resources table
detailed information on the usage of the different packages may be found in the links provided in the identifier column.
Materials and equipment
For this bioinformatics analysis we used a laptop with an Intel Core i5 8th generation processor, 8 GB RAM memory and Windows 10. No high-performance computing clusters were needed for the analysis of the data. Internet connection is required for downloading R packages and data matrixes.
Step-by-step method details
The flow chart for data processing is included in Figure 1.
Figure 1
Flow chart for data processing
Flow chart for data processing
Download and prepare the data matrix for analysis
Timing: 2 hYou can download the experiment information and clinical data directly from GEO using the GEOquery package:The series matrix file is a text file that includes a tab-delimited value-matrix for each sample containing the phenotypic/clinical and experimental data of a given dataset. In the GEO webtool, there is a hyperlink to the series matrix, called “Series Matrix File(s)”. To download the series matrix file directly to the R environment use the getGEO command:You may now extract the phenotypic/clinical data matrix from the series matrix:Download and save on your computer the raw-counts matrix from GEO website. This matrix is a tab-delimited txt. file containing the counts for every gene aligned from a RNA-seq experiment. After downloading it, load the matrix into RStudio:The sequencing data for GSE152075 was submitted as raw-counts in a separate file from the experimental and clinical data. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075; GSE152075_raw_counts_GEO.txt.gz); therefore, it was downloaded separately. For datasets with pre-processed/normalized data, the counts matrix might be included in the series matrix file downloaded in step 1. Troubleshooting 3.
RNA-seq data organization and count normalization
Timing: 1 hBefore performing differential gene expression analysis, it is required to normalize the read counts if the raw-counts matrix was downloaded. This normalization step allows to compare gene expression (read counts) among samples (Evans et al., 2018). It is also recommended to correct for batch effect if multiple batches of experiments were performed. If the user’s dataset is already normalized, then go directly to step 6.Gene expression normalization:Before sample normalization, data should be converted and organized to the format required for further analysis (data format and organization might vary for different packages). Troubleshooting 4.Make sure that the grouping variables are factors. We also changed the original names of the columns containing the relevant variables to make them shorter and easier to work with.Merge the read counts and clinical data matrixes into a DESeqDataSet object using the DESeq2 package:Normalization by estimation of size factor:Create a new table with the normalized read counts (gene expression) for all genes:There are alternative methods to normalize and extract the counts such as rlog and vst that would fit this analysis (Love et al., 2014). However, this STAR Protocol replicates the bioinformatics analysis described in Bizzotto et al. (Bizzotto et al., 2020) where the command “normalized=TRUE” was used.For our study, we converted the continuous variables (e.g., age, viral load) into factor/strata variables (e.g., 10-year age ranges, low/medium/high viral load). The code below shows an example on how to stratify the viral load and age (after duplicating the original variable in order to not overwrite the original data).This is an optional step depending on your variables and analysis.
Differential gene expression analysis across different strata
Timing: 1 dayAs mentioned on Bizzotto et al. (Bizzotto et al., 2020), some samples were removed from the analysis since they were considered to have low quality read sequencing (>70% genes with 0 counts). Because this might not apply to all protocols, we did not include the code for this filtering in the main manuscript, but it was included in the troubleshooting 5. Therefore, the outcomes for this protocol might slightly differ from the outcome published on Bizzotto et al. (Bizzotto et al., 2020), but this omission does not change the results and interpretation of the study.This step aims to compare gene expression across different strata. Below we show, as an example, the differential MX1 expression analysis for SARS-CoV-2 positive vs. negative patients.As described in our publication, we selected specific genes that could potentially be linked to SARS-CoV-2 infection. For this section, we selected MX1 as an example to show. Plot MX1 expression for SARS-CoV-2 positive and negative patients (Figure 2A.i) using the following code:
Figure 2
Gene expression analysis based on SARS-CoV-2 positivity and age
Dot plot for MX1 expression (i) and statistical output from RStudio (ii) for: (A) SARS-CoV-2 positive vs. negative patients (P-values correspond to Wilcoxon rank-sum test); (B) SARS-CoV-2 positive vs. negative patients categorized by age groups (P-values correspond to decreasing Jonckheere-Terpstra trend test. Alternative hypothesis: Mediani > Medianj > … > Mediann, indicating that gene expression is significantly lower in SARS-CoV-2 patients when age is increased. Left panel: SARS-CoV-2 negative; right panel: SARS-CoV-2 positive). Figures 2 A.i and B.i were adapted with permission from Bizzotto et al., 2020.
Gene expression analysis based on SARS-CoV-2 positivity and ageDot plot for MX1 expression (i) and statistical output from RStudio (ii) for: (A) SARS-CoV-2 positive vs. negative patients (P-values correspond to Wilcoxon rank-sum test); (B) SARS-CoV-2 positive vs. negative patients categorized by age groups (P-values correspond to decreasing Jonckheere-Terpstra trend test. Alternative hypothesis: Mediani > Medianj > … > Mediann, indicating that gene expression is significantly lower in SARS-CoV-2 patients when age is increased. Left panel: SARS-CoV-2 negative; right panel: SARS-CoV-2 positive). Figures 2 A.i and B.i were adapted with permission from Bizzotto et al., 2020.CRITICAL: For the y argument you must specify a data frame containing the samples in the rows and the variables (genes) in the columns; therefore, we used the t() argument to transpose the data frame “norm_counts”. Gene expression is expressed as the log2(counts); therefore, and because some samples have 0 counts, it is necessary to add 1 count to all genes for all samples to avoid errors due to log2(0).Export the plot in high quality (complying with most publication standards):Perform a Wilcoxon test to assess the statistical significance of MX1 expression differences between SARS-CoV-2 positive and negative patients (Figure 2A.ii):Please note that Limma and DESeq2 would be more appropriate statistical packages to run when analyzing whole transcriptomes. In (Bizzotto et al., 2020), we analyzed a pre-selected set of genes; therefore, the Wilcoxon test can be used to analyze mean differences between groups.In addition, we performed the Jonckheere-Terpstra (Arif et al., 2015) trend test to evaluate gene expression trends across ordered strata (e.g., age stratified by 10-year ranges). As an example, here we show the trend test for MX1 expression across age categories in SARS-CoV-2 negative and positive patients (Figure 2B.ii, left and right panels, respectively). The continuous age variable was stratified in 10-year ranges as described in“RNA-seq data organization and counts normalization - step 5” section.In this code the column “clindata$age_cat” contains the age stratified by 10-years ranges previously created.to perform this test in SARS-CoV-2 negative patients, then replace “COVID19” with “HEALTHY”.The argument alternative could be “two.sided”, “increasing” or “decreasing”. Select the best argument for your hypothesis testing.
Correlation analysis
The aim of this step is to analyze gene expression correlation in the different categorical variables. We also provide the code to perform a pairwise gene expression correlation including the viral load as a third variable plotted in a color scale.Timing: 1 daySpearman correlation analysis between gene expression levels:Calculate the Spearman coefficients and plot all pairwise correlations. The example below shows the correlation analysis for four genes (MX1, MX2, ACE2 and TMPRSS2) in SARS-CoV-2 positive and negative patients (Figure 3A):
Figure 3
Correlation analysis
(A) Pairwise Spearman correlation matrix analysis between all genes of interest (MX1, MX2, ACE2 and TMPRSS2), considering different infection status (all patients together (Corr.), SARS-CoV-2 negative (HEALTHY), or SARS-CoV-2 positive patients (COVID19)). Alternative hypothesis: rho (ρ) ≠ 0, which indicates that there is a lineal correlation between the variables under study. Statistical significance: ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001.
(B) Scatter plot of MX1 and MX2 gene expression levels, color-coded by viral load (i) and Spearman correlation output in RStudio (ii). Figures 3A and 3B.i were adapted with permission from Bizzotto et al., 2020.
Correlation analysis(A) Pairwise Spearman correlation matrix analysis between all genes of interest (MX1, MX2, ACE2 and TMPRSS2), considering different infection status (all patients together (Corr.), SARS-CoV-2 negative (HEALTHY), or SARS-CoV-2 positive patients (COVID19)). Alternative hypothesis: rho (ρ) ≠ 0, which indicates that there is a lineal correlation between the variables under study. Statistical significance: ∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001.(B) Scatter plot of MX1 and MX2 gene expression levels, color-coded by viral load (i) and Spearman correlation output in RStudio (ii). Figures 3A and 3B.i were adapted with permission from Bizzotto et al., 2020.Any continuous variable (e.g. viral load expressed as Ct) could be included in the analysis of correlation with gene expression (Bizzotto et al., 2020).Plot gene expression correlation between two genes and include viral load (expressed as Ct) in a color scale (Figure 3B.i) and calculate the Spearman correlation coefficient (Figure 3B.ii)
Patient segregation based on gene expression
Principal-component analysis (PCA) is a dimensionality reduction technique used to increase the interpretability of a given dataset, minimizing information loss and maximizing variance.Timing: 1 dayPrincipal component analysis based on disease status:Calculate the principal component on the log2 transformed gene expression data. The output is a table as shown in Figure 4A, containing the weight of each gene in the variance of the samples for Principal Component 1 (PC1; the largest component of variance in the data set) and Principal Component 2 (PC2; the second most important component influencing the variance):
Figure 4
Patient segregation based on gene expression
(A) Principal component analysis (PCA) output in RStudio, showing the first two principal components explaining most part of the variance among samples.
(B) PCA biplot of gene expression data showing a rough segregation of SARS-CoV-2 positive and negative samples.
(C) 3D scatter plot for the expression of three genes of interest (BSG, MX1 and ACE2). Samples are colored by SARS-CoV-2 status (positive or negative). Ellipsoids represent the 95% confidence interval. Figures 4B and 4C were adapted with permission from Bizzotto et al., 2020.
Patient segregation based on gene expression(A) Principal component analysis (PCA) output in RStudio, showing the first two principal components explaining most part of the variance among samples.(B) PCA biplot of gene expression data showing a rough segregation of SARS-CoV-2 positive and negative samples.(C) 3D scatter plot for the expression of three genes of interest (BSG, MX1 and ACE2). Samples are colored by SARS-CoV-2 status (positive or negative). Ellipsoids represent the 95% confidence interval. Figures 4B and 4C were adapted with permission from Bizzotto et al., 2020.For PCA we only considered expression of the candidate genes, but you might also include any independent variable you consider to affect the variability among samples.Plot PC1 vs. PC2 (Figure 4B):3D graphs showing expression for the selected genes can be plotted as follows. The output is shown in Figure 4C:
Expected outcomes
Plots in Figure 2 depict dot plots of gene expression separating patients according to different phenotypic data, such as disease status (Figure 2A) and age (Figure 2B), and the output of the statistical tests. The color code for these analyses is: red = SARS-CoV-2 negative patients, and blue = SARS-CoV-2 positive patients.In order to evaluate the association between different parameters, such as gene expression and viral load, we performed pairwise correlation analysis explained in the “Correlation Analysis” section. The outcome is a scatter plot and the Spearman coefficient (rho) with the associated P-value for each comparison. Figure 3A shows the pairwise correlation plots and statistics. Figure 3B shows the scatter plot for the correlation between two genes and viral load as color-coded dots.Finally, it is possible to visualize patient segregation by PCA using the selected genes (Bizzotto et al., 2020) as independent variables (Figures 4A and 4B). 3D scatter plots can be done. These plots depict gene expression and 95% confidence ellipsoids (Figure 4C).
Quantification and statistical analysis
Eligibility criteria, statistical tests and software used for this protocol are properly described in the “before you begin” and “step-by-step methods details” sections.
Limitations
This protocol relies on the accuracy of the clinical records submitted to the public repository by the original authors. For some datasets, RNA-sequencing raw counts data may not be available, and data could have been pre-processed by the original authors, thus limiting the decision making of which sequence quality will comply with the own standards, usage of different alignment software and algorithms, reference genome, etc. However, there are still some quality controls that might be done. As we mentioned in (Bizzotto et al., 2020), we detected that some samples had 0 counts for most genes (>70% genes), suggesting a very low quality RNA isolation and/or sequencing. Therefore, we were able to remove them from the analysis. We strongly suggest performing all possible quality controls before analyzing the data.
Troubleshooting
It is important that when working in R, and following this protocol, take into consideration the critical statements and notes for each step provided in the method details.
Problem 1
When running a piece of code, the output is not the expected (i.e., error message in the console or an unexpected feature in the plot).
Potential solution
For the proper execution of the piece of code, the syntax must be accurate, so it is important to check whether there are no syntax mistakes, such as:Forgotten comma.Unclosed bracket, parenthesis, or quotes.Misspelled function or filenames.
Problem 2
There is no count matrix available. The only accessible information are the FASTQ files thus requiring additional processing (e.g., alignment to reference transcriptome) which is not described in the current protocol.Since there are already publications explaining this matter further, we suggest visiting the manuals which describe how to process and align sequencing data before performing expression analyses using HISAT2 (Kim et al., 2019), TopHat2 (Kim et al., 2013), STAR (Dobin et al., 2013) and Salmon (Patro et al., 2017).
Problem 3
The raw counts for a specific dataset are available as a separate file, but it is not clear for the user whether they are raw counts or pre-processed data.It is important to verify that the data represent raw counts. This can be checked by reading the GEO submission description and the Methods section within the corresponding citation.
Problem 4
The phenotypic and count matrices do not contain the same list of samples.It is important to verify that the phenotypic and counts matrices contain the same list of samples in order to merge phenotypic and gene expression data. If not, discard incomplete samples. This can be achieved using the following command, which creates a table only with the samples included in both matrices:
Problem 5
There is a great number of genes with 0 counts.Samples considered to have low quality read sequencing should be removed from the analysis since they may introduce noise to the results. The following code is an example of how this can be achieved in RStudio when samples in the dataset have >70% genes with 0 counts:
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Geraldine Gueron, ggueron@iquibicen.fcen.uba.ar.
Materials availability
This study did not generate new unique reagents.
Data and code availability
The code generated during this study is available at GitHub repository accessible using the following link: https://github.com/lab-inflamacionycancer/STAR-protocol-Sanchis.et.al.
REAGENT or RESOURCE
SOURCE
IDENTIFIER
Deposited data
In vivo antiviral host response to SARS-CoV-2 by viral load, sex, and age [dataset I]
Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937
Authors: Daehwan Kim; Joseph M Paggi; Chanhee Park; Christopher Bennett; Steven L Salzberg Journal: Nat Biotechnol Date: 2019-08-02 Impact factor: 54.908
Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Nicole A P Lieberman; Vikas Peddu; Hong Xie; Lasata Shrestha; Meei-Li Huang; Megan C Mears; Maria N Cajimat; Dennis A Bente; Pei-Yong Shi; Francesca Bovier; Pavitra Roychoudhury; Keith R Jerome; Anne Moscona; Matteo Porotto; Alexander L Greninger Journal: PLoS Biol Date: 2020-09-08 Impact factor: 8.029
Authors: Lena F Schimke; Alexandre H C Marques; Gabriela Crispim Baiocchi; Caroline Aliane de Souza Prado; Dennyson Leandro M Fonseca; Paula Paccielli Freire; Desirée Rodrigues Plaça; Igor Salerno Filgueiras; Ranieri Coelho Salgado; Gabriel Jansen-Marques; Antonio Edson Rocha Oliveira; Jean Pierre Schatzmann Peron; Gustavo Cabral-Miranda; José Alexandre Marzagão Barbuto; Niels Olsen Saraiva Camara; Vera Lúcia Garcia Calich; Hans D Ochs; Antonio Condino-Neto; Katherine A Overmyer; Joshua J Coon; Joseph Balnis; Ariel Jaitovich; Jonas Schulte-Schrepping; Thomas Ulas; Joachim L Schultze; Helder I Nakaya; Igor Jurisica; Otávio Cabral-Marques Journal: Cells Date: 2022-03-01 Impact factor: 7.666