Janaina Macedo-da-Silva1, João Victor Paccini Coutinho1, Livia Rosa-Fernandes1, Suely Kazue Nagahashi Marie2, Giuseppe Palmisano3. 1. GlycoProteomics Laboratory, Department of Parasitology, ICB, University of São Paulo, São Paulo, Brazil. 2. Cellular and Molecular Biology Laboratory (LIM 15), Neurology Department, Faculdade de Medicina FMUSP, Universidade de Sao Paulo, Sao Paulo, Brazil. 3. GlycoProteomics Laboratory, Department of Parasitology, ICB, University of São Paulo, São Paulo, Brazil; School of Natural Sciences, Macquarie University, Sydney, NSW, Australia. Electronic address: palmisano.gp@gmail.com.
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat. The immediate need to understand the mechanisms and impact of the virus made omics techniques stand out, as they can offer a holistic and comprehensive view of thousands of molecules in a single experiment. Mastering bioinformatics tools to process, analyze, integrate, and interpret omics data is a powerful knowledge to enrich results. We present a robust and open access computational pipeline for extracting information from quantitative proteomics and transcriptomics public data. We present the entire pipeline from raw data to differentially expressed genes. We explore processes and pathways related to mapped transcripts and proteins. A pipeline is presented to integrate and compare proteomics and transcriptomics data using also packages available in the Bioconductor and providing the codes used. Cholesterol metabolism, immune system activity, ECM, and proteasomal degradation pathways increased in infected patients. Leukocyte activation profile was overrepresented in both proteomics and transcriptomics data. Finally, we found a panel of proteins and transcripts regulated in the same direction in the lung transcriptome and plasma proteome that distinguish healthy and infected individuals. This panel of markers was confirmed in another cohort of patients, thus validating the robustness and functionality of the tools presented.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat. The immediate need to understand the mechanisms and impact of the virus made omics techniques stand out, as they can offer a holistic and comprehensive view of thousands of molecules in a single experiment. Mastering bioinformatics tools to process, analyze, integrate, and interpret omics data is a powerful knowledge to enrich results. We present a robust and open access computational pipeline for extracting information from quantitative proteomics and transcriptomics public data. We present the entire pipeline from raw data to differentially expressed genes. We explore processes and pathways related to mapped transcripts and proteins. A pipeline is presented to integrate and compare proteomics and transcriptomics data using also packages available in the Bioconductor and providing the codes used. Cholesterol metabolism, immune system activity, ECM, and proteasomal degradation pathways increased in infected patients. Leukocyte activation profile was overrepresented in both proteomics and transcriptomics data. Finally, we found a panel of proteins and transcripts regulated in the same direction in the lung transcriptome and plasma proteome that distinguish healthy and infected individuals. This panel of markers was confirmed in another cohort of patients, thus validating the robustness and functionality of the tools presented.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in late 2019 in Wuhan, China, and has proven to be highly pathogenic, making it a global public health threat (Mahalmani et al., 2020). There are a variety of manifestations, from asymptomatic to severe cases, that include symptoms such as muscle damage, prolonged tiredness, shortness of breath, loss of taste/smell, and severe pneumonia (Dixon et al., 2021; Nalbandian et al., 2021). Cumulative reports also associate post-acute effects of the infection, including pulmonary, hematological, cardiovascular, neuronal, renal, endocrine, and gastrointestinal sequelae (Hayes, Ingram, & Sculthorpe, 2021; Nalbandian et al., 2021). The quest to understand the infectious mechanisms of SARS-CoV-2, diagnostic alternatives, and treatments have been approached by several techniques (Das, Ahmed, Akhtar, Begum, & Banu, 2021). Among them, omics sciences generate large amounts of information to enable holistic understanding of cell, tissue or organism function and reaction against a disease. These are high-performance techniques able to explore an organism at the level of genes (genomics), proteins (proteomics), metabolites (metabolomics), and lipids (lipidomics). Integration of this vast information provides a comprehensive and powerful tool for exploring the infection mechanism of SARS-CoV-2 (Overmyer et al., 2021; Wu et al., 2021).In the last 10 years, with the advancement of technologies and processing capacity, proteomics and transcriptomics were readily implemented in several laboratories (Martens & Vizcaíno, 2017), while metabolomics and lipidomics were integrated along the years (Wenk, 2005). The more sophisticated data becomes, the greater the demand for new computational methods to deal with them. Currently, numerous tools can assess protein-protein interaction, calculate correlation between expression of different genes, identify enriched pathways and biological processes, and thus, drive data to find the most relevant points (Mangul et al., 2019; Mishra et al., 2021).
Applications
Omics sciences combined with bioinformatics technologies, in particular, contributed to the identification of therapeutic targets (Bojkova et al., 2020; Li et al., 2021); to the elucidation of virus pathophysiological mechanisms (Desai et al., 2020; Rosa-Fernandes et al., 2019; Wu et al., 2020); pathogen–host interactions (Terracciano et al., 2021); detection of patients predisposed to manifestations of severe symptoms (Lazari et al., 2021), selection of prognostic markers (Rosa-Fernandes et al., 2020; Terracciano et al., 2021) and to determine differential gene/protein/metabolite expression profiles (Desai et al., 2020; Leng et al., 2020; Wu et al., 2020). In addition to these fundamental applications, omics tools are also helpful in stratifying clinical manifestations in viral infections (Macedo-da-Silva et al., 2020). Furthermore, integrating data from different biological matrices may offer a comprehensive overview of disease pathogenesis, helping in prioritizing biomarkers and therapeutic targets (Wu et al., 2021; Zhu et al., 2022). Storing, analyzing, and interpreting these data can be a big challenge for researchers. However, it is a worthwhile task to overcome as omics approach enable answering multiple biological questions.Notably the generated raw data in previous studies are cumulatively stored and accessible in public repositories, such as PRIDE (Vizcaíno et al., 2013) and SRA (Kodama et al., 2012), allowing reanalysis by different research groups. Therefore, mastering bioinformatics tools to process, analyze, integrate, and interpret these omic data is a powerful knowledge to build and validate hypotheses. The objective of this study is to present bioinformatics tools that can be used to reanalyze transcriptomic and proteomic data deposited in public databases and integrate them to increase the understanding of the pathophysiology of SARS-CoV-2.
Methodologies
Selected datasets
Four datasets, including proteomics and transcriptomics, were selected from public repositories. Shu et al. (PXD019106) evaluated plasma samples from patients infected with SARS-CoV-2, subdivided into fatal (FT, n
= 5), severe (SV, n
= 7), and moderate (MD, n
= 10) groups by proteomics. In addition, the study included samples from healthy patients (HE, n
= 8) who tested negative on throat swab and serological tests (Shu et al., 2020). Zhong et al. (S-BSST719) determined the plasma proteome profile of 50 individuals with mild and moderate disease. Collections were performed in the first 24 h (D0) after the positive PCR test and after 14 days (D14) with negative PCR result (Zhong et al., 2021). Kwan et al. (PRJNA692253) conducted RNA sequencing analyses of whole blood samples from 64 patients, of whom 45 tested positive for COVID-19 (INF) and 19 healthy participants not exposed to the virus (CTRL) (Kwan et al., 2021). Wu et al. (PRJNA646224) explored the lung and colon tissue transcriptome of patients who died due to COVID-19 (INF, n
= 8) and healthy counterparts were obtained from cancer patients who underwent surgical resection or biopsy (CTRL) (Wu et al., 2020).
Hardware features
The analyzes presented here were performed on an Ubuntu 20.04.3 LTS system (https://www.ubuntu.com/), with a total memory of 62GB, 40 CPUs, 2 threads per core, 10 cores per socket, and 2 sockets. The local storage of raw files required a large amount of space. However, the size of the files varied substantially as it depended on the quality of the data and equipment used.
Data download
In this protocol, we did not address the processing of raw files from proteomic approaches. The protein quantification tables provided in the selected manuscripts were downloaded and analyzed. The detailed pipelines for processing proteomic raw data are available in Kong, Leprevost, Avtonomov, Mellacheruvu, and Nesvizhskii (2017), Carvalho et al. (2016), Guangcan et al. (2021), and Tyanova, Temu, and Cox (2016).For transcriptomic analysis, the fastq raw files were downloaded through the SRA Explorer platform (https://sra-explorer.info/) with the access numbers PRJNA692253 and PRJNA646224. Information regarding each file, such as SRA Accession, instrument, and total bases (Mb) were accessed. The SRA repository stores raw high-throughput sequencing data from different search fields. SRA files consist of raw data and require quality scores per basis for the submitted data. Tools like the sratoolkit download sequencing data from this database. Fastq files are raw data that have a definition line (defline) that contains a read identifier and nucleotide base information, all in text form.After selection, all files from PRJNA692253 were downloaded by clicking on “Add to collection.” A total of 64 SRA files were added. The platform offers the option of downloading .fastq files directly, but also .sra files and metadata. In the “FastQ Downloads” tab, the corresponding URL for all files could be downloaded. Kwan et al. used paired-end sequencing, resulting in 128 fastq files. Accession number PRJNA646224 provides the possibility to download colon and lung transcriptome data. Here, we downloaded only data referring to the lung (CTRL and INF), resulting in a total of 19 fastq files since Wu et al. used single-end sequencing. After creating a directory named “fastq_files” in the terminal console, files were downloaded using the “wget” command.
FastQC
FastQC is a highly recommended software for checking the quality of raw data coming from high-throughput sequencing pipelines. To download the software via the terminal console, “wget” command must be used. The download is also possible by using directly the web browser from the Brabraham Institut website (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). After unzipping the file, turned the tool executable and added FastQC to the path. A directory to allocate FastQC outputs files was created and then the software was executed.At the end of the process, two output files will be created for each fastq file. By opening the generated .html file, the user can access the sample quality control report. The first module presents the results of basic statistics, including Total Sequences, Sequences flagged as poor quality, Sequence length, and %GC. Below is a BoxWhisker chart, which on the y-axis indicates the quality scores and on the x-axis the position in the read. The background indicates the quality of the sequences: green (very good), orange (fair), and red (poor). The “Per sequence quality scores” graph shows the average quality score on the x-axis and the number of sequences with that average on the y-axis. Generally, most samples' readings are expected to have a high-quality score. The parameter “Per base sequence content” reveals the proportion of each position of a base. If the difference between A and T or G and C bases is greater than 10%, the software will report a warning. If the difference is greater than 20%, this module will fail. In “Per sequence GC content,” the GC content measurement is reported. A normal distribution profile is expected. In “Overrepresented Sequences” frequent sequences are reported, indicating that it has some biological relevance or that the library is not diverse or contaminated.
Trim Galore!
Trim Galore! is a platform that performs sample quality control by efficiently removing poor-quality parts of the readings and trimming the adapter. This tool requires the previous installation of cutadapt and FastQC tools. The step of removing adapter sequences may or not be adopted in a transcriptome data analysis pipeline. In the selected datasets, Wu et al. chose to apply Trim Galore! while Kwan et al. started the sequence alignment right after running FastQC. In this pipeline, we applied the tool only to the dataset from Wu et al. study, based on the obtained FastQC report. The command used followed the structure for single-end sequencing data.In the above command, the “-j” option indicates the number of cores to be used for trimming; “-e” corresponds to the maximum allowed error rate; “-q” allows the trimming of low-quality ends from reads in addition to adapter removal; “--stringency” points out the overlap with adapter sequence required to trim a sequence; the “-O” option creates a directory to save the output files. The command “-a” is used to add the adapter sequence. The Trim Galore! can automatically detect whether the Illumina universal, Nextera transposase, or Illumina small RNA adapter sequence was used. However, it is possible to add the adapter sequence manually. When the command is missing, the software auto-detects the adapter.
Reference genome
Before starting the sequence alignment, it is necessary to download the reference genome. As indicated in the original manuscripts, Wu et al. used the h19 genome and Kwan et al. used the GRCh38 genome. The downloaded file will be allocated in the created directory. Then the HISAT2 sequence alignment software will be downloaded, unzipped, and added to the path. The genome will be indexed using the “hisat2-build” script, allowing HISAT2 to perform the read alignment faster in the next step. At the end of the indexing, eight .ht2 files will be created.
Alignment of raw readings
Sequence alignment was performed with HISAT2 software (Kim, Paggi, Park, Bennett, & Salzberg, 2019). Other programs are available for free, such as STAR (Dobin et al., 2013), Bowtie2 (Langmead & Salzberg, 2012), and TopHat2 (Kim et al., 2013). A comparison between the methods was recently reported and can be accessed at Schaarschmidt, Fischer, Zuther, and Hincha (2020) and Musich, Cadle-Davidson, and Osier (2021).The “-p” command indicates the number of threads to perform the alignment; “-x” is used to enter the path of previously generated reference files. It is important not to add the file extension (e.g., .ht2). The “-U” option is used to show the path of unpaired files. To perform paired-end sequencing alignment, the synthesis used is “-1” and “-2” to assign files _1 and _2. The “-S” command designates the .sam file that will be generated. SAM files store aligned sequences and take up much space, so these files can be converted to compressed version .BAM by the samtools software.
Samtools to convert SAM to BAM files
Samtools is a set of programs for interacting with high-throughput sequencing data. It is helpful for converting SAM, BAM and CRAM files. One of the most used commands is the “samtools view,” which takes .BAM/.SAM files as input and converts them to .SAM/.BAM, respectively. The “-S” and “-b” commands are used. The alignment of fastq files occurs in random order with the position in the reference genome. Therefore, in “samtools sort,” the BAM files sorting is performed. The “-o” command indicates the output file.
FeatureCounts
FeatureCounts is a platform that performs reading counting, also called reading summarization, by gene (Liao, Smyth, & Shi, 2014). The htseq-count tool (Anders, Pyl, & Huber, 2015), and featureCounts, are widely used for this purpose. Both share the same file input format (BAM or SAM) and need an annotation file that includes the chromosomal coordinates of features. Among the mandatory arguments required to run the software are “-a” and “-o,” which specifies the name/path of an annotation file and the name of the output file including read counts. Among the optional arguments, we used the “-T” command to indicate the number of threads to be used in the analysis; the “-t” option specifies the feature type in GTF annotation, and “-g” specifies the attribute type in the GTF annotation.
Exploring data with R packages
After obtaining the read counts for each sample and downloading the quantitative proteomics tables, a set of R packages were used to handle the data and identify the relevant biological findings. The packages used are: limma, Glimma, edgeR, Homo.sapiens, metboAnalystR, Enhancedvolcano, combiroc, pROC, clusterProfiler, flashClust, uwot, NbClust, richplot, ggplot2, WCGNA, gprofiler2, ggvenn, DOSE, wordcloud, FactoMineR, factoextra, ggstatsplot, among others. The codes used to generate the data and figures will be available in Supplementary File 1
.
Supplementary File 1
Codes used to perform the analyzes and build the figures presented.
Codes used to perform the analyzes and build the figures presented.The gene expression count files for each sample mapped by the transcriptomics pipeline were gathered in a single directory and, with the “readDGE” function available in the edgeR package, they were merged into a single object DGEList class. The raw counts were transformed into counts per million (CPM) so that the size of the libraries was taken into account and then scaled to log (log CPM). Most samples genes that did not have sufficiently large counts were removed with the “filterByExpr” function. The resulting matrix was normalized by a trimmed mean of M-values (TMM) (Law et al., 2016).The differentially expressed genes and proteins were determined by applying the standard analysis pipeline of the limma package, with p-value correction by the Benjamini-Hochberg method (q-value). The Homo.sapiens package performed the transformation of ENSMBL ID into Gene symbol. Interactive MAplots were plotted with the Glimma package. The table resulting from the analysis of differentially regulated genes/proteins was used as input to visualize of volcano plots by Enhancedvolcano package. The x and y values were the columns logFC and adjusted p-value, respectively. Principal component analysis (PCA) was conducted by applying the FactoMineR package using the “PCA” function. The “fviz_pca_ind()” function, available in the factoextra package, was applied to visualize the result. Samples were identified as outliers using the MetaboAnalystR package, with the “PlotRF.Outlier” function. The input data used were filtered and normalized matrices. Venn diagrams were built with the ggvenn package. The gprofiler2 package was used to perform gene ontology (GO). The “gost” function was applied to access functional enrichments of gene lists by applying a threshold of q-value < 0.05 and “gostplot” to visualize the results, with the capped and interactive options set to TRUE. Only ontologies with 5 or more proteins/genes were considered. The wordcloud package was used to build the word cloud based on the name frequency of ontologies identified by gprofiler2. Bubble plots were built with the ggplot package. The x-axis shows the gene count in each ontology shown on the y-axis. The size of the bubbles indicates the -log of the q-value and the color indicates the ontology category. To access pathways related to differentially expressed genes/proteins, the clusterProfiler and enrichplot packages were used. The “GSA” function with the options minGSSize set to 5 and maxGSSize to 100. The p-value correction method was indicated as “BH” (Benjamini-Hochberg). The regulated pathways network was built with the “cnetplot” function.Data integration was performed by applying the “merge” function to select Gene symbols common between lung transcriptomics and plasma proteomics normalized tables. The “boxplot” function was applied to verify data distribution. Then the “removeBatchEffect” function, available in limma, was applied. The data were then normalized by the “scale” function (z-score). The “umap” function, available in the uwot package, was used to produce a low-dimensional embedding that summarizes the overall structure of high-dimensional data. The parameters adopted were min_dist = 0.01, n_neighbors = 20, n_epochs = 10,000, verbose = 2 and spread = 2. The NbClust package was applied to determine the best number of clusters. Then the data were clustered by the “eclust” function. The cluster number was set to k = 8. The “silhouette” function, available in the cluster package, was applied to assess the clustering quality. Genes and proteins from the same cluster, which have logFC > | 1 | (same direction) and q-value < 0.05 were selected and applied to construct ROC curves based on data from other datasets. The WGCNA package was applied to the expression data from Wu et al. and Kwan et al. The intersection of the ENSEMBL IDs mapped in the two datasets was performed by the “intersect” function. Correlation and connectivity between two datasets were determined using the softPower parameter set to 5. The number of modules was set to 4 and minClusterSize to 30. The “combiroc” package was used to determine the best marker combinations, and the “pROC” package was applied to plot and build the ROC curves.
Results
Given the increasing availability of public omics data, we present a robust and open access pipeline for extracting insights from quantitative biomolecular data (Fig. 1
). Initially, raw files generated by RNA sequencing were downloaded and processed. The proteomics and transcriptomics data were then analyzed to find differently regulated features between the groups. In addition, outlier samples, biological processes, cellular components, molecular functions, and pathways of interest were identified. Transcriptomics data from different matrices were correlated, and proteomics and transcriptomics data from different biological samples were fitted and integrated. Finally, we evaluated through ROC curves the ability to separate samples from infected and healthy individuals based on the panel of relevant proteins and transcripts. Proteins and transcripts with better resolution between the clinical conditions were prioritized and discussed.
Fig. 1
Computational workflow and tools adopted to explore quantitative proteomics and transcriptomics data from plasma, whole blood, and lung of healthy individuals and patients infected with SARS-CoV-2.
Computational workflow and tools adopted to explore quantitative proteomics and transcriptomics data from plasma, whole blood, and lung of healthy individuals and patients infected with SARS-CoV-2.
Plasma proteome presents different patterns according to the patient's COVID-19 grade
After filtering the original data provided by the authors, the PCA plot demonstrated that the plasma proteome of infected patients presented a different pattern than that of healthy patients (HE). The groups SV (severe) and FT (fatal) showed a very similar pattern, which differed from the MD (moderate) group (Fig. 2A). The protein profile of plasma from the same patient collected on day 0 (D0) of infection (PCR positive) and day 14 (D14) (PCR negative) showed similarity in the majority of cases, however few presented distinct recovery pattern (Fig. 2B).
Fig. 2
Exploring the plasma proteome alterations resulting from SARS-CoV-2 infection. (A) Principal component analysis (PCA) indicating the separation between the groups that developed fatal (FT), severe (SV), moderate (MD) COVID-19 disease, and healthy donors (HE); (B) PCA indicating the separation between the same patients, with samples collected on day 0 of infection (D0) and day 14 (D14); (C) Venn diagram showing differently regulated proteins between FTvsHE, SVsvHE and MDvsHE comparisons; (D–G) Volcanos plot of differentially expressed proteins among FTvcHE, SVvsHE, MDvsHE, and D0vsD14 conditions; (H) Venn diagram showing upregulated proteins in the SASRS-CoV-2 infected group in both evaluated studies and (I) Venn diagram showing downregulated proteins in the SASRS-CoV-2 infected group in both evaluated studies.
Exploring the plasma proteome alterations resulting from SARS-CoV-2 infection. (A) Principal component analysis (PCA) indicating the separation between the groups that developed fatal (FT), severe (SV), moderate (MD) COVID-19 disease, and healthy donors (HE); (B) PCA indicating the separation between the same patients, with samples collected on day 0 of infection (D0) and day 14 (D14); (C) Venn diagram showing differently regulated proteins between FTvsHE, SVsvHE and MDvsHE comparisons; (D–G) Volcanos plot of differentially expressed proteins among FTvcHE, SVvsHE, MDvsHE, and D0vsD14 conditions; (H) Venn diagram showing upregulated proteins in the SASRS-CoV-2 infected group in both evaluated studies and (I) Venn diagram showing downregulated proteins in the SASRS-CoV-2 infected group in both evaluated studies.A total of 453 proteins were differentially regulated in the comparison between FT and HE (Fig. 2D); 424 between SV and HE (Fig. 2E), and 403 between MD and HE (Fig. 2F). Among patients in groups D0 and D14, 421 regulated proteins were identified (Fig. 2G). Comparing the proteins upregulated and downregulated in all comparisons (Fig. 2H and I), 6 proteins (C1QA, CD93, FLT4, INHBC, ENPP2, and NID1) are identified upregulated in the infected groups, indicating a possible role in the SARS-CoV-2 pathology (Fig. 3
).
Fig. 3
Boxplot of upregulated proteins in SARS-CoV-2 infection. (A) Upregulated proteins in the groups that developed fatal (FT), severe (SV), and moderate (MD) COVID-19; (B) Upregulated proteins at the onset of COVID-19 (D0).
Boxplot of upregulated proteins in SARS-CoV-2 infection. (A) Upregulated proteins in the groups that developed fatal (FT), severe (SV), and moderate (MD) COVID-19; (B) Upregulated proteins at the onset of COVID-19 (D0).The boxplot plots (Fig. 3) showed that in the three degrees of infectious disease severity (FT, SV, MD) the expression level of the six proteins were higher than the healthy group (HE). The intra-patient comparisons (Fig. 3B) showed that after 14 days (D14) of disease onset (negative PCR), the levels of these proteins were lower than at the time of diagnosis (D0).
Plasma proteome analysis indicates dysregulation of immune system and cholesterol metabolism
Enriched GO terms were associated to the dysregulation of immune system, proteasome, cytoskeleton, cell adhesion, and lipoprotein metabolism (Fig. 4A). Upregulated proteins in the infected groups (FT, SV, MD, and D0) are related to the activation of the immune system, especially with the response mediated by leukocytes. Cytoskeleton dynamics and proteasomal degradation were also upregulated (Fig. 4B). Processes related to lipid metabolism and cell adhesion and migration are downregulated (Fig. 4C). Pathway analysis was performed using differentially regulated proteins and confirmed findings from GO analysis (Fig. 4D and E). In fact, downregulated proteins in the infected groups (FT, SV, MD, and D0) are related to cholesterol and lipoprotein metabolism. The ECM-proteoglycans pathway and cell adhesion molecules were also associated with downregulated features. On the other hand, the pathways of infection, immune system, apoptosis, and extracellular matrix (ECM) organization are upregulated.
Fig. 4
Gene ontology and GSEA analysis of differentially regulated plasma proteins. (A) Word cloud generated from differentially regulated identified terms (gene count > 5 and q-value < 0.05); (B–C) Biological processes, cellular components, and molecular functions of upregulated and downregulated proteins in infected groups, respectively; (D) Network of proteins that participate in the main enriched pathways. Red and blue dots indicate upregulated and downregulated proteins, respectively; (E) Pathways associated with differentially regulated proteins.
Gene ontology and GSEA analysis of differentially regulated plasma proteins. (A) Word cloud generated from differentially regulated identified terms (gene count > 5 and q-value < 0.05); (B–C) Biological processes, cellular components, and molecular functions of upregulated and downregulated proteins in infected groups, respectively; (D) Network of proteins that participate in the main enriched pathways. Red and blue dots indicate upregulated and downregulated proteins, respectively; (E) Pathways associated with differentially regulated proteins.
Lung and whole blood transcriptome reinforce findings in the plasma proteome
PCA analysis based on blood transcriptome from CTRL and INF patients with different degrees of symptoms did not show a clear separation (Fig. 5A). On the other hand, analysis of lung tissue from patients with fatal COVID-19 (INF) and healthy counterparts (CTRL) showed separation between groups (Fig. 5B). A total of 1039 genes were identified differentially expressed in the blood transcriptome (Fig. 5C) and 1034 in the lung tissue analysis (Fig. 5D) (logFc | 1 | and q-value < 0.05). A higher number of enriched GO terms (q value < 0.05) were found in the lung tissue compared to the blood transcriptome giving more information about the COVID-19 pathophysiology. Immune response mediated by leukocytes, apoptotic processes, collagen degradation, and viral life cycle were among the upregulated enriched GO terms corroborating the findings of plasma proteomics. On the other hand, there was an enrichment associated with misfolded proteins that were not identified in plasma (Fig. 5E). Ontologies related to downregulated genes showed greater differences with respect to proteomic findings, although processes such as cell adhesion are common to both. Terms related to cell development and differentiation, especially of the brain components, are highlighted (Fig. 5F). Processes related to lipid and cholesterol metabolism were not identified.
Fig. 5
Alteration of blood and lung transcriptome due to SARS-CoV-2 infection. (A) Principal component analysis (PCA) based on blood transcriptome from healthy (CTRL) and infected patients (INF); (B) PCA showing lung transcriptome-based separation of groups that developed fatal COVID-19 (INF) and non-infected individuals (CTRL); (C) Volcano plot indicating differentially expressed genes identified in blood; (D) Volcano plot indicating differentially expressed genes identified in the lung; Gene ontology (GO) analysis of upregulated (E) and downregulated (F) genes in the INF (blood and lung) groups.
Alteration of blood and lung transcriptome due to SARS-CoV-2 infection. (A) Principal component analysis (PCA) based on blood transcriptome from healthy (CTRL) and infected patients (INF); (B) PCA showing lung transcriptome-based separation of groups that developed fatal COVID-19 (INF) and non-infected individuals (CTRL); (C) Volcano plot indicating differentially expressed genes identified in blood; (D) Volcano plot indicating differentially expressed genes identified in the lung; Gene ontology (GO) analysis of upregulated (E) and downregulated (F) genes in the INF (blood and lung) groups.
The integration of omics data identifies clusters of proteins and genes related to key processes in infection
Weighted Gene Coexpression Network Analysis (WGCNA) is a popular method applied to identify correlation patterns between genes in samples. Integration of the 5000 common genes between lung and blood transcriptome did not show high modulus preservation between studies (Fig. 6A). However, the green (Fig. 6B) and turquoise (Fig. 6C) modules have 5 < Z < 10, indicating moderate preservation. By selecting and working with the 400 key genes (labeled on the basis of kME) differentially expressed in the assessed groups of the turquoise and green modules, we identified enriched biological processes and pathways. The green module has genes related to leukocyte activation and proteasome activity, which was also identified in the turquoise module (Fig. 6D). However, it was also possible to verify processes linked to the virus, including proteins associated with spike viral glycoprotein maturation and deregulation of host-glycosylation (Fig. 6E), a key process during infection.
Fig. 6
Weighted gene co-expression network analysis (WGCNA). (A) Z-score for modules preserved between lung transcriptome and whole blood; (B–C) Scatter plot indicating the LogFC of the 400 genes comprising the green and turquoise modules (Z > 5), respectively, selected by the topGenesKME function; (D–E) Network indicating differentially expressed proteins from the turquoise and green modules. The donut plot indicates enriched processes and pathways (q-value < 0.05).
Weighted gene co-expression network analysis (WGCNA). (A) Z-score for modules preserved between lung transcriptome and whole blood; (B–C) Scatter plot indicating the LogFC of the 400 genes comprising the green and turquoise modules (Z > 5), respectively, selected by the topGenesKME function; (D–E) Network indicating differentially expressed proteins from the turquoise and green modules. The donut plot indicates enriched processes and pathways (q-value < 0.05).The clustering of 421 features common to lung transcriptome and plasma proteome data from healthy (CTRL and HE) and fatal (INF and FT) COVID-19 patients resulted in 8 clusters (Fig. 7A and B). The silhouette plot indicates genes possibly associated with the wrong cluster (n
= 27) (Fig. 7C). These genes were disregarded in the enrichment analyses. Clusters 3, 4, 6, 7, and 8 show the highest correlation between the LogFC of the proteome and transcriptome (Fig. 7D). Considering only genes regulated in the same direction in both matrices, we identified enriched biological pathways and processes. Cluster 1 highlights terms associated with ECM; cluster 2, terms linked to the immune system and enzymatic activity; cluster 3, cytoskeletal activities; cluster 4, lipid metabolism; cluster 5, immune system and leukocyte activation; cluster 6, proteasomal degradation, immune system, and enzyme activity; cluster 7, complement system and cluster 8, glucose metabolism (Fig. 7E).
Fig. 7
Integration of lung transcriptomics and plasma proteomics data. (A) UMAP plot indicating the clustering of 421 genes/proteins identified in both evaluated datasets; (B) Dendrogram indicating the separation of 8 clusters identified; (C) Silhouette graph showing the quality of clustering applied. Negative values indicate genes that were possibly wrongly clustered. These genes were disregarded; (D) Scatter plot representing the LogFC of the proteome (x) and transcriptome (y) analysis; (E) Gene ontology and enriched pathways associated with identified clusters. The size of the bubbles indicates the -log-q-value and the color represents the associated cluster.
Integration of lung transcriptomics and plasma proteomics data. (A) UMAP plot indicating the clustering of 421 genes/proteins identified in both evaluated datasets; (B) Dendrogram indicating the separation of 8 clusters identified; (C) Silhouette graph showing the quality of clustering applied. Negative values indicate genes that were possibly wrongly clustered. These genes were disregarded; (D) Scatter plot representing the LogFC of the proteome (x) and transcriptome (y) analysis; (E) Gene ontology and enriched pathways associated with identified clusters. The size of the bubbles indicates the -log-q-value and the color represents the associated cluster.The transcripts and proteins differentially regulated with a logFC > | 1 | in the same direction (upregulation or downregulation) in both the proteome and the transcriptome were selected for application of the ROC curve (Fig. 8
). The targets with the highest specificity and sensitivity were selected based in the integrated expression dataset (plasma proteome and lung transcriptome) (Fig. 8A). Additionally, the predictive power of these targets to stratify infected and healthy conditions was tested in another cohort (plasma proteome of D0 and D14 and whole blood transcriptome from INF and CTRL).
Fig. 8
ROC curves based on proteins involved in SARS-CoV-2 pathology. (A) Bubble chart indicating combinations between proteins/genes that present better results (golden combinations); (B) ROC curves of gold combinations indicating the potential for protein separation in distinguishing between groups D0 (PCR positive, day 01) and D14 (PCR negative, day 14); (C) ROC curves of gold combinations indicating the gene splitting potential in distinguishing between infected (INF) and healthy (CTRL) groups from whole blood transcriptomics analysis; (D–E) Boxplots indicating the separation of patients based on the selected protein/gene.
ROC curves based on proteins involved in SARS-CoV-2 pathology. (A) Bubble chart indicating combinations between proteins/genes that present better results (golden combinations); (B) ROC curves of gold combinations indicating the potential for protein separation in distinguishing between groups D0 (PCR positive, day 01) and D14 (PCR negative, day 14); (C) ROC curves of gold combinations indicating the gene splitting potential in distinguishing between infected (INF) and healthy (CTRL) groups from whole blood transcriptomics analysis; (D–E) Boxplots indicating the separation of patients based on the selected protein/gene.Three biomolecules gave the best AUC: TIMP1, a natural inhibitor of the matrix metalloproteinases involved in extracellular matrix remodeling, cell proliferation and regulated in response to cytokines; TNXB, tenascin XB, an extracellular matrix glycoprotein involved in wound healing, and PTPRS, protein tyrosine phosphatase receptor type S, the cell surface receptor that binds to glycosaminoglycans and involved in cell-cell interaction and cell growth and differentiation (Fig. 8B). The combination of the three showed an AUC greater than 0.8. At the gene level, the same combination resulted in AUC higher than 0.9 (Fig. 8C). Other combinations that showed AUC values close to 1 were PTPRS and TIMP1; and PTPRS and SPP1, a sialoprotein related to lymphocytic activation (Fig. 8D and E).
Discussion
To explore and understand the mechanisms of SARS-CoV-2 infection, the virus responsible for the death of more than 5 million people worldwide (https://covid19.who.int/), we present here an optimized pipeline for omics data analysis (Fig. 1), as well as extract meaningful information from high-throughput techniques. Dealing with codes in command line may not be a common expertise; however, being able to access and interpret public data is of great advantage for performing in silico validations and designing experiments in the wet lab to test hypotheses. We present the application of computational tools to analyze raw transcriptomics data and funnel the findings to access key disease processes. In addition, we integrated quantitative proteomics data to improve our understanding of COVID-19.The rapid need to understand the mechanisms and impacts of SARS-CoV-2 makes omics strategies stand out as they can offer a holistic and comprehensive view of thousands of molecules (Colinge & Bennett, 2007; Haas, Muralidharan, Krogan, Kaake, & Hüttenhain, 2021). The ability to study alterations of a large number of proteins and transcripts in a single experiment is attractive (Al-Amrani, Al-Jabri, Al-Zaabi, Alshekaili, & Al-Khabori, 2021). Furthermore, accessing the modulation of molecules based on a clinical status can answer many questions about a given disease (Amiri-Dashatan, Koushki, Abbaszadeh, Rostami-Nejad, & Rezaei-Tavirani, 2018). However, to access these countless possibilities offered by omics, it is necessary to conduct information processing to reduce noisy data and focus only on what is relevant. Computational and bioinformatic analyses complement the sample preparation and data acquisition steps allowing a comprehensive visualization of dysregulated pathways (Cannataro, 2007).The mechanisms altered by SARS-CoV-2 during infection can be accessed through the application of shotgun proteomics (Badua, Baldo, & Medina, 2021; Rais, Fu, & Drabovich, 2021). In addition, this approach can provide the quantification of viral proteins in complex clinical samples (Bojkova et al., 2020; Lachén-Montes, Corrales, Fernández-Irigoyen, & Santamaría, 2020; Lazari et al., 2021). Reports of sequelae resulting from viral infection are increasingly frequent. Chen et al. (2021) applied a proteomic approach monitoring the health status of patients who developed COVID-19. The study showed changes in proteins related to cholesterol metabolism and heart disease. Our in silico analyses resulting from the application of the deft pipeline detailed here showed an alteration in processes linked to lipoproteins and cholesterol. These results are in agreement with changes in the HDL proteome associated with COVID-19 complications (Souza Junior et al., 2021). The data obtained in this manuscript show that the plasma proteome reflects the immunological status of infected patients, since pathways and proteins were identified as increased in these groups. The intra-patient comparison at different time points showed there is an activation of the immune system, especially of pathways linked to leukocytes in the first 24 h. However, these responses are diminished after 14 days. Other studies also reported the ability of the plasma and serum proteome to reflect the action of the immune system in the infection (Arthur et al., 2021; Kumar, 2021; Villar et al., 2021; Zhong et al., 2021). Among the immune system proteins that stood out were C1QA, which is crucial in the activation of the classical pathway of the complement system (Ghebrehiwet, Hosszu, Valentino, & Peerschke, 2012; Kouser et al., 2015; Nayak, Pednekar, Reid, & Kishore, 2012), and CD93, involved in the regulation of phagocytosis of apoptotic cells in vivo. Transcriptomics approaches have also been extensively applied to study SARS-CoV-2 infection (Butler et al., 2021; Chakraborty, Sharma, Bhattacharya, Zayed, & Lee, 2021; Islam et al., 2021; O'Donnell et al., 2021; Sun et al., 2020; Wong et al., 2021). We identified regulated pathways linked to misfolded proteins in the lung of infected patients, as previously reported (Rosa-Fernandes et al., 2021). Moreover, we identified an increase in the enzymes involved ECM remodeling, extravasation of intracellular contents and activation of the immune system (Leeming et al., 2021; Leng et al., 2020).Leukocyte activation was recurrent in our analyses of the integrated proteomic and transcriptomic data (Alon et al., 2021; Coradi & Vieira, 2021). Regarding our data integration approach, we saw that there is little correlation between transcripts mapped in lung and whole blood transcriptome. However, looking at moderately conserved transcripts, we found processes linked to leukocyte activation and host glycosylation. Our integrated data also mapped transcripts and proteins linked to proteasome activity in the infected groups. In fact, other groups have identified the importance of the proteasome in COVID-19 from the application of other techniques (Chatterjee et al., 2020; Longhitano et al., 2020; Wang et al., 2021).
Conclusions
We conducted integrated omic data analysis with focus on open tools to analyze and interpret this data type. We presented the entire pipeline for handling from raw files to differentially expressed features using the FastQC, Trim Galore!, HISAT2, samtools, featureCounts and R softwares. We also looked at useful tools to explore processes and pathways related to mapped transcripts and proteins. An important point was establishing a pipeline to integrate proteomics and transcriptomics data, since adjusting these data to make them comparable can be a big challenge. We also consistently reported applying various packages available in the bioconductor, providing the codes used. Regarding the biological findings, we evidenced increase in the cholesterol metabolism, immune system activity, ECM and proteasome degradation increased in infected patients. We also noticed a leukocyte activation profile in both proteomics and transcriptomics data. Finally, we identified a panel of proteins and transcripts that are regulated in the same direction in the lung transcriptome and plasma proteome that have a great ability to distinguish between healthy and infected groups. This panel of markers was applied to another cohort of patients and showed good results, corroborating the robustness and usefulness of the computational tools presented in this manuscript.
Authors: Daehwan Kim; Joseph M Paggi; Chanhee Park; Christopher Bennett; Steven L Salzberg Journal: Nat Biotechnol Date: 2019-08-02 Impact factor: 54.908
Authors: Livia Rosa-Fernandes; Raquel Hora Barbosa; Maria Luiza B Dos Santos; Claudia B Angeli; Thiago P Silva; Rossana C N Melo; Gilberto Santos de Oliveira; Bernardo Lemos; Jennifer E Van Eyk; Martin R Larsen; Claudete Araújo Cardoso; Giuseppe Palmisano Journal: J Proteome Res Date: 2020-08-04 Impact factor: 4.466
Authors: Brian E Dixon; Kara K Wools-Kaloustian; William F Fadel; Thomas J Duszynski; Constantin Yiannoutsos; Paul K Halverson; Nir Menachemi Journal: PLoS One Date: 2021-03-24 Impact factor: 3.240
Authors: Abul Bashar Mir Md Khademul Islam; Md Abdullah-Al-Kamran Khan; Rasel Ahmed; Md Sabbir Hossain; Shah Md Tamim Kabir; Md Shahidul Islam; A M A M Zonaed Siddiki Journal: J Transl Med Date: 2021-01-07 Impact factor: 5.531