Literature DB >> 32324845

alona: a web server for single-cell RNA-seq analysis.

Oscar Franzén¹, Johan L M Björkegren^1,2.

Abstract

SUMMARY: Single-cell RNA sequencing (scRNA-seq) is a technology to measure gene expression in single cells. It has enabled discovery of new cell types and established cell type atlases of tissues and organs. The widespread adoption of scRNA-seq has created a need for user-friendly software for data analysis. We have developed a web server, alona that incorporates several of the most popular single-cell analysis algorithms into a flexible pipeline. alona can perform quality filtering, normalization, batch correction, clustering, cell type annotation and differential gene expression analysis. Data are visualized in the web browser using an interface based on JavaScript, allowing the user to query genes of interest and visualize the cluster structure. alona accepts a compressed gene expression matrix and identifies cell clusters with a graph-based clustering strategy. Cell types are identified from a comprehensive collection of marker genes or by specifying a custom set of marker genes.
AVAILABILITY AND IMPLEMENTATION: The service runs at https://alona.panglaodb.se and the Python package can be downloaded from https://oscar-franzen.github.io/adobo/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Gene Species

Year: 2020 PMID： 32324845 PMCID： PMC7320629 DOI： 10.1093/bioinformatics/btaa269

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful technology to measure gene expression in single cells as it provides more detailed information than bulk RNA-seq (Sandberg, 2014). A typical scRNA-seq experiment generates hundreds to thousands of transcriptomes. The rapid rise of scRNA-seq has created a wealth of scRNA-seq data and parallel to this development an increasing need for user-friendly data analysis software. A web server for analysis of scRNA-seq data unlocks access to researchers without having to learn programming. Here, we describe alona—a public, fully automated web service, with a core written in the Python 3 programing language—that can be used to analyze, annotate and visualize scRNA-seq data. The tool takes advantage of a wide range of state-of-the-art scRNA-seq methods, normalization schemes and clustering algorithms as well as an intuitive web interface for data exploration. The web server accepts a compressed gene expression matrix in plain text format. The uploaded data are queued, processed and analyzed, often within an hour depending on the workload. Results are visualized in the web browser using a light-weight JavaScript library, which allows exploring cell clusters and gene expression using simple interactions. In addition, the analysis script is always provided so that the user can examine the code needed to reproduce the results.

2 Materials and methods

The analysis framework (named adobo; https://oscar-franzen.github.io/adobo/) is written in Python and runs on a virtual private server shared with the PanglaoDB web server (Franzén ). The backend is based on the LEMP stack. Jobs are queued and executed serially. The web interface allows the user to upload data and select analysis parameters (the default parameters are sensible and fit most experiments). During the data upload, the web server checks for data consistency and reports problems to the user. A typical experiment of ∼3000 cells is processed within 10 min; an optional e-mail address can be specified to send a reminder when the analysis is completed. The web server does not require registration to be used; uploaded data are kept confidential and are automatically deleted after 7 days. Data are only seen within the scope of the present browser session, which is identified using a cookie containing a random string. Pre-processing of the raw sequencing data (barcode demultiplexing, alignment and deduplication of unique molecular identifiers) is performed using external bioinformatics tools. The input data must be raw read counts in a matrix with genes as rows and cells as columns; the input file must also be compressed with gzip, zip, bzip2 or xz. The matrix can have a header or not. Fields are separated by tabs, spaces or commas. The Matrix Market format is also supported (https://math.nist.gov/MatrixMarket/formats.html); in which case the input file should be a tar.gz archive containing three files: matrix.mtx.gz, barcodes.tsv.gz and genes.tsv.gz. An overview of the analysis steps is shown in Figure 1 and the main steps are described here:

Fig. 1.

Flowchart showing the main analysis steps in alona. (A) Global overview. (B) A detailed overview of the cell clustering process

Quality filtering. Low quality cells are initially removed using simple thresholds (minimum number of total reads). Subsequently, the quality filtering approach from Lun is applied. Cells are removed based on two quality metrics: (i) the log of the library size and (ii) the log of the number of detected genes. The median and median absolute deviation (MAD) is computed for (i) and (ii). For any cell, if (i) or (ii) are below a defined number of MAD (default is 3) from the median, the cell is removed. Uninformative genes are removed by requiring each gene to be expressed in a certain percent of cells (default is 1%). Doublet detection is performed in this step using the Scrublet package (Wolock ). Normalization. Four normalization procedures are supported: (i) standard normalization (simple scaling of counts by library size); (ii) full-quantile normalization; (iii) centered log-ratio normalization; and (iv) variance-stabilizing normalization (Hafemeister and Satija, 2019). The standalone Python package also supports adjustment by gene length (RPKM). Batch correction (optional). The user can supply a list of batches (one per cell) to correct for known batch effects using the ComBat algorithm (Johnson ). An alternative to ComBat is to directly regress out batch effects using the function adobo.dr.regress. Feature selection. Highly variable genes (HVG) are discovered using either: (i) a Seurat-like strategy, utilizing binning of genes according to average expression (Butler ) or (ii) the method described by Brennecke . Three additional methods are supported in the standalone package (Andrews and Hemberg, 2019; Chen ; Lun ). The default is to find 1000 HVG. Dimensionality reduction. Principal component (PC) analysis is performed on the HVG with the method described by Baglama and Reichel (2005). The default setting is to identify 40 PCs. The Python package also supports the jackstraw method for identifying the optimal number of PCs to use. The 2D embedding is performed on PCs with t-Distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008) (perplexity is set to 30 as default) or Uniform Manifold Approximation and Projection (UMAP) (Becht ). Clustering. The PCs are searched for k-nearest neighbors using the BallTree algorithm. A shared nearest neighbor graph, with weights as the number of shared neighbors, is generated and pruned. Cell clusters are identified from the graph with the Leiden (Traag ), Louvain or Walktrap (Pons and Latapy, 2005) algorithms. For Leiden and Louvain, cluster resolution is set to 0.6 as default (decreasing this value gives larger clusters and vice versa). Cell type annotation. The method for cell type annotation was described in Franzén . Annotation of cell types is performed at the cluster level. Cluster-level analysis is faster than cell-level analysis since not every cell needs to be considered; it also reduces the impact of molecular dropout events and cell doublet artifacts, which frequently contaminate scRNA-seq data. Gene expression in clusters is represented by taking the median across all cells. The procedure estimates gene expression activity of a set of marker genes and then ranks the resulting cell types. Significance is determined by computing a one-sided Fisher’s exact test for each cell type and adjusting P-values with the Benjamini–Hochberg procedure. An acceptable false-discovery rate was chosen to be 10%. Thus, if the adjusted P-value is higher than 0.1, the cell type receives an ‘Unknown’ annotation. Custom marker genes can be entered or the user can choose to simply use markers from PanglaoDB. The latter option only supports mouse and human data. The present function is implemented in adobo.bio.cell_type_predict. Differential gene expression. The first step involves all-versus-all cluster comparisons; i.e. every gene is compared between every pair of clusters. Two methods are available for generating the initial set of comparisons: (i) linear models followed by t-tests, similar to the limma R package (Ritchie ) or (ii) Wilcox tests, as a non-parametric option. The latter is computationally much slower since t-tests were implemented using vectorized operations. To generate a single P-value for every gene, pairwise P-values are combined for every gene using Fisher’s method. Multiple testing correction is then applied with the Benjamini–Hochberg procedure. Tests are subsequently filtered based on two criteria: (i) adjusted P-value ¡ 0.01 and (ii) the number of cells expressing the gene in the cluster must be above a specified threshold (default is 80%). Flowchart showing the main analysis steps in alona. (A) Global overview. (B) A detailed overview of the cell clustering process Results can be downloaded as a tar.gz archive as well as visualized in the web browser. Supplementary Figure S1 shows an overview of the interface and contains descriptions of analysis output files.

3 Results and discussion

3.1 Test case: PBMC

To demonstrate the utility of alona, we applied it on a dataset consisting of 8381 peripheral blood mononuclear cells (PBMC). The dataset came from a healthy human donor and it was originally generated by 10X Genomics. Cells were clustered with default settings into 20 groups. Supplementary Figure S2 shows a UMAP plot of the data (colors correspond to clusters). Six cell types were identified (number of cells in parenthesis): T memory cells (3404), monocytes (2224), NK cells (1331), B cells (1222), platelets (91) and plasmacytoid dendritic cells (66). The identified cell types are commonly found in blood, and their proportions were consistent with the typical proportions reported in PBMC samples (Bolen ).

3.2 Comparison with existing web servers

A number of important web servers for scRNA-seq analysis have been developed, such as ASAP (Gardeux ), SCRAT (Ji ), iS-CellR (Patel, 2018), Granatum (Zhu ) and Single Cell Explorer (Feng ). The functionality of alona is comparable to the aforementioned services, with some notable differences: alona offers more choices in terms of algorithms; the clustering strategy is graph-based; cell type prediction is always performed—a key goal in most single-cell experiments. Finally, the backends of previously published web servers can, in most cases, not be executed standalone. The latter makes it impossible or difficult to reproduce results. Every analysis run by alona can be reproduced offline since the Python code for the analysis is always provided. Finally, alona automatically recognizes the Matrix Market format, which is common in NCBI’s Gene Expression Omnibus. Supplementary Figure S3 shows a comparison matrix where key features are compared with five other web servers.

4 Conclusions

We have here presented a user-friendly software for scRNA-seq analysis, alona. Development of alona will continue and we plan to expand the number of supported algorithms and analysis strategies.

Funding

This work was supported by the Karolinska Institutet & AstraZeneca Integrated Cardio Metabolic Centre (to J.L.M.B.); the Fondation Leducq – Transantlantic PlaqOmics Network (to J.L.M.B.); Hjärt- och Lungfonden [20170265 to J.L.M.B.]; and Vetenskapsrådet [2018-02529 to J.L.M.B]. Conflict of Interest: none declared. Click here for additional data file.

18 in total

1. Entering the era of single-cell transcriptomics in biology and medicine.

Authors: Rickard Sandberg
Journal: Nat Methods Date: 2014-01 Impact factor: 28.547

2. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.

Authors: Aaron T L Lun; Davis J McCarthy; John C Marioni
Journal: F1000Res Date: 2016-08-31

3. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.

Authors: Samuel L Wolock; Romain Lopez; Allon M Klein
Journal: Cell Syst Date: 2019-04-03 Impact factor: 10.304

4. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data.

Authors: Oscar Franzén; Li-Ming Gan; Johan L M Björkegren
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

5. Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Authors: Andrew Butler; Paul Hoffman; Peter Smibert; Efthymia Papalexi; Rahul Satija
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

6. Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists.

Authors: Xun Zhu; Thomas K Wolfgruber; Austin Tasato; Cédric Arisdakessian; David G Garmire; Lana X Garmire
Journal: Genome Med Date: 2017-12-05 Impact factor: 11.117

7. ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data.

Authors: Vincent Gardeux; Fabrice P A David; Adrian Shajkofci; Petra C Schwalie; Bart Deplancke
Journal: Bioinformatics Date: 2017-10-01 Impact factor: 6.937

8. iS-CellR: a user-friendly tool for analyzing and visualizing single-cell RNA sequencing data.

Authors: Mitulkumar V Patel
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

9. From Louvain to Leiden: guaranteeing well-connected communities.

Authors: V A Traag; L Waltman; N J van Eck
Journal: Sci Rep Date: 2019-03-26 Impact factor: 4.379

10. M3Drop: dropout-based feature selection for scRNASeq.

Authors: Tallulah S Andrews; Martin Hemberg
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

14 in total

Review 1. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.

Authors: Zoe A Clarke; Tallulah S Andrews; Jawairia Atif; Delaram Pouyabahar; Brendan T Innes; Sonya A MacParland; Gary D Bader
Journal: Nat Protoc Date: 2021-05-24 Impact factor: 13.491

2. Applying transcriptomics to studyglycosylation at the cell type level.

Authors: Leo Alexander Dworkin; Henrik Clausen; Hiren Jitendra Joshi
Journal: iScience Date: 2022-05-18

3. scCloudMine: A cloud-based app for visualization, comparison, and exploration of single-cell transcriptomic data.

Authors: Mathew G Lewsey; Changyu Yi; Oliver Berkowitz; Felipe Ayora; Maurice Bernado; James Whelan
Journal: Plant Commun Date: 2022-01-22

4. ICARUS, an interactive web server for single cell RNA-seq analysis.

Authors: Andrew Jiang; Klaus Lehnert; Linya You; Russell G Snell
Journal: Nucleic Acids Res Date: 2022-05-10 Impact factor: 19.160

5. scMRMA: single cell multiresolution marker-based annotation.

Authors: Jia Li; Quanhu Sheng; Yu Shyr; Qi Liu
Journal: Nucleic Acids Res Date: 2022-01-25 Impact factor: 19.160

Review 6. Understanding the Adult Mammalian Heart at Single-Cell RNA-Seq Resolution.

Authors: Ernesto Marín-Sedeño; Xabier Martínez de Morentin; Jose M Pérez-Pomares; David Gómez-Cabrero; Adrián Ruiz-Villalba
Journal: Front Cell Dev Biol Date: 2021-05-12

7. WASP: a versatile, web-accessible single cell RNA-Seq processing platform.

Authors: Andreas Hoek; Katharina Maibach; Ebru Özmen; Ana Ivonne Vazquez-Armendariz; Jan Philipp Mengel; Torsten Hain; Susanne Herold; Alexander Goesmann
Journal: BMC Genomics Date: 2021-03-18 Impact factor: 3.969

Review 8. Probing infectious disease by single-cell RNA sequencing: Progresses and perspectives.

Authors: Geyang Luo; Qian Gao; Shuye Zhang; Bo Yan
Journal: Comput Struct Biotechnol J Date: 2020-10-21 Impact factor: 7.271

9. CHARTS: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell RNA-seq data sets.

Authors: Matthew N Bernstein; Zijian Ni; Michael Collins; Mark E Burkard; Christina Kendziorski; Ron Stewart
Journal: BMC Bioinformatics Date: 2021-02-23 Impact factor: 3.169

Review 10. Prospects and challenges of cancer systems medicine: from genes to disease networks.

Authors: Mohammad Reza Karimi; Amir Hossein Karimi; Shamsozoha Abolmaali; Mehdi Sadeghi; Ulf Schmitz
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622