| Literature DB >> 29939984 |
Luke Zappia1,2, Belinda Phipson1, Alicia Oshlack1,2.
Abstract
As single-cell RNA-sequencing (scRNA-seq) datasets have become more widespread the number of tools designed to analyse these data has dramatically increased. Navigating the vast sea of tools now available is becoming increasingly challenging for researchers. In order to better facilitate selection of appropriate analysis tools we have created the scRNA-tools database (www.scRNA-tools.org) to catalogue and curate analysis tools as they become available. Our database collects a range of information on each scRNA-seq analysis tool and categorises them according to the analysis tasks they perform. Exploration of this database gives insights into the areas of rapid development of analysis methods for scRNA-seq data. We see that many tools perform tasks specific to scRNA-seq analysis, particularly clustering and ordering of cells. We also find that the scRNA-seq community embraces an open-source and open-science approach, with most tools available under open-source licenses and preprints being extensively used as a means to describe methods. The scRNA-tools database provides a valuable resource for researchers embarking on scRNA-seq analysis and records the growth of the field over time.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29939984 PMCID: PMC6034903 DOI: 10.1371/journal.pcbi.1006245
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1(A) Number of tools in the scRNA-tools database over time. Since the scRNA-seq tools database was started in September 2016 more than 160 new tools have been released. (B) Publication status of tools in the scRNA-tools database. Over half of the tools in the full database have at least one published peer-revirew paper while another third are described in preprints. (C) When stratified by the date tools were added to the database we see that the majority of tools added before October 2016 are published, while around half of newer tools are available only as preprints. Newer tools are also more likely to be unpublished in any form. (D) The majority of tools are available using either the R or Python programming languages. (E) Most tools are released under a standard open-source software license, with variants of the GNU Public License (GPL) being the most common. However licenses could not be found for a large proportion of tools. Up-to-date versions of these plots (with the exception of C) are available on the analysis page of the scRNA-tools website (https://www.scrna-tools.org/analysis).
Fig 2Phases of a typical unsupervised scRNA-seq analysis process.
In Phase 1 (data acquisition) raw sequencing reads are converted into a gene by cell expression matrix. For many protocols this requires the alignment of genes to a reference genome and the assignment and de-duplication of Unique Molecular Identifiers (UMIs). The data is then cleaned (Phase 2) to remove low-quality cells and uninformative genes, resulting in a high-quality dataset for further analysis. The data can also be normalised and missing values imputed during this phase. Phase 3 assigns cells, either in a discrete manner to known (classification) or unknown (clustering) groups or to a position on a continuous trajectory. Interesting genes (eg. differentially expressed, markers, specific patterns of expression) are then identified to explain these groups or trajectories (Phase 4).
Descriptions of categories for tools in the scRNA-tools database.
| Phase | Category | Description |
|---|---|---|
| Phase 1 | Alignment | Alignment of sequencing reads to a reference |
| Phase 1 | Assembly | Tools that perform assembly of scRNA-seq reads |
| Phase 1 | UMIs | Processing of Unique Molecular Identifiers |
| Phase 1 | Quantification | Quantification of expression from reads |
| Phase 2 | Quality Control | Removal of low-quality cells |
| Phase 2 | Gene Filtering | Removal of lowly expressed or otherwise uninformative genes |
| Phase 2 | Imputation | Estimation of expression where zeros have been observed |
| Phase 2 | Normalisation | Removal of unwanted variation that may affect results |
| Phase 2 | Cell Cycle | Assignment or correction of stages of the cell cycle, or other uses of cell cycle genes, or genes associated with similar processes |
| Phase 3 | Classification | Assignment of cell types based on a reference dataset |
| Phase 3 | Clustering | Unsupervised grouping of cells based on expression profiles |
| Phase 3 | Ordering | Ordering of cells along a trajectory |
| Phase 3 | Rare Cells | Identification of rare cell populations |
| Phase 3 | Stem Cells | Identification of cells with stem-like characteristics |
| Phase 4 | Differential Expression | Testing of differential expression across groups of cells |
| Phase 4 | Expression Patterns | Detection of genes that change expression across a trajectory |
| Phase 4 | Gene Networks | Identification or use of co-regulated gene networks |
| Phase 4 | Gene Sets | Testing for over representation or other uses of annotated gene sets |
| Phase 4 | Marker Genes | Identification or use of genes that mark cell populations |
| Multiple | Dimensionality Reduction | Projection of cells into a lower dimensional space |
| Multiple | Interactive | Tools with an interactive component or a graphical user interface |
| Multiple | Variable Genes | Identifcation or use of highly (or lowly) variable genes |
| Multiple | Visualisation | Functions for visualising some aspect of scRNA-seq data or analysis |
| Other | Allele Specific | Detection of allele-specific expression |
| Other | Alternative Splicing | Detection of alternative splicing |
| Other | Haplotypes | Use or assignment of haplotypes |
| Other | Immune | Assignment of receptor sequences and immune cell clonality |
| Other | Integration | Combining of scRNA-seq datasets or integration with other single-cell data types |
| Other | Modality | Identification or use of modality in gene expression |
| Other | Simulation | Generation of synthetic scRNA-seq datasets |
| Other | Transformation | Transformation between expression levels and some other measure |
| Other | Variants | Detection or use of nucleotide variants |
Fig 3(A) Categories of tools in the scRNA-tools database. Each tool can be assigned to multiple categories based on the tasks it can complete. Categories associated with multiple analysis phases (visualisation, dimensionality reduction) are among the most common, as are categories associated with the cell assignment phase (ordering, clustering). (B) Changes in analysis categories over time, comparing tools added before and after October 2016. There have been significant increases in the percentage of tools associated with visualisation, dimensionality reduction, gene networks and simulation. Categories including expression patterns, ordering and interactivity have seen relative decreases. (C) Changes in the percentage of tools associated with analysis phases over time. The percentage of tools involved in the data acquisition and data cleaning phases have increased, as have tools designed for alternative analysis tasks. The gene identification phase has seen a relative decrease in the number of tools. (D) The number of categories associated with each tools in the scRNA-tools database. The majority of tools perform few tasks. (E) Most tools that complete many tasks are relatively recent.