Literature DB >> 33364592

CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology.

Matthew N Bernstein¹, Zhongjie Ma², Michael Gleicher², Colin N Dewey^2,3.

Abstract

Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification of cell clusters by considering the rich hierarchical structure of known cell types. Furthermore, CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. CellO's comprehensive training set enables it to run out of the box on diverse cell types and achieves competitive or even superior performance when compared to existing state-of-the-art methods. Lastly, CellO's linear models are easily interpreted, thereby enabling exploration of cell-type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO's models across the ontology.

Entities: CellLine Chemical Disease Gene Species

Keywords: Classification of Bioinformatical Subject; Genomic Analysis; Genomics

Year: 2020 PMID： 33364592 PMCID： PMC7753962 DOI： 10.1016/j.isci.2020.101913

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data. Recently, a number of computational tools have been developed for automating the cell type annotation task. Nonetheless, many of these tools suffer from certain disadvantages that inhibit their use. First, many existing methods require the user to provide either a set of marker genes associated with each cell type (Zhang et al., 2019a; Pliner et al. 2019) or a suitable training data set with cells already annotated with cell type labels (Ma and Pellegrini 2020; Alquicira-Hernandez et al., 2019; Tan and Cahan 2019). Marker gene-based approaches are challenged by the fact that there is not a canonical set of marker genes for most cell types (Zhang et al., 2019b). Furthermore, finding an appropriate and labeled training set that contains all of the cell types in the target data set can be challenging, especially considering that existing approaches are sensitive to the chosen training set (Abdelaal et al., 2019). Second, many existing methods use flat classification. Flat classification suffers from the possibility that predictions are logically inconsistent with the hierarchy of cell types. Specifically, for a given query, a flat classifier may output a probability for a cell type that is larger than the classifier's output for its parent cell type in the hierarchy (Obozinski et al., 2008). Such incoherent outputs reduce the scientific usefulness of the classifier. We assert that framing the cell type classification task as that of hierarchical classification against the Cell Ontology (Bard et al. 2005) poses a number of advantages over flat classification. The Cell Ontology provides a comprehensive hierarchy of animal cell types encoded as a directed acyclic graph (DAG). This DAG provides a rich source of prior knowledge to the cell type classification task that remains un-utilized in flat classification. In addition, if the algorithm is uncertain about which specific cell type the cell may be, the use of a hierarchy allows the algorithm to place a cell internally within the graph rather than at a leaf node. Thus, for cells whose specific cell types are absent from the training set, a classifier that uses a hierarchy is capable of providing more informative output than simply claiming that the cell is “uncertain” as is implemented by some flat classifiers such as ACTINN (Ma and Pellegrini 2020). Finally, those methods that do perform hierarchical classification do not make use of the rich hierarchical relationships between known cell types encoded by the Cell Ontology. For example, CHETAH (de Kanter et al., 2019) classifies cells against a hierarchy; however, CHETAH infers this hierarchy from the data rather than utilizing the existing hierarchy encoded by the Cell Ontology. Garnett (Pliner et al. 2019) utilizes a hierarchy of cell types; however, these hierarchies must be pre-specified by the user. Furthermore, Garnett requires that each cell within the hierarchy be associated with a set of marker genes. To the best of our knowledge, the only method that utilizes the graph structure of an ontology for the task at hand is URSA (Lee et al., 2013), which classifies gene expression profiles against the BRENDA Tissue Ontology (Gremse et al., 2011). In this work, we present Cell Ontology-based Classification (CellO), a tool for annotating cells against the graph-structured Cell Ontology (Figure 1A). CellO is a discriminative, supervised machine learning approach for classifying clusters of cells in scRNA-seq data. CellO comes pre-trained on a comprehensive data set comprising the majority human primary samples in the Sequence Read Archive (SRA; Leinonen et al., 2011) and therefore arrives ready to run on diverse scRNA-seq data sets. CellO offers a complementary approach to marker gene-based methods for scenarios in which the test set contains cell types with poorly characterized marker genes.

Figure 1

Overview of CellO

(A) A schematic overview of CellO's hierarchical classification approach. CellO performs hierarchical classification with the Cell Ontology. Given a gene expression profile, CellO annotates the cell with a set of cell types (gray nodes) that are consistent with the hierarchical structure of the Cell Ontology.

(B) We compare CellO to eight recent cell type annotation methods regarding the criteria we surmise are desirable in a cell type classification approach: whether the method (1) arrives pre-trained and can run out of the box, (2) incorporates a hierarchy of cell types, (3) specifically uses the Cell Ontology as its hierarchy, (4) requires cell-type-specific marker genes, (5) uses a model that can be interrogated to better understand how it arrived at its decision, and (6) whether the method operates on clusters or single cells. We compare CellO to scMatch (Hou et al. 2019), SingleR (Aran et al., 2019), scCatch (Shao et al., 2020), CHETAH (de Kanter et al., 2019), Garnett (Pliner et al. 2019), CellAssign (Zhang et al., 2019a), ACTINN (Ma and Pellegrini 2020), scPred (Alquicira-Hernandez et al., 2019), CaSTLe (Lieberman et al. 2018), and SingleCellNet (Tan and Cahan 2019). CellO meets more desirable criteria than existing methods.

(C) Euler diagrams of the cell types within the bulk RNA-seq expression profiles used to train CellO. This training set comprises most primary cell bulk RNA-seq samples within the SRA and consists of diverse cell types spanning various tissues, developmental stages, and stages of differentiation. These diagrams were created with nVenn (Pérez-Silva et al. 2018).

Overview of CellO (A) A schematic overview of CellO's hierarchical classification approach. CellO performs hierarchical classification with the Cell Ontology. Given a gene expression profile, CellO annotates the cell with a set of cell types (gray nodes) that are consistent with the hierarchical structure of the Cell Ontology. (B) We compare CellO to eight recent cell type annotation methods regarding the criteria we surmise are desirable in a cell type classification approach: whether the method (1) arrives pre-trained and can run out of the box, (2) incorporates a hierarchy of cell types, (3) specifically uses the Cell Ontology as its hierarchy, (4) requires cell-type-specific marker genes, (5) uses a model that can be interrogated to better understand how it arrived at its decision, and (6) whether the method operates on clusters or single cells. We compare CellO to scMatch (Hou et al. 2019), SingleR (Aran et al., 2019), scCatch (Shao et al., 2020), CHETAH (de Kanter et al., 2019), Garnett (Pliner et al. 2019), CellAssign (Zhang et al., 2019a), ACTINN (Ma and Pellegrini 2020), scPred (Alquicira-Hernandez et al., 2019), CaSTLe (Lieberman et al. 2018), and SingleCellNet (Tan and Cahan 2019). CellO meets more desirable criteria than existing methods. (C) Euler diagrams of the cell types within the bulk RNA-seq expression profiles used to train CellO. This training set comprises most primary cell bulk RNA-seq samples within the SRA and consists of diverse cell types spanning various tissues, developmental stages, and stages of differentiation. These diagrams were created with nVenn (Pérez-Silva et al. 2018). Lastly, CellO makes extensive use of linear models, which are particularly amenable to interpretation. To enable their interpretation, we present a web-based tool, the CellO Viewer, for exploring the cell type expression signals uncovered by the model (https://uwgraphics.github.io/CellOViewer/). We benchmarked CellO on a collection of diverse single-cell data sets and found CellO capable of accurately annotating data sets that existing state-of-the-art and ready-to-run (i.e. come pre-trained) annotation methods were unable to accurately annotate, thus highlighting CellO's ability to annotate diverse data sets out of the box. Through its use of the Cell Ontology, its comprehensive training set, and the interpretability of its models, CellO is a practical tool for scRNA-seq cell type annotation (Figure 1B).

Results

A comprehensive curated RNA-seq data set of human primary cells

In order to capture robust cell type signals, we sought a data set of bulk RNA-seq samples comprising only healthy primary cells that originate from cells that have been isolated based on phenotypic characteristics downstream of gene expression itself (such as cell surface proteins). We thus avoid the circularity in using ground truth cell type labels determined by gene expression (via the expression of cell-type-specific marker genes) as are often provided in scRNA-seq data sets. We did not wish to include cells that underwent multiple passages, were diseased, or underwent other treatments, such as in vitro differentiation because these conditions alter gene expression. We therefore curated a data set from the SRA consisting of healthy, untreated, primary cells. To do so efficiently, we leveraged the annotations provided by the MetaSRA project (Bernstein et al. 2017), which includes sample-specific information including cell type, disease state, and sample type. We then manually curated the samples selected via the MetaSRA by both annotating technical variables and refining cell type annotations (Transparent Methods). This curation effort resulted in a data set comprising 4,293 bulk RNA-seq samples from 264 studies. These samples were labeled with 310 cell type terms, of which 113 were the most specific cell types in our data set (i.e., no sample in our data was labeled with a descendant cell type term). These cell types were diverse, spanning multiple stages of development and differentiation (Figure 1C). We uniformly quantified and normalized (via log transcripts per million) gene expression from the raw RNA-seq data for these samples (Figure 2A, Transparent Methods). To the best of our knowledge, this data set is the largest and most diverse set of bulk RNA-seq samples derived from only primary cells. Prior to this work, the most comprehensive bulk primary cell transcriptomic data set was compiled by (Aran et al. 2017), which contains data for 64 cell types from 6 studies. Although our data set consists of only RNA-seq data, this prior data set included samples assayed with several other technologies, such as microarrays. Another comprehensive set of primary cell expression data was collected by Mabbott et al. (2013), which contain primary cell data from 745 samples from 105 studies; however, these data are exclusively from microarrays.

Figure 2

Overview of analyses and CellO's algorithm

(A) A schematic illustration of the data sets and analyses performed in this study. Initial candidate bulk RNA-seq samples were selected from the SRA via the MetaSRA, filtered for errors, and quantified using the kallisto algorithm (Bray et al., 2016), which resulted in a comprehensive bulk RNA-seq training set consisting of healthy, human primary cells. This training set was split into a pre-training and validation set for tuning the parameters of the binary classifiers, as well as for evaluating the graph correction methods (Transparent Methods). The full bulk RNA-seq data set was then used to train the final models that were then evaluated on three sets of scRNA-seq data. The first set consisted of an aggregation of diverse non-droplet-based data sets from the SRA. The second data set consisted of FAC-sorted PBMCs from Zheng et al. (2017). The third set consisted of primary lung tumor cells from Laughney et al. (2020).

(B) A schematic illustration of CellO's classification procedure. First, for a given sample, the raw classifier probabilities are corrected with the cell ontology using IR (if CLR is used, this step is not necessary). We illustrate one edge of the graph whose incident nodes have probabilities that are logically inconsistent with the hierarchy and thus require correction because the child node has a higher probability than the parent. Once corrected, cell types whose raw probabilities meet their respective decision threshold are selected. Among these, the most specific cell types (i.e., lowest in the ontology) are examined and the cell type with the highest output probability is selected. CellO outputs this final selected cell type along with all ancestor terms.

Overview of analyses and CellO's algorithm (A) A schematic illustration of the data sets and analyses performed in this study. Initial candidate bulk RNA-seq samples were selected from the SRA via the MetaSRA, filtered for errors, and quantified using the kallisto algorithm (Bray et al., 2016), which resulted in a comprehensive bulk RNA-seq training set consisting of healthy, human primary cells. This training set was split into a pre-training and validation set for tuning the parameters of the binary classifiers, as well as for evaluating the graph correction methods (Transparent Methods). The full bulk RNA-seq data set was then used to train the final models that were then evaluated on three sets of scRNA-seq data. The first set consisted of an aggregation of diverse non-droplet-based data sets from the SRA. The second data set consisted of FAC-sorted PBMCs from Zheng et al. (2017). The third set consisted of primary lung tumor cells from Laughney et al. (2020). (B) A schematic illustration of CellO's classification procedure. First, for a given sample, the raw classifier probabilities are corrected with the cell ontology using IR (if CLR is used, this step is not necessary). We illustrate one edge of the graph whose incident nodes have probabilities that are logically inconsistent with the hierarchy and thus require correction because the child node has a higher probability than the parent. Once corrected, cell types whose raw probabilities meet their respective decision threshold are selected. Among these, the most specific cell types (i.e., lowest in the ontology) are examined and the cell type with the highest output probability is selected. CellO outputs this final selected cell type along with all ancestor terms.

Applications of hierarchical classification methods

We frame the cell type classification task as hierarchical classification against the Cell Ontology's DAG. The hierarchical classification task is inherently a multi-label classification task where each input sample (i.e., cell) is mapped to a “set” of output labels (i.e., cell types). Hierarchical classification extends multi-label classification by further requiring that the output labels are “consistent” with the labels' DAG. That is, for each label in a given output set of labels, the label's parent labels are also in the output set (Figure 1A). Moreover, when training a hierarchical classifier, samples that are annotated with specific labels (i.e., terms that are lower in the DAG) can be aggregated in order to train the classifier to recognize more general, ancestral labels in the DAG. We implemented two strategies for performing hierarchical classification against the Cell Ontology's DAG that both come packaged with CellO. First, we implemented cascaded logistic regression (CLR; Obozinski et al., 2008), which entails classifying a sample in a top-down fashion from the root of the ontology downward via a collection of binary classifiers. Specifically, each binary classifier is associated with a cell type and is trained to classify a sample conditioned on the sample belonging to all of the cell type's parents in the ontology. Next, we implemented a collection of one-vs-rest binary classifiers for each cell type in the DAG. We will refer to this as the “independent classifiers” approach. This approach suffers from the possibility that the classifiers' outputs will be inconsistent with the hierarchical structure of the ontology. An inconsistency occurs when the output probability for a given cell type exceeds that of one of its parent cell types in the ontology (Figure 2B). We tested the use of independent logistic regression classifiers and found inconsistencies to be a frequent source of errors. Specifically, we performed leave-study-out cross-validation on the full set of bulk RNA-seq data and examined the consistency of all edges that were adjacent to at least one cell type whose classifier produced a non-negligible probability (>0.01) of the sample originating from that cell type. Of these edges, 12.1% were inconsistent (Figure S1). Nearly all samples (>99%) contained at least one inconsistent edge and 34% of samples contained at least one severely inconsistent edge in which the child classifier's probability exceeded the parent classifier's probability by at least 0.25. We will use the term “correction” to refer to the task of reconciling the outputs of independent classifiers with a hierarchy (Figure 2B). To date, the one correction method that has been applied to the task at hand is Bayesian network correction (BNC), which is implemented in the URSA tool (Lee et al., 2013). Therefore, as a baseline, we implemented a BNC algorithm following the description in Lee et al. (2013) (Transparent Methods). We also tested two correction methods that have yet to be applied to the cell type classification task: isotonic regression correction (IR; Obozinski et al., 2008) and a heuristic procedure called the True Path Rule (TPR; Notaro et al., 2017). IR uses a projection-based approach for correction that entails finding a set of consistent output cell type probabilities that minimize the sum of squared differences to the raw, and possibly inconsistent, classifier output probabilities. In contrast, TPR uses a heuristic procedure that involves a bottom-up pass through the ontology such that the outputs of children classifiers are averaged with the output of the parent classifier to allow information flow across the ontology graph. To test these correction methods, we first partitioned the bulk RNA-seq data set into a pre-training and validation set (Figure 2A; Transparent Methods). Using this validation set, we performed a grid search to find the optimal parameters for training each binary logistic regression classifier, and given the optimal set of parameters, compared how well the aforementioned correction methods either enhanced or degraded accuracy over the samples in the validation set. Overall, we find that IR and TPR output probabilities similar to those output by the independent classifiers in regards to both average precision scores across the cell types in the validation set (Figure 3A) and precision-recall curves when considering each sample-cell type pair as an independent prediction (Figure 3B). This indicates that IR and TPR do not degrade performance in comparison to independent classifiers. In contrast, we found that the BNC approach significantly degraded performance (Figures 3A and 3B). We note that these results are in line with work by Obozinski et al (2008), which demonstrates that IR outperforms BNC on the hierarchical protein function prediction task. Although both IR and TPR yielded similar results, we use IR as our correction method of choice due to its simplicity.

Figure 3

Reconciling the outputs of independent classifiers with a hierarchy

(A) Average precision scores across all cell types for the independent classifiers (Ind.), as well as for IR, TPR, and BNC on the validation set.

(B) Each paired sample and cell type prediction was considered independently. The set of all such predictions was ordered according to their prediction probability and the corresponding precision-recall curve was constructed for the independent classifiers, IR, TPR, and BNC.

Reconciling the outputs of independent classifiers with a hierarchy (A) Average precision scores across all cell types for the independent classifiers (Ind.), as well as for IR, TPR, and BNC on the validation set. (B) Each paired sample and cell type prediction was considered independently. The set of all such predictions was ordered according to their prediction probability and the corresponding precision-recall curve was constructed for the independent classifiers, IR, TPR, and BNC. We also used this partition of the bulk RNA-seq samples to tune the parameters of the CLR algorithm (Transparent Methods). We found that after tuning, both IR and CLR achieved similar median F1-scores and median average precisions across cell types on the validation set (Figure S13), and therefore, both are included in the CellO software and evaluated throughout the remainder of this study.

Comparison to existing methods

We trained both CLR and IR on the full set of bulk RNA-seq samples in order to test their performance on single-cell data (Figure 2A). We note that training a single-cell classifier with bulk RNA-seq data may lead to models being poorly calibrated to the sparse single-cell expression profiles. To address this challenge, we first cluster single-cell data using the Leiden community detection algorithm (Traag et al. 2019) using the default resolution parameter of 1.0, as implemented in the Scanpy Python package (Wolf et al. 2018), and then compute each cluster's mean expression profile. The mean expression profiles are less sparse than those of the individual cells and thus better resemble the bulk RNA-seq data on which the algorithms were trained. CellO first classifies each cluster based on its mean expression profile and then assigns each cell to its cluster's assigned cell types. We compiled a data set consisting of 7,366 healthy primary cells originating from non-droplet-based RNA-seq assays, such as SMART-Seq2 (Picelli et al., 2013) and MARS-seq (Jaitin et al., 2014), from the SRA in a manner similar to that used for compiling the bulk RNA-seq training data. This data set originated from 14 studies and were labeled with 125 cell type terms, of which 32 were most specific to the data. Of these cells, 4,936 were of cell types that were included in the bulk RNA-seq training set. This subset of cells originate from 12 studies and were labeled with 71 cell type terms of which 16 were most specific to the data set. We note that for many of the cells used in this analysis, the ground-truth cell types provided by the authors of the data were determined via in silico and/or manual approaches (e.g. via heuristic marker gene-based approaches), and thus, this analysis can be understood as an analysis of the “consistency” between the cell types as annotated by the authors and those annotated by the automated methods explored in this work. We use the subset of 4,936 cells to evaluate IR, CLR, as well as a baseline one-nearest-neighbor (1NN) algorithm that simply returns the cell type labels of the most similar sample in the training set to the query expression profile using Pearson correlation as the similarity metric. We evaluated two aspects of these algorithms' classifications. First, we compute the average precision (a measure of area under the precision-recall curve) on each cell type's output probabilities. Second, for each cell, we evaluate a set of binary yes-no decisions for each cell type that result from thresholding the raw output probabilities and enforcing each cell to be annotated with only one most specific cell type (Figure 2B; Transparent Methods). We evaluate these binary decisions using precision, recall, and F1-score (harmonic mean of precision and recall). We modified the evaluation metrics to take into account samples that were labeled with a general cell type but not a specific cell type (e.g. T cell versus CD8+ T cell; Transparent Methods). We found that IR and CLR outperformed 1NN according to F1-score, precision, and average precision (Figures 4A, 4B, and 4C). Specifically, IR, CLR, and 1NN produced median F1-scores of 0.81, 0.85, and 0.63, respectively, across all cell types. The cell types on which classification performance was poor were generally more specific and clustered within the hierarchy (Figure 4D). We note that CellO's average precision scores across cell types tend to be higher than its F1-scores (Figure 4D). This discrepancy indicates that CellO is doing well at discriminating among these cell types; however, the decision thresholds used by CellO to output hard classifications for some cell types may be non-optimal. We hypothesize that this is due to some classifiers being poorly calibrated, especially for cell types lower in the ontology. This may be due to there being fewer “studies” in the training set generating these cell types, and thus, for a given cell type, CellO's binary classifiers may be more prone to fitting the batch effects present in these fewer studies (Figure S16).

Figure 4

Results on non-droplet-based single-cell data

CellO's performance on the 4,936 non-droplet-based cells considering only cells whose cell types are present in the bulk RNA-seq training set. We compare the distributions of (A) F1-score, (B) precision, and (C) average precision across all such cell types. (D) The subgraph spanning the non-droplet-based cells where each cell type is colored according to CellO's (IR) F1-score (top) as well as by average precision (bottom).

Results on non-droplet-based single-cell data CellO's performance on the 4,936 non-droplet-based cells considering only cells whose cell types are present in the bulk RNA-seq training set. We compare the distributions of (A) F1-score, (B) precision, and (C) average precision across all such cell types. (D) The subgraph spanning the non-droplet-based cells where each cell type is colored according to CellO's (IR) F1-score (top) as well as by average precision (bottom). We also compared CellO to two existing methods, scMatch (Hou et al. 2019) and SingleR (Aran et al., 2019). scMatch and SingleR are most comparable to CellO because they come packaged with comprehensive reference data sets of human cells. Like CellO, these methods are designed to run out of the box on diverse single-cell data sets. scMatch comes packaged with a reference data set comprising data from the FANTOM5 project (Lizio et al., 2017). SingleR comes packaged with two comprehensive human reference data sets: a data set comprising data from the Blueprint (Fernández et al., 2016) and ENCODE (Sloan et al., 2016) projects and a reference set from the Human Primary Cell Atlas (Mabbott et al., 2013). We also built a reference set for SingleR from CellO's training set in order to isolate methodological differences between SingleR and CellO. Unfortunately, scMatch does not support the creation of a custom reference set, and therefore, we were unable to perform this experiment with scMatch. To enable a comparison between scMatch, SingleR, and CellO, we project the outputs of scMatch and SingleR onto the Cell Ontology in order to evaluate scMatch and SingleR within the hierarchical classification framework. Specifically, for a given cell annotated by one of these methods with some cell type C, we also annotate the cell with all ancestors of C according to the Cell Ontology. First, we note that CellO's training sets included over 50% more of the cell types in this test set compared to those of scMatch and SingleR. In fact, most cell types in this test set, such as pancreatic islet cell types, are absent from scMatch and SingleR's reference sets, thus indicating that a user would be required to supply their own reference set for annotating these cell types (Figure 5A). We thus evaluated each method on only cell types that exist in each respective method's prepackaged reference set and found that CellO outperformed existing approaches (Figure 5B). SingleR performed poorly with CellO's training set, which may be due to the high number of samples in CellO's training set, the high number genes (i.e. including non-coding genes), and the fact that CellO contains bulk RNA-seq samples at a higher level in the ontology (Figure S4A). We note that, in this analysis, for a given method, we may remove from the analysis a specific cell type that is absent from a given method's training set (e.g. stomach epithelial cell), but we would keep an ancestral cell type term (e.g. epithelial cell) if the method's training set contains a sample labeled with this ancestral term (e.g. a sample labeled as intestinal epithelial cell).

Figure 5

Comparison of CellO to existing approaches on non-droplet-based single-cell data

Evaluating CellO, SingleR, and scMatch on the non-droplet-based cells.

(A) The fraction of cell types in the single-cell test data set that are also present in each method's training set. IR and CLR are not shown separately because they share the same training set. We evaluate SingleR's built in reference sets from the Human Primary Cell Atlas (HPCA) and BluePrint + ENCODE (BE).

(B) The distribution of both F1-scores (left) and precisions (right) for only those cell types that are in each method's training set. We compare CellO to scMatch, SingleR with the Human Primary Cell Atlas (HPCA), and SingleR with the Blueprint + ENCODE reference (BE). Note each distribution evaluates different sets of cell types depending on the particular subset of cell types present in each method's training set.

Comparison of CellO to existing approaches on non-droplet-based single-cell data Evaluating CellO, SingleR, and scMatch on the non-droplet-based cells. (A) The fraction of cell types in the single-cell test data set that are also present in each method's training set. IR and CLR are not shown separately because they share the same training set. We evaluate SingleR's built in reference sets from the Human Primary Cell Atlas (HPCA) and BluePrint + ENCODE (BE). (B) The distribution of both F1-scores (left) and precisions (right) for only those cell types that are in each method's training set. We compare CellO to scMatch, SingleR with the Human Primary Cell Atlas (HPCA), and SingleR with the Blueprint + ENCODE reference (BE). Note each distribution evaluates different sets of cell types depending on the particular subset of cell types present in each method's training set. Next, we evaluated CellO on fluorescence-activated cell sorted (FAC-sorted) f peripheral blood mononuclear cells (PBMC's) from Zheng et al. (2017) that were sequenced with Chromium 10x. We selected this data set because it is one of the few droplet-based data sets for which the cell type labels are determined phenotypically (via sorting) rather than computationally (via expression analysis). To reflect the size of typical single-cell data sets, we subsampled 2,000 cells from each of the ten sorted cell types and aggregated these cells together creating a data set consisting of 20,000 cells. We first compared IR and CLR to the 1NN baseline and again found that IR and CLR outperformed 1NN with respect to median F1-score, although the difference between IR/CLR and 1NN was smaller than in the comparison than on the aforementioned non-droplet-based scRNA-seq data set (Figure 6D). Specifically, IR, CLR, and 1NN produced median F1-scores of 0.97, 0.96, and 0.95, respectively, across all cell types.

Figure 6

Results on 10x PBMC data

(A and B) (A) The subgraph of the Cell Ontology spanning the 10x PBMC data set from Zheng et al. (2017). Each cell type is colored according to CellO's (IR) F1-score, as well as (B) average precision.

(C) UMAP plots of the single-cell data set where cells are colored by their true cell type (top), as well as the most specific predicted cell type (i.e. lowest in the ontology) as output by CellO (bottom).

(D) Boxplots displaying the distribution of F1-scores across all cell types for IR, CLR, 1NN, scMatch, SingleR with the Human Primary Cell Atlas (HPCA), SingleR with the Blueprint + ENCODE reference (BE), and SingleR with the Monaco et al. reference (M).

Results on 10x PBMC data (A and B) (A) The subgraph of the Cell Ontology spanning the 10x PBMC data set from Zheng et al. (2017). Each cell type is colored according to CellO's (IR) F1-score, as well as (B) average precision. (C) UMAP plots of the single-cell data set where cells are colored by their true cell type (top), as well as the most specific predicted cell type (i.e. lowest in the ontology) as output by CellO (bottom). (D) Boxplots displaying the distribution of F1-scores across all cell types for IR, CLR, 1NN, scMatch, SingleR with the Human Primary Cell Atlas (HPCA), SingleR with the Blueprint + ENCODE reference (BE), and SingleR with the Monaco et al. reference (M). Next, we compared CellO to scMatch and SingleR on this PBMC data set. In this analysis, we also ran SingleR using an immune-specific reference set of purified immune cells from Monaco et al. (2019). The cell types in this data set are better represented in scMatch and SingleR's respective reference sets, and thus, a comparison between CellO and these methods on this data better isolates performance differences between these methods. Like scMatch and SingleR, CellO struggled to accurately classify the T cell subtypes (Figures 6A–6C, S2C, S3, and S4B). Among the methods compared here, SingleR with the Monaco et al. reference set most accurately classified the T cell subtypes (Figure S3), though we note that this reference set is specialized for immune cell types, whereas the other reference sets, including CellO's, are more broad. We also note that CellO produced high average precision scores on most cell types including many of these T cell subtypes. Again, this indicates that CellO's classifiers have learned to discriminate between these cell types; however, the threshold for calling these cell types may be non-optimal.

Inspection of performance on challenging and diseased samples

We examined CellO's classifications on three challenging data sets. Two of these data sets comprised subsets of the 7,366 cells from non-droplet-based assays and are challenging because they contained cells for which their combination of cell types was absent from CellO's training set. First, we examined CellO's accuracy on 1,978 healthy pancreatic islet cells from Segerstolpe et al. (2016), which includes cell types that are absent from the bulk RNA-seq training set, specifically ductal cells, acinar cells, epsilon cells, and delta cells. We found that CellO was able to correctly annotate the acinar cells as glandular epithelial cells, which is an ancestral cell type to acinar cells in the Cell Ontology (Figures 7A and S2A). This highlights the advantage of classifying against a hierarchy in that it enables CellO to annotate cells with a term higher in the ontology DAG when it is unsure about a cell's more specific cell type. We also note that a number of cells were uncharacterized in the original study due to not meeting a stringent quality control filter. CellO annotated many of these cells as pancreatic A cells (a.k.a. pancreatic alpha cells), which is plausible owing to both their close position to annotated A cells according to UMAP, which is known to preserve some level of global structure in high dimensional data (Becht et al., 2018), as well as the fact that A cells were found to be the most abundant endocrine cell type in Segerstolpe et al. (2016) of those that passed their stringent quality control filtering.

Figure 7

Examination of CellO's performance on difficult data sets

(A) UMAP plots of all healthy cells in Segerstolpe et al. (2016) including cells for which their specific cell types are not present in CellO's bulk RNA-seq training set. Cells are colored according to their true cell type (left) and (IR) predicted cell type (right). Highlighted are CellO's predictions made on pancreatic acinar cells (top right ovals), as well as a subset of uncharacterized pancreatic epithelial cells predicted as A cells (center ovals).

(B) UMAP plots of human, embryonic neural cells from La Manno et al. (2016). Cells are colored according to their true cell type (left) and predicted cell type (right). Highlighted are CellO's predictions made on both the microglial and glial cells and note that CellO annotates these cells using terms that are higher in the ontology's graph than their true terms.

Examination of CellO's performance on difficult data sets (A) UMAP plots of all healthy cells in Segerstolpe et al. (2016) including cells for which their specific cell types are not present in CellO's bulk RNA-seq training set. Cells are colored according to their true cell type (left) and (IR) predicted cell type (right). Highlighted are CellO's predictions made on pancreatic acinar cells (top right ovals), as well as a subset of uncharacterized pancreatic epithelial cells predicted as A cells (center ovals). (B) UMAP plots of human, embryonic neural cells from La Manno et al. (2016). Cells are colored according to their true cell type (left) and predicted cell type (right). Highlighted are CellO's predictions made on both the microglial and glial cells and note that CellO annotates these cells using terms that are higher in the ontology's graph than their true terms. We also further examined CellO's classification on 1,977 fetal neural cells from La Manno et al. (2016). Although the bulk RNA-seq training data contain samples of both embryonic cells and cells of various neural cell types, they do not contain any sample labeled as both neural cell and embryonic cell. Despite this discrepancy, CellO was able to annotate these cells with reasonable cell type labels (Figures 7B and S2B). We note that the microglial cells were annotated as phagocytes, which are an ancestral term to microglial cells in the Cell Ontology. Similarly, CellO annotated the glial cells as neuron associated cells. These examples again highlight CellO's ability to annotate cells with a term higher in the ontology DAG when it is unsure about a cell's more specific cell type. Finally, we examined CellO's classification on eight lung adenocarcinoma tumor samples from Laughney et al. (2020) that were sequenced with 10x. This data set provides a good, yet challenging test set for CellO due to (a) the heterogeneity of constituent cell types, (b) the fact that it contains cell types absent from CellO's training set, and (c) these cells originate from diseased tissue, thereby providing insight into how CellO will perform on non-healthy samples. We compared CellO's output to the annotations provided by the authors, which were the result of a custom, in silico cell type annotation pipeline. Overall, we found a high correspondence between the labels provided by CellO and the authors' annotation pipeline (Figures 8; Figures S5–S12). When the two methods differed, we attempted to determine which method is likely to be correct using manual inspection of known marker genes and in many of these discrepancies, CellO produced the correct cell type labels. For example, in tumor LX675 and LX682, CellO correctly annotates putative endothelial cells (Figures 8; Figure S5), based on expression of PECAM1 (Figure 8C) and CD34 (data not shown), whereas the authors of these data labeled these as epithelial cells. In tumor LX675, CellO labeled the myeloid dendritic cell population as CD1c+ myeloid dendritic cells, a more specific cell type than that provided by the authors (Figure 8). In tumor LX682, CellO correctly annotates the putative fibroblasts as evidenced by their expression of FAP (Puré and Blomberg 2018) (Figure S5) and S100A4 (Strutz et al., 1995; data not shown). In tumor LX679, CellO appears to correctly annotate the plasmacytoid dendritic cell population as evidenced by the expression of IL3RA (Figure S6), a known marker for this cell type (Collin et al. 2013). We also note that for cells whose likely true cell types are absent from CellO's training set, CellO was able to label these cells using a correct but more general cell type. For example, in tumor LX675, mast cells were absent from CellO's training set, yet CellO accurately labeled these cells as hematopoietic cells (Figure 8). Similarly, when CellO was unsure of the labels for cells of cell types for which training data were sparse, such as for plasma cells and pericytes, CellO labeled these cells as more general cell type terms lymphocyte of B lineage and connective tissue cell, respectively (Figure 8). Lastly, we note some of the most clear-cut cases in which the authors' pipeline produced correct labels but CellO erred. For example, in tumor LX679, CellO labeled many of the epithelial cell types as prostate epithelial cells (Figure S6), which is clearly incorrect given that these are lung tumor samples. Other errors produced by CellO are due to rarer cell types within the sample clustering together with a more common, similar cell type. For example, in tumor LX682, the myeloid dendritic cells are labeled by CellO as macrophages due to these cells clustering together with the macrophage population (Figure S5).

Figure 8

Examination of CellO on diseased cells

(A) UMAP plots of lung adenocarcinoma tumor LX675 from Laughney et al. (2020) colored by CellO's output using IR and a Leiden resolution parameter of 1.0 (left) and the original cell type labels provided by the authors. We highlight four subpopulations comprising putative CD1C + myeloid dendritic cells (top left), endothelial cells (top right), plasma cells (bottom left), and mast cells (bottom right).

(B) The legend for coloring cells in (A).

(C) UMAP plots of cells colored by their expression, in units log(TPM+1), of CD1C, a marker for CDC1+ myeloid dendritic cells, PECAM1, a marker for endothelial cells, SDC1, a marker for plasma cells, and KIT, a marker for mast cells.

Examination of CellO on diseased cells (A) UMAP plots of lung adenocarcinoma tumor LX675 from Laughney et al. (2020) colored by CellO's output using IR and a Leiden resolution parameter of 1.0 (left) and the original cell type labels provided by the authors. We highlight four subpopulations comprising putative CD1C + myeloid dendritic cells (top left), endothelial cells (top right), plasma cells (bottom left), and mast cells (bottom right). (B) The legend for coloring cells in (A). (C) UMAP plots of cells colored by their expression, in units log(TPM+1), of CD1C, a marker for CDC1+ myeloid dendritic cells, PECAM1, a marker for endothelial cells, SDC1, a marker for plasma cells, and KIT, a marker for mast cells.

Evaluation of robustness to clustering

In order to test CellO's robustness to cluster granularity when clustering single-cell data, we tested CellO on the Zheng et al. PBMC data set and Laughney et al. lung cancer data set using differing values for Leiden's resolution parameter. On the Zheng et al. data set, we tested CellO with five values for the resolution parameter and found both CLR and IR to perform similarly across these settings (Figure S15). In contrast to the Zheng et al. data set, the lack of robust ground-truth labels for the Laughney et al. data set made it difficult to perform a similar assessment. Instead, on this data set, we ran CellO using the default resolution of 1.0, as well as a higher resolution of 8.0, and then manually inspected the differences in the outputs produced by these two settings (Figures S5–S12). Overall, we found that the higher resolution led CellO to classify clusters with more granularity but at the cost of some errors. For example, on tumor LX679 and LX682, CellO was able to correctly classify the putative myeloid dendritic cells at the resolution of 8.0, whereas at the default resolution of 1.0, these cells were labeled incorrectly as alveolar macrophages (Figures S5 and S6). This is likely due to the fact that the dendritic cells were subsumed into the macrophage cluster at the lower resolution. However, in other instances, the higher resolution led to errors. These errors are likely due to the fact that at a more granular clustering resolution, each cluster's mean expression value remained sparse due to the averaging of fewer cells. For example, in tumor LX679, cells correctly labeled as respiratory epithelial cells at a resolution of 1.0 were incorrectly labeled as prostate epithelial cells at a resolution of 8.0 (Figure S6).

User-friendly software

We provide a Python package for running CellO, using either IR or CLR, on a user-provided gene expression matrix (https://github.com/deweylab/CellO). CellO reduces the burden of reformatting and preprocessing an input expression matrix by accepting a variety of input file formats, including comma or tab-separated text files and HDF5, and by accepting expression data in a variety of units including counts or transcripts per million (TPM). To address the scenario in which the input data set's genes do not match those expected by the pre-trained classifiers, we provide functionality for a user to re-train the models on the bulk RNA-seq training set with a custom gene set. On a personal laptop, training a new classifier took 31 minutes to train IR and 11 minutes to train CLR on the full set of 58,243 GENCODE genes. Training time is reduced when trained on a smaller set of genes (e.g., only protein-coding genes). We also note that the time required for CellO to perform classification is low because of the fact that it uses pre-trained logistic regression classifiers operating on cluster-averaged expression profiles. On the Zheng et al. (2017) data set, CellO took six minutes to run on a personal laptop (including time for clustering), whereas scMatch required six hours and nine minutes (run with five cores), and SingleR required between nine and 22 minutes depending on the reference set used. Finally, we note that the relative performance of IR and CLR varies across cell types. To guide a user on their selection of either IR or CLR, we provide the average precision values achieved by both methods on each cell type in the bulk RNA-seq validation set (Table S1, Bulk RNA-seq Validation Set Metrics). We also provide average precision values and F1-scores achieved by both methods on each cell type on the test set of 4,936 non-droplet-based single cells whose cell types were present in the bulk RNA-seq training set (Table S2, Single-cell RNA-seq Test Set Metrics). Lastly, we note that due to CellO's comprehensive training set, which comprises cell types from many organs and tissues, some of CellO's errors are due to CellO annotating cells using a cell type that is unique to a tissue type that differs from the known tissue type of the target sample. For example, in some lung cancer tumors from Laughney et al., endothelial cells were classified as “endothelial cell of the umbilical vein” (Figures S5, S6, and S9), which is clearly incorrect given that these samples were taken from the lung. This error is likely due to the fact that the endothelial cells in CellO's training set largely originate from umbilical cord samples. Because these errors can be easily caught by the user, the CellO package enables a user to fix such errors by enabling the user to supply a blacklist of tissue types that do not pertain to the target sample. CellO then uses edges between the Cell Ontology and the Uberon ontology (which encodes anatomical entities; Mungall et al., 2012) to filter out cell types from CellO's output that are uniquely located in the blacklisted tissue types. For example, by blacklisting the Uberon term “umbilical vein”, CellO correctly classifies the endothelial cells from Laughney et al.

Interpretability of models

CellO makes extensive use of linear models, which are particularly amenable to interpretation especially when the coefficients are sparse (Gleicher 2013). Although CellO's models are not regularized to be sparse (as in Gleicher 2013), we sparsify them by selecting the top ten genes per cell type according to the magnitude of the coefficients associated with each gene within each cell type's one-vs-rest binary classification model, which is used for CellO's IR classifier. To enable their interpretation, we present a web-based tool, the CellO Viewer, for exploring these discriminative genes uncovered by the models (https://uwgraphics.github.io/CellOViewer/). The tool supports two modes of operation: a “cell-centric” mode (Figure 9A) and a “gene-centric” mode (Figure 9B). In the cell-centric mode, the user can select cell types via a graphical display of the Cell Ontology in order to view and compare the most important genes for distinguishing those cell types. In the gene-centric view, the user can select genes and explore which cell types these genes are most important for distinguishing from the remaining cell types. The CellO Viewer uses an interactive display of the Cell Ontology's graph to enable the user to navigate between cell types across the ontology.

Figure 9

The CellO Viewer

Screenshots of the CellO Viewer web application for enabling the exploration of cell-type-specific expression signatures across the Cell Ontology.

(A) Comparing the top ten genes between CD4+ T cells and CD8+ T cells (red nodes in the Graph View) ranked by the magnitude of their coefficients in their corresponding models. Genes that are shared between the two lists are highlighted with the same color. The CellO Viewer displays genes whose expressions are both positively correlated (green) and negatively correlated (red) with the selected cell types.

(B) A screenshot of the gene-centric mode of the CellO Viewer with GFAP, an astrocyte marker, selected. For a given selected gene, the CellO Viewer will display the cell types within the DAG (top) and in list form (bottom) for which the selected gene appears within the top ten genes ranked by each model's coefficients.

The CellO Viewer Screenshots of the CellO Viewer web application for enabling the exploration of cell-type-specific expression signatures across the Cell Ontology. (A) Comparing the top ten genes between CD4+ T cells and CD8+ T cells (red nodes in the Graph View) ranked by the magnitude of their coefficients in their corresponding models. Genes that are shared between the two lists are highlighted with the same color. The CellO Viewer displays genes whose expressions are both positively correlated (green) and negatively correlated (red) with the selected cell types. (B) A screenshot of the gene-centric mode of the CellO Viewer with GFAP, an astrocyte marker, selected. For a given selected gene, the CellO Viewer will display the cell types within the DAG (top) and in list form (bottom) for which the selected gene appears within the top ten genes ranked by each model's coefficients. We found that across diverse cell types, many known cell-type-specific marker genes were recovered by the CellO models and are presented by the CellO Viewer. For example, CD3D, CD3E, and CD3G, which are canonical markers for T cells, were all present within the top ten genes ranked according to the magnitude of their coefficients within the binary logistic regression model used for distinguishing T cells from all other cell types. Similarly, CD4 and CD8 were present in the top genes for the CD4+ T cell and CD8+ T cell models, respectively (Figure 9). In a more complex example, the genes GCG, LOXL4, DPP4, GC, and FAP, known markers for pancreatic alpha cells, and INS, IAPP, and ADCYAP1, known markers for pancreatic beta cells Segerstolpe et al. (2016), all appear within the top ten genes for their respective cell types. Interestingly, certain genes appear in the top ten coefficients for broad cell types but not more specific cell types, indicating that CellO is able to find signals specific to broad cell type categories. For example, DDX4 appeared in the top ten genes for distinguishing germ line cells but did not appear within the top ten genes for any of the more specific germ cell subtypes. DDX4 is known to be expressed in germ cells across both sexes (Hickford et al., 2011). Similarly, the gene NRG1 appeared in the top ten genes for distinguishing precursor cells but did not appear within the top ten genes for any of the more specific cell types that are descendents of precursor cells within the ontology. NRG1 is known to play a role in the development of a number of organ systems (Lemmens et al. 2007; Mei and Xiong 2008).

Discussion

In this work, we explore the application of hierarchical classification algorithms to cell type classification with the Cell Ontology using a well-curated set of human primary cell RNA-seq samples. This data set may prove useful for future investigations of cell type expression patterns or for use in cell type deconvolution methods (Aran et al., 2019; Newman et al., 2015). We demonstrate that the trained classifiers perform well across cell types in diverse single-cell data sets and outperformed existing cell type annotation methods when trained on their comprehensive reference sets. We packaged these classifiers into an easy-to-run Python package called CellO. In our exploration of methods for correcting the independent one-vs-rest classifiers, we found that discriminative methods outperformed the generative BNC approach implemented in URSA (Lee et al., 2013). We hypothesize that BNC suffers in comparison due to two causes. First, BNC's probabilistic model makes strong assumptions regarding the generative process of classifier scores and true cell type assignments. Second, BNC requires estimating the conditional probability distribution of each classifier's output scores (i.e., distance from the decision boundary) conditioned on the true cell type labels, which may be difficult to estimate accurately given the limited quantity of training data available for each cell type. By using linear models, CellO's trained parameters are easily interpreted as cell-type-specific signatures across the ontology. However, we note that since certain cell types undergo similar sorting and preparation procedures (e.g., fluorescence-activated cell sorting), it remains unclear to what extent these procedures affect gene expression and thus confound cell type. We sought to mitigate this effect by using data from a diversity of studies. We also note that the CLR algorithm may help to further mitigate this effect since the binary classifiers trained in this framework for each cell type condition on the sample belonging to the parent cell types. Thus, for a given cell type, if samples of its parent cell types were prepared through similar procedures, the learned model parameters for that cell type will better capture biological cell type signatures.

Limitations of the study

There are a number of avenues that require further investigation. First, future work will entail curating comprehensive training sets from the SRA for other species such as mouse. This work will rely partly on future inclusion of standardized mouse metadata in the MetaSRA. Second, CellO is a cluster-based annotation method, and thus, its accuracy relies, in part, on the robustness of the clustering algorithm. If the clustering is too coarse, rare cell types may be missed. If clustering is too fine, the algorithm may not be combining enough data to accurately annotate each cluster. Determining the optimal clustering in scRNA-seq data is a challenging, open problem that will require further investigation (Kiselev et al. 2019). Nonetheless, we demonstrated that CellO accurately classified a number of diverse data sets using Leiden's default parameter in the Scanpy package. Third, we note that calibrating discriminative models trained on bulk RNA-seq data and applying them to single-cell data is challenging. In this work, we developed techniques for closing the gap between the performance of CellO when evaluated with average precision versus when evaluated with F1-score. The very high average precision scores across many cell types indicate that CellO is learning an accurate representation of these cell types and that with better calibration, CellO's accuracy when making binary yes-no decisions for each cell type could be improved. Future work will investigate alternative approaches to calibrating CellO's models in order to improve CellO's binary cell type decisions. Fourth, the Cell Ontology encodes anatomical and functional relationships between cell types; however, there exist a number of other relationships between cell types that could be utilized to improve accuracy. Such examples include lineage-based relationships (i.e., one cell type derives from another cell type; Yuan et al., 2020) or evolutionary relationships between extant cell types and ancient cell types (Arendt et al., 2016; Liang et al., 2018). For example, the evolutionary relationships between cell types may be utilized to address inconsistencies in the independent classifiers approach that arise when certain cell types share a parent cell type via the currently encoded “is a” relationship but are purported to have divergent evolutionary origins. Fifth and finally, we expect the performance of hierarchical classifiers to improve as both more data are collected and as the Cell Ontology is expanded. Most importantly, we expect the calibration of the classifiers to improve as more training data become available for each cell type. More training data will be collected both as data are continually added to the SRA and as improvements are made to the SRA's metadata, thereby allowing retrieval of previously undiscovered primary cell samples.

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Colin Dewey (colin.dewey@wisc.edu).

Data and code availability

A Python package for running CellO can be found on GitHub: https://github.com/deweylab/CellO. The data used in this work can be found on https://doi.org/10.5281/zenodo.4289064. The CellO Viewer can be accessed at https://uwgraphics.github.io/CellOViewer/. The code implementing the CellO Viewer can be found on GitHub: https://github.com/uwgraphics/CellOViewer. All code for performing the experiments in this work can be found on GitHub: https://github.com/deweylab/cell-type-classification-paper.

Materials availability

This study did not generate new unique reagents.

Methods

All methods can be found in the accompanying Transparent methods supplemental file.

49 in total

1. ACTINN: automated identification of cell types in single cell RNA sequencing.

Authors: Feiyang Ma; Matteo Pellegrini
Journal: Bioinformatics Date: 2020-01-15 Impact factor: 6.937

2. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling.

Authors: Allen W Zhang; Ciara O'Flanagan; Elizabeth A Chavez; Jamie L P Lim; Nicholas Ceglia; Andrew McPherson; Matt Wiens; Pascale Walters; Tim Chan; Brittany Hewitson; Daniel Lai; Anja Mottok; Clementine Sarkozy; Lauren Chong; Tomohiro Aoki; Xuehai Wang; Andrew P Weng; Jessica N McAlpine; Samuel Aparicio; Christian Steidl; Kieran R Campbell; Sohrab P Shah
Journal: Nat Methods Date: 2019-09-09 Impact factor: 28.547

3. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types.

Authors: Diego Adhemar Jaitin; Ephraim Kenigsberg; Hadas Keren-Shaul; Naama Elefant; Franziska Paul; Irina Zaretsky; Alexander Mildner; Nadav Cohen; Steffen Jung; Amos Tanay; Ido Amit
Journal: Science Date: 2014-02-14 Impact factor: 47.728

4. The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources.

Authors: Marion Gremse; Antje Chang; Ida Schomburg; Andreas Grote; Maurice Scheer; Christian Ebeling; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2010-10-28 Impact factor: 16.971

5. An expression atlas of human primary cells: inference of gene function from coexpression networks.

Authors: Neil A Mabbott; J Kenneth Baillie; Helen Brown; Tom C Freeman; David A Hume
Journal: BMC Genomics Date: 2013-09-20 Impact factor: 3.969

Review 6. Human dendritic cell subsets.

Authors: Matthew Collin; Naomi McGovern; Muzlifah Haniffa
Journal: Immunology Date: 2013-09 Impact factor: 7.397

7. From Louvain to Leiden: guaranteeing well-connected communities.

Authors: V A Traag; L Waltman; N J van Eck
Journal: Sci Rep Date: 2019-03-26 Impact factor: 4.379

8. Identification and characterization of a fibroblast marker: FSP1.

Authors: F Strutz; H Okada; C W Lo; T Danoff; R L Carone; J E Tomaszewski; E G Neilson
Journal: J Cell Biol Date: 1995-07 Impact factor: 10.539

9. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies.

Authors: Young-suk Lee; Arjun Krishnan; Qian Zhu; Olga G Troyanskaya
Journal: Bioinformatics Date: 2013-09-12 Impact factor: 6.937

Review 10. Pro-tumorigenic roles of fibroblast activation protein in cancer: back to the basics.

Authors: Ellen Puré; Rachel Blomberg
Journal: Oncogene Date: 2018-05-03 Impact factor: 9.867

9 in total

Review 1. Cell type ontologies of the Human Cell Atlas.

Authors: David Osumi-Sutherland; Chuan Xu; Maria Keays; Adam P Levine; Peter V Kharchenko; Aviv Regev; Ed Lein; Sarah A Teichmann
Journal: Nat Cell Biol Date: 2021-11-08 Impact factor: 28.824

2. Bias-invariant RNA-sequencing metadata annotation.

Authors: Hannes Wartmann; Sven Heins; Karin Kloiber; Stefan Bonn
Journal: Gigascience Date: 2021-09-22 Impact factor: 6.524

3. scMRMA: single cell multiresolution marker-based annotation.

Authors: Jia Li; Quanhu Sheng; Yu Shyr; Qi Liu
Journal: Nucleic Acids Res Date: 2022-01-25 Impact factor: 19.160

4. Annotating cell types in human single-cell RNA-seq data with CellO.

Authors: Matthew N Bernstein; Colin N Dewey
Journal: STAR Protoc Date: 2021-08-17

Review 5. Mapping the multiscale structure of biological systems.

Authors: Leah V Schaffer; Trey Ideker
Journal: Cell Syst Date: 2021-06-16 Impact factor: 11.091

6. Screening the components of Saussurea involucrata for novel targets for the treatment of NSCLC using network pharmacology.

Authors: Dongdong Zhang; Tieying Zhang; Yao Zhang; Zhongqing Li; He Li; Yueyang Zhang; Chenggong Liu; Zichao Han; Jin Li; Jianbo Zhu
Journal: BMC Complement Med Ther Date: 2022-02-28

Review 7. Computational solutions for spatial transcriptomics.

Authors: Iivari Kleino; Paulina Frolovaitė; Tomi Suomi; Laura L Elo
Journal: Comput Struct Biotechnol J Date: 2022-09-01 Impact factor: 6.155

8. PlantGF: an analysis and annotation platform for plant gene families.

Authors: Jiaxuan Li; Shuai Yang; Xiaojie Yang; Hui Wu; Heng Tang; Long Yang
Journal: Database (Oxford) Date: 2022-02-02 Impact factor: 4.462

9. CHARTS: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell RNA-seq data sets.

Authors: Matthew N Bernstein; Zijian Ni; Michael Collins; Mark E Burkard; Christina Kendziorski; Ron Stewart
Journal: BMC Bioinformatics Date: 2021-02-23 Impact factor: 3.169

9 in total