Jing Jiang1, Cankun Wang1, Ren Qi1, Hongjun Fu2, Qin Ma1. 1. Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA. 2. Department of Neuroscience, The Ohio State University, Columbus, OH 43210, USA.
Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder of the brain and the most common form of dementia among the elderly. The single-cell RNA-sequencing (scRNA-Seq) and single-nucleus RNA-sequencing (snRNA-Seq) techniques are extremely useful for dissecting the function/dysfunction of highly heterogeneous cells in the brain at the single-cell level, and the corresponding data analyses can significantly improve our understanding of why particular cells are vulnerable in AD. We developed an integrated database named scREAD (single-cell RNA-Seq database for Alzheimer's disease), which is as far as we know the first database dedicated to the management of all the existing scRNA-Seq and snRNA-Seq data sets from the human postmortem brain tissue with AD and mouse models with AD pathology. scREAD provides comprehensive analysis results for 73 data sets from 10 brain regions, including control atlas construction, cell-type prediction, identification of differentially expressed genes, and identification of cell-type-specific regulons.
Alzheimer's disease (AD) is a progressive neurodegenerative disorder of the brain and the most common form of dementia among the elderly. The single-cell RNA-sequencing (scRNA-Seq) and single-nucleus RNA-sequencing (snRNA-Seq) techniques are extremely useful for dissecting the function/dysfunction of highly heterogeneous cells in the brain at the single-cell level, and the corresponding data analyses can significantly improve our understanding of why particular cells are vulnerable in AD. We developed an integrated database named scREAD (single-cell RNA-Seq database for Alzheimer's disease), which is as far as we know the first database dedicated to the management of all the existing scRNA-Seq and snRNA-Seq data sets from the human postmortem brain tissue with AD and mouse models with AD pathology. scREAD provides comprehensive analysis results for 73 data sets from 10 brain regions, including control atlas construction, cell-type prediction, identification of differentially expressed genes, and identification of cell-type-specific regulons.
Alzheimer's disease (AD) is the most common cause of dementia. Currently, there are an estimated 5.8 million Americans aged 65 yeas or older suffering from AD (Claxton et al., 2015). AD is a slowly progressive brain disease that only after years of brain changes do individuals experience noticeable symptoms, such as difficulty in remembering recent conversations, names or events, and language problems (Shinagawa, 2016). Symptoms occur because neurons in parts of the brain involved in thinking, learning, and memory have been damaged or destroyed, probably by the accumulation of amyloid beta (Aβ) protein and tau protein aggregates and the neuroinflammation (Dolgin, 2018; Mucke, 2009). Unfortunately, there is no effective therapeutics that can cure or alter the disease process (Gao et al., 2016). Furthermore, molecular mechanisms underlying AD, especially the cellular vulnerability, are poorly understood.Single-cell RNA sequencing (scRNA-Seq) examines the dynamic transcriptomic profile of individual cells with next-generation sequencing technologies and hence provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment (Seweryn et al., 2020; Wang et al., 2020; Wu et al., 2014). For frozen brain samples, using the single-nucleus RNA sequencing (snRNA-Seq) is also an important strategy. It addresses these samples that cannot be readily dissociated into a single-cell suspension and minimizes the alteration of gene expression caused by the procedure of dissociation. Previous studies have demonstrated that AD pathology differs in age, gender, brain regions, and cell types (Ewers et al., 2011; Mucke, 2009; Sala Frigerio et al., 2019). In order to study the cellular heterogeneity of the brain and reveal the complex cellular changes in the AD brain by profiling tens of thousands of individual cells, scRNA-Seq provides an alternative method (Mathys et al., 2017). The scRNA-Seq can reveal complex and rare cell populations, uncover regulatory relationships between genes, and track the trajectories of distinct cell lineages in development (Grubman et al., 2019; Qi et al., 2020). The single-cell view of AD pathology paints a unique cellular-level view of transcriptional alterations associated with AD pathology and significantly improves our understanding of the pathogenesis of AD (Del-Aguila et al., 2019; Mathys et al., 2019).Here, we developed a database called scREAD (single-cell RNA-Seq database for Alzheimer's disease), which provides comprehensive analysis results of all the existing scRNA-Seq and snRNA-Seq data sets collected from Gene Expression Omnibus (GEO) (Barrett et al., 2013) and Synapse databases. The scREAD has several key features, namely, (i) it is the first-of-its-kind database with a collection of all the 17 existing human and mouseAD scRNA-Seq and snRNA-Seq data sets from the public domain (Table S1); (ii) it re-defines the 17 data sets into 73 data sets, each of which corresponds to a specific species (human or mouse), gender (male or female), brain region (entorhinal cortex, prefrontal cortex, superior frontal gyrus, cortex, cerebellum, subventricular zone, superior parietal lobe, or hippocampus) (Table S2), disease or control, and age stage (7 months, 15 months, or 20 months for mice and 50–100+ years old for human) (Table S3); (iii) it provides comprehensive analysis results for each of the 73 data sets, including but not limited to the construction of control atlas, cell clustering, prediction of cell types, identification of differentially expressed genes (DEGs), and identification of cell-type-specific regulons (CTSRs) in support of the in-depth analysis of heterogeneous regulatory mechanisms; (iv) all these analysis results are visualized through a one-stop and user-friendly interface to free AD biologists from programming burdens (Data S1); (v) the backend workflow enabling all the above computational analyses is freely accessible as stand-alone one-line-command scripts in R (Data S2).
Results
Overview Functionalities of scREAD Database
There are four major functionalities in scREAD: (i) construction of control atlas for different human and mouse brain regions based on the 23 control data sets; (ii) identification of human and mouse disease cell types by projecting the AD data sets onto the control atlases; (iii) identification of DEGs for each cell type among different conditions and functional enrichment analysis of DEGs; and (iv) identification of CTSRs for each cell type among different conditions. These four functions and the schematic workflow of scREAD are shown in Figure 1.
Figure 1
The Workflow of scREAD
The Workflow of scREAD
Construction of Control Atlas for Different Human and Mouse Brain Regions
We constructed 23 human and mouse control cell atlases based on 17 scRNA-Seq and snRNA-Seq data sets, which cover 10 brain regions, two genders, and different mouse and human ages, totally 713,640 cells (Figure 2A and Transparent Methods). These 17 data sets were redefined into 73 data sets according to species, gender, brain region, disease or control, and age (Figure 2B). The number of cells and the statistical distribution of these 73 data sets are shown in Figures S1 and S2. Not all the data sets in scREAD are available to download for users; data sets from the GEO database are available to download, but data sets from Synapse are not available to download.
Figure 2
General Information about scREAD Data sets
(A) General statistical distribution of all the 73 data sets. The pie charts represent four factors of distribution: species, control/disease condition, brain region, and gender from the left side to the right side, respectively. Each pie chart represents one factor, and each color in each pie chart represents one element, and the number represents the number of data sets for each element under each factor for 73 data sets.
(B) General information table on the homepage, which includes nine factors (species, gender, condition, region, Braak stage, age, mice model, GEO/synapse ID, and #cells).
General Information about scREAD Data sets(A) General statistical distribution of all the 73 data sets. The pie charts represent four factors of distribution: species, control/disease condition, brain region, and gender from the left side to the right side, respectively. Each pie chart represents one factor, and each color in each pie chart represents one element, and the number represents the number of data sets for each element under each factor for 73 data sets.(B) General information table on the homepage, which includes nine factors (species, gender, condition, region, Braak stage, age, mice model, GEO/synapse ID, and #cells).Cell types of these 23 control atlases were assigned using Seurat (Stuart et al., 2019) and Semi-supervised Category Identification and Assignment (SCINA) (Zhang et al., 2019), and the known cell-type marker genes used in this process were collected from literature and PanglaoDB (Franzen et al., 2019) (Table S4 and Transparent Methods). scREAD contains eight major cell types of the human and mouse brain, i.e. astrocytes, endothelial cells, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, oligodendrocyte precursor cells, and pericytes. These 23 control atlases were then visualized using Uniform Manifold Approximation and Projection (UMAP) (Becht et al., 2018) and can be downloaded on the “Browse control atlas” page.
Identification of Human and Mouse Disease Cell Types Based on the Control Atlas
Not all cells collected from ADpatient samples are malignant, and there are heterogeneous cells within individual patients, i.e., normal control cells are included. In Granja et al.’s research (Granja et al., 2019), they defined these control cells as control-like cells. Here, we applied this concept to AD data sets in our scREAD. These control cells maintain distinct regulatory mechanisms and gene expression patterns compared to AD cells, and they will disturb the accurate identification of AD cell types. Thus, we removed these control cells from AD data sets to identify real AD-associated cells. Using the human and mouse control atlas, we sought to project AD-associated cells onto the control atlas at single-cell resolution to identify human and mouse disease cell types (Transparent Methods). The general information is located at the top of the result page for retrieving an overview of data set description, source, and other data sets from the same experiment in scREAD (Figure 3A). The cell types and subclusters can be visualized interactively on the UMAP plot (Figure 3B) and can be exported in Portable Network Graphics (PNG) format by clicking on the “save” button at the right corner. The Adjusted Rand Index or silhouette score is also listed next to the UMAP plots for evaluating the clustering performance (see Transparent Methods) (Lovmar et al., 2005; Steinley et al., 2016). For each gene, the gene expression values are visualized interactively overlaid onto the same UMAP coordinates. For example, MBP is the marker gene of oligodendrocyte cell type, and it has higher expression in oligodendrocytes than in other cell types as expected (Figure 3C).
Figure 3
Overall Information of an AD Disease Data set (AD00103) and UMAP Plot of the Cell Types and Expression Distribution of Gene MBP in This Data set
(A) Overall information on an AD disease data set (AD00103). It includes the information of species, brain region, condition, gender, age, number of control-like and AD-associated cells, data set source, and data sets from the same experiment.
(B) UMAP plot colored by cell type on this AD disease data set. We identified seven cell types, i.e. astrocytes, endothelial cells, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells.
(C) UMAP plot of expression distribution of oligodendrocyte marker gene MBP in the same data set. The darker the color is in this UMAP, the higher the expression value of the gene. Adj. p-value: wilcoxon rank sum test, Bonferroni corrected.
Overall Information of an AD Disease Data set (AD00103) and UMAP Plot of the Cell Types and Expression Distribution of Gene MBP in This Data set(A) Overall information on an AD disease data set (AD00103). It includes the information of species, brain region, condition, gender, age, number of control-like and AD-associated cells, data set source, and data sets from the same experiment.(B) UMAP plot colored by cell type on this AD disease data set. We identified seven cell types, i.e. astrocytes, endothelial cells, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells.(C) UMAP plot of expression distribution of oligodendrocyte marker gene MBP in the same data set. The darker the color is in this UMAP, the higher the expression value of the gene. Adj. p-value: wilcoxon rank sum test, Bonferroni corrected.Transcriptional alterations seemed to stem from changes in cell state, with certain cell-type subpopulations more readily captured in AD pathology. To dissect disease-associated cellular subpopulations and cell-type heterogeneity, subclusters were identified for each cell type. For each cell type, we have carried out the subcluster finding analysis for investigating subcluster-specific changes and functional diversity occurring in AD. For different brain regions, enforcing an annotation to the closest cell type is likely to result in misannotation of such regions, but we are aware that subtypes could finely resolve and characterize this problem. Therefore, the subcluster function of our scREAD would provide a more comprehensive cell-type annotation considering cross-region heterogeneity. Due to no standard or consistent annotations available, we have not annotated those subclusters.
Differential Gene Expression and Functional Enrichment Analysis
Differential gene expression analyses include cell-type-specific genes, subcluster-specific genes, and cell-type-pairwise DEGs within one data set or between data sets based on diverse conditions (Table S5) (Monier et al., 2019). All the conditions are in the same species under the same gender, brain region, and age. The DEGs are presented based on the selections of the comparison group and the cell type of interest, and users can drag or type on the panel to apply different parameter cutoffs. The log2 fold-change can be adjusted ranging from 0 to 5, and the adjusted p value range can be adjusted from 10−6 to 1. All DEGs are scaled by cell types or conditions and are presented in the tables, allowing users to explore the differential expression of interesting genes across different conditions (Figure 4A).
Figure 4
The DEGs and Functional Enrichment Analysis of Selected DEGs on an AD Disease Data set (AD00103)
(A) The DEG panel from the astrocytes cell type using the default parameters (left) and the DEG result table (right). Adjusted p-value: wilcoxon rank sum test, Bonferroni corrected.
(B) The functional enrichment analysis of DEGs. For each functional enrichment analysis, the top five most functional enrichment analysis results are shown. Adjusted p-value: hypergeometric test, Benjamini-Hochberg corrected.
The DEGs and Functional Enrichment Analysis of Selected DEGs on an AD Disease Data set (AD00103)(A) The DEG panel from the astrocytes cell type using the default parameters (left) and the DEG result table (right). Adjusted p-value: wilcoxon rank sum test, Bonferroni corrected.(B) The functional enrichment analysis of DEGs. For each functional enrichment analysis, the top five most functional enrichment analysis results are shown. Adjusted p-value: hypergeometric test, Benjamini-Hochberg corrected.Functional enrichment analysis is a computational method for inferring knowledge about an input gene set by comparing it to annotated gene sets representing prior biological knowledge. scREAD provides enrichment analysis of the DEGs against Kyoto Encyclopedia of Genes and Genomes pathways and Gene Ontology databases (Figure 4B) (Gene Ontology, 2015; Kanehisa and Goto, 2000). The enrichment analysis is performed and displayed in real time from the DEG list based on the input of the current DEG cutoffs. All of the DEG tables and functional enrichment tables are available to be downloaded by users.
Identification of CTSRs
CTSRs are defined as a group of genes, which receive similar regulatory signals in a specific cell type, hence tending to have similar expression patterns and share conserved motifs in this cell type (Wan et al., 2019; Yang et al., 2019). A successful elucidation of CTSRs will substantially improve the identification of transcriptionally co-regulated gene modules, realistically allowing reliable prediction of global transcription regulation networks encoded in a specific cell type (Ma et al., 2020b; Xie et al., 2020).scREAD provides both the CTSR result table and the visualization detail information of each CTSR across each cell type for each data set (Figure 5). Taking an AD disease data set (AD00103) as an example, scREAD shows all the identified CTSRs in the table based on the index of cell types and allows users to download this table (Figure 5A). We also display an interactive visualization of all the CTSRs below the result table (Figure 5B). For CT3-R1 (the first regulon in cell type 3), this regulon includes 64 genes co-regulated by the same transcription factor (TF), MXI1. CT3-R1 is marked as a CTSR based on a significant regulon specificity score (RSS) of 0.77. Of all the 64 genes, 25 are differentially expressed in CT3 (marked with stars), according to the differential expression analysis using Seurat. Details of each gene and motif can be found by clicking on the gene name and TF logo, respectively. More detailed motif finding results including positions, sequences, and position weight matrix information can be found by clicking the “Open”. For each gene in this regulon, the UMAP plot of its expression value across all cell types can be achieved by clicking the “Display” button.
Figure 5
The CTSR Result Table and Details of the CT3-R1 on an AD Disease Data set (AD00103)
(A) The result table of CTSRs on AD00103.
(B) The details of the top one CTSR of cell type three. (1) A regulon is named as CTn-Rm with n representing the index of cell type and m represents the regulon rank. (2) Asterisks indicate marker genes, that is, the differential expressed gene, identified in each cluster using Seurat. (3) Gene symbols and links to the GeneCards (Human) or the Mouse Genome Informatics (MGI) website. (4) Corresponding gene Ensembl ID columns link to the website. (5) Gene expression UMAP and comparison to the cell types. (6) The corresponding TF with a corresponding link to the HOCOMOCO database. (7) Detailed motif finding results from including positions, sequences, position weight matrix, etc. (8) Motif details linking to the TOMTOM database. Regulon p-value: wilcoxon rank sum test, Bonferroni corrected.
The CTSR Result Table and Details of the CT3-R1 on an AD Disease Data set (AD00103)(A) The result table of CTSRs on AD00103.(B) The details of the top one CTSR of cell type three. (1) A regulon is named as CTn-Rm with n representing the index of cell type and m represents the regulon rank. (2) Asterisks indicate marker genes, that is, the differential expressed gene, identified in each cluster using Seurat. (3) Gene symbols and links to the GeneCards (Human) or the Mouse Genome Informatics (MGI) website. (4) Corresponding gene Ensembl ID columns link to the website. (5) Gene expression UMAP and comparison to the cell types. (6) The corresponding TF with a corresponding link to the HOCOMOCO database. (7) Detailed motif finding results from including positions, sequences, position weight matrix, etc. (8) Motif details linking to the TOMTOM database. Regulon p-value: wilcoxon rank sum test, Bonferroni corrected.
Discussion
In this paper, we described the first release of scREAD, which is as far as we know the first database that collects all existing human and mouse scRNA-Seq and snRNA-Seq data sets with AD pathology and provides a one-stop interactive visualization of the control atlas and analysis results based on these data sets. These data sets have been published and freely accessible in the public domain as of September 22nd, 2020. With the development and application of scRNA-Seq technology, scREAD will continue to be enriched and expanded to be a big database such as SC2disease (Zhao et al., 2020). Furthermore, scREAD allows users to submit a new data set through the submit page to reproduce all the analysis results showcased in scREAD in support of their AD research. We will ask for users' permission if we want to store the data uploaded by users into our database. We believe that our database will benefit the AD researchers particularly through studying the data and corresponding analysis results in scREAD.scREAD provides comprehensive analysis results for those 73 scRNA-Seq data sets collected so far, including the construction of control cell atlas, cell clustering and subclustering, prediction of cell types, identification of DEGs, and identification of CTSRs. Based on the constructed control cell atlas, we can identify those AD-associated cells at specific brain regions and disease stages. Further analysis of the function/dysfunction of highly heterogeneous cells in the brain at the single-cell level via cell clustering and subclustering, as well as DEG and functional enrichment analysis, can help us understand subcluster-specific changes in the transcriptomic profile and functional diversity occurring in AD. The identification of CTSRs will substantially improve the reliable prediction of global transcription regulation networks encoded in a specific cell type. Thus, scREAD will greatly help the AD community by supporting the in-depth analysis of heterogeneous regulatory mechanisms in AD and identifying the potential therapeutic targets for the prevention and/or treatment of AD.
Limitations of the Study
Currently, scREAD only contains the scRNA-Seq and snRNA-Seq data sets as of September 22nd, 2020, and have not included other omics and spatial transcriptomics data. In the future, we will collect more AD scRNA-Seq and snRNA-Seq data from more brain regions and build up healthy atlas in diverse brain regions of human, mouse and extend to other species. Meanwhile, we will collect AD single-cell omics data, such as scATAC-seq and proteomics data, and achieve more comprehensive analysis results based on single-cell multiple omics data (Li et al., 2020; Ma et al., 2020a). In addition, spatial transcriptomics and in situ sequencing have been recently used in studying AD (Chen et al., 2020). The transcriptome-scale spatial gene expression data sets can further provide insights into answering the regional and cellular vulnerability in AD. Thus, we will specifically add the spatial transcriptomics and in situ sequencing data sets from humanAD and AD-like animals to the current scREAD to enable more functional interpretation. Currently, scREAD only contains nine cell types across 10 human and mouse brain regions; we will provide a more comprehensive cell-type annotation considering more brain regions. In this study, we have only removed the control-like cells in the AD data sets. However, the individuals in the control group may be patients with potential AD, and thus, control samples might also have AD-like cells. Therefore, we will use the same strategy on the control data sets to construct control atlases in the future.
Resource Availability
Lead Contact
Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Qin Ma (qin.ma@osumc.edu) or Hongjun Fu (hongjun.fu@osumc.edu).
Materials Availability
This study did not generate new unique data.
Data and Code Availability
All data sets used in this work are available from publicly available sources as cited in the manuscript. scREAD is a one-stop and user-friendly interface and freely available at https://bmbls.bmi.osumc.edu/scread/. The backend workflow can be downloaded from https://github.com/OSU-BMBL/scread/tree/master/script to enable more discovery-driven analyses.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Authors: Wei-Ting Chen; Ashley Lu; Katleen Craessaerts; Benjamin Pavie; Carlo Sala Frigerio; Nikky Corthout; Xiaoyan Qian; Jana Laláková; Malte Kühnemund; Iryna Voytyuk; Leen Wolfs; Renzo Mancuso; Evgenia Salta; Sriram Balusu; An Snellinx; Sebastian Munck; Aleksandra Jurek; Jose Fernandez Navarro; Takaomi C Saido; Inge Huitinga; Joakim Lundeberg; Mark Fiers; Bart De Strooper Journal: Cell Date: 2020-07-22 Impact factor: 41.582
Authors: Hansruedi Mathys; Jose Davila-Velderrain; Zhuyu Peng; Fan Gao; Shahin Mohammadi; Jennie Z Young; Madhvi Menon; Liang He; Fatema Abdurrob; Xueqiao Jiang; Anthony J Martorell; Richard M Ransohoff; Brian P Hafler; David A Bennett; Manolis Kellis; Li-Huei Tsai Journal: Nature Date: 2019-05-01 Impact factor: 49.962
Authors: Alexandra Grubman; Gabriel Chew; John F Ouyang; Guizhi Sun; Xin Yi Choo; Catriona McLean; Rebecca K Simmons; Sam Buckberry; Dulce B Vargas-Landin; Daniel Poppe; Jahnvi Pflueger; Ryan Lister; Owen J L Rackham; Enrico Petretto; Jose M Polo Journal: Nat Neurosci Date: 2019-12 Impact factor: 24.884
Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Jeffrey M Granja; Sandy Klemm; Lisa M McGinnis; Arwa S Kathiria; Anja Mezger; M Ryan Corces; Benjamin Parks; Eric Gars; Michaela Liedtke; Grace X Y Zheng; Howard Y Chang; Ravindra Majeti; William J Greenleaf Journal: Nat Biotechnol Date: 2019-12-02 Impact factor: 54.908
Authors: Annalisa M Baratta; Adam J Brandner; Sonja L Plasil; Rachel C Rice; Sean P Farris Journal: Front Mol Neurosci Date: 2022-06-23 Impact factor: 6.261
Authors: Blanca Diaz-Castro; Alexander M Bernstein; Giovanni Coppola; Michael V Sofroniew; Baljit S Khakh Journal: Cell Rep Date: 2021-08-10 Impact factor: 9.423