Literature DB >> 34013209

Use of scREAD to explore and analyze single-cell and single-nucleus RNA-seq data for Alzheimer's disease.

Cankun Wang¹, Yujia Xiang¹, Hongjun Fu², Qin Ma¹.

Abstract

Single-cell RNA-sequencing (scRNA-seq) and single-nucleus RNA-sequencing (snRNA-seq) studies have provided remarkable insights into understanding the molecular pathogenesis of Alzheimer's disease. We recently developed scREAD, a database to provide comprehensive analyses of all the existing AD scRNA-seq and snRNA-seq data from the public domain. Here, we report protocols for using the scREAD web interface and running the backend workflow locally. Our protocols enable custom analyses of AD single-cell and single-nucleus gene expression profiles. For complete details on the use and execution of this protocol, please refer to Jiang et al. (2020).

Entities: Chemical

Keywords: Bioinformatics; Neuroscience; RNA-seq; Single Cell

Mesh：

Year: 2021 PMID： 34013209 PMCID： PMC8113978 DOI： 10.1016/j.xpro.2021.100513

Source DB: PubMed Journal: STAR Protoc ISSN： 2666-1667

Before you begin

Preparing to use the scREAD website

Timing: 5 min We recommend using Chrome, Safari, Microsoft Edge, or Firefox web browser to access scREAD (https://bmbls.bmi.osumc.edu/scread/). Microsoft Internet Explorer is not supported. If you would like to submit data to scREAD, prepare raw scRNA-seq or snRNA-seq gene expression data in text format, in which rows represent genes, and columns represent cells.

Preparing local environment to run the scREAD workflow

Timing: 30 min Install R, version 3.6 or greater. The required R dependencies are listed in the key resources table section below. Raw scRNA-seq or snRNA-seq expression data mainly has three formats: A single .txt, .tsv or .csv formatted gene expression matrix, in which each row represents a feature (gene), and each column represents a cell. A hierarchical data format (hdf5) feature-barcode matrix, generally named as filtered_feature_bc_matrix.h5 in the 10× Genomics CellRanger output folder. The Hdf5 format consists of a matrix and metadata. The matrix contains barcodes (barcode sequences), data (Nonzero UMI counts in column-major order), indices (0-based row index), indptr (0-based column index) and shape (a tuple contains matrix dimensions, (# row, # column); and the metadata contain diverse attributes. The Hdf5 file hierarchy example (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices) : (root) └── matrix [HDF5 group] ├── barcodes ├── data ├── indices ├── indptr ├── shape └── features [HDF5 group] ├─ _all_tag_keys ├─ target_sets [for Targeted Gene Expression] │…└─ [target set name] ├─ feature_type ├─ genome ├─ id ├─ name ├─ pattern [Feature Barcode only] ├─ read [Feature Barcode only] └─ sequence [Feature Barcode only] The three gzip files recording information of barcodes (barcodes.tsv), features (genes.tsv), and gene expressions (matrix.mtx) in the 10× Genomics CellRanger output folder.

Barcodes file (Barcodes.tsv)

Features (genes.tsv)

The first column represents the Ensembl gene id, and the second column represents the gene symbol.

Gene expression matrix (matrix.mtx)

The first column “gene id index” represents the row number of genes in the genes.tsv. The second column “cell id index” represents the row number of cell-barcode in the barcodes.tsv. The third column represents the total Unique Molecular Identifier (UMI) count in each cell and gene combination.

scREAD maintenance and sustainability plan

Currently, scREAD only contains the scRNA-seq and snRNA-seq data sets as of September 22nd, 2020. We will routinely collect more publicly available AD scRNA-seq and snRNA-seq data. You may also contact us through emails and let us know about your dataset of interest to be added in scREAD. We will continue to develop scREAD to support integrative analysis of single-cell multimodal omics data and spatial transcriptomics data. We appreciate any suggestions or feedback from you.

Key resources table

Step-by-step method details

Checking AD studies summary statistics

Timing: 10 min This section allows one to browse the overall summary statistics, including species, disease condition, brain regions, and gender. The researcher can then make an informed decision of which dataset of interest one should navigate to. The current release of the scREAD collected datasets from 15 studies in total. Based on the metadata provided from the original papers, we constructed the original samples into 73 datasets, each of which corresponds to a specific species (human or mouse), gender (male or female), brain region (entorhinal cortex, prefrontal cortex, superior frontal gyrus, cortex, cerebellum, subventricular zone, superior parietal lobe, or hippocampus), disease or control, and age stage (7 months, 15 months, or 20 months for mice, and 50–100+ years old for human). Navigate to https://bmbls.bmi.osumc.edu/scread/, and the dataset summary should be listed (Figures 1A and 1B).

Figure 1

Overview of the scREAD homepage

(A) The pie charts represent four factors of distribution: species, control/disease condition, brain region, and gender from the left side to the right side, respectively. Each color in each pie chart represents one element, and the number represents the distribution ratio for each element under each factor for 73 datasets.

(B) The table shows the general information of all 73 datasets. Users can select filters and the table will be updated accordingly. Clicking a row in the table will pop up the dataset overview panel, and users can navigate to the dataset details page through the link.

Overview of the scREAD homepage (A) The pie charts represent four factors of distribution: species, control/disease condition, brain region, and gender from the left side to the right side, respectively. Each color in each pie chart represents one element, and the number represents the distribution ratio for each element under each factor for 73 datasets. (B) The table shows the general information of all 73 datasets. Users can select filters and the table will be updated accordingly. Clicking a row in the table will pop up the dataset overview panel, and users can navigate to the dataset details page through the link. You can either click on one of the pie charts or select filters from the dataset summary table, including species, sample condition, brain region, and gender. The content of the table will be updated accordingly. You can click any row in the table, and a dataset overview panel will pop. You can further navigate to the dataset details page through the link.

Searching genes of interest from the differential gene expression (DGE) analysis results

Timing: 20 min This section allows one to search a gene of interest from DGE analysis results across multiple comparisons. For detailed gene information, the researcher can check from the link to the dataset ID of interest (Figure 2).

Figure 2

Example DGE analysis searching result of the GAD1 gene, a marker gene of inhibitory neurons

Users can select filters and the table will be updated accordingly.

To search genes of interest from DGE analysis results across multiple datasets, type the gene symbol in the search box, and click on the search button. You can select filters to specify dataset sources or decide which comparison types should be displayed. The query results are returned from multiple DGE analyses, including cell-type-specific genes, subcluster specific genes, AD vs control differentially expressed genes (DEGs), or AD vs AD DEGs. Cell-type-specific genes were identified by performing DGE analysis between the cell type of interest and the average of the remaining cell types. Subcluster-specific genes were identified by performing DGE analysis between the subcluster of interest and the average of the remaining subclusters from the same cell type. AD vs control DGE analysis was performed within the same cell type, brain region, and gender. For example, Male-AD-Prefrontal cortex vs Male-Control-Prefrontal cortex. AD vs AD DGE analysis was performed within the sample cell type while based on different sample conditions. For example, one AD vs AD comparison can be Male-AD-Prefrontal cortex vs Male-AD-Entorhinal Cortex or Male-AD-Prefrontal cortex vs Female-AD-Prefrontal cortex. The ‘multiple comparison types’ in the comparison type selection box include disease vs control performed within the same dataset and disease vs disease performed from two different datasets. A positive log foldchange (FC) value indicates the gene expressions are higher in the first group. The log FC is returned in the natural logarithm. Example DGE analysis searching result of the GAD1 gene, a marker gene of inhibitory neurons Users can select filters and the table will be updated accordingly.

Checking cell clustering results

Timing: 20 min In this section, we used a dataset from scREAD as an example to show the analysis result (ID: AD00103), https://bmbls.bmi.osumc.edu/scread/AD00103. This dataset consists of 6,629 cells isolated from a human AD female prefrontal cortex sample (Mathys et al., 2019). As we know, not all cells collected from AD patient samples are malignant, and some healthy cells may be included in the cell populations, which were defined as healthy-like cells in Granja et al.’s study (Granja et al., 2019). We applied this concept to all AD datasets in scREAD and defined these healthy-like cells as control-like cells. These control-like cells maintain distinct regulatory mechanisms and gene expression patterns compared to AD cells, and they will disturb the accurate identification of AD cell types. Thus, we removed these control cells from disease datasets and identify AD-associated cells. Here, scREAD filtered out 950 control-like cells and kept 5,679 AD-associated cells for the downstream analysis. See the quantification and statistical analysis section for more details about how scREAD filtered out control-like cells. Checking cell clustering results (Figure 3).

Figure 3

Cell clustering and gene expression result from the scREAD homepage

Cell clustering results including UMAP plot colored by cell types or subclusters (left), and searching gene expression using the same UMAP coordinates (right). The darker the color is in this UMAP, the higher the expression value of the gene.

By default, all the cell types will be selected and the corresponding Adjusted Rand Index (ARI) will be displayed. The ARI score is used to evaluate the similarity of our predicted cell types compared with the original cell labels in the original paper. The ARI score will not be displayed when the cell labels were not provided from the original paper, and a silhouette score will be displayed instead. Meanwhile, the ARI score will be hidden if users did not select all cell types. Choose one of these cell types, the following Uniform Manifold Approximation and Projection (UMAP)(Becht et al., 2019) will change to the UMAP of predicted subclusters for this specific cell type. A sliding bar is used for controlling the size of each point in the following UMAP. It ranges from 1 to 10, i.e., the bigger the number is, the larger the point size is. This function bar contains several quick buttons for graphic operations. Hovering the cursor on cell points will display cell type, cell name, and the UMAP coordinates. The legend of this UMAP plot will be displayed based on the genes selected in the drop-down bar. The darker the color is in this UMAP, the higher the expression value of the gene. CRITICAL: Rendering gene expression scatter plot can be slow due to network speed or a large number of cells data need to process, please be patient while scREAD is fetching data from the backend server. Cell clustering and gene expression result from the scREAD homepage Cell clustering results including UMAP plot colored by cell types or subclusters (left), and searching gene expression using the same UMAP coordinates (right). The darker the color is in this UMAP, the higher the expression value of the gene.

Checking differential expression (DE) results and performing functional enrichment analysis

Timing: 20 min In this section, we used the same example data from checking cell clustering results to illustrate DGE analysis results and to perform functional gene set enrichment analysis based on DEGs. First, apply the necessary filtering criteria from the DGE analysis results panel (Figure 4A).

Figure 4

Differential gene expression (DGE), and functional gene set enrichment based on DEGs

(A) Differential gene expression analysis panel, the comparison groups include cell-type-specific genes, subcluster specific genes, and DEGs from the cross-dataset comparison.

(B) KEGG pathway, GO biological process, molecular function, and cellular component analysis using the DEGs from (A), an enriched genes example were displayed on the 5th GO cellular component term.

Differential gene expression (DGE), and functional gene set enrichment based on DEGs (A) Differential gene expression analysis panel, the comparison groups include cell-type-specific genes, subcluster specific genes, and DEGs from the cross-dataset comparison. (B) KEGG pathway, GO biological process, molecular function, and cellular component analysis using the DEGs from (A), an enriched genes example were displayed on the 5th GO cellular component term. DGE analysis groups for browsing cell-type-specific genes, subcluster specific genes, and DE genes from the cross-dataset comparison. Choose the cell type of interest in DGE analysis. Choose the log fold-change ranges. (default = 0.5; ranges from 0 to 5). The adjusted p value ranges. (default = 0.05; ranges from 10ˆ−6 to 1). The DE direction can filter by all DE genes, only up-regulated genes, only down-regulated genes (default = ’all’). You can search for genes that you are interested in, and then the following table will return the matching result. Download the currently listed table. GeneCards database (https://www.genecards.org/) is linked to each gene in the table. Adjusting any parameters above will immediately affect the displayed DEGs table. Performing functional gene set enrichment (Figure 4B) KEGG pathway, GO biological process, GO molecular function, and GO cellular component analysis using the DEGs from above, an example of enriched genes are displayed on the 4th GO cellular component term. You can also search for a specific item by entering the content that you want to search in the search box. CRITICAL: The functional gene set analysis results are calculated in real-time by sending the DEGs to the Enrichr (Kuleshov et al., 2016) web server (https://maayanlab.cloud/Enrichr/). Thus, changing DEG log FC or p value cutoffs can significantly change the results of enrichment analysis. Considering Enrichr does not provide options to submit a custom background and the results could be potentially misleading (Timmons et al., 2015). We also provide a link to another enrichment analysis tool, g: Profiler (https://biit.cs.ut.ee/gprofiler/), which allows users to submit custom background gene sets.

Identifying cell-type-specific regulons

Timing: 20 min In this section, we describe the process of identifying cell-type-specific regulons (CTSRs) using IRIS3. In the DE section, when you select the “Cell-type-specific genes” item in the “Group” select box, cell-type-specific regulon analysis will be performed. CTSRs results are displayed at the bottom of the screen. Click on the “Cell-type-specific regulons” bar, detailed CTSRs information will be shown. The CTSRs are displayed in a reactive table in the DE section, each row shows that for each cell type, a set of genes are regulated by a specific transcription factor (TF). Clicking on the “Regulon overview” panel, a table in the panel summarizes the overall cell number and regulon number in each cell cluster (Figure 5A).

Figure 5

Cell-type-specific regulon (CTSR) analysis results from IRIS3

(A) Overview table of identified CTSRs.

(B) An example of CTSR (p value < 0.05), stars indicate differential expressed gene within the cell type, the corresponding matched TF is linked to the HOCOMOCO database.

Cell-type-specific regulon (CTSR) analysis results from IRIS3 (A) Overview table of identified CTSRs. (B) An example of CTSR (p value < 0.05), stars indicate differential expressed gene within the cell type, the corresponding matched TF is linked to the HOCOMOCO database. (C) UMAP plot colored by regulon activities in each cell. You can navigate into the IRIS3 to see the detailed results of this job by clicking the 'Open cell-type-specific regulon result page in the new tab' button. In the table, the index number will be given to represent CTSRs (Figure 5B). Both gene compositions of regulons and their expression values across different cell types can be intuitively displayed in a heatmap. Regulons are ranked in increasing order of the empirical p values of regulon specificity scores (RSS) as described above, and a regulon is named as CTn-Rm with 'n' representing the index of cell type and 'm' represents the regulon rank. Due to the space limitation, only the top ten regulons and their corresponding genes are showcased in the heatmap, and the component genes of each regulon are indicated as green rectangles. The heatmap records the log-transformed expression level of each top-ten-regulon-covered gene across all cells. Regulon results are separately showcased in each cell type. Click on the "CT#" button to switch to see results in other cell types. A scatter plot shows the distribution of the RSS of each regulon. CTSRs are ranked top and marked as blue dots with their representative TF names, and insignificant regulons are marked as grey dots. For each regulon, genes and the corresponding TF are presented (Figure 5B) with several actions that link to showing heatmap, functional gene set enrichment analysis, and regulon activities in the UMAP plot (Figure 6C). A more detailed interpretation of each regulon can be found on the IRIS3 website, https://bmbl.bmi.osumc.edu/iris3/tutorial.php#3example&q=2.

Figure 6

Overlapping genes and annotated cell type (ct) from the example dataset

(A) The up-regulated overlapping genes in Human Entorhinal Cortex Astrocytes (ast).

(B) The log FC, dataset source, and rankings from the overlapping genes table in (A).

(D) A UMAP plot of the disease dataset example with six cell types was transferred from the example reference control dataset.

Overlapping genes and annotated cell type (ct) from the example dataset (A) The up-regulated overlapping genes in Human Entorhinal Cortex Astrocytes (ast). (B) The log FC, dataset source, and rankings from the overlapping genes table in (A). (C) A UMAP plot of the control dataset example with six cell types was annotated from the scREAD workflow. (D) A UMAP plot of the disease dataset example with six cell types was transferred from the example reference control dataset. CRITICAL: The CTSRs results are only available when you selected the cell-type-specific option in the DEG group box.

Optional step: calculating overlapping DEGs from multiple comparisons

Timing: 1 h In this section, we provide a workflow for calculating overlapping DEGs from multiple comparisons. Suppose we have AD vs control comparisons from a cell type of interest in a specific brain region. For each comparison, we select top DEGs based on the ranked log FC. We define an "overlapping gene" as the gene that appears at least times in comparisons (). are parameters set by the users. Here, we provide two approaches, you can follow the code example below on your local R environment; For users wishing to avoid setup procedures, you can use the following link from Google Colab, which is an interactive computational environment that combines live code, visualizations, and explanatory text, https://colab.research.google.com/drive/1lInXa6jD4yc7RGJc0EWDfy5NNoXT1qye?usp=sharing. If you wish to perform the calculation in your local computer, first, load the R packages, scREAD data, and predefined functions in your R local environment: To calculate overlapping genes, these parameters are needed The number of genes to be selected in each AD vs control DEG results (default = 100) Species (default = Human) Brain region (e.g., Entorhinal Cortex) DE direction (e.g., up) Overlap threshold (For example, A gene is an overlapping gene if A should at least appear 3 times in total 4 comparisons, here the threshold is 3) We can then process some of our metadata: Below are the necessary settings to calculate the overlapping genes. We use the top 100 DE genes in each AD vs control comparison: Species should be either 'Human' or 'Mouse': Specify our brain region of interest, here we selected the 5th brain region in REGION_LIST variable, i.e., Entorhinal Cortex': DE direction should either 'up' or 'down', 'up' means we select DE genes that are expressed higher in the disease dataset (the first group): The OVERLAP_THRES should be manually defined based on your interest and the total number of comparisons in scREAD. For example, scREAD have 4 total AD vs control datasets comparisons, we set the threshold to 3, meaning that we want to find overlapping genes that are at least appeared in 3 comparisons: Now, we can calculate the overlapping genes based on the parameters above, the results are stored in a list variable: Two tables can be generated by accessing the result variable: The overlapping genes in the selected brain region (Figure 6A) The detailed information, including rankings, log FC, dataset source information from the overlapping genes (Figure 6B)

Optional section 7: Running the scREAD backend analysis workflow locally

Timing: 2 h In this section, we present how to run the scREAD workflow to process a custom dataset. The workflow can be used in the Unix command-line environment with R installed. Download the scREAD workflow and an example dataset from https://github.com/OSU-BMBL/scread-protocol/tree/master/workflow, the folder should contain the following files (5 min): custom_marker.csv: A manually created marker gene list file used for identified cell types. functions.R: Visualization functions used in R. build_control_atlas.R: build control cells atlas Seurat object from count matrix file. transfer_cell_type.R: filter out control-like cells in disease dataset. run_analysis.R: run analysis workflow, and export tables in the scREAD database format. example_control.csv. The example control dataset. example_disease.csv. The example disease dataset. Build the control atlas file from the raw gene expression matrix (5 min). Prepare your control gene expression data. In the data frame, the first column should be gene symbols and other columns as cell labels. Put all code and data in a working directory. (e.g., PATH_TO_WD), in this protocol, we will run example_control.csv. build_control_atlas.R takes three parameters: Working directory path. Control data path. Output data ID. Next, run the following command, remember to change PATH_TO_WD to your working directory path: The expected output for this step contains four files: control_example.rds: The Seurat R object storing example control data. control_example_expr.txt: Filtered gene expression matrix. control_example_cell_label.txt: The first column is the cell name, the second column is the cell type information. control_example_umap.png: UMAP plot of example control data colored by cell types (Figure 6C). Transfer cell types based on control atlas, the goal of this step is to annotate cell type using the control atlas as the reference, onto the disease gene expression matrix file (∼5 min). Put all code and data in a working directory. (e.g., PATH_TO_WD) after you have generated the control atlas file (control_example.rds). transfer_cell_type.R takes four parameters: Working directory path. Control atlas Seurat object file name. Disease gene expression matrix name. Output disease data ID. Next, run the following command The expected output for this step contains four files: disease_example.rds: The Seurat R object storing example disease data. disease_example_expr.txt: Filtered gene expression matrix. disease_example_cell_label.txt: The first column is the cell name, the second column is the cell type information. disease_example_umap.png: UMAP plot for disease data colored by cell types (Figure 6D). Run data analysis, the goal of this step is to identify cell-type-specific genes, DEGs from two example datasets (60 min). Put all code and data in a working directory. (e.g., PATH_TO_WD) after you have generated the control atlas file (control_example.rds), and the disease file (disease_example.rds) run_analysis.R takes three parameters: Working directory path. Control Seurat object file name. Disease Seurat object file name. Next, run the following command: The expected output for this step contains three folders: /de. Differential gene expression analysis results. Cell-type-specific genes. Sub-cluster specific genes. DEGs between two conditions. /dimension: UMAP coordinates for two datasets. /subcluster_dimension: UMAP coordinates for each sub-clusters in two datasets. Identify CTSRs using IRIS3 (2 h). Navigate to https://bmbl.bmi.osumc.edu/iris3/submit.php, submit two jobs for the two example datasets: upload control_example_expr.txt and control_example_cell_label.txt upload disease_example_expr.txt and disease_example_cell_label.txt The expected output for IRIS3 contains these files: Lists CTSR gene list Marker gene list Gene module list Motif list Transcription factor list Tables Predicted cell types Bulk ATAC peak enrichment TAD association Figures (only display on the website) CTSR active UMAP CTSR gene heatmap Trajectory UMAP CRITICAL: You need to open the advanced options tab in the IRIS3 submission page to upload a custom cell label.

Expected outcomes

Web server results

scREAD provides comprehensive analysis results for the selected AD scRNA-seq or snRNA-seq datasets. These summaries and integrated analysis of differentially expressed genes (DEGs) are presented in a graphical and tabular format (Figure 1). Figure 2A shows the dataset details page that outlines all the necessary information regarding dataset metadata, source, and related data IDs from the same study. Figure 3 annotates cell clustering results including UMAP plot colored by cell types (left), and gene expression using the same UMAP coordinates (right). The darker the color is in this UMAP, the higher the expression value of the gene. Figure 4Ashows DEGs after the user selects the comparison group and the cell type of interest. Users can sort the table by log FC or p values. The DEGs results table can be downloaded as a tab-separated value file for further analysis. The DEGs are filtered based on the threshold set by the user, and the functional gene set enrichment analysis for the KEGG pathways, Gene Ontology terms are calculated using the current DEGs as input (Figure 4B). If the user selects ‘cell-type-specific genes’ in the group select box, the identified CTSRs along with RSS will be displayed at the bottom of the screen. The user can also navigate to the IRIS3 website to browse the regulons for more details. Figure 5 shows the identified CTSR analysis results from IRIS3, including the table of all identified regulons (Figure 5A), and detailed information from CT3-R1 (cell type: 3, regulon 1), the stars indicate differential expressed gene within the cell type, the corresponding matched TF is linked to the HOCOMOCO (Kulakovskiy et al., 2018) database (Figure 5B). The regulon activities are visualized in a UMAP plot (Figure 5C).

Overlapping DEGs results

After calculating overlapping DEGs from the same cell type across datasets using the default settings, two tables for overlapping genes will be generated: The overlapping genes in the selected region, using the default settings (Figure 6A), the fourth row in the table below can be interpreted as: 'For all AD vs control datasets comparisons in Human Entorhinal Cortex Astrocytes (ast), the GFAP gene ranked top 100 by log FC values in at least 3 comparisons'. The ranking information from the overlapping genes. Using the GFAP gene as an example, we found GFAP is an overlapping gene in Human Entorhinal Cortex Astrocytes. The GFAP ranked top 50 in 3 comparisons of 4 total comparisons, the mean rank of the gene is 13, and the average log-FC of GFAP in each comparison are also listed (Figure 6B).

Backend workflow results

The backend workflow generated is a series of tables containing information on the intermediate results and final analysis results, including cell-type-specific genes, subcluster-specific genes, and DEGs between two datasets. The following descriptions of output files use the example dataset given as the filename, while the names of your files will differ depending on your filename settings. The control atlas output should contain four files: File 1: control_example.rds: The Seurat R object storing example control data. File 2: control_example_expr.txt: Filtered gene expression matrix. File 3: control_example_cell_label.txt: The first column is the cell name, the second column is the cell type information. File 4: control_example_umap.png: UMAP plot of the example control data. The UMAP plot visualizes scRNA-seq data in two-dimensional spaces. Each dot represents a cell and cells are colored by annotated cell types. Cells within the cell types should co-localize on the UMAP plot (Figure 6C). The disease dataset output should contain four files: File 1 : disease_example.rds: The Seurat R object storing example disease data. File 2: disease_example_expr.txt: Filtered gene expression matrix. File 3: disease_example_cell_label.txt: The first column is the cell name, the second column is the cell type information. File 4: disease_example_umap.png: UMAP plot of the example disease data. The UMAP plot visualizes scRNA-seq data in two-dimensional spaces. Each dot represents a cell and cells are colored by annotated cell types. Cells within the cell types should co-localize on the UMAP plot (Figure 6D). The output tables should be stored in three folders: Folder 1: /de. Differential gene expression analysis results. Cell-type-specific genes. Sub-cluster specific genes. DEGs between two conditions. Folder 2: /dimension: UMAP coordinates for two datasets. Folder 3: /subcluster_dimension: UMAP coordinates for each sub-clusters in two datasets.

Quantification and statistical analysis

To determine whether cells from disease datasets are control-like, the Harmony R package (v1.0) was first used to integrate the disease dataset with its corresponding control atlas. After the integration, cells were clustered using Seurat’s FindClusters function with a resolution of 4. A hypergeometric test was performed for each cluster using the number of cells from disease cells and the number of cells from the control atlas. Clusters were considered to be control-like if the hypergeometric test result was significant (p value < 0.0001, Benjamini-Hochberg adjusted), and the cells from the disease dataset in control-like clusters were removed from the downstream analyses. Differential gene expression analysis was performed using MAST (Finak et al., 2015). Seurat’s FindAllMarkers and FindMarkers functions that utilize the MAST package were used to run DGE analysis on normalized gene expression data. Cell-type-specific genes were identified by performing DGE analysis between the cell type of interest and the average of the remaining cell types. Subcluster-specific genes were identified by performing DGE analysis between the subcluster of interest and the average of the remaining subclusters from the same cell type. For each cell type, several DGE analysis was performed within two different datasets, categorized from AD versus control, and AD versus AD in the same species under the same gender, brain region, and age. To regress out technical biases from different datasets, the dataset latent variables were added in all cross-dataset DGE analyses. Functional enrichment analysis was performed using the Enrichr web server. The p value was computed using a standard statistical method used by most enrichment analysis tools: Fisher's exact test or the hypergeometric test. This is a binomial proportion test that assumes a binomial distribution and independence for the probability of any gene belonging to any set. CTSRs were identified using IRIS3. The RSS for a cell type was calculated according to the entropy of regulon activity score (RAS) of cells within the cell type compared to other cell types. An RSS ranges from 0 to 1, with a higher value representing greater specificity of a regulon in the cell type. An empirical P-value of a regulon's RSS can be estimated by comparing it with the RSSs of randomly selected gene sets (having the same number of genes in this regulon through a bootstrap method) in the same cell type, 10 000 times. Regulon P-values are Bonferroni-adjusted by multiplying the number of regulons in the exact cell type. Regulons with adjusted P-values < 0.05 (by default) are considered CTSRs.

Limitations

Limitation 1

Although scREAD is continuing to collect all publicly available AD scRNA-seq & snRNA-seq data, you may still find your dataset of interest was not included in scREAD. This may be due to the policy that we are not allowed to put some datasets in scREAD.

Limitation 2

scREAD uses a semi-supervised cell type annotation method, thus, only eight major cell types from our marker genes list can be annotated, i.e., astrocytes, endothelial cells, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, oligodendrocyte precursor cells, and pericytes. For different brain regions, enforcing an annotation to the closest cell type is likely to result in misannotation of such regions, but we are aware that subtypes could be finely resolved and characterize this problem. Therefore, the subcluster function of our scREAD would provide a more comprehensive cell type annotation considering cross-region heterogeneity. Due to no standard or consistent annotations available, we have not annotated the subclusters.

Troubleshooting

Problem 1

The content on the scREAD webpage cannot be displayed correctly in checking AD studies summary statistics.

Potential solution

Please make sure to use one of the modern browsers, including Chrome, Firefox, Microsoft Edge, or Safari. Internet Explorer is not supported in scREAD.

Problem 2

Uploading incorrect file format to the scREAD online analysis workflow (https://bmbls.bmi.osumc.edu/scread/submit ). scREAD online analysis workflow needs to provide a gene expression matrix in text format, in which each row represents a gene, each column represents a cell.

Problem 3

Difficulty in selecting parameters at DEGs at the DE results in checking differential expression (DE) results and performing functional enrichment analysis. Adjusting any parameters in the DE section will immediately affect the displayed DEGs table. A less-stringent p value is recommended to be used to select a large preliminary list of genes, then all the genes in the list are ranked by FC, and finally, an FC cutoff is applied to determine the final set of DEGs (Zhao et al., 2018). If you would like to use more stringent criteria for the enrichment analysis result, please increase the log fold-change values or decrease the p values and vice versa.

Problem 4

File system permission errors occur while installing R packages in section 7. In some computers or high-performance computing (HPC) systems, your user account may not have the necessary permission to install new packages, we suggest contacting your administrator to install these dependencies. Specifically, if you have trouble compiling Harmony, please refer to this solution: https://github.com/immunogenomics/harmony/issues/10.

Problem 5

Using different input file format files while running workflow in section 7. The protocol in section 7 provided examples with CSV file extension. Please change the file reading function from read.csv to the following formats accordingly: HDF5 file: https://rdrr.io/cran/Seurat/man/Read10X_h5.html Three gzip files recording information of barcodes (barcodes.tsv), features (genes.tsv), and gene expressions (matrix.mtx): https://rdrr.io/cran/Seurat/man/Read10X.html

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Qin Ma (qin.ma@osumc.edu).

Materials availability

This study did not generate any new materials.

Data and code availability

scREAD is freely available at https://bmbls.bmi.osumc.edu/scread/. All code for the scREAD protocols is freely available on GitHub: https://github.com/OSU-BMBL/scread-protocol. The interactive tutorial for optional step: calculating overlapping DEGs from multiple comparisons to find overlapping genes is freely available in Google Colab: https://colab.research.google.com/drive/1lInXa6jD4yc7RGJc0EWDfy5NNoXT1qye?usp=sharing.

Feature	Cell1	Cell2	Cell3	Cell4	Cell_n
Feature 1	0	2	7	3	…
Feature 2	0	4	6	0	…
Feature 3	2	0	2	0	…

Barcode
AAACATACAAAACG-1
AAACATACAAAAGC-1
AAACATACAAACAG-1
AAACATACAAACGA-1
…

Ensembl_id	Gene_symbol
ENSG00000243485	MIR1302-10
ENSG00000237613	FAM138A
ENSG00000186092	OR4F5
ENSG00000238009	RP11-34P13.7
…	…

Gene id index	Cell id index	UMI count
498	1	1
5423	1	6
6374	1	3
12932	1	1
…	…	…

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

AD scRNA-seq data (Grubman et al., 2019)	GEO	GSE138852
scREAD database	(Jiang et al., 2020)	https://bmbls.bmi.osumc.edu/scread/
All code for the scREAD protocols	GitHub	https://github.com/OSU-BMBL/scread-protocol
The interactive tutorial for optional step: calculating overlapping DEGs from multiple comparisons to find overlapping genes	Google Colab	https://colab.research.google.com/drive/1lInXa6jD4yc7RGJc0EWDfy5NNoXT1qye?usp=sharing

Software and algorithms

Chrome	Google	https://www.google.com/chrome/
Safari	Apple	https://www.apple.com/safari/
Microsoft Edge	Microsoft	https://www.microsoft.com/en-us/edge
Firefox	Mozilla	https://www.mozilla.org/en-US/
R (>=3.6)	R Project	https://www.r-project.org/
IRIS3 (v1.2.4)	(Ma et al., 2020)	https://bmbl.bmi.osumc.edu/iris3/
Seurat (v3.2)	(Stuart et al., 2019)	https://satijalab.org/seurat
Harmony (v0.1)	(Korsunsky et al., 2019)	https://github.com/immunogenomics/harmony
Polychrome (v1.2.5)	R CRAN	https://cran.r-project.org/web/packages/Polychrome
SCINA (v1.2.0)	(Zhang et al., 2019)	https://github.com/jcao89757/SCINA
RColorBrewer (v1.1-2)	R CRAN	https://cran.r-project.org/web/packages/RColorBrewer
ggplot2 (v3.3.2)	R CRAN	https://ggplot2.tidyverse.org/
tidyverse (v1.3.0)	R CRAN	https://www.tidyverse.org/
cowplot (v1.0.0)	(Wilke et al., 2021)	https://github.com/wilkelab/cowplot/tree/1.1.1
RMySQL (v0.10.21)	R CRAN	https://cran.r-project.org/web/packages/RMySQL
future (v1.21.0)	R CRAN	https://cran.r-project.org/web/packages/future
MAST (v1.16.0)	(Finak et al., 2015)	https://www.bioconductor.org/packages/release/bioc/html/MAST.html

14 in total

1. Comprehensive Integration of Single-Cell Data.

Authors: Tim Stuart; Andrew Butler; Paul Hoffman; Christoph Hafemeister; Efthymia Papalexi; William M Mauck; Yuhan Hao; Marlon Stoeckius; Peter Smibert; Rahul Satija
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

2. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.

Authors: Ivan V Kulakovskiy; Ilya E Vorontsov; Ivan S Yevshin; Ruslan N Sharipov; Alla D Fedorova; Eugene I Rumynskiy; Yulia A Medvedeva; Arturo Magana-Mora; Vladimir B Bajic; Dmitry A Papatsenko; Fedor A Kolpakov; Vsevolod J Makeev
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

3. IRIS3: integrated cell-type-specific regulon inference server from single-cell RNA-Seq.

Authors: Anjun Ma; Cankun Wang; Yuzhou Chang; Faith H Brennan; Adam McDermaid; Bingqiang Liu; Chi Zhang; Phillip G Popovich; Qin Ma
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

4. How many differentially expressed genes: A perspective from the comparison of genotypic and phenotypic distances.

Authors: Bi Zhao; Aqeela Erwin; Bin Xue
Journal: Genomics Date: 2017-08-24 Impact factor: 5.736

5. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.

Authors: Maxim V Kuleshov; Matthew R Jones; Andrew D Rouillard; Nicolas F Fernandez; Qiaonan Duan; Zichen Wang; Simon Koplev; Sherry L Jenkins; Kathleen M Jagodnik; Alexander Lachmann; Michael G McDermott; Caroline D Monteiro; Gregory W Gundersen; Avi Ma'ayan
Journal: Nucleic Acids Res Date: 2016-05-03 Impact factor: 16.971

6. Single-cell transcriptomic analysis of Alzheimer's disease.

Authors: Hansruedi Mathys; Jose Davila-Velderrain; Zhuyu Peng; Fan Gao; Shahin Mohammadi; Jennie Z Young; Madhvi Menon; Liang He; Fatema Abdurrob; Xueqiao Jiang; Anthony J Martorell; Richard M Ransohoff; Brian P Hafler; David A Bennett; Manolis Kellis; Li-Huei Tsai
Journal: Nature Date: 2019-05-01 Impact factor: 49.962

7. A single-cell atlas of entorhinal cortex from individuals with Alzheimer's disease reveals cell-type-specific gene expression regulation.

Authors: Alexandra Grubman; Gabriel Chew; John F Ouyang; Guizhi Sun; Xin Yi Choo; Catriona McLean; Rebecca K Simmons; Sam Buckberry; Dulce B Vargas-Landin; Daniel Poppe; Jahnvi Pflueger; Ryan Lister; Owen J L Rackham; Enrico Petretto; Jose M Polo
Journal: Nat Neurosci Date: 2019-12 Impact factor: 24.884

8. Fast, sensitive and accurate integration of single-cell data with Harmony.

Authors: Ilya Korsunsky; Nghia Millard; Jean Fan; Kamil Slowikowski; Fan Zhang; Kevin Wei; Yuriy Baglaenko; Michael Brenner; Po-Ru Loh; Soumya Raychaudhuri
Journal: Nat Methods Date: 2019-11-18 Impact factor: 28.547

9. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia.

Authors: Jeffrey M Granja; Sandy Klemm; Lisa M McGinnis; Arwa S Kathiria; Anja Mezger; M Ryan Corces; Benjamin Parks; Eric Gars; Michaela Liedtke; Grace X Y Zheng; Howard Y Chang; Ravindra Majeti; William J Greenleaf
Journal: Nat Biotechnol Date: 2019-12-02 Impact factor: 54.908

10. Multiple sources of bias confound functional enrichment analysis of global -omics data.

Authors: James A Timmons; Krzysztof J Szkop; Iain J Gallagher
Journal: Genome Biol Date: 2015-09-07 Impact factor: 13.583

1 in total

1. Decoding the Role of Astrocytes in the Entorhinal Cortex in Alzheimer's Disease Using High-Dimensional Single-Nucleus RNA Sequencing Data and Next-Generation Knowledge Discovery Methodologies: Focus on Drugs and Natural Product Remedies for Dementia.

Authors: Peter Natesan Pushparaj; Gauthaman Kalamegam; Khalid Hussain Wali Sait; Mahmood Rasool
Journal: Front Pharmacol Date: 2022-02-28 Impact factor: 5.810

1 in total