| Literature DB >> 33995899 |
Yanhui Hu1,2, Sudhir Gopal Tattikota1, Yifang Liu1,2, Aram Comjean1,2, Yue Gao1,2, Corey Forman1, Grace Kim1, Jonathan Rodiger1,2, Irene Papatheodorou3, Gilberto Dos Santos4, Stephanie E Mohr1,2, Norbert Perrimon1,2,5.
Abstract
With the advent of single-cell RNA sequencing (scRNA-seq) technologies, there has been a spike in studies involving scRNA-seq of several tissues across diverse species including Drosophila. Although a few databases exist for users to query genes of interest within the scRNA-seq studies, search tools that enable users to find orthologous genes and their cell type-specific expression patterns across species are limited. Here, we built a new search database, DRscDB (https://www.flyrnai.org/tools/single_cell/web/), to address this need. DRscDB serves as a comprehensive repository for published scRNA-seq datasets for Drosophila and relevant datasets from human and other model organisms. DRscDB is based on manual curation of Drosophila scRNA-seq studies of various tissue types and their corresponding analogous tissues in vertebrates including zebrafish, mouse, and human. Of note, our search database provides most of the literature-derived marker genes, thus preserving the original analysis of the published scRNA-seq datasets. Finally, DRscDB serves as a web-based user interface that allows users to mine gene expression data from scRNA-seq studies and perform cell cluster enrichment analyses pertaining to various scRNA-seq studies, both within and across species.Entities:
Keywords: Cross-species analysis; Data mining; Model organisms; single-cell RNA-seq
Year: 2021 PMID: 33995899 PMCID: PMC8085783 DOI: 10.1016/j.csbj.2021.04.021
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Curation and processing of scRNA-seq datasets from the literature. DRscDB is built based on curation of published scRNA-seq literature. During the curation process, curators extract the information about experimental design, sample information, and marker genes from each publication, and organize the information in a standard template. Data wranglers retrieve the data files (cell expression matrix and metadata file) from GEO and calculate the expression statistics of each gene at the cluster level (Supplementary Fig. S1). Subsequently, data files and annotation files are processed by a software engineer for database upload.
DRscDB coverage.
| Species | Tissue | # publications | # datasets | PMIDs |
|---|---|---|---|---|
| Drosophila | Immune | 4 | 12 | 32396065 |
| Drosophila | Ovary | 3 | 4 | 31919193 |
| Drosophila | Wing disc | 3 | 5 | 31363221 |
| Drosophila | Brain | 2 | 11 | 31746739 |
| Drosophila | Intestine | 2 | 2 | 31851941 |
| Drosophila | Embryo | 1 | 1 | 28860209 |
| Drosophila | eye disc | 1 | 4 | 30479347 |
| Drosophila | Kidney | 1 | 1 | 32175841 |
| Drosophila | Testis | 1 | 1 | 31418408 |
| Human | Kidney | 5 | 14 | 29870722 |
| Human | Brain | 2 | 3 | 31303374 |
| Human | Immune | 2 | 2 | 28428369 |
| Human | Intestine | 2 | 5 | 31753849 |
| Human | Testis | 2 | 5 | 30726734 |
| Mosquito | Immune | 1 | 2 | 32855340 |
| Mouse | Kidney | 4 | 4 | 31689386 |
| Mouse | Testis | 2 | 2 | 31237565 |
| Mouse | Brain | 1 | 1 | 29545511 |
| Mouse | Embryo | 1 | 1 | 30840884 |
| Mouse | Immune | 1 | 1 | 27365425 |
| Mouse | Intestine | 1 | 1 | 29144463 |
| Zebrafish | Brain | 3 | 6 | 29608178 |
| Zebrafish | Immune/Kidney | 1 | 1 | 28878000 |
| Zebrafish | Intestine | 1 | 1 | 32092251 |
Fig. 2Use of DRscDB for data mining. At the DRscDB search page, a user can enter a gene of interest with or without specifying the tissue of interest, and results are summarized in a table format listing the number of datasets expressing the gene of interest as well as the orthologous genes. Next, the user can find more detailed information such as the relevant clusters expressing the gene of interest. The statistics about the percent of cells expressing the gene, as well as the average expression level, can be visualized by dot plot, bar graph, or heatmap. If a gene is identified as one of the marker genes for any of the clusters, the statistics of fold enrichment as well as P value are also displayed by bar graph.
Fig. 3Use of DRscDB for enrichment analysis. At the DRscDB enrichment analysis page, a user can input a list of genes and find the clusters for which the input genes are significantly enriched among the top 100 marker genes. In addition, at this page, a user can also enter multiple gene lists and compare each input gene list (for example, 15 lists) with every cluster of a selected study (for example, 9 clusters). The enrichment results are visualized by a heatmap, consisting in this example of a 9x15 matrix, with columns representing each input gene list and rows represents each cluster from the selected study. The darkness of the color represents similarity (-log10 P value or fold enrichment).
Fig. 4Unsupervised hierarchical clustering of enrichment results comparing top markers from for publications on the Clustering of the top 100 marker genes per cluster from [6], with 2 other published immune datasets [10], [11]. The results reveal that similar cell types tend to cluster together from these three immune datasets; therefore, it is reasonable to suggest that for newly generated datasets DRscDB can be used to assign cell types.
Fig. 5DRscDB facilitates comparison of cell clusters across datasets and species. A. Comparison of the top 10 marker genes per cluster derived from Drosophila[6] or mosquito [17] blood scRNA-seq datasets. B. Comparison of the top 20 marker genes per cluster from the Drosophila gut study by Hung et al., 2020 [16] with published human intestinal cell clusters [18].