Literature DB >> 34591957

Gene Expression Nebulas (GEN): a comprehensive data portal integrating transcriptomic profiles across multiple species at both bulk and single-cell levels.

Yuansheng Zhang1,2,3, Dong Zou1,2, Tongtong Zhu1,2,3, Tianyi Xu1,2, Ming Chen1,2,3, Guangyi Niu1,2,3, Wenting Zong1,2,3, Rong Pan1,2,3, Wei Jing1,2,3, Jian Sang1,2,3, Chang Liu1,2,3, Yujia Xiong4, Yubin Sun1,2, Shuang Zhai1,2, Huanxin Chen1,2, Wenming Zhao1,2,3, Jingfa Xiao1,2,3, Yiming Bao1,2,3, Lili Hao1,2, Zhang Zhang1,2,3.   

Abstract

Transcriptomic profiling is critical to uncovering functional elements from transcriptional and post-transcriptional aspects. Here, we present Gene Expression Nebulas (GEN, https://ngdc.cncb.ac.cn/gen/), an open-access data portal integrating transcriptomic profiles under various biological contexts. GEN features a curated collection of high-quality bulk and single-cell RNA sequencing datasets by using standardized data processing pipelines and a structured curation model. Currently, GEN houses a large number of gene expression profiles from 323 datasets (157 bulk and 166 single-cell), covering 50 500 samples and 15 540 169 cells across 30 species, which are further categorized into six biological contexts. Moreover, GEN integrates a full range of transcriptomic profiles on expression, RNA editing and alternative splicing for 10 bulk datasets, providing opportunities for users to conduct integrative analysis at both transcriptional and post-transcriptional levels. In addition, GEN provides abundant gene annotations based on value-added curation of transcriptomic profiles and delivers online services for data analysis and visualization. Collectively, GEN presents a comprehensive collection of transcriptomic profiles across multiple species, thus serving as a fundamental resource for better understanding genetic regulatory architecture and functional mechanisms from tissues to cells.
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Year:  2022        PMID: 34591957      PMCID: PMC8728231          DOI: 10.1093/nar/gkab878

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Transcriptomic profiling, involving both transcriptional and post-transcriptional modifications or events at whole-genome level, is of great importance for uncovering functional elements across the three domains of life, including ‘Bacteria’, ‘Archaea’ and ‘Eukarya’ (1–3). High-throughput RNA sequencing (RNA-seq) (4), which can qualitatively and quantitatively capture any type of RNA, promises to help researchers characterize transcriptome comprehensively due to the capacities of whole-genome expression profiling (5–7), detection of novel RNA forms and variants (8–12) and genome reannotation (13,14). With the continuous developments of RNA-seq technology, it has become a routine and indispensable approach for systematically characterizing transcriptome across diverse developmental stages and physiological conditions (1,10,15–17). Of note, over the past years, transcriptomic studies have made the leap from bulk RNA-seq to single-cell RNA-seq (scRNA-seq), unveiling new insights into cell type classification and cellular heterogeneity exploration (18,19). As RNA-seq has been widely used in a broad diversity of species worldwide, a huge amount of transcriptomic data has been generated at unprecedentedly exponential rates, accordingly posing great challenges in large-scale data aggregation and standardized processing. To facilitate more effective reuse, integration, and mining of those data, valuable efforts have been made to construct comprehensive or specialized database resources, such as Gene Expression Omnibus (GEO) (20), Expression Atlas (21), Human Cell Atlas (HCA) (22) and Genotype-Tissue Expression (GTEx) (23). Specifically, GEO, a widely used resource developed by NCBI (24), is devoted to archiving worldwide transcriptomic data (as well as other omics data), yet ignoring standardized data processing and structured metadata management. Expression Atlas in EBI (25), contains both bulk and single-cell expression profiles with unified processing, nevertheless lacking co/post-transcriptional events (e.g. RNA editing and splicing). HCA is specialized in human single-cell expression profiling, whereas GTEx focuses on human gene expression and regulation across tissues. To sum up, existing resources have two major shortcomings. First, none of them takes good account of transcriptomic profiles (e.g. expression, RNA editing, splicing, etc.). Second, they do not well curate and categorize experimental metadata under the framework of biological contexts. Given the large-scale data volumes and heterogeneous types of data and metadata, it is challenging to build a comprehensive database that integrates transcriptomic profiles at both bulk and single-cell levels, accompanying with standardized data processing, metadata curation, and online tools. To address these challenges, here we present Gene Expression Nebulas (GEN, https://ngdc.cncb.ac.cn/gen/), an open-access data portal integrating transcriptomic profiles under various conditions across multiple species. It was originally established in 2016, along with the foundation of the National Genomics Data Center (NGDC; previously named as BIG Data Center) (26,27), China National Center for Bioinformation (CNCB). Since its inception, GEN, as one of the core resources in CNCB-NGDC, has been frequently updated by importing and processing datasets obtained from a variety of raw sequencing data archives. Unlike existing resources, GEN provides a curated collection of high-quality bulk and single-cell RNA-seq datasets with uniformed data processing and adopts a structured curation model to categorize diverse experimental conditions into different biological contexts. Accordingly, GEN features large-scale integration of diverse transcriptomic profiles and provides online tools for analysis and visualization of both bulk and single-cell RNA-seq data.

MATERIALS AND METHODS

Data collection

A number of high-throughput RNA-seq projects and their associated datasets were collected from several public raw sequencing databases, including Genome Sequence Archive (GSA, https://ngdc.cncb.ac.cn/gsa/) (28,29), Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra/) (30), European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena) (31) and DDBJ Sequence Read Archive (DRA, https://ddbj.nig.ac.jp/DRASearch/) (32). Only the datasets with median mapping rates ≥70% for bulk RNA-seq and ≥40% for scRNA-seq were kept for further processing. As a result, a total of 296 RNA-seq projects and 323 high-quality datasets were obtained.

Unified and standardized data processing

For bulk RNA-seq datasets, Fastp v0.20.0 (33) was used for trimming and filtering raw reads. And, HISAT2 v2.0.5 (34) was used for quick alignment to evaluate the data quality, and RseQC v2.6.4 (35) was implemented for inferring the strand specificity of the sequencing library. Then high-quality RNA-seq reads were aligned to the reference genome by STAR v2.7.1a (36). After that, quantification of gene/isoform assembly was performed by RSEM v1.3.1 (37) with default parameters. ‘Raw counts’, ‘FPKM’ (Fragments Per Kilobase of transcript per Million mapped fragments) and ‘TPM’ (Transcripts Per Million) values of each gene/isoform were calculated. For circular RNA (circRNA) expression analysis, the cleaned RNA-seq reads were mapped to the reference genome by BWA-MEM (38). Next, CIRCexplorer2 (39) and CIRI2 v2.0.6 (40) were used to identify circRNA candidates by recognizing the back-splicing junction (BSJ) reads (≥2) with default parameters. Moreover, RNA editing sites were identified with the genome from GENCODE v33 (41) as reference. All known RNA editing sites were retrieved from REDIportal v2.0 (42) (http://srv00.recas.ba.infn.it/atlas/). Novel human RNA editing sites were detected by Parallel Strategy of REDItool 2.0 (43). To obtain more accurate novel editing sites, a filtration step was added for non-Alu regions using additional criteria as the non-Alu regions usually have sporadic editing sites. Meanwhile, pblat v1.0 (44) was used to discover the mismatched RNA-seq reads and multi-mapping reads, which were then trimmed to remove duplicate reads by using SAMtools v1.9 (45). Editing sites of both A-to-I and C-to-U were maintained for further analysis. RepeatMasker (http://www.repeatmasker.org) and SNP files used for annotating high-confidence novel RNA editing sites were both downloaded from UCSC (https://hgdownload.soe.ucsc.edu/downloads.html). In addition, for alternative splicing analysis, high-quality RNA-seq reads were mapped to the reference genome by STAR. Then, detection of differentially spliced events was mainly executed with BAM files by rMATS v3.1.0 (46). The high-quality RNA-seq reads were mapped to the reference genome by STAR. Each ‘case’ group was compared to the ‘control’ group to identify differentially spliced events, and parameter of ‘–cstat 0.0001’ was used for 0.01% difference, to compute p-values and FDRs of splicing events with the absolute value of exon inclusion level (|Δψ|) > 0.01% cutoff. For scRNA-seq datasets, notably, alignment approach was consistent with bulk RNA-seq datasets, while gene quantification tools varied with the data generated by different platforms/strategies to deal with cell barcodes and unique molecular identifiers (UMIs). Currently, pipelines for the three most commonly adopted scRNA-seq technologies were as follows (47–49): (i) for data generated by plate- or fluidigm-based protocol, such as Smart-seq2 (50) and SMARTer (Fluidigm C1) strategies, STAR v2.7.1a and RSEM v1.3.1 were used to align and calculate ‘raw counts’, ‘FPKM’ and ‘TPM’ values of each gene/isoform with the parameter ‘–single-cell-prior’; (ii) for data from droplet-based protocol including Drop-seq (51) and inDrop (52), dropEst v0.8.6 (53) was used to provide more accurate estimates of molecular counts in individual cells by barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries; and (iii) specifically for data from 10× Genomics platform (54), CellRanger v3.1.0 (https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome) was implemented as a one-stop analysis pipeline for quality control, sample de-multiplexing, barcode processing and generation of feature-barcode matrices.

Collection of gene annotations

For all collected species, a wide range of gene functional annotations were extracted from Ensembl (55) and NCBI (24), roughly falling into basic information including genomic location and functional description, and associated terms or ontologies like Gene Ontology (GO) (56). Particularly, for Homo sapiens, housekeeping and tissue-specific genes were derived from GTEx (57), genes were annotated based on Disease Ontology (DO) (58) along with GO, and a gene structure visualization on the basis of Genome Browser (59) was provided. Furthermore, annotation information of editome-disease associations from Editome-Disease Knowledgebase (EDK, https://ngdc.cncb.ac.cn/edk) (60) and RT-qPCR reference genes from Internal Control Genes (ICG, http://icg.big.ac.cn) (61) were also included for corresponding genes, while external links to GTEx (https://www.gtexportal.org/home/) (23), REDIportal (http://srv00.recas.ba.infn.it/atlas/) (62) and GeneCard (https://www.genecards.org) (63) were added to each gene (if available).

Downstream analysis

A series of popular downstream analysis tools were implemented in GEN. For bulk RNA-seq data, four tools were included for different analysis purposes, namely, differential expression analysis with limma (64), weighted gene co-expression network analysis with WGCNA (65), functional enrichment analysis with clusterProfiler (66), and gene regulatory network inference with GENIE3 (67). For scRNA-seq data, Seurat (68) was integrated for the selection and filtration of cells based on quality-control metrics, data normalization and scaling, detection of high-variance genes, linear dimensional reduction (i.e. principal component analysis), graph-based clustering, visualization of cluster assignment and identification of cluster markers. Marker gene enrichment analysis was generated with Enrichr (69), and trajectory inference was performed with Monocle (70). Furthermore, SingleR (71) was employed to infer cell type identity of each cell independently by leveraging reference transcriptomic datasets of pure cell types. Here, reference datasets from Human Primary Cell Atlas (72), BLUEPRINT (73), and Human Immune Cell RNA-seq Data (74), Human Hematopoietic Cell RNA-seq Data (75) and DICE (Database of Immune Cell Expression, Expression quantitative trait loci (eQTLs) and Epigenomics) Project (76) were used for human cell type annotation, while those from Mouse RNA-seq Data (77) and Immunological Genome Project (ImmGen) (78) were used for mouse cell type annotation, separately.

Database implementation

GEN was implemented using Spring Boot (https://spring.io/projects/spring-boot; a framework easy to create stand-alone java applications) as the back-end framework. All data were stored and managed by using MySQL (https://dev.mysql.com; a free and popular relational database management system). To provide user-friendly and highly interactive web applications, web pages were constructed using HTML5 and rendered using JSP (https://jakarta.ee/specifications/pages/3.0/, Jakarta Server Pages, a template engine for web applications). Front-end interfaces were built using Semantic UI (https://semantic-ui.com; a development framework that helps create beautiful, responsive layouts HTML) and JQuery (https://jquery.com; a fast, small, and feature-rich JavaScript library). Furthermore, data visualization was built by HighCharts (https://www.highcharts.com; a JavaScript plug-in to create interactive charts), Plotly.js (https://plotly.com/javascript/; a high-level, declarative charting library) and DataTables (https://datatables.net; a plug-in for the jQuery JavaScript library to render HTML tables). Interactive visualization of scRNA-seq data was powered by Cerebro (79). Online tools were developed with Shiny (https://shiny.rstudio.com/, an R package to build interactive web applications).

DATABASE CONTENTS AND USAGE

GEN features comprehensive integration, manual curation and standardized analysis of high-quality transcriptomic datasets at bulk and single-cell levels based on a structured curation model and uniformed data processing pipelines. More importantly, diverse experimental conditions of all incorporated datasets are categorized into more informative biological contexts. In the current version, GEN houses a collection of transcriptomic profiles of 323 datasets covering 50 500 samples and 15 540 169 cells across 30 species. For each dataset, a full range of transcriptomic profiles including gene expression, alternative RNA splicing and RNA editing (if applicable) are provided in GEN. Moreover, GEN accommodates value-added gene annotations based on differential expression analysis across diverse experimental conditions and cell clusters. Accordingly, GEN provides user-friendly web functionalities and applications for large-scale data query, retrieval, analysis and visualization (Figure 1).
Figure 1.

Database contents and features of Gene Expression Nebulas. Abbreviations used: SC, single-cell; TS, tissue-specific; HS, house-keeping; FPKM, fragments per kilobase of transcript per million mapped fragments; TPM, transcripts per million. SE: skipped exon; A3SS: alternative 3′ splice site; A5SS: alternative 5′ splice site; MXE: mutually exclusive exons; RI: retained intron.

Database contents and features of Gene Expression Nebulas. Abbreviations used: SC, single-cell; TS, tissue-specific; HS, house-keeping; FPKM, fragments per kilobase of transcript per million mapped fragments; TPM, transcripts per million. SE: skipped exon; A3SS: alternative 3′ splice site; A5SS: alternative 5′ splice site; MXE: mutually exclusive exons; RI: retained intron.

Metadata curation and datasets

GEN adopts a structured curation model, incorporating manually curated items in light of dataset, profile (expression/splicing/editing), and sample: (i) datasets are annotated and categorized into six biological contexts of general interest, involving baseline, genetic (e.g. mutation, natural variation), phenotypic (e.g. disease, aging), environmental (e.g. abiotic stress, biotic stress), spatial (e.g. organism, tissue, cell type) and temporal (e.g. development, circadian, time series); (ii) Expression/splicing/editing profiles include the main steps and parameters of data processing together with reference genome and annotation details and (iii) samples contain a wealth of descriptive information, including basic information, sample characteristic, biological condition, experimental variable, experimental protocol, sequencing strategy and platform, quality assessment and data analysis procedure (reference genome, annotation file, software and parameter setting). All descriptive terms with controlled vocabularies are extracted and abstracted by manual curation of 293 published articles. In particular, diseases, tissues, and cell types are further linked to controlled terms from Disease Ontology (DO, https://disease-ontology.org) and BRENDA Tissue Ontology (BTO, http://www.ontobee.org/ontology/bto). More details about the curation model are publicly available at https://ngdc.cncb.ac.cn/gen/documentation. Specifically, for each dataset, GEN provides a curated summary of metadata, covering species, tissue, healthy condition, RNA type, sample number, sequencing strategy, sequencing quality & quantity and experimental condition (https://ngdc.cncb.ac.cn/gen/browse/datasets, Figure 2A). To manage all collected datasets, GEN assigns an accession number prefixed with ‘GEND’ for each dataset. Moreover, since each dataset associates with specific sample(s) (prefixed with ‘GENS’), manual curation is conducted for all datasets by linking to controlled terms from DO and BTO via sample meta-information. As a result, all datasets incorporated in GEN cover 128 tissues and 46 cell types (originally curated from metadata provided by submitters). Based on these curated metadata, as a consequence, users can conveniently find the dataset(s) of interest. Structured metadata for all collected datasets is provided in a tabular form and also freely downloadable (https://ngdc.cncb.ac.cn/gen/download). Overall, bulk RNA-seq and scRNA-seq datasets involve 30 and 22 species, 89 and 64 tissues, respectively (Table 1). Regarding the specific biological contexts, GEN incorporates 153 baseline datasets, 323 spatial datasets, 83 temporal datasets, 58 environmental datasets, 55 genetic datasets and 148 phenotypic datasets, involving 84 diseases such as autism, cancer, diabetes, systemic lupus erythematosus (https://ngdc.cncb.ac.cn/gen/browse/datasets). Not surprisingly, Homo sapiens has the most abundant datasets, involving 192 datasets, 29 942 samples, 70 tissues and 84 diseases corresponding to 11 body systems (including cardiovascular, endocrine, gastrointestinal, hematopoietic, immune, integumentary, musculoskeletal, nervous, respiratory, reproductive and urinary system). More statistics of datasets and samples housed in GEN are summarized and publicly accessible on the statistics page (https://ngdc.cncb.ac.cn/gen/statistics).
Figure 2.

Screenshots of database web interfaces. (A) Curated meta-information of dataset, including sequencing strategies, tissue, cell type, disease, biological context, quality and quantity and etc. (B) Boxplot of expression levels of multiple genes of interest across samples. (C) Heatmap of differentially expressed genes for bulk RNA-seq datasets. (D) Clustering results of single-cell RNA-seq dataset on a 3D UMAP plot where cells are color-coded by clusters.

Table 1.

Data statistics in Gene Expression Nebulas (as of August 2021)

KingdomSpecies#Datasets (bulk/single-cell)#Samples#Tissues#Cells
Animalia Homo sapiens 192 (68/124)29 942706 823 695
Mus musculus 11 (3/8)91471 176 003
Drosophila melanogaster 7 (1/6)14 80043 837 235
Gallus gallus 4 (1/3)329742 129
Macaca mulatta 4 (1/3)3261304
Rattus norvegicus 4 (2/2)1342122
Capra hircus 3 (1/2)86359
Danio rerio 3 (1/2)367928 773
Bos taurus 2 (1/1)1423100
Caenorhabditis elegans 2 (1/1)122130 713
Canis lupus familiaris 2 (1/1)307657 999
Macaca fascicularis 2 (1/1)20422 737
Ovis aries 2 (1/1)21811 380
Oryctolagus cuniculus 2 (1/1)32132
Schistosoma mansoni 2 (1/1)15255 930
Sus scrofa 2 (1/1)32132
Xenopus tropicalis 2 (1/1)11522 520 906
Plantae Oryza sativa 32 (31/1)10871427
Glycine max 16 (16/0)4998-
Arabidopsis thaliana 8 (5/3)2427220 188
Sorghum bicolor 5 (5/0)4627-
Triticum aestivum 3 (3/0)786-
Glycine soja 2 (2/0)346-
Zea mays 2 (2/0)4801-
Brassica napus 1 (1/0)446-
Gossypium hirsutum 1 (1/0)141-
Solanum lycopersicum 1 (1/0)61-
Protista Plasmodium falciparum 2 (1/1)2080180
Dictyostelium discoideum 2 (1/1)1204988
Fungi Saccharomyces cerevisiae 2 (1/1)1706637
Total 30 323 (157/166) 50 500 128 15 540 169
Screenshots of database web interfaces. (A) Curated meta-information of dataset, including sequencing strategies, tissue, cell type, disease, biological context, quality and quantity and etc. (B) Boxplot of expression levels of multiple genes of interest across samples. (C) Heatmap of differentially expressed genes for bulk RNA-seq datasets. (D) Clustering results of single-cell RNA-seq dataset on a 3D UMAP plot where cells are color-coded by clusters. Data statistics in Gene Expression Nebulas (as of August 2021)

Transcriptomic profiles at bulk and single-cell levels

GEN provides a full range of transcriptomic profiles characterizing both transcriptional and post-transcriptional regulations. For all collected datasets in GEN, expression profiles are quantified on both gene and transcript levels by three types of quantification methods, namely, raw read count number, FPKM and TPM. At the bulk level, GEN currently integrates gene expression profiles of 7412 samples from 157 datasets, involving 17 animals, 10 plants, 2 protists and 1 fungus, including Homo sapiens and model organisms such as Arabidopsis thaliana, Danio rerio, Drosophila melanogaster and Mus musculus (Table 1). Gene expression profiles can be visualized in heatmap/boxplot charts (Figure 2B). Moreover, GEN incorporates circRNA expression profiles of 456 samples from 10 human datasets. Based on the expression profiles, differentially expressed genes (DEGs) are identified between biological condition groups, which can be accessed in tabular form and visualized in heatmap charts (Figure 2C). In addition, GEN integrates a valuable collection of RNA editing events and alternative RNA splicing isoforms in 10 datasets with 574 human samples (involving 18 tissues and 16 diseases) as value-added profiles on co/post-transcriptional levels. At the single-cell level, GEN provides high-quality expression profiles of 15 540 169 cells from 166 datasets covering 22 species (17 animals, 2 plants, 2 protists and 1 fungus), 64 tissues and 42 human diseases (Table 1). To reveal biological functions underlying expression profiles, further downstream analyses including cell clustering, identification of marker genes for each cluster and functional enrichment are performed. To facilitate easy access to cell clustering results for each dataset/sample, GEN is capable of visualizing the clustered cells using t-SNE and UMAP plots, which can be color-coded according to metadata information, cell clusters and inferred cell types (Figure 2D). Notably, in the current implementation, GEN presents cell type annotations for 121 datasets in H. sapiens and 7 datasets in M. Musculus since sufficient cell type annotation reference only exists for them (see details in Materials and Methods). In addition, marker genes for each cluster and gene enrichment analysis results can be browsed and downloaded.

Gene annotations and expression profiles

GEN provides an abundance of gene annotations for a total of 1 191 846 genes across 30 species. In addition to basic annotation (such as genomic location, biotype, functional description), GEN integrates value-added annotations derived from transcriptomic profiles, including quantitative (expression levels across different conditions) and qualitative (differential expression patterns between condition groups). For any specific gene(s), expression levels in a given dataset can be visualized by interactive heatmap and boxplot charts, and expression patterns from differential expression analysis (also applicable to the identification of marker genes for specific cell types) are annotated and incorporated in GEN. Moreover, GEN incorporates additional annotations for each gene, including editome-disease associations, internal control genes, and ontology terms (from GO, DO; see details in Materials and Methods). Consequently, GEN allows users to retrieve single or multiple genes by gene name/ID/symbol (https://ngdc.cncb.ac.cn/gen/browse/genes). Based on all collected annotations in GEN, users can conveniently find the genes of interest with specific annotations/profiles and investigate expression patterns across diverse biological conditions.

Online tools for data analysis and visualization

GEN is equipped with a series of online tools in aid of further downstream data analysis and visualization (see details in Materials and Methods). For bulk RNA-seq data, GEN offers online services for differential expression analysis, weighted gene co-expression network analysis (WGCNA), functional enrichment analysis and gene regulatory network inference. For scRNA-seq data, users can perform multiple analyses including quality control, data normalization, scaling and regression, dimensional reduction, graph-based clustering, and identification of marker genes for cell clusters (68). Furthermore, GEN is able to help users conduct gene enrichment analysis for cell markers, cell trajectory inference, and cell type annotation. Meanwhile, single-cell analysis results can be visualized by Cerebro (79), which allows interactive investigation and inspection of single-cell transcriptomic profiles incorporated in GEN. All these results can be downloaded in CSV and Excel formats and visualized images can be exported to PNG or PDF.

DISCUSSION AND FUTURE DEVELOPMENTS

GEN features systematic integration, manual curation and standardized data processing of 323 high-quality bulk and single-cell RNA-seq datasets across 30 species. It enables easy access to a comprehensive range of transcriptomic profiles, which are critical for unravelling both transcriptional and post-transcriptional regulatory mechanisms. Moreover, GEN provides abundant gene annotations based on value-added curation of transcriptomic profiles and delivers online services for bulk and single-cell data analysis and visualization. Future directions of GEN include continuous integration and analysis of high-quality RNA-seq datasets with diverse sequencing strategies (e.g. miRNA-seq, single-cell spatial RNA-seq, nanopore long-read RNA-seq) across more species. Also, GEN will be frequently updated by enriching gene annotations based on manual curation of the ever-increasing transcriptomic profiles (13). Particularly, since the field of single-cell genomics is under rapid development, we will keep an eye on cutting-edge scRNA-seq analysis methods and make updates on GEN data processing pipelines accordingly. GEN will also provide online services to accept user-submitted expression profiles with quality control and manual curation. Furthermore, interconnections with external and internal database resources at multi-omics levels (e.g. variome (80), methylome (81) and interactome (82)) will be added and enhanced. Web tools for RNA editing profiling, alternative splicing detection and batch-effect correction across different technologies and conditions will be developed and/or implemented in GEN.

DATA AVAILABILITY

GEN is freely available online at https://ngdc.cncb.ac.cn/gen/ and does not require user to register.
  78 in total

Review 1.  Fungal transcriptomics.

Authors:  Vijai Bhadauria; Lucia Popescu; Wen-Sheng Zhao; You-Liang Peng
Journal:  Microbiol Res       Date:  2007-08-17       Impact factor: 5.415

2.  Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.

Authors:  Allon M Klein; Linas Mazutis; Ilke Akartuna; Naren Tallapragada; Adrian Veres; Victor Li; Leonid Peshkin; David A Weitz; Marc W Kirschner
Journal:  Cell       Date:  2015-05-21       Impact factor: 41.582

3.  Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.

Authors:  Benjamin J Schmiedel; Divya Singh; Ariel Madrigal; Alan G Valdovino-Gonzalez; Brandie M White; Jose Zapardiel-Gonzalo; Brendan Ha; Gokmen Altay; Jason A Greenbaum; Graham McVicker; Grégory Seumois; Anjana Rao; Mitchell Kronenberg; Bjoern Peters; Pandurangan Vijayanand
Journal:  Cell       Date:  2018-11-15       Impact factor: 41.582

4.  limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors:  Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal:  Nucleic Acids Res       Date:  2015-01-20       Impact factor: 16.971

5.  The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types.

Authors:  Tingting Chen; Xu Chen; Sisi Zhang; Junwei Zhu; Bixia Tang; Anke Wang; Lili Dong; Zhewen Zhang; Caixia Yu; Yanling Sun; Lianjiang Chi; Huanxin Chen; Shuang Zhai; Yubin Sun; Li Lan; Xin Zhang; Jingfa Xiao; Yiming Bao; Yanqing Wang; Zhang Zhang; Wenming Zhao
Journal:  Genomics Proteomics Bioinformatics       Date:  2021-08-13       Impact factor: 6.409

6.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors:  Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal:  Nat Protoc       Date:  2016-08-11       Impact factor: 13.491

7.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  The Sequence Read Archive: explosive growth of sequencing data.

Authors:  Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal:  Nucleic Acids Res       Date:  2011-10-18       Impact factor: 16.971

Review 9.  Current best practices in single-cell RNA-seq analysis: a tutorial.

Authors:  Malte D Luecken; Fabian J Theis
Journal:  Mol Syst Biol       Date:  2019-06-19       Impact factor: 11.429

10.  JBrowse: a dynamic web platform for genome visualization and analysis.

Authors:  Robert Buels; Eric Yao; Colin M Diesh; Richard D Hayes; Monica Munoz-Torres; Gregg Helt; David M Goodstein; Christine G Elsik; Suzanna E Lewis; Lincoln Stein; Ian H Holmes
Journal:  Genome Biol       Date:  2016-04-12       Impact factor: 13.583

View more
  4 in total

1.  Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022.

Authors: 
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

2.  DURIAN: an integrative deconvolution and imputation method for robust signaling analysis of single-cell transcriptomics data.

Authors:  Matthew Karikomi; Peijie Zhou; Qing Nie
Journal:  Brief Bioinform       Date:  2022-07-18       Impact factor: 13.994

3.  CeDR Atlas: a knowledgebase of cellular drug response.

Authors:  Yin-Ying Wang; Hongen Kang; Tianyi Xu; Lili Hao; Yiming Bao; Peilin Jia
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

4.  A web-based database server using 43,710 public RNA-seq samples for the analysis of gene expression and alternative splicing in livestock animals.

Authors:  Jinding Liu; Kun Lang; Suxu Tan; Wencai Jie; Yihua Zhu; Shiqing Huang; Wen Huang
Journal:  BMC Genomics       Date:  2022-10-17       Impact factor: 4.547

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.