| Literature DB >> 26876720 |
Pingjian Yu1, Wei Lin2.
Abstract
The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.Entities:
Keywords: Big data; RNA-seq; Signal normalization; Single cell; Transcriptional heterogeneity
Mesh:
Substances:
Year: 2016 PMID: 26876720 PMCID: PMC4792842 DOI: 10.1016/j.gpb.2016.01.005
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Number of papers/datasets addressing single-cell data and big data
Searches were performed on January 04, 2016 on http://www.ncbi.nlm.nih.gov/gds for datasets and http://www.ncbi.nlm.nih.gov/pubmed for papers. Data were obtained according to the search criteria as follows filtered by year: (1) for scRNA-seq datasets on GEO: “single cell”[All Fields] AND “Expression profiling by high throughput sequencing”[Filter]; (2) for scRNA-seq papers on PubMed: “single cell”[All Fields] AND (“rna-seq”[All Fields] OR “rna sequencing”[All Fields] OR (“sequencing”[All Fields] AND “transcriptome”[All Fields])); and (3) for big-data papers on PubMed: “big data”[All Fields] OR “hadoop”[All Fields].
Figure 2
The increased bulk expression of MYH2 is primarily driven by the growing proportion of “on-” component cells (upper cluster) over time (0, 24, 48, and 72 h after myoblast differentiation is induced). Figures were derived from the dataset in Trapnell et al [41]. A. The growth of MYH2 expression in bulk cell replicate samples (n = 3 over time). B. Beeswarm plots of the growing bimodal proportion of MYH2 from scRNA-seq over time. C–F. RNA-FISH signals at 0, 24, 48, and 72 h, respectively. MYH2 and nucleus are shown in red and blue (DAPI staining), respectively. Scare bar: 25 nm. G. MYH2 RNA molecule counts per cell over time, based on RNA-FISH analyses. RNA-FISH, RNA-fluorescence in situ hybridization.
Summary of cell types in GEO datasets
| Neuron | 11 |
| Embryonic | 80 |
| Blood | 18 |
| Lung | 17 |
| Renal | 4 |
| Brain | 17 |
| Skin | 26 |
| Heart | 9 |
| Bone marrow | 17 |
| Stem cell | 43 |
| Tumor | 23 |
| Cell line | 71 |
| Total No. of unique datasets | 195 |
Hadoop-based bioinformatics software tools
| Sequence file management | LFQC | A lossless compression algorithm for FASTQ files | ||
| Quake | Quality-guided error detection and correction of short reads | |||
| SeqPig | Simple and scalable scripting for large sequencing datasets | |||
| Hadoop-BAM | Library for scalable manipulation of aligned NGS data | |||
| smallWig | Parallel compression of RNA-seq WIG files | |||
| Search engine | SeqWare | Pipeline and query engine for storing and searching sequence | ||
| Hydra | A protein sequence database search engine | |||
| SparkSeq | Interactive data querying of genomic data analysis | |||
| GMQL | Large-scale genomic data query and management | |||
| Genomic sequence mapping | CloudAligner | A MapReduce-based application for short read alignment | ||
| CouldBurst | A parallel short read mapper | |||
| BigBWA | Hadoop implementation of BWA | |||
| SEAL | Alignment, manipulation, and analysis of short reads | |||
| DistMap | A toolkit for distributed short read mapping | |||
| SOAP3 | Short sequence read alignment with GPU acceleration | |||
| GPU-BLAST | NCBI-BLAST with GPU acceleration | |||
| Expression analysis | Myrna | RNA sequencing differential expression analysis | ||
| Eoulsan | Pipeline for calculating differential gene expression | |||
| YunBe | A gene set analysis algorithm for biomarker identification | |||
| FX | Gene expression estimation and genomic variant calling | |||
| Phylogenetic analysis | FVGWAS | Fast voxel-wise genome-wide association analysis | ||
| GATK | Variant calling | |||
| Crossbow | Alignment and SNP genotyping with Bowtie and SoapSNP | |||
| MrsRF | Calculate Robinson–Foulds distance between trees | |||
| BlueSNP | Genome-wide association studies using Hadoop clusters | |||
| GeneCOST | Scoring-based prioritization to identify disease-causing genes | |||
| Nephele | Genotyping via complete composition vector | |||
| Miscellaneous | PeakRanger | A cloud-enabled peak caller for ChIP-seq data | ||
| SeqHBase | A big-data toolset for family-based sequencing data analysis | |||
| ProKinO | A unified resource for mining the cancer kinome | |||
| BioPig | An analytic toolkit for large-scale sequence data | |||
Read count normalization methods
| RPM | Rescale | N/A | N/A | No | No | Yes | No | No | No |
| RPKM | Rescale | N/A | N/A | No | No | Yes | Yes | No | No |
| Median | Rescale | N/A | N/A | No | No | Yes | No | No | No |
| Upper-quantile | Rescale | N/A | N/A | No | No | Yes | No | No | No |
| Full-quantile | Rank average | N/A | N/A | No | No | Yes | No | No | No |
| GC-content | Statistical model | Non-parametric | Local regression | No | No | Yes | Yes | Yes | Yes |
| DESeq | Statistical model | Negative binomial | GLM | Yes | No | Yes | No | No | Yes |
| TMM | Statistical model | Negative binomial | GLM | Yes | No | Yes | No | No | Yes |
| RUV | Statistical model | Lognormal | GLM | Yes | No | Yes | No | No | Yes |
| Poisson beta | Statistical model | Mixed Poisson | Bayesian | No | Yes | Yes | No | No | No |
| Sphinx | Statistical model | Mixed negative binomial | Bayesian | Yes | Yes | Yes | No | No | No |
Note: RPM, reads per million mapped reads; RPKM, reads per kilobase per million mapped reads; TMM, trimmed mean of M values; RUV, remove unwanted variation; GLM, generalized linear model.
Figure 3Workflow of inter-institutional scRNA-seq data integration
Inter-institutional single-cell RNA-seq datasets are aligned against their genomes at the Hadoop layer. Read counts are resolved into gene “on” or “off” status at the normalization layer. Differential expression, co-expression, and other applications are developed based on gene “on” or “off” status instead of gene expression. Biology in the resulting gene list is verified by GSEA, GO-term enrichment analysis, DAVID functional analysis or other tools. GSEA, gene set enrichment analysis; GO, gene ontology; DAVID, database for annotation, visualization and integrated discovery.