| Literature DB >> 27881084 |
Hannah R Dueck1, Rizi Ai2, Adrian Camarena3, Bo Ding2, Reymundo Dominguez4, Oleg V Evgrafov3, Jian-Bing Fan5, Stephen A Fisher6, Jennifer S Herstein3, Tae Kyung Kim7,8, Jae Mun Hugo Kim3, Ming-Yi Lin4, Rui Liu9, William J Mack10, Sean McGroty6, Joseph D Nguyen3, Neeraj Salathia5, Jamie Shallcross6, Tade Souaiaia3, Jennifer M Spaethling7, Christopher P Walker3, Jinhui Wang7, Kai Wang3, Wei Wang2, Andre Wildberg2, Lina Zheng2, Robert H Chow4, James Eberwine7, James A Knowles3, Kun Zhang9, Junhyong Kim11.
Abstract
BACKGROUND: Recently, measurement of RNA at single cell resolution has yielded surprising insights. Methods for single-cell RNA sequencing (scRNA-seq) have received considerable attention, but the broad reliability of single cell methods and the factors governing their performance are still poorly known.Entities:
Keywords: Bioinformatics; Biotechnology; Genomics; Single-cell RNA-sequencing
Mesh:
Substances:
Year: 2016 PMID: 27881084 PMCID: PMC5122016 DOI: 10.1186/s12864-016-3300-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Experimental design and RNA sequencing statistics by experimental group. a Dilution experiment summary. See for detailed information. b Single cell amplification methods used. Protocols involve two key steps: conversion of RNA (blue) to cDNA (green), and amplification of cDNA. aRNA targeted poly-adenylated mRNA by using an oligo-dT T7 primer for initial cDNA synthesis. After generating double-stranded cDNA, molecules were amplified using in vitro transcription with T7 polymerase. This amplification procedure was designed to minimize exponential expansion of errors. cDNA generation and amplification were repeated two additional times before library preparation. SmartSeq Plus targeted total RNA using a mixture of poly-T and randomized primers for initial cDNA synthesis. Full-length transcripts were captured through the template-switching capacity of reverse transcriptase. Double stranded cDNA molecules were amplified using 18 rounds of PCR. All cDNA and amplification reactions were performed on a 96-well Fluidigm C1 chip, intended to reduce experimental variation by performing reactions in small volume. NuGen targeted total RNA through use of proprietary primers for initial cDNA synthesis. Second strand cDNA synthesis was generated using an RNA primer, which was subsequently degraded from the second strand cDNA copy, resulting in linear amplification by DNA replication. This method was designed to minimize exponential amplification of error. c Sample sizes and RNA sequencing statistics by experimental group. Includes color key used in all figures. For analysis based on combined HBR and UHR dilution replicates, solid colors were used. Abbreviations: Human Brain Reference (HBR), Universal Human Reference (UHR), University of Pennsylvania (Penn), University of California San Diego (UCSD), University of Southern California (USC), picogram (pg.), base pair (bp.), contamination (contamin.), average (Ave.), standard deviation (Sd.), amplification (amp.)
Coverage selectivity by method
| Source | Protocol | Genome coverage | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| % exons (excluding rRNA & mitochondria) | % intronic | % rRNA (nuclear) | % rRNA (mitochondrial) | % mitochondrial (non-rRNA) | |||||||
| Ave. | Sd. | Ave. | Sd. | Ave. | Sd. | Ave. | Sd. | Ave. | Sd. | ||
| HBR | aRNA | 59.07 | 5.70 | 23.29 | 4.68 | 0.03 | 0.02 | 4.82 | 1.62 | 12.29 | 2.79 |
| SmartSeq Plus | 41.00 | 0.86 | 39.56 | 0.85 | 1.28 | 0.07 | 10.02 | 0.43 | 7.77 | 0.26 | |
| NuGen | 29.27 | 4.97 | 45.09 | 5.88 | 1.33 | 0.40 | 20.38 | 5.19 | 3.65 | 0.59 | |
| Bulk (Poly-A) | 80.13 | - | 8.52 | - | 0.08 | - | 1.96 | - | 9.01 | - | |
| Bulk (rRNA-depleted) | 61.44 | 0.54 | 37.89 | 0.45 | 0.03 | 0.01 | 0.10 | 0.04 | 0.25 | 0.04 | |
| UHR | aRNA | 71.57 | 2.34 | 23.02 | 2.60 | 0.06 | 0.05 | 1.61 | 0.19 | 3.38 | 0.59 |
| SmartSeq Plus | 35.89 | 0.89 | 52.84 | 0.94 | 0.31 | 0.02 | 6.20 | 0.26 | 3.85 | 0.12 | |
| NuGen | 33.00 | 4.02 | 39.33 | 8.99 | 1.25 | 0.57 | 23.47 | 7.31 | 2.59 | 0.62 | |
| Bulk (Poly-A) | 86.99 | - | 7.11 | - | 0.11 | - | 0.47 | - | 5.09 | - | |
| Bulk (rRNA-depleted) | 58.17 | 0.34 | 41.28 | 0.29 | 0.02 | 0.01 | 0.03 | 0.01 | 0.18 | 0.02 | |
Average percent of aligned reads assigned to genomic regions for each method. Nuclear rRNA includes rRNA genes, pseudogenes and repeats. See for definitions of genomic regions
Fig. 2Single-cell RNA sequencing sensitivity. a Number of detected genes. Each point represents a single sample. Horizontal black lines indicate group mean. Boxes indicate ± 2 Sd.. Gray horizontal lines indicate 95% CI for the expected number of genes in a diluted replicate, assuming total (dark gray) or poly-A (light gray) RNA. See . b Probability of gene detection as a function of the expected number of input molecules estimated using logistic model (see main text and ). Horizontal lines indicate 50% and 90% probability. Vertical lines indicate 1 and at 4.605 molecules (99% probability of ≥ 1 molecule present in diluted replicate). Bands indicate 95% CI. Black line indicates probability of ≥ 1 molecule present in a diluted replicate. c Odds ratio for gene detection as a function of sequencing depth (+500,000 reads). Horizontal line indicates an odds ratio of one (no gain in detection sensitivity). Band indicates 95% CI. d Odds ratios for differences in biophysical trait values. Error bars indicate a 95% CI. “*” indicates significant difference across pairs of methods (Bonferonni corrected p < 0.05). The odds ratio for length and secondary structure are shown for the increase from 25%ile length (structure) in the HBR and UHR transcriptomes to 75%ile length (structure). e Odds ratio for an increase of 0.01 in GC content. Bands indicate 95% CI. f Boxplot of gene mappability, or the fraction of the gene body that can be aligned to uniquely (see ) for computationally ambiguous gene detection outliers (wide boxes) and background genes (narrow boxes). Both undetected (Undet.) and detected (Det.) outliers are shown. “*” indicates significant difference (Wilcoxon rank-sum two-way test p < 0.05). g As in F, but for the fraction of the gene body that overlaps in genomic position with a separate gene annotation. h Nucleotide coverage. Observed over expected coverage normalized for expression level as a function of absolute 3' to 5' position. See . i Comparison of nucleotide coverage with uniform distribution. Empirical CDF is of normalized per nucleotide coverage. Black diagonal line indicates uniformity. Kolmogorov-Smirnov (KS) statistic for difference from the uniform distribution is in the bottom right, with larger values indicating greater difference between the distributions. j Coverage for genes with different expression levels. Relative 5' to 3' coverage, calculated over 100 equally spaced bins for four expression level categories (rows). See . Abbreviations: confidence interval (CI); Cumulative distribution function (CDF)
Fig. 3Single-cell RNA-sequencing precision. a Pairwise correlations for all samples. Upper triangle: Pearson correlation. Lower triangle: Kendall correlation. Zeros treated as missing values. Each row and column is an individual sample. Experimental group is indicated by color bars at edge of plot. b Relationship between standard deviation (st. dev.) and mean characterized by least squares regression (see ). All estimated coefficients were highly significant (coefficient t-test p < 10−16). c–f Enrichment of biophysical traits in experimentally precise (low) and variable (high) genes with respect to background genes (see ). Error bars indicate 95% CI. “*” indicates significant difference (p < 0.05). Numbers at bottom indicate sample size (number of genes). (C) Median difference in gene length estimated by Hodges-Lehman statistic. Significance: Wilcoxon rank sum two-way test. (D) Relative risk of containing an internal A-hexamer. Significance: Fisher’s exact test. (E) As C for % GC content. (F) As C for strength of local secondary structure. g–j PCA projection of dilution data on PC 1 and 2. Plots were centered so that bulk UHR or HBR was positioned at the origin. Points represent individual dilution replicates. Colored ovals represent bivariate normal 95% confidence ellipses. % Sd. explained by a PC is indicated in axis label. See . (G) HBR 10 pg. (H) UHR 10 pg. (I) HBR 100 pg. (J) UHR 100 pg. k–n As G–J, but using only abundantly expressed genes (see main text). Axis scales differ from G–J, with axes in equivalent to the purple-boxed region in G–J. (K) HBR 10 pg. (L) UHR 10 pg. (M) HBR 100 pg. (N) UHR 100 pg. Abbreviations: Standard error (SE); confidence interval (CI); kilobases (kb); kilocalories (kcal); mole (mol); principal components analysis (PCA); principal component (PC); standard deviation (sd)
Fig. 4Single-cell RNA sequencing accuracy. a Average pairwise correlation of diluted replicates with bulk HBR or UHR. Zeros treated as missing values. Error bars indicate ± 2 s.e.m. b–c Relationship between % unique alignment and similarity with bulk HBR or UHR. “r” indicates Pearson correlation of x and y. (B) 10 pg. (C) 100 pg. d–f Distribution of fold deviation across genes. Wide boxes represent measurements in individual replicates. Narrow boxes represent average measurements across replicates. Y-axis was truncated for visualization and 99%ile values for wide boxes are in panel descriptions below, ordered to match plot. “*” indicates significant difference between individual and average measurements (Wilcoxon rank sum test of greater fold deviation in average measurements, p < 0.05). (D) aRNA. 99%ile values: 2066; 1442; 514; 718. (E) SmartSeq Plus. 99%ile values: 877; 770; 553; 598. (F) NuGen. 99%ile values: 2332; 1766; 784; 937. g–j Enrichment of biophysical traits among underestimated (low) and overestimated (high) genes with respect to remaining genes. See . Plot notation and statistics are as in Fig. 3c–f. (G) Median difference in gene length. (H) Relative risk of containing an internal A-hexamer. (I) As G, for % GC content. (F) As G, for strength of local secondary structure. k–m Density scatter plots of normalized read counts in individual 10 pg. replicates vs. expected number of input molecules. See . Red indicates high density. Solid line indicates expected read count and hashed lines indicate ± 2 fold. (K) aRNA. (L) SmartSeq Plus. (M) NuGen. n–p Density scatter plots of average normalized read counts vs. number of input molecules. (N) aRNA. (O) SmartSeq Plus. (P) NuGen. Gene filtering: D–F and K–P considered computationally unambiguous genes and excluded gene detection outliers. D–J considered genes with greater than 95% probability of presence in a diluted replicate. Abbreviations: confidence interval (CI); kilobases (kb); kilocalorie (kcal); mole (mol)
Evaluation of protocol variations
| Variation | Category | Trait | Modified group | Control group | Median difference (Modified: Control) | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample size | Median | Sample size | Median | Statistic | 95% C.I. | Paired/Unpaired | Statistic |
| ||||
| No initial cDNA purification | Sensitivity | # genes detected | 6 | 11072 | 6 | 10249 | 782.5 | 293.0 | 1678.0 | Unpaired | 36 | 0.002 |
| Precision | Pairwise correlation across samples | 15 | 0.344 | 15 | 0.306 | 0.039 | 0.028 | 0.051 | Unpaired | 215 | 1.79E-06 | |
| Accuracy | Pairwise correlation with reference | 6 | 0.400 | 6 | 0.383 | 0.019 | 0.008 | 0.049 | Unpaired | 36 | 0.002 | |
| Reduce rounds of cDNA amp. | Sensitivity | # genes detected | 5 | 11936 | 3 | 11062 | 1051.0 | 33.0 | 4122.0 | Unpaired | 15 | 0.036 |
| Precision | Pairwise correlation across samples | 10 | 0.329 | 3 | 0.333 | −0.006 | −0.034 | 0.018 | Unpaired | 12 | 0.692 | |
| Accuracy | Pairwise correlation with reference | 5 | 0.430 | 3 | 0.401 | 0.029 | 0.009 | 0.066 | Unpaired | 15 | 0.036 | |
| Optimized aRNA | Sensitivity | # genes detected | 5 | 11936 | 18 | 10286 | 1810.0 | 942.0 | 3354.0 | Unpaired | 84 | 0.002 |
| Precision | Pairwise correlation across samples | 10 | 0.329 | 153 | 0.306 | 0.019 | 0.002 | 0.035 | Unpaired | 1084 | 0.028 | |
| Accuracy | Pairwise correlation with reference | 5 | 0.430 | 18 | 0.377 | 0.055 | 0.028 | 0.083 | Unpaired | 79 | 0.009 | |
| Add ERCCs | Sensitivity | # genes detected | 5 | 10154 | 4 | 10101 | 138.5 | −397.0 | 798.0 | Unpaired | 13 | 0.556 |
| Precision | Pairwise correlation across samples | 10 | 0.287 | 6 | 0.282 | 0.002 | −0.012 | 0.013 | Unpaired | 34 | 0.713 | |
| Accuracy | Pairwise correlation with reference | 5 | 0.364 | 4 | 0.369 | −0.003 | −0.016 | 0.019 | Unpaired | 9 | 0.905 | |
| Perform strand-specific sequencing | Sensitivity | # genes detected | 17 | 10006 | 17 | 10325 | −267.0 | −331.5 | −218.5 | Paired | 0 | 1.53E-05 |
| Sensitivity | Depth of unique genes | 17 | 6.397 | 17 | 0.848 | 5.46 | 3.68 | 6.75 | Paired | 153 | 1.53E-05 | |
| Sensitivity | # genes not in bulk | 17 | 534 | 17 | 610 | −62.0 | −76.0 | −49.5 | Paired | 0 | 1.53E-05 | |
| Accuracy | Pairwise correlation with reference | 17 | 0.362 | 17 | 0.376 | −0.013 | −0.017 | −0.009 | Paired | 2 | 4.58E-05 | |
Comparison of dilution replicates generated using modified protocols with control dilution replicates. Sample information can be found in Additional file 1 and protocol information in . # genes detected only considers genes observed in bulk HBR or UHR. Kendall correlation was calculated excluding zeros in either sample. Unpaired comparisons were made using Wilcoxon two-way rank sum test for difference in medians. Paired comparisons were made using Wilcoxon two-way rank sign test for difference in medians. The null hypothesis of no difference was rejected at p < 0.05. Median difference between groups, with 95% CI, was calculated using the Hodges-Lehman statistic