| Literature DB >> 28986563 |
Simone Rizzetto1,2, Auda A Eltahla1,2, Peijie Lin3,4, Rowena Bull1,2, Andrew R Lloyd1,2, Joshua W K Ho3,4, Vanessa Venturi5, Fabio Luciani6,7.
Abstract
Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28986563 PMCID: PMC5630586 DOI: 10.1038/s41598-017-12989-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
scRNA-seq data sets analysed in this study.
| Dataset | Dataset | Reference | Accession number | Organism | Number of cells | Average reads length (bp) | Average number of PE reads (x106 reads) | scRNA-seq protocol | Number of genes expressed (FPKM >1) per cell |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HCV specific CD8 + T cells | Elthala[ | E-MTAB-4850 | human | 54 | 145 | 8.4 | Smart-Seq. 2 | 2,563 |
| 2 | HCV specific CD8 + T cells | Elthala[ | E-MTAB-4850 | human | 12 | 215 | 3.5 | Smart-Seq. 2 | 3,289 |
| 3 | Th17 cells (A) | Gaublomme[ | GSE74833 | mouse | 399 | 125 | 2.5 | Smart-Seq | 5,128 |
| 4 | Th17 cells (B) | Gaublomme[ | GSE74833 | mouse | 269 | 100 | 3.7 | Smart-Seq | 6,540 |
| 5 | Th17 cells (C) | Gaublomme[ | GSE74833 | mouse | 100 | 25 | 1.5 | Smart-Seq | 4,146 |
| 6 | CD4 + T cells | Stubbington[ | E-MTAB-3857 | mouse | 272 | 100 | 4.3 | Smart-Seq | 2,354 |
| 7 | CD8 + T cells | Kimmerling[ | GSE74923 | mouse | 106 | 32 | 1.2 | Smart-Seq. 2 | 6,796 |
| 8 | Th2 | Mahata[ | E-MTAB-2512 | mouse | 93 | 75 | 16.3 | Smarter-Seq | 6,401 |
List of dataset used for the analysis.
The success rate of reconstructing full-length T-cell receptors (TCR) using VDJPuzzle for the eight scRNA-seq data sets.
| Dataset | Number of cells | Average reads length (bp) | Average number of PE reads (x106 reads) | TCRα success rate (%) VDJPuzzle | TCRβ success rate (%)VDJPuzzle | TCRαβ success rate (%) VDJPuzzle |
|---|---|---|---|---|---|---|
| 1 | 54 | 145 | 8.4 | 81.48 | 85.19 | 81.48 |
| 2 | 12 | 215 | 3.5 | 100.00 | 100.00 | 100.00 |
| 3 | 399 | 125 | 2.5 | 99.25 | 98.75 | 98.50 |
| 4 | 269 | 100 | 3.7 | 98.51 | 98.88 | 97.77 |
| 5 | 100 | 25 | 1.5 | 0.00 | 0.00 | 0.00 |
| 6 | 272 | 100 | 4.3 | 89.71 | 93.38 | 85.66 |
| 7 | 106 | 32 | 1.2 | 1.89 | 7.55 | 1.89 |
| 8 | 93 | 75 | 16.3 | 89.25 | 93.55 | 86.02 |
Success rates for TCRαβ detection in each dataset.
Figure 1Success rates of TCRαβ reconstruction as a function of read length (A) and sequencing depth (B) using VDJPuzzle. Panels C and D show the distributions of the length of the reconstructed CDR3α and CDR3β regions, respectively.
Figure 2(A) Generation of the simulated datasets from real scRNA-seq data 1. (B) Success rate for TCRαβ reconstruction as a function of read length and sequencing depth from the simulated datasets.
Figure 3Analysis of the alignment of the simulated datasets as a function of sequencing depth and read length. Shown is the number of paired-end reads aligned (in log10 scale), along with the proportion of concordant and discordant pairs, and of multiple alignment instances.
Number of genes expressed in at least 10% of the cells in the simulated data sets, comprised of subsamples of the scRNA-seq data set 1, with various sequencing depths (columns) and read lengths (rows).
| Sequencing depth (PE reads x million) | ||||
|---|---|---|---|---|
| Read length (bp) | 0.05 | 0.25 | 0.625 | 1.25 |
| 25 | 6,081 | 8,497 | 9,184 | 10,849 |
| 50 | 5,665 | 7,801 | 8,255 | 8,240 |
| 100 | 5,141 | 6,879 | 7,333 | 7,440 |
| 150 | 4,836 | 6,458 | 6,824 | 6,936 |
Analysis of empirical drop out rate on simulated datasets.
Figure 4The effect of read length and sequencing depth on the technical error variability using simulated scRNA-seq datasets. A: Number of identified expressed genes (Fragment per Kilobase per Million reads; FPKM >1) as a function of read length and sequencing depth (A). Error bars (box plot, mean and 5–95% interval) represent variability across individual cells. (B) Mean pairwise cell-to-cell Pearson correlation of gene expression values as a function of sequencing depth and read length. (C) The distribution of pairwise cell-to-cell Pearson correlation of gene expression values using subsets of different read length drawn from the original dataset. Original dataset had a read length of 145 bp with depth >8 millions PE reads, two samples drawn from this dataset were taken, with length 25 bp and same depth.
Figure 5Clustering analysis for the three populations of HCV specific CD8+ T cells. Panels A and B display Principle Coordinate Analysis of the three subsets of cells by varying read length (25 to 150 bp). Coverage for each dataset was set to 1.25 millions of PE reads per cell. The point colours correspond to the ‘ground truth’ cell type labels (see legend), while the three point styles correspond to the three identified clusters (circle, triangle and cross). Clustering analysis was performed using CIDR, and forcing the number of clusters to be n = 3. Panels C and D display the misclassification and the variability within the same cell type (within-class sum of squares) as a function of read length and sequencing depth, respectively. Panel D displays only results from PBMC-derived T cells.
Figure 6Gene expression profiles of selected genes identified from dataset 1, human HCV- specific CD8+ T cells (Table 1).