| Literature DB >> 27769170 |
Sergi Sayols1, Denise Scherzinger2,3, Holger Klein4,5.
Abstract
BACKGROUND: PCR clonal artefacts originating from NGS library preparation can affect both genomic as well as RNA-Seq applications when protocols are pushed to their limits. In RNA-Seq however the artifactual reads are not easy to tell apart from normal read duplication due to natural over-sequencing of highly expressed genes. Especially when working with little input material or single cells assessing the fraction of duplicate reads is an important quality control step for NGS data sets. Up to now there are only tools to calculate the global duplication rates that do not take into account the effect of gene expression levels which leaves them of limited use for RNA-Seq data.Entities:
Keywords: Bioconductor; Duplication rate; PCR artefacts; Quality control tool; RNA-Seq; Single cell RNA-Seq
Mesh:
Substances:
Year: 2016 PMID: 27769170 PMCID: PMC5073875 DOI: 10.1186/s12859-016-1276-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example values for a sample of 10 genes from the library 13276
| ID | geneLength | allCounts | filteredCounts | dupRate | dupsPerId | RPK | RPKM |
|---|---|---|---|---|---|---|---|
| LOC100288069 | 1371 | 17 | 15 | 0.12 | 2 | 12.40 | 0.60 |
| LINC00115 | 1317 | 28 | 28 | 0.00 | 0 | 21.26 | 1.03 |
| LOC643837 | 9233 | 281 | 246 | 0.12 | 35 | 30.43 | 1.47 |
| FAM41C | 1706 | 1 | 1 | 0.00 | 0 | 0.59 | 0.03 |
| LOC100130417 | 496 | 0 | 0 | NA | 0 | 0.00 | 0.00 |
| SAMD11 | 2554 | 0 | 0 | NA | 0 | 0.00 | 0.00 |
| NOC2L | 2800 | 329 | 273 | 0.17 | 56 | 117.50 | 5.67 |
| KLHL17 | 2564 | 2 | 2 | 0.00 | 0 | 0.78 | 0.04 |
| ISG15 | 666 | 590 | 271 | 0.54 | 319 | 885.89 | 42.78 |
| AGRN | 7326 | 3 | 3 | 0.00 | 0 | 0.41 | 0.02 |
Some columns were omitted due to space constraints; refer to Additional file 7: Table S2 for the complete table
Fig. 1Several RNA-seq datasets from Marinov et al. [26]. Legends shows the intercept and slope of a fitted logit model. a Single cell experiment with relatively low duplication rates and most of the genes detected. b Single cell experiment with most of the genes undetected and high duplication rate on the detected ones. c RNA-seq experiment pushing the protocol to only 100 pg of input material, with low duplication rates and relatively good identification of genes. d same RNA-seq experiment, showing over-sequencing due to higher sequencing depth of the library