| Literature DB >> 30740268 |
Nigel P Dyer1, Vahid Shahrezaei2, Daniel Hebenstreit1.
Abstract
Protocols for preparing RNA sequencing (RNA-seq) libraries, most prominently "Smart-seq" variations, introduce global biases that can have a significant impact on the quantification of gene expression levels. This global bias can lead to drastic over- or under-representation of RNA in non-linear length-dependent fashion due to enzymatic reactions during cDNA production. It is currently not corrected by any RNA-seq software, which mostly focus on local bias in coverage along RNAs. This paper describes LiBiNorm, a simple command line program that mimics the popular htseq-count software and allows diagnostics, quantification, and global bias removal. LiBiNorm outputs gene expression data that has been normalized to correct for global bias introduced by the Smart-seq2 protocol. In addition, it produces data and several plots that allow insights into the experimental history underlying library preparation. The LiBiNorm package includes an R script that allows visualization of the main results. LiBiNorm is the first software application to correct for the global bias that is introduced by the Smart-seq2 protocol. It is freely downloadable at http://www2.warwick.ac.uk/fac/sci/lifesci/research/libinorm.Entities:
Keywords: Gene expression; Global bias; Normalization; RNA-seq; Smart-seq2
Year: 2019 PMID: 30740268 PMCID: PMC6366399 DOI: 10.7717/peerj.6222
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Output files.
| Filename | Description |
|---|---|
| <countfilename> | Main output file with raw counts, gene length, global bias, and bias-corrected, normalized TPM expression levels for all genes. |
| <fileroot>_bias.txt | Consolidated data indicating the distribution of reads within the transcripts. The transcripts are ordered by length and then grouped into 500 roughly equal bins. The file gives the average gene length and a histogram of the read distribution for each of the bins. |
| <fileroot>_norm.txt | Parameter estimates and bias predicted by the model as a function of selected transcript lengths, which forms the basis of the normalization which is applied by the model. |
| <fileroot>_results.txt | Provides detailed information relating to the parameter estimation process including the results from each of the MCMC runs used to generate these results and an indication of the spread of the parameter estimates that were obtained from these runs. |
| <fileroot>_distribution.txt | Histogram of the read distribution within the transcripts for five different groups of transcripts each centered on a specific transcript length, together with the distribution predicted by the model for these lengths. |
Notes:
created as a result of the —c command line option. Otherwise these data are sent to the program standard output.
only created if the —u command line option is used.
Figure 1Example plots of read bias (SRA accession SRR1743160) produced with LiBiNorm.
(A) detected transcripts are aligned at 5′ and 3′ ends and ordered by length, shortest on top. Read density along RNAs is indicated by color intensity (the darker, the higher). (B) predicted bias for each model as a function of transcript length: bias relative to a linear length model. (C) comparison of negative log likelihood values (the lower the better the fit) for each of the six models with parameters determined for the SRR1743160 dataset. (D–G) estimated model parameter values d, h, t1 & t2, and a, respectively. See text for interpretation of parameters. (H) read coverages along transcripts aligned at 5′ and 3′ ends and separated into different length classes (colors). The experimental data and model fits are shown separately as solid and dashed lines (fit of model BD), respectively.
LiBiNorm output for SRR1743160 and model BD parameter and bias estimates.
| Model | Goodness-of-fit (log likelihood) | Parameter estimates | ||||
|---|---|---|---|---|---|---|
| log10 ( | log10 ( | log10 ( | log10 ( | |||
| BD | 48722 | −0.103 | 1.89 | −4.27 | −3.52 | 0.59 |
Figure 2Evaluation of bias correction.
(A) scatter plot of gene expression values derived from RNA-seq using TruSeq (SRR1743167) and Smart-seq2 (SRR1743160) based on conventional (linear; equivalent to FPKM) TPM. (B) same as (A), but using LiBiNorm (Model BD) to calculate TPM for the Smart-seq2 sample, which improves the R2 compared to conventional TPM. Red dots mark genes with mRNA lengths between 10 and 10.1 kb in length, showing how the bias correction compensates for the underestimated expression levels of these genes. (C) change of R2 (%; y-axis) when systematically comparing gene expression for Smart-seq2 and TruSeq protocols compared to a linear TPM reference (x-axis). An average across the four TruSeq samples is plotted for each of the 14 Smart-seq2 samples for each of the software packages as indicated.