| Literature DB >> 26423047 |
Mikhail G Dozmorov, Indra Adrianto, Cory B Giles, Edmund Glass, Stuart B Glenn, Courtney Montgomery, Kathy L Sivils, Lorin E Olson, Tomoaki Iwayama, Willard M Freeman, Christopher J Lessard, Jonathan D Wren.
Abstract
BACKGROUND: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26423047 PMCID: PMC4597324 DOI: 10.1186/1471-2105-16-S13-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1RepeatSoaker comparisons. Overview of the various permutations of the three data processing steps compared.
RNA-seq alignment statistics for different combinations of the sequencing data processing steps
| Trim | Dup | RS | Total reads | properly paired (%) | singletons (%) | with mate mapped to a different chr (%) | Number of DEGs | KEGG: ECM-receptor interaction | GO: multicellular organismal process | Reactome: Transmembrane transport of small molecules | R2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| - | - | - | 41,505,942 | 68.98+-3.71 | 16.03+-3.20 | 4.01+-3.48 | 2189 | 1.86E-07 | 8.86E-16 | 6.38E-13 | 0.6687 |
| + | - | - | 51,984,539 | 60.10+-3.68 | 10.98+-2.44 | 17.92+-3.89 | 2139 | 3.29E-08 | 2.86E-13 | 9.85E-11 | 0.6614 |
| - | + | - | 15,429,501 | 61.72+-5.45 | 12.49+-2.37 | 10.22+-4.68 | 2487 | 1.34E-07 | 2.50E-22 | 1.85E-11 | 0.6672 |
| + | + | - | 25,738,167 | 43.14+-5.77 | 7.51+-1.33 | 36.18+-5.70 | 2391 | 1.97E-07 | 4.93E-17 | 3.54E-09 | 0.6575 |
| - | - | 75 | 28,283,010 | 70.55+-3.17 | 16.74+-3.85 | 0.69+-0.09 | 2100 | 7.62E-08 | 8.85E-17 | 5.71E-12 | 0.6708 |
| - | - | 50 | 26,450,592 | 70.05+-3.24 | 17.22+-3.95 | 0.63+-0.08 | 2068 | 7.18E-08 | 1.31E-16 | 6.85E-14 | 0.6712 |
| - | - | 25 | 24,703,408 | 69.63+-3.31 | 17.66+-4.05 | 0.62+-0.08 | 2021 | 8.92E-09 | 6.33E-19 | 2.94E-14 | 0.6705 |
| - | - | 0 | 21,413,178 | 69.39+-3.47 | 18.20+-4.26 | 0.61+-0.09 | 2087 | 1.02E-08 | 3.13E-19 | 3.98E-15 | 0.6643 |
| + | - | 75 | 32,589,028 | 64.70+-3.88 | 12.29+-2.88 | 10.46+-1.82 | 2116 | 4.88E-07 | 1.12E-13 | 2.89E-12 | 0.6637 |
| + | - | 50 | 30,174,345 | 64.93+-3.92 | 12.70+-2.97 | 9.71+-1.82 | 2066 | 3.49E-07 | 2.96E-14 | 2.44E-12 | 0.6642 |
| + | - | 25 | 28,231,486 | 64.55+-4.00 | 13.04+-3.03 | 9.76+-1.87 | 2004 | 3.15E-07 | 5.05E-16 | 1.13E-13 | 0.6636 |
| + | - | 0 | 24,546,936 | 63.85+-4.25 | 13.42+-3.12 | 10.26+-2.06 | 2028 | 2.65E-07 | 3.42E-16 | 3.55E-14 | 0.6583 |
| - | + | 75 | 9,681,047 | 68.59+-4.15 | 12.22+-2.96 | 1.45+-0.25 | 2302 | 1.21E-07 | 4.18E-23 | 5.71E-12 | 0.6695 |
| - | + | 50 | 8,987,150 | 68.34+-4.18 | 12.53+-3.06 | 1.221+-0.20 | 2256 | 1.48E-07 | 4.80E-22 | 4.07E-14 | 0.6700 |
| - | + | 25 | 8,346,861 | 68.09+-4.22 | 12.83+-3.16 | 1.21+-0.19 | 2245 | 1.48E-07 | 2.70E-21 | 1.44E-14 | 0.6694 |
| - | + | 0 | 7,151,500 | 68.13+-4.26 | 12.99+-3.12 | 1.19+-0.20 | 2326 | 1.69E-07 | 2.81E-24 | 8.28E-16 | 0.6628 |
| + | + | 75 | 14,251,402 | 52.88+-6.50 | 7.54+-1.45 | 23.48+-5.56 | 2210 | 1.18E-06 | 4.34E-20 | 3.95E-09 | 0.6598 |
| + | + | 50 | 12,985,873 | 53.93+-6.45 | 7.65+-1.48 | 22.01+-5.48 | 2180 | 7.69E-07 | 5.94E-19 | 3.02E-11 | 0.6604 |
| + | + | 25 | 12,125,100 | 53.69+-6.50 | 7.81+-1.51 | 22.11+-5.58 | 2124 | 4.40E-06 | 2.98E-19 | 6.28E-13 | 0.6599 |
| + | + | 0 | 10,416,970 | 52.34+-6.98 | 7.74+-1.30 | 23.54+-6.17 | 2176 | 4.22E-06 | 3.54E-18 | 3.00E-13 | 0.6539 |
"Total reads" - average number of reads; "paired (%)" - average percent of paired reads; "singletons (%)" - average percent of single end reads; "with mate mapped to a different chr (%)" - average percent of inter-chromosome mapped reads. "Number of DEGs" - number of differentially expressed genes. To allow direct comparisons of p-values among the processing steps, the "ECM-receptor interaction" KEGG pathway, the "multicellular organismal process" GO, and the "Transmembrane transport of small molecules" Reactome pathway were selected as the most representative and most enriched functional categories in each processing step, with the full enrichment analyses results shown in Additional Files 4 and 5. "+/-" indicate whether the step (Trim - adapter trimming, Dup - duplicate removal, RS - filtering out low complexity regions with RepeatSoaker) was applied/not applied, respectively. The number in the RepeatSoaker column reflects the threshold of removing reads overlapping with low complexity regions, i.e., 75% indicates that reads overlapping 75% or more with a low complexity region were removed.
ChIP-seq alignment statistics for different combinations of sequencing data processing steps
| Trim | Dup | RS | Total reads | properly paired (%) | singletons (%) | with mate mapped to a different chr (%) | SPI1 E-value | Number of motifs |
|---|---|---|---|---|---|---|---|---|
| - | - | - | 70,929,429 | 96.81+-1.57 | 0.40+-0.17 | 0.80+-0.34 | 8.2e-9446 | 44 |
| + | - | - | 70,155,620 | 97.49+-1.44 | 0.15+-0.03 | 0.39+-0.13 | 7.6e-10075 | 65 |
| - | + | - | 40,472,954 | 95.60+-2.19 | 0.44+-0.24 | 1.53+-1.06 | 1.5e-10726 | 26 |
| + | + | - | 40,500,416 | 96.83+-1.82 | 0.16+-0.04 | 0.65+-0.30 | 4.0e-11010 | 25 |
| - | - | 75 | 68,856,152 | 96.81+-1.57 | 0.39+-0.16 | 0.79+-0.34 | 2.0e-9425 | 43 |
| - | - | 50 | 68,578,937 | 96.81+-1.57 | 0.39+-0.16 | 0.79+-0.34 | 2.0e-9425 | 42 |
| - | - | 25 | 68,405,768 | 96.81+-1.57 | 0.39+-0.16 | 0.79+-0.34 | 2.0e-9425 | 43 |
| - | - | 0 | 68,279,169 | 96.81+-1.57 | 0.39+-0.16 | 0.79+-0.34 | 2.0e-9425 | 42 |
| + | - | 75 | 68,004,984 | 97.50+-1.45 | 0.15+-0.03 | 0.38+-0.13 | 2.8e-9899 | 64 |
| + | - | 50 | 67,805,679 | 97.50+-1.45 | 0.15+-0.03 | 0.38+-0.13 | 2.8e-9899 | 64 |
| + | - | 25 | 67,679,663 | 97.50+-1.45 | 0.15+-0.03 | 0.38+-0.13 | 2.8e-9899 | 67 |
| + | - | 0 | 67,587,630 | 97.50+-1.45 | 0.15+-0.03 | 0.38+-0.13 | 2.8e-9899 | 62 |
| - | + | 75 | 39,242,893 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 7.6e-10575 | 26 |
| - | + | 50 | 39,080,973 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 7.6e-10575 | 26 |
| - | + | 25 | 38,981,300 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 7.6e-10575 | 26 |
| - | + | 0 | 38,908,929 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 7.6e-10575 | 26 |
| + | + | 75 | 39,242,893 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 4.7e-10731 | 27 |
| + | + | 50 | 39,080,973 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 4.7e-10731 | 27 |
| + | + | 25 | 38,981,300 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 4.7e-10731 | 27 |
| + | + | 0 | 38,908,929 | 95.61+-2.19 | 0.43+-0.24 | 1.51+-1.05 | 4.7e-10731 | 30 |
"+/-" indicate the step (Trim - adapter trimming, Dup - duplicate removal, RS - filtering out low complexity regions with RepeatSoaker) was applied/not applied, respectively. The RepeatSoaker % reflects the threshold of removing reads overlapping with low complexity regions, i.e., 75% indicates that reads overlapping with low complexity regions 75% or more are removed. SPI1 E-value is an equivalent of a p-value for the detection of PU.1 motif.
Figure 2Differential expression detection. Number of differentially expressed genes detected after removing reads overlapping low complexity (LC) regions. Conditions for differential expression analysis: "all LC overlaps kept/removed" - reads touching/overlapping LC regions are either kept or removed, respectively; "25%/50%/75% LC overlaps removed" - reads overlapping LC regions at least 25%/50%/75%, respectively, are removed before differential expression analysis.
Figure 3Gene comparison of distribution and expression levels. Comparison of the log2 fold change (A) and expression (B) distributions among genes at different thresholds for removing reads overlapping low complexity (LC) regions. "All kept"/"Reads without LC overlaps" - metrics of all differentially expressed genes detected using all/none reads overlapping LC regions; "75%/50%/25% LC overlaps removed" - metrics of genes that became non-differentially expressed after removing reads overlapping LC regions at least 75%/50%/25%, respectively.
Figure 4Gene comparison of distribution and expression levels. Effect of data processing on expression (A) and fold change (B) distribution of differentially expressed genes.