| Literature DB >> 29490610 |
WeiBo Wang1, Wei Sun2, Wei Wang3, Jin Szatkiewicz4.
Abstract
BACKGROUND: The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.Entities:
Keywords: Bioinformatic; Computational biology; Next-generation sequencing
Mesh:
Year: 2018 PMID: 29490610 PMCID: PMC5831535 DOI: 10.1186/s12859-018-2077-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Simulation results for evaluating the RGE coefficient estimates on CN. The x-axis: sampling proportion; the y-axis: CN coefficient estimates. The ground truth is 1 at the y-axis. Boxplots are used to summarize the distributions of the coefficient estimates from 200 replication runs for each sampling strategy. The blue bars represent RGE (weighted sampling) given the sampling proportions (x-axis) 0.1 and 0.5. The green bars represent RGE (uniform sampling) given the sampling proportion (x-axis) 0.1 and 0.5. The segment at the x-axis-value of 1 represents the coefficient estimates using the entire dataset
Sensitivity comparison between GENSENG and R-GENSENG
| Window | Methods comparison ( | |||||
|---|---|---|---|---|---|---|
| Total CNV | Deletion | Duplication | ||||
| Size |
|
|
|
|
|
|
| 100bps | 188/200 | 187/200 | 112/119 | 112/119 | 76/81 | 75/81 |
| 94% | 94% | 94% | 94% | 94% | 93% | |
| 200bps | 187/200 | 183/200 | 111/119 | 111/119 | 76/81 | 72/81 |
| 94% | 92% | 93% | 93% | 94% | 89% | |
| 500bps | 169/200 | 168/200 | 99/119 | 99/119 | 70/81 | 69/81 |
| 85% | 84% | 83% | 83% | 86% | 85% | |
| 1000bps | 125/200 | 121/200 | 78/119 | 75/119 | 47/81 | 46/81 |
| 63% | 61% | 66% | 63% | 58% | 57% | |
FDR comparison between GENSENG and R-GENSENG
| Window | Methods comparison (( | |||||
|---|---|---|---|---|---|---|
| Total CNV | Deletion | Duplication | ||||
| Size |
|
|
|
|
|
|
| 100bps | 18/206 | 28/215 | 10/122 | 16/128 | 8/84 | 12/87 |
| 8.7% | 13.0% | 8.2% | 12.5% | 9.5% | 13.8% | |
| 200bps | 10/197 | 14/197 | 3/114 | 5/116 | 7/83 | 9/81 |
| 5.1% | 7.1% | 2.6% | 4.3% | 8.4% | 11.1% | |
| 500bps | 5/174 | 7/175 | 0/99 | 0/99 | 5/75 | 7/76 |
| 2.9% | 4% | 0% | 0% | 6.7% | 9.2% | |
| 1000bps | 0/125 | 4/125 | 0/78 | 0/75 | 0/47 | 4/50 |
| 0% | 3.2% | 0% | 0% | 0% | 8% | |
Fig. 2Running time of the real data with different window sizes. The x-axis is the window size and the y-axis is the running time (in seconds). The red curve connects the points representing the average running time of GENSENG at varying window sizes and the blue curve connects the points representing the average running time of R-GENSENG
The proportions of GENSENG calls overlapped by R-GENSENG calls
| Window Size | NA12878 | NA12891 | NA12892 |
|---|---|---|---|
| 100bps | 0.95 | 0.84 | 0.82 |
| 200bps | 0.92 | 0.95 | 0.93 |
| 500bps | 0.98 | 0.98 | 0.97 |
| 1000bps | 0.97 | 0.97 | 0.97 |