| Literature DB >> 27695476 |
Hoang T Nguyen1, James Boocock2, Tony R Merriman3, Michael A Black3.
Abstract
Copy-number variation (CNV) has been associated with increased risk of complex diseases. High-throughput sequencing (HTS) technologies facilitate the detection of copy-number variable regions (CNVRs) and their breakpoints. This helps in understanding genome structure as well as their evolution process. Various approaches have been proposed for detecting CNV breakpoints, but currently it is still challenging for tools based on a single analysis method to identify breakpoints of CNVs. It has been shown, however, that pipelines which integrate multiple approaches are able to report more reliable breakpoints. Here, based on HTS data, we have developed a pipeline to identify approximate breakpoints (±10 bp) relating to different ancestral events within a specific CNVR. The pipeline combines read-depth and split-read information to infer breakpoints, using information from multiple samples to allow an imputation approach to be taken. The main steps involve using a normal mixture model to cluster samples into different groups, followed by simple kernel-based approaches to maximize information obtained from read-depth and split-read approaches, after which common breakpoints of groups are inferred. The pipeline uses split-read information directly from CIGAR strings of BAM files, without using a re-alignment step. On simulated data sets, it was able to report breakpoints for very low-coverage samples including those for which only single-end reads were available. When applied to three loci from existing human resequencing data sets (NEGR1, LCE3, IRGM) the pipeline obtained good concordance with results from the 1000 Genomes Project (92, 100, and 82%, respectively). The package is available at https://github.com/hoangtn/SRBreak, and also as a docker-based application at https://registry.hub.docker.com/u/hoangtn/srbreak/.Entities:
Keywords: breakpoint cluster region; copy number variant (CNV); read depth; split read; structural variation (SV)
Year: 2016 PMID: 27695476 PMCID: PMC5023681 DOI: 10.3389/fgene.2016.00160
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Breakpoints of simulated samples for single- and paired-end reads.
| Start | End | Length (bp) | Coverage | Sample number | |
|---|---|---|---|---|---|
| Del | 101545220 | 101630000 | 84780 | 1–15x | 15 |
| Del | 101556000 | 101576000 | 20000 | 1–15x | 15 |
| Del | 101560000 | 101565000 | 5000 | 1–15x | 15 |
| Del | 101561000 | 101562000 | 1000 | 1–15x | 15 |
| Dup | 101555000 | 101605000 | 50000 | 1–15x | 15 |
| Dup | 101556000 | 101576000 | 20000 | 1–15x | 15 |
| Dup | 101558000 | 101568000 | 10000 | 1–15x | 15 |
| Normal | 1–15x | 15 | |||
| Total | 120 |
Results of SRBreak on simulated data sets, using different window sizes and different thresholds to call CNVRs.
| Threshold | Window = 50 | Window = 100 | Window = 250 | Window = 500 | Window = 1000 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| TPR | FDR | TPR | FDR | TPR | FDR | TPR | FDR | TPR | FDR | ||
| 0.50 | 0.71 | 0.10 | 0.94 | 0.02 | 0.96 | 0.01 | 0.98 | 0.00 | 0.82 | 0.04 | |
| 0.35 | 0.80 | 0.11 | 0.91 | 0.04 | 0.96 | 0.01 | 0.98 | 0.00 | 0.82 | 0.06 | |
| 0.25 | 0.80 | 0.11 | 0.94 | 0.01 | 0.96 | 0.01 | 0.98 | 0.00 | 0.82 | 0.06 | |
| 0.20 | 0.80 | 0.12 | 0.94 | 0.01 | 0.96 | 0.01 | 0.97 | 0.01 | 0.81 | 0.07 | |
| 0.50 | 0.58 | 0.10 | 0.94 | 0.02 | 0.97 | 0.00 | 0.97 | 0.00 | 0.86 | 0.00 | |
| 0.35 | 0.58 | 0.10 | 0.95 | 0.01 | 0.97 | 0.00 | 0.97 | 0.00 | 0.86 | 0.00 | |
| 0.25 | 0.58 | 0.10 | 0.87 | 0.01 | 0.97 | 0.00 | 0.97 | 0.00 | 0.86 | 0.00 | |
| 0.20 | 0.58 | 0.10 | 0.85 | 0.00 | 0.97 | 0.00 | 0.97 | 0.00 | 0.86 | 0.00 | |
The performance of SRBreak on simulated data sets of different sample sizes.
| TPR | FDR | |||||
|---|---|---|---|---|---|---|
| Sample size | 25.00% | 50.00% | 75.00% | 25.00% | 50.00% | 75.00% |
| 5 | 0.75 | 0.88 | 1.00 | 0.00 | 0.17 | 0.30 |
| 10 | 0.81 | 0.89 | 0.95 | 0.05 | 0.11 | 0.19 |
| 20 | 0.89 | 0.94 | 0.97 | 0.03 | 0.06 | 0.10 |
| 50 | 0.94 | 0.96 | 0.98 | 0.01 | 0.02 | 0.05 |
| 100 | 0.97 | 0.98 | 0.98 | 0.00 | 0.01 | 0.01 |
| Sample size | 25.00% | 50.00% | 75.00% | 25.00% | 50.00% | 75.00% |
| 5 | 0.80 | 0.90 | 1.00 | 0.00 | 0.00 | 0.20 |
| 10 | 0.87 | 0.94 | 1.00 | 0.00 | 0.05 | 0.06 |
| 20 | 0.94 | 0.96 | 1.00 | 0.00 | 0.00 | 0.03 |
| 50 | 0.95 | 0.98 | 0.98 | 0.00 | 0.00 | 0.00 |
| 100 | 0.97 | 0.97 | 0.98 | 0.00 | 0.00 | 0.00 |
Results of different pipelines on 120 samples of simulation data on 1 Mb region.
| All 120 samples | Class I: 1–5x | Class II: 6–10x | Class III: 11–15x | |||||
|---|---|---|---|---|---|---|---|---|
| TPR | FDR | TPR | FDR | TPR | FDR | TPR | FDR | |
| SRBreak | 0.98 | 0.00 | 0.77 | 0.16 | 1.00 | 0.00 | 1.00 | 0.00 |
| Pindel | 0.88 | 0.00 | 0.31 | 0.00 | 0.89 | 0.15 | 0.97 | 0.08 |
| DeLLY | 0.78 | 0.00 | 0.63 | 0.00 | 0.86 | 0.00 | 0.86 | 0.00 |
| Softsearch | 0.32 | 0.61 | 0.19 | 0.78 | 0.59 | 0.66 | 0.91 | 0.58 |
| MATCHCLIP | 0.69 | 0.00 | 0.29 | 0.00 | 0.77 | 0.00 | 1.00 | 0.00 |
| CNVnator | 0.12 | 0.47 | 0.11 | 0.50 | 0.13 | 0.50 | 0.13 | 0.50 |
Results of different pipelines on five low-coverage simulated samples (1–5x) for whole chromosome 21.
| Whole chromosome 21 | |||
|---|---|---|---|
| TPR | FDR | Time | |
| SRBreak | 0.92 | 0.10 | 1 m 35 s |
| Pindel | 0.84 | 0.00 | >3 h |
| DELLY | 0.86 | 0.00 | 3 m 32 s |
| SoftSearch | 0.20 | 0.88 | 1 m 27 s |
| MATCHCLIP | 0.20 | 0.58 | 0 m 53 s |
| CNVnator | 0.01 | 0.99 | 2 m 15 s |
The performance of the five analysis methods on three loci.
| NEGR1 | LCE3 | IRGM | |
|---|---|---|---|
| SRBreak | 0.92 (344/374) | 1.00 (271/271) | 0.82 (196/240) |
| Pindel | 0.93 (348/374) | 0.85 (231/271) | 0.78 (187/240) |
| DELLY | 0.90 (337/374) | 0.83 (224/271) | 0.68 (163/240) |
| SoftSearch | 0.10 (37/374) | 0.1 (27/271) | 0.19 (46/240) |
| MATCHCLIP | 0.47 (174/374) | 0 (0/271) | 0.04 (9/240) |