| Literature DB >> 28881988 |
Pinar Kavak1, Yen-Yi Lin2, Ibrahim Numanagic2, Hossein Asghari2, Tunga Güngör1, Can Alkan3, Faraz Hach2,4,5.
Abstract
MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.Entities:
Mesh:
Year: 2017 PMID: 28881988 PMCID: PMC5870608 DOI: 10.1093/bioinformatics/btx254
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Classification of donor sequence regions in terms of read mappings. Concordant read: both ends map in correct orientation and within expected insert size. OEA read: one-end anchored, only one end maps to the reference. Split read is an OEA read whose unmapped end crosses the breakpoint and generates split mapping. Orphan read: none of the ends map to the reference
Fig. 2General overview of Pamir
Fig. 3Genotyping novel sequence insertions with Pamir. Here we show an example for calculating r, i and x based on the Figure: = 2 (the # of mappings passing through the breakpoint on R); = 9 (the # of mappings passing through the left breakpoint on I); = 7 (the # of mappings passing through the right breakpoint on I); = (i+i)/2 = 8; x = (i-r)/(i + r) = 0.6
Precision and recall of Pamir, PopIns, MindTheGap and BASIL & ANISE on simulated 30x datasets generated for different sequencing platforms with varying read lengths
| Error free | Noisy | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HiSeq2500-100 bp | HiSeq2500-150 bp | HiSeq2000-100 bp | HiSeq2500-100 bp | HiSeq2500-150 bp | HiSeq2000-100 bp | |||||||
| P | R | P | R | P | R | P | R | P | R | P | R | |
| Pamir | ||||||||||||
| PopIns | 0.973 | 0.814 | 0.958 | 0.726 | 0.972 | 0.823 | 0.969 | 0.800 | 0.968 | 0.789 | 0.938 | 0.709 |
| MindTheGap | 0.900 | 0.900 | 0.900 | 0.900 | 0.965 | 0.897 | 0.905 | 0.811 | ||||
| BASIL & ANISE | 0.989 | 0.757 | 0.989 | 0.763 | 0.989 | 0.763 | 0.989 | 0.757 | 0.989 | 0.754 | 0.974 | 0.743 |
Best results are marked with bold typeface.
Precision.
Recall.
Precision and recall rates of perfect Illumina HiSeq2000-100 bp simulation data with respect to different ranges of insertion sizes where each range contains 50 insertions
| Insertion | Pamir | PopIns | MindTheGap | BASIL & ANISE | ||||
|---|---|---|---|---|---|---|---|---|
| Length (bp) | R | P | R | P | R | P | R | P |
| 10–100 | 1.00 | 1.00 | 0.10 | 1.00 | 0.88 | 1.00 | 0.00 | 1.00 |
| 100–200 | 1.00 | 1.00 | 0.82 | 0.98 | 0.92 | 1.00 | 0.60 | 1.00 |
| 200–500 | 1.00 | 1.00 | 0.84 | 0.95 | 0.92 | 1.00 | 1.00 | 1.00 |
| 500–1K | 0.98 | 1.00 | 1.00 | 0.98 | 0.88 | 1.00 | 0.98 | 1.00 |
| 1K–2K | 0.96 | 1.00 | 1.00 | 0.94 | 0.92 | 1.00 | 1.00 | 0.98 |
| 2K–5K | 0.92 | 1.00 | 1.00 | 0.93 | 0.84 | 1.00 | 0.98 | 1.00 |
| 5K–10K | 0.80 | 1.00 | 1.00 | 0.93 | 0.94 | 1.00 | 0.98 | 1.00 |
| Total | 0.82 | 0.97 | 0.90 | 0.76 | 0.99 | |||
Best results for total are highlighted in boldface.
R: Recall.
P: Precision.
Precision and recall rates of 5 simulated samples (noisy HiSeq2500 2*100 bp 10x)
| Pooled | Individual | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Samples | All | |||||||||||
| # of Insertions | 350 | 350 | 280 | 280 | 280 | 280 | ||||||
| P | R | P | R | P | R | P | R | P | R | P | R | |
| Pamir | ||||||||||||
| PopIns | 0.977 | 0.811 | 0.575 | 0.657 | 0.591 | 0.675 | 0.575 | 0.657 | 0.574 | 0.646 | 0.603 | 0.668 |
Best results are marked with bold typeface.
Precision.
Recall.
Precision and recall rates of both individual and pooled calls of five low coverage samples. The paired-end reads (2*100 bp) are generated using Illumina HiSeq2500 error model. We have simulated 350 insertions in this dataset: S1 have all insertions, and genomes of the other four individuals contains 280 events. The column All shows performances of Pamir and PopIns based on pooling simulation reads, and each column S represents single sample detection results for i-th individual.
Evaluation of predicted genotypes using 5 simulated genomes
| Samples | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| # of Insertions | 350 | 280 | 280 | 280 | 280 | |||||
| # of Insertions not in the sample | 0 | 70 | 70 | 70 | 70 | |||||
| Pamir | PopIns | Pamir | PopIns | Pamir | PopIns | Pamir | PopIns | Pamir | PopIns | |
| Correct (INS) | 284 | 210 | 214 | 225 | 227 | |||||
| Correct (REF) | – | – | 66 | 54 | 66 | 56 | 64 | 59 | 60 | 57 |
| Incorrect zygosity | 2 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 |
| No call (INS) | 31 | 66 | 27 | 50 | 27 | 52 | 25 | 55 | 21 | 53 |
| No call (REF) | – | – | 4 | 16 | 4 | 14 | 6 | 11 | 10 | 13 |
Best results are marked with bold typeface.
Evaluation of genotyping results for the same five samples as in Table 3, based on pooling simulated reads. The paired-end reads (2*100 bp) are generated using Illumina HiSeq2500 error model. We have simulated 350 insertions in this dataset: S1 have all insertions, and genomes of the other four individuals contains 280 events. Correct (INS) lists the number of insertions that are correctly genotyped. Correct (REF) shows the number of detections discarded after genotyping, which are not actual insertions in an individual but falsely predicted based on pooling reads. Incorrect zygosity provides the number of insertions incorrectly genotyped as heterozygous; only 5 calls were identified as heterozygous in S1, S3 and S4 although they were homozygously inserted. All insertions map to common repeats. The No call (INS) row shows the number of insertions missed in the pooled run for each sample, i.e. false negatives. No call (REF) provides the number of insertions missed in the pooled run but the insertion was not inserted into this sample.
Summary of insertions predicted in CHM1
| All | > 50bp | ||
|---|---|---|---|
| Number of insertions | 22,676 | 20,232 | 2,444 |
| Minimum length | 5 | 5 | 51 |
| Maximum length | 4,135 | 50 | 4,135 |
| Average length | 26.20 | 12.12 | 142.51 |
Comparison of insertions in CHM1 by SMRT-SV using PacBio reads versus Pamir and PopIns using Illumina reads allowing 10bp breakpoint resolution
| PacBio | Illumina | ||||
|---|---|---|---|---|---|
| SMRT-SV | Pamir | PopIns | |||
| Insertion Length | Prediction | Prediction | Shared with SMRT-SV | Prediction | Shared with SMRT-SV |
| 1–50 bp | 187 | 20,232 (56%, 38%) | 27 (63%, 14%) | 21 (71%, 24%) | 0 |
| 50–100 bp | 4,384 (54%, 53%) | 1,273 (70%, 18%) | 205 (52%, 14%) | 246 (73%, 4%) | 17 (70%, 0%) |
| 100–200 bp | 2,959 (54%, 50%) | 815 (75%, 13%) | 125 (58%, 13%) | 793 (66%, 4%) | 120 (62%, 1%) |
| 200–500 bp | 3,123 (55%, 37%) | 291 (74%, 7%) | 97 (61%, 1%) | 1,074 (65%, 3%) | 141 (58%, 1%) |
| >500 bp | 2,345 (60%, 32%) | 65 (63%, 3%) | 34 (50%, 3%) | 1,286 (59%, 3%) | 207 (51%, 1%) |
| All | 12,998 (55%, 45%) | 22,676 (58%, 36%) | 488 (56%, 10%) | 3,420 (58%, 3%) | 485 (56%, 1%) |
For each category, we report (i) the percentile of the calls that fall into repeat regions compared to repeat masker file, and (ii) the percentile of the calls with biased GC ratios (20% or 60%) in the form (% of repeat regions, % of biased GC ratios) in the parentheses.
All events reported have a length of 50bp. Note that the comparisons are based only on breakpoint positions without consideration about contents of insertions. If we simultaneously consider insertion lengths and contents, most of PopIns predictions will be filtered out as shown in Supplementary Tables S5 and S6. It is worth mentioning that Pamir can call most of the predictions as PopIns. However, it filters most of them because of the stringent rules.
Hierarchical non-redundant analysis of predicted CHM1 insertions with Pamir and PopIns with respect to other datasets
| Pamir | PopIns | |||||||
|---|---|---|---|---|---|---|---|---|
| 50 - 200 bp | 200 - 500 bp | >500 bp | Total | 50 - 200 bp | 200 - 500 bp | >500 bp | Total | |
| # of insertions | 2,088 | 291 | 65 | 2,444 | 1,038 | 1,075 | 1,286 | 3,399 |
| In GRCh38 | 17 | 1 | 1 | 19 | 0 | 1 | 1 | 2 |
| In CHM1_1.1 ( | 251 | 54 | 2 | 307 | 15 | 8 | 1 | 24 |
| In CHM1 PacBio ( | 213 | 13 | 23 | 249 | 5 | 2 | 12 | 19 |
| In SMRT-SV ( | 73 | 47 | 11 | 131 | 118 | 132 | 193 | 443 |
| In long insert clones | 212 | 21 | 1 | 234 | 565 | 627 | 705 | 1,897 |
| In repeat regions | 1,065 | 126 | 21 | 1,212 | 221 | 191 | 214 | 626 |
| Remainder | 257 | 29 | 6 | 292 | 114 | 114 | 160 | 388 |
Here we provide a hierarchical non-redundant breakdown of comparison of insertions we predicted in the CHM1 genome with Pamir and PopIns. We compare the predictions in the following order: the GRCh38 assembly, then remaining to the reference-guided CHM1_1.1 assembly, the Pacific Biosciences (PacBio) assembly, SMRT-SV call set, long insert clones and those that are in repeat regions.
Long insert clones include both fosmid clones and bacterial artificial chromosomes (BAC). Since we apply more stringent rules to filter false positives in Pamir, many of our discarded calls are still kept by PopIns. This will affect recall rate of Pamir, especially for longer insertions whose orphan contigs are difficult to be assembled.
Summary of novel sequences found in 10 low coverage WGS datasets from the 1000 Genomes Project
| Total | > 50bp | |
|---|---|---|
| Number of insertions | 49,473 | 6,846 |
| Minimum length | 5 | 51 |
| Maximum length | 1,928 | 1,928 |
| Average length | 28.872 | 128.085 |
| In 1000 Genomes Project | 14,837 | 425 |
| In dbSNP version 147* | 14,409 | 2,027 |
*We intersected with dbSNP after removing those insertions that are found in the 1000 Genomes Project.
Genotyping results for the novel sequences found in the 1000 Genomes Project datasets
| Homozygous | Heterozygous | Total insertion length (bp) | |
|---|---|---|---|
| NA06985 | 22,971 | 10,246 | 941,868 |
| NA07357 | 22,582 | 10,158 | 921,225 |
| NA10851 | 23,274 | 9,465 | 930,766 |
| NA11840 | 20,973 | 12,745 | 959,017 |
| NA11918 | 22,610 | 9,994 | 953,968 |
| NA11933 | 21,049 | 11,092 | 936,615 |
| NA12004 | 19,024 | 12,650 | 928,371 |
| NA12044 | 18,753 | 13,002 | 919,212 |
| NA12234 | 20,841 | 10,804 | 916,251 |
| NA12286 | 19,027 | 12,622 | 922,799 |
(Pamir & PopIns) Analysis of insertions found in low-coverage samples with respect to other datasets
| Pamir | PopIns | |||||||
|---|---|---|---|---|---|---|---|---|
| 50–200 bp | 200–500 bp | >500 bp | Total | 50–200 bp | 200–500 bp | >500 bp | Total | |
| # of insertions | 6,050 | 667 | 129 | 6,846 | 5,963 | 4,068 | 2,838 | 12,869 |
| In GRCh38 | 31 | 2 | 1 | 34 | 0 | 0 | 4 | 4 |
| In long insert clones | 1,072 | 89 | 31 | 1,192 | 3,515 | 2,592 | 1,784 | 7,891 |
| In repeat regions | 3,837 | 488 | 71 | 4,396 | 1,542 | 947 | 613 | 3,102 |
| Remainder | 1,110 | 88 | 26 | 1,224 | 906 | 529 | 437 | 1,872 |
Here, we provide a hierarchical non-redundant breakdown of comparison of insertions we predicted in the 10 1000 genomes. We compare our predictions in the following order: the GRCh38 assembly, then remaining to the long insert clones and those that are in repeat regions. Before mapping to GRCh38 reference we extracted 200 bp left and right spanning regions of the insertion breakpoints on GRCh37 reference sequence, inserted the discovered sequence in between and searched the obtained sequence in GRCh38.
Fig. 4Performance comparison of PopIns and Pamir in Illumina HiSeq2500 100bp simulation dataset with 170 calls falling in repeat regions
Fig. 5Performance comparison of PopIns and MindTheGap in Illumina HiSeq2500 100bp simulation dataset with 170 calls falling in repeat regions
Running times of Pamir, PopIns, MindTheGap, and BASIL & ANISE on a 2*100bp simulation dataset based on HiSeq2500 model with 30X coverage
| Pamir | PopIns | MindTheGap | BASIL & ANISE |
|---|---|---|---|
| 3min 9sec | 1min 59sec | 13min 25sec | 11min 16sec |