| Literature DB >> 29618752 |
Daichi Shigemizu1,2,3,4,5, Fuyuki Miya6,7, Shintaro Akiyama8, Shujiro Okuda9, Keith A Boroevich7, Akihiro Fujimoto10, Hidewaki Nakagawa7, Kouichi Ozaki8,7, Shumpei Niida8, Yonehiro Kanemura11,12, Nobuhiko Okamoto13, Shinji Saitoh14, Mitsuhiro Kato15, Mami Yamasaki16, Tatsuo Matsunaga17, Hideki Mutai17, Kenjiro Kosaki18, Tatsuhiko Tsunoda19,20,21,22.
Abstract
Insertions and deletions (indels) have been implicated in dozens of human diseases through the radical alteration of gene function by short frameshift indels as well as long indels. However, the accurate detection of these indels from next-generation sequencing data is still challenging. This is particularly true for intermediate-size indels (≥50 bp), due to the short DNA sequencing reads. Here, we developed a new method that predicts intermediate-size indels using BWA soft-clipped fragments (unmatched fragments in partially mapped reads) and unmapped reads. We report the performance comparison of our method, GATK, PINDEL and ScanIndel, using whole exome sequencing data from the same samples. False positive and false negative counts were determined through Sanger sequencing of all predicted indels across these four methods. The harmonic mean of the recall and precision, F-measure, was used to measure the performance of each method. Our method achieved the highest F-measure of 0.84 in one sample, compared to 0.56 for GATK, 0.52 for PINDEL and 0.46 for ScanIndel. Similar results were obtained in additional samples, demonstrating that our method was superior to the other methods for detecting intermediate-size indels. We believe that this methodology will contribute to the discovery of intermediate-size indels associated with human disease.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29618752 PMCID: PMC5884821 DOI: 10.1038/s41598-018-23978-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The workflow of intermediate-size indel prediction.
Summary for our intermediate-size indel prediction.
| Sample | High quality soft-clipped fragments | Consensus fragment | Intermediate-size indel | ||
|---|---|---|---|---|---|
| Forward | Backward | Forward | Backward | ||
| NA18943 | 45,240 | 46,084 | 10,778 | 11,004 | 60 |
| NA18948 | 40,204 | 42,101 | 9,296 | 9,804 | 47 |
| NA12878 | 7,062 | 6,826 | 1,179 | 1,283 | 17 |
Accuracy estimation of four call methods.
| Sample | Genotype calls | †Sanger examined | TP (a) | FP (b) | FN (c) | Precision (a)/(a + b) | Recall (a)/(a + c) | F-measure | |
|---|---|---|---|---|---|---|---|---|---|
| NA18943 | IMSindel | 60 | 54 | 49 | 5 | 14 | 0.91 | 0.78 | 0.84 |
| GATK | 39 | 30 | 26 | 4 | 37 | 0.87 | 0.41 | 0.56 | |
| PINDEL | 70 | 60 | 32 | 28 | 31 | 0.53 | 0.51 | 0.52 | |
| ScanIndel | 32 | 24 | 20 | 4 | 43 | 0.83 | 0.32 | 0.46 | |
| NA18948 | IMS | 47 | 36 | 32 | 4 | 22 | 0.89 | 0.59 | 0.71 |
| GATK | 17 | 15 | 15 | 0 | 39 | 1.00 | 0.28 | 0.43 | |
| PINDEL | 65 | 49 | 30 | 19 | 24 | 0.61 | 0.56 | 0.58 | |
| ScanIndel | 40 | 27 | 19 | 8 | 35 | 0.70 | 0.35 | 0.39 | |
| NA12878 | IMSindel | 17 | — | 16 | 1 | 8 | 0.94 | 0.67 | 0.78 |
| GATK | 15 | — | 7 | 8 | 17 | 0.47 | 0.29 | 0.36 | |
| PINDEL | 22 | — | 14 | 8 | 10 | 0.64 | 0.58 | 0.61 | |
| ScanIndel | 19 | — | 10 | 9 | 14 | 0.53 | 0.42 | 0.47 |
†The number of genotypes that could be examined using Sanger sequencing.
Figure 2Time and peak memory used by four indel detection methods for NA18943.
Figure 3Intermediate-size indels detected by the three methods for NA18943. Venn diagram showing the overlap of the indels detected by all four methods: IMSindel, GATK HaplotypeCaller, PINDEL and ScanIndel in NA18943 (a), NA18948 (b) and NA12878 (c). The numbers of indel detected in the each method categorized by size in NA18943 (d), NA18948 (e) and NA12878 (f).
Figure 4Distribution of intermediate-size indels predicted in IMSindel. (a) The total number of deletions and insertions predicted in 478 WES data. The percentage of 12 functional groups in predicted deletions (b) and insertions (c). The number in parenthesis indicates the number of predicted indels per sample.
Figure 5Performance comparison for indel detection using simulation data. The indel size ranged from 100 bp to 1,000 bp at interval 100 bp. Their sequence reads were generated with several parameters: point mutation rate (0.001 and 0.005), read length (75 bp and 150 bp), and sequencing coverage (100× and 200×).