| Literature DB >> 23831772 |
Daichi Shigemizu1, Akihiro Fujimoto, Shintaro Akiyama, Tetsuo Abe, Kaoru Nakano, Keith A Boroevich, Yujiro Yamamoto, Mayuko Furuta, Michiaki Kubo, Hidewaki Nakagawa, Tatsuhiko Tsunoda.
Abstract
The recent development of massively parallel sequencing technology has allowed the creation of comprehensive catalogs of genetic variation. However, due to the relatively high sequencing error rate for short read sequence data, sophisticated analysis methods are required to obtain high-quality variant calls. Here, we developed a probabilistic multinomial method for the detection of single nucleotide variants (SNVs) as well as short insertions and deletions (indels) in whole genome sequencing (WGS) and whole exome sequencing (WES) data for single sample calling. Evaluation with DNA genotyping arrays revealed a concordance rate of 99.98% for WGS calls and 99.99% for WES calls. Sanger sequencing of the discordant calls determined the false positive and false negative rates for the WGS (0.0068% and 0.17%) and WES (0.0036% and 0.0084%) datasets. Furthermore, short indels were identified with high accuracy (WGS: 94.7%, WES: 97.3%). We believe our method can contribute to the greater understanding of human diseases.Entities:
Mesh:
Year: 2013 PMID: 23831772 PMCID: PMC3703611 DOI: 10.1038/srep02161
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Read depth per nucleotide and GC content.
(a) Distribution of read depth in WGS and WES on-target regions. (b) Distribution of GC content of WES on-target regions.
Estimation of the accuracy of VCMM using SNP genotyping platforms
| WGS | WES | |||||
|---|---|---|---|---|---|---|
| Genotyping array | WGS or WES | Before Sanger sequencing validation | After Sanger sequencing validation | Before Sanger sequencing validation | After Sanger sequencing validation | |
| Not analyzed | - | - | 1,126 | - | 3,083 | - |
| Concordance | No-Ref-Ho | - | 137,786 | - | 2,893 | - |
| Ref-Ho | - | 326,125 | - | 183,411 | - | |
| Ht | - | 177,949 | - | 3,843 | - | |
| Total | - | 641,860 | - | 190,147 | - | |
| False positive | Ref-Ho | No-Ref-Ho | 5 | 2 (2) | 0 | 0 |
| Ref-Ho | Ht | 37 | 16 (15) | 11 | 6 (3) | |
| No-Ref-Ho | Ht | 31 | 17 (16) | 1 | 1 (0) | |
| Ht | Ht (Different genotype) | 25 | 9 (9) | 2 | 0 (0) | |
| Total | 98 | 44 (42) | 14 | 7 (3) | ||
| False negative | Ht | Ho | 850 | 850 | 23 | 13 (8) |
| No-Ref-Ho | Ref-Ho | 233 | 233 | 13 | 3 (1) | |
| Total | 1,083 | 1,083 | 36 | 16 (9) | ||
†: No-Ref-Ho; Non reference homozygous genotype, Ref-Ho; Reference homozygous genotype, Ht; Heterozygous genotype.
*: The numbers in parenthesis represented the number of SNPs that could not be amplified by PCR.
Number of identified SNVs and indels
| Number | WGS | WES |
|---|---|---|
| Total SNVs | 3,406,875 | 79,060 |
| Total indels | 763,944 (106,732) | 10,999 |
| Total SNVs in splice sites | 105 | 56 |
| Total SNVs in coding region | 20,314 | 19,861 |
| Missense | 9,502 | 9,360 |
| Nonsense | 109 | 83 |
| Synonymous | 10,703 | 10,418 |
| Total indels in coding region | 461 | 509 |
*: In the WGS, the numbers of indels in all region and non-repeat regions are shown.
Figure 2Common indels identified by VCMM, GATK and SAMtools.
(a) SNV in WGS. SNVs in repeat regions and unknown contigs were not used for the comparison. (b) Indel in WGS. Indels in repeat regions and unknown contigs were not used for the comparison. (c) SNV in WES. (d) Coding indel in WES.
Comparison of VCMM with other methods using SNP genotyping platforms
| Number | Proportion (%) | |||||||
|---|---|---|---|---|---|---|---|---|
| Chip | VCMM | GATK | SAMtools | VCMM | GATK | SAMtools | ||
| WGS | OmniExpress BeadChip | Concordant | 641,860 | 641,538 | 639,112 | 99.816 | 99.766 | 99.389 |
| FN | 1,083 | 1,366 | 3,832 | 0.168 | 0.212 | 0.595 | ||
| FP | 98 | 137 | 97 | 0.015 | 0.021 | 0.015 | ||
| WES | Exome BeadChip | Concordant | 190,147 | 190,137 | 189,825 | 99.974 | 99.968 | 99.804 |
| FN | 36 | 46 | 361 | 0.019 | 0.024 | 0.190 | ||
| FP | 14 | 14 | 11 | 0.007 | 0.007 | 0.006 | ||