| Literature DB >> 30924874 |
Takuya Moriyama1, Seiya Imoto2, Shuto Hayashi1, Yuichi Shiraishi3, Satoru Miyano1,2, Rui Yamaguchi1.
Abstract
MOTIVATION: Detection of somatic mutations from tumor and matched normal sequencing data has become among the most important analysis methods in cancer research. Some existing mutation callers have focused on additional information, e.g. heterozygous single-nucleotide polymorphisms (SNPs) nearby mutation candidates or overlapping paired-end read information. However, existing methods cannot take multiple information sources into account simultaneously. Existing Bayesian hierarchical model-based methods construct two generative models, the tumor model and error model, and limited information sources have been modeled.Entities:
Mesh:
Year: 2019 PMID: 30924874 PMCID: PMC6821361 DOI: 10.1093/bioinformatics/btz233
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) The typical pattern of reads when heterozygous SNPs near the mutation candidate appear. (b) The typical pattern of paired-end reads when overlapping paired-end reads cover the mutation candidate. (c) The typical pattern of reads when both strand bias of variant supporting reads appear
Fig. 2.Typical cases of error are shown in the IGV screenshot. (a) In this case, both heterozygous SNPs near the mutation candidate appear in the variant supporting reads. See the erroneous case in Figure 1a. (b) One corresponding paired-end read is highlighted in red line. In this case, inconsistent bases in a paired-end read occur at a mutation candidate position. See the erroneous case in Figure 1b. Our method successfully evaluates these errors with low Bayes factor scores, i.e. 0.000059 in (a) and 0.0000011 in (b)
Fig. 3.Graphical model for partitioning-based integration of generative models. Where states the hypothesis
Fig. 4.(a) Graphical model of OHVarfinDer. (b) Ideal paired-end reads set in and corresponding proportion and for the tumor model and error model. and are error rate for overlapping reads, strand bias rate and haplotype frequency used in the error model of O(+)H(+). Characteristic information of heterozygous SNPs, overlapping paired-end reads and strand bias can be considered by setting the proportions and . Occurrence probabilities shown in black (red) letters are for (). The black colored formulations in the left hand side are based on the tumor model of O(+)H(−) category. The black colored formulations in the right hand side and all the red colored formulations are based on the error model of O(+)H(−), cf. the Supplementary Material A.2 and A.4
Simulation results summary (AUC)
|
| Heterozygous SNPs | Overlap | Distance to SNP | μ | σ | OHVarfinDer | OVarCall | HapMuC | Fisher | #SNV | #Error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 5 | − | − | 500–5000 | 300 | 30 |
| 0.750 |
| 0.810 | 341 | 822 |
| 10 | − | − |
| 0.867 | 0.880 |
| 713 | 871 | ||||
| 20 | − | − | 0.967 | 0.978 | 0.950 |
| 896 | 872 | ||||
|
| 5 | − | + | 500–5000 | 180 | 30 |
| 0.917 | 0.786 | 0.817 | 407 | 1394 |
| 10 | − | + |
| 0.954 | 0.843 | 0.899 | 763 | 1413 | ||||
| 20 | − | + | 0.989 |
| 0.947 | 0.988 | 897 | 1411 | ||||
|
| 5 | + | − | 1–100 | 300 | 30 | 0.880 | 0.765 |
| 0.825 | 301 | 851 |
| 10 | + | − |
| 0.877 | 0.907 | 0.886 | 733 | 871 | ||||
| 20 | + | − |
| 0.984 | 0.977 | 0.983 | 896 | 925 | ||||
|
| 5 | + | + | 1–100 | 180 | 30 |
| 0.923 | 0.838 | 0.803 | 388 | 1356 |
| 10 | + | + |
| 0.952 | 0.918 | 0.914 | 757 | 1398 | ||||
| 20 | + | + |
| 0.991 | 0.977 | 0.990 | 896 | 1354 |
The highest AUC values are written in italic letters.
Exome datasets summary (AUC)
| SNV/InDel | VAF | OVarCall | OHVarfinDer | HapMuC | Strelka | MuTect | VarScan2 | #SNV | #Error |
|---|---|---|---|---|---|---|---|---|---|
| SNV | 2–7% | 0.982 |
| 0.965 | 0.933 | 0.875 | 0.625 | 52 | 2422 |
| SNV | ≥7% | 0.991 | 0.988 | 0.955 |
| 0.994 | 0.900 | 184 | 1982 |
The highest AUC values are written in italic letters. VAF, represents variant allele frequency; SNV, represents single nucleotide variant.
Real datasets summary whole genome (AUC)
| Sample | SNV/InDel | OVarCall | OHVarfinDer | HapMuC | Strelka | MuTect | VarScan2 | #SNV/InDel | #Error |
|---|---|---|---|---|---|---|---|---|---|
| HCC1143_n20t80 | SNV | 0.869 |
| 0.827 | 0.873 | 0.848 | 0.801 | 10 618 | 2327 |
| HCC1143_n40t60 | 0.870 |
| 0.824 | 0.877 | 0.855 | 0.799 | 8517 | 2049 | |
| HCC1143_n60t40 | 0.884 |
| 0.843 | 0.901 | 0.876 | 0.814 | 5450 | 1684 | |
| HCC1143_n80t20 | 0.901 |
| 0.870 | 0.938 | 0.918 | 0.830 | 1874 | 1451 | |
| HCC1954_n20t80 | 0.882 |
| 0.852 | 0.903 | 0.869 | 0.862 | 10 653 | 2854 | |
| HCC1954_n40t60 | 0.893 |
| 0.852 | 0.917 | 0.880 | 0.858 | 7969 | 2327 | |
| HCC1954_n60t40 | 0.917 |
| 0.865 | 0.937 | 0.905 | 0.852 | 4638 | 1770 | |
| HCC1954_n80t20 | 0.941 | 0.970 | 0.880 |
| 0.942 | 0.848 | 1389 | 1404 | |
| Total | 0.895 |
| 0.860 | 0.913 | 0.886 | 0.852 | 51 108 | 15 866 | |
| HCC1143_n20t80 | InDel | 0.707 |
| 0.678 | 0.713 | — | 0.722 | 926 | 4951 |
| HCC1143_n40t60 | 0.733 |
| 0.700 | 0.755 | — | 0.748 | 617 | 4761 | |
| HCC1143_n60t40 | 0.760 |
| 0.723 | 0.784 | — | 0.778 | 328 | 4563 | |
| HCC1143_n80t20 | 0.809 |
| 0.770 | 0.816 | — | 0.800 | 94 | 4899 | |
| HCC1954_n20t80 | 0.800 |
| 0.771 | 0.822 | — | 0.825 | 1771 | 5219 | |
| HCC1954_n40t60 | 0.821 |
| 0.778 | 0.843 | — | 0.835 | 1172 | 5215 | |
| HCC1954_n60t40 | 0.819 |
| 0.770 | 0.848 | — | 0.831 | 607 | 5200 | |
| HCC1954_n80t20 | 0.815 |
| 0.777 | 0.864 | — | 0.823 | 159 | 5053 | |
| Total | 0.777 |
| 0.774 | 0.794 | — | 0.792 | 5674 | 39 861 |
The highest AUC values are written in italic letters.