| Literature DB >> 31099741 |
Dai Yoshimura1, Rei Kajitani1, Yasuhiro Gotoh2, Katsuyuki Katahira2, Miki Okuno1, Yoshitoshi Ogura2, Tetsuya Hayashi2, Takehiko Itoh1.
Abstract
Bacteria are highly diverse, even within a species; thus, there have been many studies which classify a single species into multiple types and analyze the genetic differences between them. Recently, the use of whole-genome sequencing (WGS) has been popular for these analyses, and the identification of single-nucleotide polymorphisms (SNPs) between isolates is the most basic analysis performed following WGS. The performance of SNP-calling methods therefore has a significant effect on the accuracy of downstream analyses, such as phylogenetic tree inference. In particular, when closely related isolates are analyzed, e.g. in outbreak investigations, some SNP callers tend to detect a high number of false-positive SNPs compared with the limited number of true SNPs among isolates. However, the performances of various SNP callers in such a situation have not been validated sufficiently. Here, we show the results of realistic benchmarks of commonly used SNP callers, revealing that some of them exhibit markedly low accuracy when target isolates are closely related. As an alternative, we developed a novel pipeline BactSNP, which utilizes both assembly and mapping information and is capable of highly accurate and sensitive SNP calling in a single step. BactSNP is also able to call SNPs among isolates when the reference genome is a draft one or even when the user does not input the reference genome. BactSNP is available at https://github.com/IEkAdN/BactSNP.Entities:
Keywords: SNP calling; bacterial comparative genomics; benchmark; molecular epidemiology; whole genome sequencing
Mesh:
Year: 2019 PMID: 31099741 PMCID: PMC6562250 DOI: 10.1099/mgen.0.000261
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Conceptual diagram of analysis situation simulated in benchmark. Reference and root sequence pairs exhibiting approximately 97 %, 98 %, 99 % and 99.9 % identity were chosen from complete genomes in the NCBI database and the genomes of 10 virtual isolates and their reads were simulated.
Benchmarks using simulated sequence data
Public complete genomes of three species, including (a) , (b) and (c) , were used for the benchmark. Identity, sequence identity between the root and reference isolates; positive predictive value (PPV), ratio of true-positive detected SNP sites to all detected SNP sites; Sensitivity, ratio of true-positive detected SNP sites to all true SNP sites; Called-sites, ratio of sites where a nucleotide was determined unambiguously in all isolates to the reference-genome size. All stats are averaged values among ten duplications. Called-sites ratio was not calculated for Cortex and CFSAN and was represented as ‘–’, because they do not output information for non-variant regions.
| (a) | Identity (%) | Cortex | Freebayes | GATK | SAMtools | VarScan | Snippy | CFSAN | NASP | PHEnix | BactSNP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PPV (%) | 99.9 | 99.07 | 74.35 | 73.04 | 93.36 | 96.27 | 58.05 | 99.78 | 100.00 | 99.94 | 100.00 |
| 99 | 98.16 | 38.69 | 26.75 | 63.68 | 78.22 | 7.13 | 95.45 | 100.00 | 99.89 | 100.00 | |
| 98 | 98.59 | 31.71 | 25.18 | 53.33 | 72.04 | 3.31 | 96.22 | 100.00 | 99.67 | 100.00 | |
| 97 | 98.37 | 27.55 | 21.17 | 46.73 | 71.34 | 2.13 | 95.34 | 99.94 | 99.44 | 100.00 | |
| Common-region | 99.9 | 100.00 | 100.00 | 98.79 | 99.91 | 100.00 | 94.24 | 100.00 | 100.00 | 100.00 | 100.00 |
| 99 | 99.89 | 98.05 | 93.55 | 98.76 | 98.50 | 48.47 | 100.00 | 100.00 | 100.00 | 100.00 | |
| 98 | 100.00 | 98.46 | 95.34 | 97.63 | 99.48 | 33.57 | 100.00 | 100.00 | 100.00 | 100.00 | |
| 97 | 100.00 | 99.17 | 89.87 | 96.88 | 98.60 | 24.31 | 99.84 | 100.00 | 100.00 | 100.00 | |
| Sensitivity (%) | 99.9 | 95.37 | 99.15 | 99.71 | 99.83 | 99.39 | 99.66 | 99.04 | 97.81 | 99.83 | 99.55 |
| 99 | 85.63 | 94.83 | 99.56 | 99.56 | 99.45 | 98.62 | 97.29 | 97.79 | 99.45 | 99.06 | |
| 98 | 77.55 | 87.97 | 98.04 | 98.99 | 98.59 | 96.98 | 90.85 | 96.74 | 98.76 | 98.54 | |
| 97 | 73.24 | 81.09 | 97.60 | 98.82 | 97.60 | 95.42 | 81.25 | 94.97 | 98.49 | 97.71 | |
| Called-sites (%) | 99.9 | – | 88.92 | 95.36 | 64.40 | 95.96 | 94.19 | – | 91.91 | 94.78 | 93.69 |
| 99 | – | 84.31 | 90.11 | 61.24 | 89.88 | 87.80 | – | 87.16 | 89.37 | 88.25 | |
| 98 | – | 84.96 | 90.40 | 61.86 | 89.97 | 86.69 | – | 87.27 | 89.42 | 88.30 | |
| 97 | – | 84.38 | 89.65 | 61.71 | 88.90 | 84.66 | – | 86.39 | 88.39 | 87.12 |
Fig. 2.Benchmarks using simulated sequence data. PPV and Sensitivity in Table 1(a) were represented graphically (Those in Table 1(b) and (c) are represented in Fig. S1). The values 7, 8, 9 and 9.9 in the graph represent 97, 98, 99 and 99.9 % identity between the reference-root pair, respectively. (a) PPV and sensitivity of SNP callers that exhibited low PPV (<99) in at least one identity. (b) PPV and sensitivity of SNP callers that exhibited high PPV (≥99) in all identities.
Ratio of false-positive SNP sites in ‘soft-clip regions’ to all false-positive SNP sites
Public complete genomes of three species including (a) , (b) and (c) were used for the benchmarks. The definition of ‘soft-clip region’ is described in Fig. S2. In the middle nine columns, the numerator of the fraction denotes the number of false-positive SNP sites in ‘soft-clip regions’; the denominator, the number of all false-positive SNP sites; the number in parentheses, the value of the fraction. In the rightmost column, the numerator of the fraction denotes the total size of ‘soft-clip regions’; the denominator, the reference-genome size; the value in parentheses, the value of the fraction. This table is based on the results for the first reference-root pair among ten pairs in each species and identity (Table S1). Results for BactSNP were not shown, because they did not detect any false-positives.
| (a) | Identity (%) | Cortex | Freebayes | GATK | SAMtools | VarScan | Snippy | CFSAN | PHEnix | NASP | Soft-clip region |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 99.9 | 0/0 | 31/33 | 40/42 | 3/3 | 1/1 | 143/222 | 2/2 | 0/0 | 0/0 | 62,258/2,824,404 | |
| (–) | (93.00) | (95.00) | (100.00) | (100.00) | (64.00) | (100.00) | (–) | (–) | (2.20) | ||
| 99 | 0/6 | 264/280 | 497/514 | 70/73 | 35/35 | 1,739/2198 | 6/6 | 0/0 | 0/0 | 282,639/2,742,807 | |
| (0.00) | (94.00) | (96.00) | (95.00) | (100.00) | (79.00) | (100.00) | (–) | (–) | (10.30) | ||
| 98 | 0/0 | 380/402 | 602/624 | 127/139 | 46/52 | 4,543/5262 | 4/4 | 0/0 | 0/0 | 545,715/3,046,545 | |
| (–) | (94.00) | (96.00) | (91.00) | (88.00) | (86.00) | (100.00) | (–) | (–) | (17.91) | ||
| 97 | 2/8 | 226/233 | 466/475 | 125/129 | 49/49 | 7,211/8146 | 8/9 | 3/3 | 0/0 | 754,175/2,778,079 | |
| (25.00) | (96.00) | (98.00) | (96.00) | (100.00) | (88.00) | (88.00) | (100.00) | (–) | (27.14) |
Fig. 3.Number of detected SNP sites in real sequence data analysis. The relationship between the identity between the reference isolate and the target isolates and the number of detected SNP sites among the target isolates is shown. (a) SNP callers that exhibited relatively low PPV (<95) in at least one identity. (b) SNP callers that exhibited relatively high PPV (≥95) in all identities.