| Literature DB >> 35791022 |
Martin Hunt1,2, Brice Letcher1, Kerri M Malone1, Giang Nguyen1, Michael B Hall1, Rachel M Colquhoun3, Leandro Lima1, Michael C Schatz4, Srividya Ramakrishnan4, Zamin Iqbal5.
Abstract
There are many short-read variant-calling tools, with different strengths and weaknesses. We present a tool, Minos, which combines outputs from arbitrary variant callers, increasing recall without loss of precision. We benchmark on 62 samples from three bacterial species and an outbreak of 385 Mycobacterium tuberculosis samples. Minos also enables joint genotyping; we demonstrate on a large (N=13k) M. tuberculosis cohort, building a map of non-synonymous SNPs and indels in a region where all such variants are assumed to cause rifampicin resistance. We quantify the correlation with phenotypic resistance and then replicate in a second cohort (N=10k).Entities:
Mesh:
Year: 2022 PMID: 35791022 PMCID: PMC9254434 DOI: 10.1186/s13059-022-02714-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Variant adjudication pipeline implemented by Minos. Input variants in one or more VCF file(s) are merged to make a deduplicated set of variants. When running on a single sample, the input VCF files could be from different tools. When joint genotyping across samples, there is one VCF file originating from each sample. Next, overlapping variants are clustered together—for example the variants at positions 7 and 8—allowing the construction of a non-nested variation graph. Genotype calls are made using read mapping to the graph
Mean precision, recall, and F-score on each empirical data set with each reference genome. Numbers in bold show the best precision, recall, and F-score for each species and reference genome
| Species | Number of samples | Reference genome | Tool | Mean precision | Mean recall | Mean F-score |
|---|---|---|---|---|---|---|
| 17 | H37Rv | BayesTyper | 0.9995 | |||
| GraphTyper | 0.8938 | 0.9422 | ||||
| Minos | 0.9181 | 0.9559 | ||||
| 28 | TW20 | BayesTyper | 0.9984 | 0.8669 | 0.9279 | |
| GraphTyper | 0.7530 | 0.8545 | ||||
| Minos | 0.9988 | |||||
| USA300 | BayesTyper | 0.9993 | 0.8671 | 0.9283 | ||
| GraphTyper | 0.7506 | 0.8534 | ||||
| Minos | 0.9994 | |||||
| 17 | GCF_000784945.1 | BayesTyper | 0.9990 | 0.9052 | 0.9495 | |
| GraphTyper | 0.9063 | 0.9505 | ||||
| Minos | ||||||
| GCF_001952915.1 | BayesTyper | 0.9995 | 0.8800 | 0.9346 | ||
| GraphTyper | 0.8788 | 0.9340 | ||||
| Minos | 0.9994 | |||||
| GCF_003073315.1 | BayesTyper | 0.9996 | 0.9267 | 0.9617 | ||
| GraphTyper | 0.9297 | 0.9634 | ||||
| Minos | 0.9998 | |||||
| GCF_003076555.1 | BayesTyper | 0.9994 | 0.9397 | 0.9686 | ||
| GraphTyper | 0.9387 | 0.9683 | ||||
| Minos | ||||||
| GCF_011006575.1 | BayesTyper | 0.9995 | 0.9078 | 0.9511 | ||
| GraphTyper | 0.9075 | 0.9513 | ||||
| Minos | 0.9998 |
Summary of M. tuberculosis data sets used for joint genotyping. “Genome inside sites” is the total length of all reference alleles across all sites after clustering. It is reported as the total number of base pairs, and in parentheses as a percentage of the 4.4Mbp H37Rv reference genome. SNP sites is the number of sites where all alleles have length 1
| Data set | Number of samples | Unique variants | Excluded variants | Sites after clustering | Genome inside sites (bp(%)) | Total alleles | SNP sites |
|---|---|---|---|---|---|---|---|
| Walker 2013 | 385 | 31,548 | 231 | 30,621 | 41,437 (1%) | 62,690 | 27,639 |
| Mykrobe | 13,411 | 699,484 | 6,259 | 593,584 | 756,003 (17%) | 1,414,723 | 552,543 |
| CRyPTIC | 15,215 | 718,863 | 6,576 | 611,269 | 778,949 (18%) | 1,469,100 | 568,224 |
Fig. 2Precision and recall when joint genotyping M. tuberculosis outbreak data. The left plot considers non-reference allele calls only, i.e., the variant sites that are genotyped to be different from the reference genome. The right plot shows the results when all allele calls are included. Individual samples are marked as dots, and the mean precision and recall for each tool is shown as a cross. The convex hull of the data points for each caller is shaded with an associated color
Fig. 3All amino acid variants identified in the RRDR of the rpoB gene by joint genotyping 8,955 samples from the CRyPTIC M. tuberculosis data set. Each plot shows the RRDR region from left to right. Single amino acid variants are shown in the upper grid, with the y axis corresponding to the variant amino acid. The lower area shows deletions and insertions, with the inserted sequence given in the colored boxes. For example, the leftmost deletion of amino acids TS at position 427-428 is found in one sample, which is resistant. The leftmost insertion adds R after the S at position 431 (found in one resistant sample). The plots show the same variants, but with different color schemes. In a each variant is colored by the number of samples possessing that variant. Plot b colors the variants by the percent of samples with that variant that are rifampicin resistant