| Literature DB >> 32398743 |
Viola Dreyer1, Christian Utpatel1, Thomas A Kohl1, Ivan Barilar1, Matthias I Gröschel1, Silke Feuerriegel1,2, Stefan Niemann3,4.
Abstract
Accurate drug resistance detection is key for guiding effective tuberculosis treatment. While genotypic resistance can be rapidly detected by molecular methods, their application is challenged by mixed mycobacterial populations comprising both susceptible and resistant cells (heteroresistance). For this, next-generation sequencing (NGS) based approaches promise the determination of variants even at low frequencies. However, accurate methods for a valid detection of low-frequency variants in NGS data are currently lacking. To tackle this problem, we developed the variant detection tool binoSNP which allows the determination of low-frequency single nucleotide polymorphisms (SNPs) in NGS datasets from Mycobacterium tuberculosis complex (MTBC) strains. By taking a reference-mapped file as input, binoSNP evaluates each genomic position of interest using a binomial test procedure. binoSNP was validated using in-silico, in-vitro, and serial patient isolates datasets comprising varying genomic coverage depths (100-500×) and SNP allele frequencies (1-30%). Overall, the detection limit for low-frequency SNPs depends on the combination of coverage depth and allele frequency of the resistance-associated mutation. binoSNP allows for valid detection of resistance associated SNPs at a 1% frequency with a coverage ≥400×. In conclusion, binoSNP provides a valid approach to detect low-frequency resistance-mediating SNPs in NGS data from clinical MTBC strains. It can be implemented in automated, end-user friendly analysis tools for NGS data and is a step forward towards individualized TB therapy.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32398743 PMCID: PMC7217866 DOI: 10.1038/s41598-020-64708-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic overview of the binoSNP workflow. binoSNP accepts a preprocessed BAM-file where ideally duplicates (PCR artefacts) have been removed and base quality scores have been recalibrated. Additionally, the script requires an interval list where the positions to be examined are named as well as a RefAlt-table defining reference and the alternative allele for those positions. As a first step the bam-readcount algorithm from Larson[29] is executed to extract information about the number and quality of reference and alternative alleles at the positions named in the interval list and stores this information in a text file. In a second step the resulting txt-file is read into R and for each position a p-value is calculated by using the binomial test procedure. In the next step a table is produced containing all information including the calculated p-value for each position named in the interval list. The last step applies the user-defined p-value, e.g. report variants with a p-value below 5% (standard value for statistical significance)
Figure 2Minimal detectable allele frequency. The coverage is displayed on the x-axis and the minimum frequency of alternative alleles which is needed to get a significant result (here p-value < 0.05) is shown on the y-axis. For the simulation an error probability of 0.00326 (base quality score Q25, which can be transformed to an error probability of 0.00316 + Illumina sequencing error of 0.01%) has been assumed. Calculation was done with R. The red line illustrates that with a coverage of 400× the minimal detectable frequency of alternative alleles is 1%.
In-silico datasets and included resistance-mediating mutations.
| Dataset | Mutation |
|---|---|
| MDR1 | |
| MDR2 | |
| MDR3 | |
| XDR1 | |
| XDR2 | |
| XDR3 | |
| PZA1 | |
| PZA2 | |
| EMB1 | |
| EMB2 | |
| EMB3 | |
| SM1 | |
| SM2 | rpsL Lys88Arg |
| RMP1 | |
| RMP2 | |
| INH1 | |
| INH2 | |
| FQ1 | |
| FQ2 | |
| FQ3 |
Figure 3p-value distribution of in-silico dataset positions. Scatterplot of calculated p-values for all positions with at least one alternative allele (n = 6870) divided by the status no SNP (n = 5955)/SNP (n = 915). The red line marks the critical p-value of 5%. (a) SNPs colored by the prediction type. FN – false negative, FP – false positive, TN – true negative and TP – true positive. The majority of positions was correctly classified, only 44 positions were false positive and 69 were false negative. (b) Colored by the theoretical coverage observed at a position; 56% of the wrongly classified positions had 100x coverage. (c) Colored by the theoretical frequency of the alternative allele at a position. Only positions with less than 5% frequency were wrongly classified.
Figure 4Performance measurement of separating SNPs and artefacts using p-value by. a Receiver Operating Characteristics (ROC) curve. ROC curve based on p-values with false positive rate (FPR) on the x-axis and true positive rate (TPR) on the y-axis. The area under the curve (AUC) equals 0.9906 (99.06%), which means it has near optimal measure of separability. The black dashed line marks the cut-offs for . The theoretical optimal threshold for a p-value to distinguish between true SNPs and sequencing errors based on the ROC curve was calculated as and is shown as red dashed line
Analysis results for in-vitro datasets.
| Mixture | Position | REF | ALT | DP | Freq (ALT) | Q (ALT) | p-value | AA exchange |
|---|---|---|---|---|---|---|---|---|
| RpoB526_1 | 761140 | A | C | 394 | 0.0076 | 36.67 | 5.99 × 10−04 | RpoB His526Pro |
| RpoB526_5 | 761140 | A | C | 443 | 0.0542 | 35.42 | 5.67 × 10−35 | RpoB His526Pro |
| RpoB526_10 | 761140 | A | C | 586 | 0.0819 | 35.58 | 1.14 × 10−82 | RpoBHis526Pro |
| RpoB531_1 | 761155 | C | T | 492 | 0.0203 | 32.80 | 2.1 × 10−06 | RpoB Ser531Leu |
| RpoB531_5 | 761155 | C | T | 656 | 0.0503 | 31.79 | 4.52 × 10−53 | RpoB Ser531Leu |
| RpoB531_10 | 761155 | C | T | 582 | 0.1031 | 31.52 | 3.27 × 10−118 | RpoB Ser531Leu |
Abb.: REF - Reference allele, ALT – Alternative allele, DP – Coverage/read depth, Freq – Frequency, Q – Base quality value, AA – Amino acid.