Literature DB >> 31815935

Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters.

Christine Jandrasits¹, Stefan Kröger², Walter Haas², Bernhard Y Renard¹.

Abstract

Next-generation sequencing based base-by-base distance measures have become an integral complement to epidemiological investigation of infectious disease outbreaks. This study introduces PANPASCO, a computational pan-genome mapping based, pairwise distance method that is highly sensitive to differences between cases, even when located in regions of lineage specific reference genomes. We show that our approach is superior to previously published methods in several datasets and across different Mycobacterium tuberculosis lineages, as its characteristics allow the comparison of a high number of diverse samples in one analysis-a scenario that becomes more and more likely with the increased usage of whole-genome sequencing in transmission surveillance.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31815935 PMCID： PMC6922483 DOI： 10.1371/journal.pcbi.1007527

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Genotyping and sequencing methods have revolutionized infectious disease surveillance. So called molecular surveillance—molecular data in combination with classical epidemiological data—allows the investigation of the transmission of disease within the population and the sensitive detection of outbreaks. The employed methods shifted from fingerprinting, e.g. variable number tandem repeat (VNTR) methods, and sequence-based genotyping assays, such as bacterial multilocus sequence typing (MLST) to next-generation sequencing (NGS) based whole genome sequencing (WGS) in recent years [1]. WGS gives access to all genetic information and enables studies on phylogeny, geographical spread of lineages, strain-specific differences, virulence and drug resistance [2, 3]. WGS allows for the comparison of pathogens on the level of single nucleotide polymorphisms (SNP) and thus is particularly useful for the transmission analysis of stable genomes with low mutation rates such as Mycobacterium tuberculosis. Tuberculosis (TB) is one of the oldest communicable diseases in humankind and can be dated back to 8000 BCE [4]. Although Robert Koch’s characterization of the bacteria is known for more than hundred years, no effective way to eliminate TB has been found and M. tuberculosis is still one of the deadliest pathogens worldwide. Between 2000 and 2016 TB caused 53 million deaths. The WHO estimates more than 1.7 million deaths (including HIV coinfection) and 10.4 million new TB infections in 2016 only [5]. Since available vaccination against TB is merely partly effective, recommended only for children and in high burden settings [6], breaking transmission chains and successful treatment is needed to prevent new infections and decrease the spread to finally eliminate TB, which is the global goal to reach before 2050 [7]. Here, multi-resistant (MDR-)TB is of special interest with about half a million new cases in 2017. MDR-TB alone is responsible for one third of all antimicrobial resistance deaths worldwide [8] and part of “WHO’s Ten threats to global health in 2019” list in the context of antimicrobial resistance [9]. In recent outbreak investigations WGS has been an indispensable tool of outbreak detection and transmission analysis and is replacing genotyping (mycobacterial interspersed repetitive unit—variable number tandem repeat, MIRU-VNTR) as the method of choice in low incident countries such as much of West-Europe [10-12]. WGS can validate participation of individuals to a common transmission event and thus associate TB cases that do not have a clear epidemiological link or vice versa [10, 13, 14]. With its high resolution WGS outperforms other molecular typing methods such as MIRU-VNTR, Spoligotyping or RFLP [15, 16]. Furthermore, NGS technology enabled the study of the geographical distribution of different M. tuberculosis strains and the dynamics of M. tuberculosis evolution [13]. Strain differentiation was of specific interest for the last decade and several studies showed how WGS outperforms other genotyping methods for detecting recent transmissions [17, 18] and clusters [19, 20]. WGS enables base-by-base comparison between two samples with distinction of identical and non-identical regions. Reference genomes like the laboratory strain H37Rv [21] allow for comparison of multiple samples by comparing their sequences to the reference. Over the years a variety of additional reference genomes based on clinical strains with specific drug resistance patterns or specific to certain geographic appearance were created. Thereby, individual reference genomes help to match samples or sample groups like outbreak clusters [2, 22, 23]. Various methods for measuring the distance between samples using SNPs obtained with NGS have been proposed. Different strategies for making distance analysis less complex like core-genome MLST [24] or whole genome MLST [25] have been described. However, these strategies use a gene-by-gene comparison approach. As the molecular clock of M. tuberculosis is considered as very low with 0.3-0.5 mutations per genome per year [26, 27], the MLST approaches can be insufficient to investigate transmission patterns in clusters or reconstruct direct transmission links. For this reason, we, as well as many outbreak analyses [10–12, 18, 19, 27–31], focus on SNP-counting methods as they retain higher resolution for patient-patient transmission. We assessed twelve studies on detection of M. tuberculosis transmission clusters [11, 12, 14, 18, 19, 27–30, 32–34]. In nine of them the M. tuberculosis strain H37Rv (NC_000962.3) [21] is used as the reference genome to identify sample-specific variations. In order to achieve this, samples of patients’ bacteria were sequenced and the resulting reads were mapped to the reference genome with commonly used short read aligners. Then, variant calling tools were used to detect single nucleotide polymorphisms in the sample data. For the majority of studies the main steps after variant calling were similar: SNPs were filtered using high quality standards such as a minimum number of reads mapped to the variant site (e.g. 5-10 reads), with a high percentage of these reads supporting the variant (75%). In most of these studies additional regions of low confidence, such as repetitive regions, known resistance mutations, regions with more than one SNP, insertion or deletion within 12 bp of the SNP are identified and variant calls within these regions were excluded. However, the described methods varied in how sites with insufficient coverage or low-quality variant calls were handled when comparing all samples in a dataset. There are different reasons for the occurrence of these regions: fluctuation in the coverage while sequencing, deletions within the sequence of the sample or large structural differences between the reference genome and the analyzed sample. Independent of the underlying cause, these regions of low coverage were either completely excluded from the analysis and all comparisons (e.g. [11, 12]) by only considering positions with base calls of high quality in all analyzed samples. Or the low-quality base calls and regions with low coverage are ignored and substituted with the reference sequence. This is usually done by selecting SNPs of all samples and concatenating them, using the reference sequence for samples without SNPs at these positions to achieve equal sequence length [14, 19, 34]. Some studies used pairwise comparisons for a small set of samples [32, 33]. Several of these strategies are facilitated by the BugMat software [35]. Considering these studies, the questions that are posed for standardizing WGS-based molecular surveillance analyses include: Which genome should the reads be aligned to? How can regions with missing or low-quality information be handled? In alignment-based whole genome sequencing analyses, the choice of the reference genome predetermines the results of subsequent analysis [36]. There is a need to avoid this bias and include multiple reference sequences into the analysis e.g. by using a pan-genome [37]. In this context the term pan-genome describes a set of associated (whole genome) sequences rather than the core and accessory genes of all strains of an organism. The method of comparison of samples should consider all high-quality variants while excluding regions of low quality for each sample. We present a new approach, PANPASCO (PAN-genome based PAirwise SNP COmparison), that combines an improved pairwise distance measure, that allows the comparison and clustering of a large number of diverse samples with the use of a computational pan-genome reference sequence. PANPASCO considers each detectable difference between pairs of samples, without sacrificing the ability to resolve intra-cluster patient-patient relationships. We apply this approach to several datasets of M. tuberculosis samples with a computational pan-genome representing lineages 1-4.

Results

We developed PANPASCO, a novel method to determine the distance between samples based on SNP differences. We compare samples in a pairwise manner, considering all variant sites of high-quality for each pair. These high-quality sites are identified using an NGS variant calling workflow and a five step variant quality filter. To enable the comparison of pairs of samples with differing amount of missing data, the number of low-quality sites and regions with missing information is also incorporated into the distance measure in a normalization step. To minimize the information loss and avoid the problem of identifying each best fitting reference genome per sample, PANPASCO uses a computational pan-genome built from 146 M. tuberculosis genomes with seq-seq-pan [38]. This computational pan-genome is about 18% (<1 Mb) longer than a single M. tuberculosis genome, e.g. the H37Rv strain, and contains all genomic regions shared by and specific to each included genome. We use the computational pan-genome sequence in place of a lineage specific reference genome in our mapping and variant calling workflow. When comparing samples using their SNP difference, we can therefore also include SNPs that occur in regions that are not part of commonly used reference strains (for details see Methods). Several studies have investigated the mutation rate for M. tuberculosis and evaluated the SNP-based difference cutoffs for detecting transmission between patients and within clusters ranging from 3 to 14 SNPs [11, 27–29, 39]. Here, we chose a conservative definition that assigns samples with a distance of fewer than 13 SNPs into transmission clusters and thereby distinguishes them from unrelated samples [11]. We compared PANPASCO to two other commonly used strategies for detecting SNP-based difference, where variant sites and regions with missing information in one of the samples are either completely excluded from the analysis [11, 12] or are ignored and therefore substituted with the reference sequence when samples are compared [19] (referred to as exclusion method and substitution method). For these two methods we use the M. tuberculosis H37Rv strain as reference genome as described in the respective publications [11, 12, 19]. Fig 1 shows how these methods work in detail and their individual characteristics: SNP differences in pairs of samples are missed if they are located in regions with missing information in unrelated samples with the exclusion method, while artificial differences are introduced by substituting missing data with the reference sequence when using the substitution method. Fig 1 also shows how all differences between pairs of samples are detected and incorporated in the distance measure with PANPASCO.

Fig 1

Description and comparison of three SNP distance methods.

(A) Schematic representation of the next-generation sequencing (NGS) reads of three samples aligned to a reference sequence. Reference sequence is depicted in gray, while samples are colored yellow, green and blue. Dashed lines represent borders of regions of the samples that are not covered by any NGS reads. Transparent reads represent the true genome of the samples in regions with no coverage. Reads reveal single nucleotide polymorphisms (SNPs) when comparing the samples to the reference sequence (depicted as black points on the reads). SNPs also occur in sequences of samples where no read data is available, these can not be detected using any method that is based on these aligned reads. In sum, the samples have 12 differences, while only 8 can be identified with the available reads. (B) Representation of compared sequence parts using three different distance measuring methods. Each method compares pairs of samples. Samples are depicted as whole genomes in yellow, blue, and green. Differences between samples are marked with X, true differences are colored black and incorrect ones are colored red. The three methods differ in how regions with missing information in one sample are treated when comparing other samples of the dataset. With the substitution method, missing parts in the samples are replaced with the corresponding parts of the reference sequence (in gray). This leads to incorrectly identified differences, where the true sequence of the samples differs from the reference sequence but has no read coverage. Using the exclusion method, all regions with no coverage of all samples are excluded in all comparisons, e.g. low coverage in Sample 1 influences the comparison of Samples 2 and 3. This leads to overlooking differences between samples, in this example only 4 of 12 detectable differences were identified. The pairwise SNP comparison method used in PANPASCO determines all comparable regions for each pair of samples. Regions with low coverage in each pair are excluded for pairwise comparison. This way all detectable differences are identified, without introducing additional, incorrect ones.

Description and comparison of three SNP distance methods.

Experimental setup

To evaluate the different SNP-counting strategies, we compare the number of transmission cluster links identified by each method in four different datasets. We created a simulation dataset with the properties (e.g. number of mutations and coverage distribution) of real datasets [11, 40, 41] (see S1 Appendix). It includes 20 transmission clusters with 3 to 55 samples per cluster. Samples were simulated from four different genomes including the commonly used M. tuberculosis reference strain H37Rv, with five clusters for each genome (for details see Supplementary Methods in S1 Appendix and S1 Table). With this simulation we compare the accuracy of the classification of sample links of the different methods. We also evaluated PANPASCO’s performance for the analysis of three published datasets, to demonstrate the relevance of our improved distance measure. We chose a study with a large national dataset of 217 samples, with one transmission cluster described in detail [11] and a dataset with a small number of patients [12] (referred to as UKTB (United-Kingdom-TB) and RAGTB (Romania-Austria-Germany-TB), respectively). We also included a study including isolates from a high-burden setting in China [40] (referred to as CTB). For the comparison and evaluation, we classify all links between samples into transmission cluster links (SNP difference of fewer than 13) and unrelated links following previously published results [11] and group samples into transmission clusters. To be assigned to a transmission cluster the distance of one sample to only one of the other samples has to be classified as transmission cluster link, therefore unrelated samples can belong to the same transmission cluster when they are connected by other samples.

Simulation dataset results

We applied the exclusion method to the simulation dataset and compared the detected transmission links to the simulated ones. Using the exclusion method results in a high number of predicted transmission cluster links. The sensitivity is very high (1.000), as there are no false negative classifications, however the specificity (0.782) and F-Score (0.326) are low. In detail, the strategy of excluding low-quality variant sites found in any sample from the transmission analysis resulted in a high number of false positive classifications—differences are overlooked and a transmission is assumed where there is no relation between the samples (Table 1).

Table 1

Comparison of SNP-counting methods in all clusters of the simulation dataset.

	TP	TN	FP	FN	Sensitivity	Specificity	Accuracy	F-Score
exclusion	2604	38654	10745	0	1.000	0.782	0.793	0.326
substitution	465	49394	5	2139	0.179	1.000	0.959	0.303
PANPASCO	2525	49174	225	79	0.970	0.995	0.994	0.943

TP = true positives, TN = true negatives, FP = false positives, FN = false negatives

Samples in this dataset were simulated from four different M. tuberculosis genomes including the H37Rv strain. This dataset includes 2604 tranmission cluster links and 49399 unrelated links. We compare three SNP-counting approaches: exclusion = low-quality variable sites are excluded for all samples in the analysis, substitution = SNPs at low-quality sites are substituted with the reference base, PANPASCO = high-quality sites are identified and differences counted for each pair separately and a computational pan-genome sequence built from 146 M. tuberculosis genomes was used as reference genome. Samples with a SNP-difference of fewer than 13 SNPs belong to the same transmission cluster. The data shows that using PANPASCO results in the highest accuracy and F-score. TP = true positives, TN = true negatives, FP = false positives, FN = false negatives The substitution method classifies a high number of links between samples as unrelated. Comparison of this classification with the simulated links showed that there were many true negative classifications (specificity = 1.000), but also many false negative ones (sensitivity = 0.179). Substituting regions of low quality with the sequence of the reference genome introduces artificial differences between the samples and therefore this method cannot be used to identify transmission links, as the relation between samples is obscured (Table 1). Transmission cluster links can be accurately identified using PANPASCO resulting in the highest accuracy and F-Score for the simulation dataset (>0.99 and >0.94, respectively, Table 1). To analyze the influence of the reference genome on the results of the different methods, we group all samples by the genome used for simulation and calculated the accuracy of transmission link classification for each of the four sets individually. The 74 samples of the first set of clusters (C1-C5) were simulated from the M. tuberculosis H37Rv strain, which is commonly used for SNP distance analyses [11, 12, 19]. The samples in the rest of the clusters are simulated from three other genomes that are increasingly different from the H37Rv strain (see S2 Table) and therefore increase the diversity between analyzed samples. For details on all genomes used for simulation see Supplementary Methods in S1 Appendix and S1 Table. Due to the characteristics of the exclusion method, results change with the number of samples and the sequence diversity among the samples that are included in the analysis. For this reason we used the exclusion method in two ways, once including only the clusters simulated from the respective genome and once including all samples but show the results for the respective samples only. The differing results for these strategies are clearly evident in Table 2. Using the exclusion method for analyzing groups of very similar samples works well. In contrast, when more samples, originating from different genomes are included, more genomic regions are excluded from the analysis and therefore fewer differences are detected. This results in a strong increase of false-positive classifications of sample links and therefore a large decrease in specificity and accuracy.

Table 2

Comparison of SNP-counting methods for simulated samples grouped by the genome used for simulation.

Genome	Method	TP	TN	FP	FN	Sensitivity	Specificity	Accuracy	F-Score
H37Rv	exclusion, cluster samples*	499	1915	287	0	1.000	0.870	0.894	0.777
	exclusion*	499	0	2202	0	1.000	0.000	0.185	0.312
	substitution	463	2197	5	36	0.928	0.998	0.985	0.958
	PANPASCO	489	2187	15	10	0.980	0.993	0.991	0.975
MDRMA208	exclusion, cluster samples*	464	2279	260	0	1.000	0.898	0.913	0.781
	exclusion*	464	0	2539	0	1.000	0.000	0.155	0.268
	substitution	2	2539	0	462	0.004	1.000	0.846	0.009
	PANPASCO	425	2520	19	39	0.916	0.993	0.981	0.936
HKBS1	exclusion, cluster samples*	440	1531	109	0	1.000	0.934	0.948	0.890
	exclusion*	440	0	1640	0	1.000	0.000	0.212	0.349
	substitution	0	1640	0	440	0.000	1.000	0.788	0.000
	PANPASCO	425	1616	24	15	0.966	0.985	0.981	0.956
TB282	exclusion, cluster samples*	1201	3679	685	0	1.000	0.843	0.877	0.778
	exclusion*	1201	0	4364	0	1.000	0.000	0.216	0.355
	substitution	0	4364	0	1201	0.000	1.000	0.784	0.000
	PANPASCO	1186	4197	167	15	0.988	0.962	0.967	0.929

* Results change with the number and diversity of samples when using the exclusion method. We show the results for each group, including only the respective cluster samples or all simulated samples in the analysis.

TP = true positives, TN = true negatives, FP = false positives, FN = false negatives

Comparison of sample link classification for all samples of the simulated dataset grouped by the genome used for simulation shows the impact of the reference genome for the different methods. For all samples the same reference genome was used in the analysis: the M. tuberculosis H37Rv strain for the exclusion and substitution method and the computational pan-genome reference sequence for PANPASCO. The results show a great reduction in accuracy and F-Score for the groups of samples that were simulated from genomes other than the H37Rv strain for the substitution method and the exclusion method when using all samples in the distance calculation. PANPASCO achieves very good classifications with no difference between the groups of samples. * Results change with the number and diversity of samples when using the exclusion method. We show the results for each group, including only the respective cluster samples or all simulated samples in the analysis. TP = true positives, TN = true negatives, FP = false positives, FN = false negatives The choice of reference genome strongly affects the results obtained with the substitution method. Substituting regions of low quality with the reference genome works well only if the analyzed samples are mapped to the best fitting reference genome: almost all transmission cluster links between samples simulated from genomes other than the H37Rv strain were misclassified as unrelated (Table 2). Using PANPASCO, the classification results do not differ between the four groups of samples. We achieved the highest accuracy and F-Score (>0.96 and >0.92, respectively) for all links between samples compared to the other two methods (Table 2). This detailed analysis shows the advantage of our approach: links between all samples in a large, diverse dataset are classified equally well, independent of the number of included samples or strains, as all SNPs, even those in strain specific genomic regions are taken into account for measuring the distance between pairs of samples. Detailed inspection of sample links for each simulated cluster (S4 Table) and comparing the number of SNPs detected for each link within each simulated cluster underlines the properties of the other methods (see Fig 2): Due to the diversity of the genomes used for the simulation and the resulting exclusion of genomic regions, most samples within clusters have a reported difference of 0 or 1 with the exclusion method. The opposite is true for the substitution method as most transmission cluster links were reported as being unrelated links (> 12 SNPs).

Fig 2

Counting difference between samples within clusters.

Counting difference between samples within clusters.

We select each cluster individually and count the differences between samples within the clusters. Each cluster can contain closely related samples (0 SNPs), transmission cluster links (fewer than 13 SNPs) and unrelated samples. Samples are assigned to a cluster if the difference is fewer than 13 SNPs compared to at least one sample within the cluster. We evaluate the frequency of each distance value for the three methods and compare them to the true distance. The exclusion method reports distances of 0-3 SNPs for all samples and with the substitution method all distances are greater than 12. The distribution of distances using PANPASCO closely resemble the true distances in the dataset. For a complete assessment of the different methods we also we also analyzed the simulation dataset with our computational pan-genome as the reference genome with the exclusion and substitution methods and PANPASCO with the M. tuberculosis H37Rv as the reference genome (see S5 Table). This comparison shows that the classification results do not improve for the exclusion method. The combination of the substitution method and the pan-genome achieves very poor results as not a single transmission cluster link was detected. PANPASCO works best with the computational pan-genome reference sequence, but also achieves better results than the other two methods using the standard reference genome.

Real datasets results

All three previously published datasets were analyzed using the computational pan-genome as reference sequence and the mapping and variant calling workflow described in the Supplementary Methods in S1 Appendix. For the UKTB dataset we focused our comparison on the cluster for which an epidemiological network and nucleotide variants were provided (cluster seven, UKTB7) [11]. This cluster was initially defined by the shared MIRU-VNTR profile of the samples and includes 17 sequenced isolates of ten patients with one central, treatment non-compliant individual. When calculating the SNP distance using PANPASCO, we identified one pair of samples with 4 differing SNPs in addition to the ones described in the original manuscript: P066 and P175. These patients were reported with a distance of 0, indicating a close transmission event or even direct transmission. The minimum spanning tree we calculated with Cytoscape App Spanning Tree [42] based on our SNP differences better matches the described epidemiological network (Fig 3). In contrast to the findings in [11] it is more likely that P076 infected the other two cases, rather than one of these the other one. We investigated the variant sites that we identified and compared them to the published ones. We identified 36 variant sites including all 20 sites listed in [11] (depicted in S6 Table). In the original manuscript the exclusion method was used and the additional 24 sites we identified were excluded because they are not covered in samples assigned to MIRU-VNTR cluster six (data not shown). This lack of coverage is likely explained by the differing phylotypes of cluster six and seven (see Table in [11]) and in combination with the exclusion method is the reason why the 4 differing SNPs between P066 and P175 were not reported in the original publication.

Fig 3

Minimum spanning tree computed with PANPASCO between samples of the UKTB7 dataset.

Minimum spanning tree computed with PANPASCO between samples of the UKTB7 dataset.

(A) Genetic distances estimated by Walker et al. SNP distances are represented by dots on edges, isolates within blue circles are separated by 0 SNPs. (B) Epidemiological network as published by Walker et al. (C) Minimum spanning tree from distances calculated with PANPASCO. Adjoining isolates are separated by 0 SNPs, edges are labelled with pairwise distance. Transmission links between P066 and P175 are marked in all three parts—more SNPs were detected using PANPASCO and the resulting tree better fits the epidemiological network. ((A) and (B) taken from [11] and edited). The second dataset, RAGTB [12] consists of 13 patients and two replicates for one of the patients. We again calculated a minimum spanning tree using distances calculated with PANPASCO (see Fig 4). We identified the same transmission clusters as described in the original manuscript with one exception: We identified more than 12 differing SNPs for patients VI and III. The reason for that is, that the patients were previously analyzed using the exclusion method, with additional filtering of SNPs associated with drug resistance or located in repetitive regions of the genome [12]. As this is a small dataset, there is not a large difference between the results of our approach and the exclusion method.

Fig 4

Minimum spanning tree computed from pairwise distances between samples of the RAGTB dataset.

Minimum spanning tree computed from pairwise distances between samples of the RAGTB dataset.

Edges are labeled with SNP distances (format: PANPASCO | published data). The exclusion method and the M. tuberculosis H37Rv strain were used in the original publication [12]. Blue lines mark distances of fewer than 13 SNPs detected with PANPASCO, clustering patients into three transmission clusters and four independent samples. Samples XII and XIII show no difference with both analyses. The comparison of these two distances shows that there is no large difference in the results as this is a small dataset with very similar samples. In the original study of the CTB dataset three clusters of patients (A, B, C) were analyzed [40]. The clusters contained 19, 9 and 4 patients from two regions in China. We used PANPASCO to calculate the distances between all 32 isolates and annotated all identified transmission links in the Median-Joining networks from the original publication (Fig 5). Using the cutoff of fewer than 13 SNPs we identified the same transmission clusters, which shows that PANPASCO can also be used in this high-burden setting.

Fig 5

Median-Joining networks for samples of the CTB dataset annotated with pairwise distances.

Edges between isolates with genetic distance of fewer than 10 SNPs identified in the original publication are labeled with SNP distances calculated with PANPASCO. Blue lines mark distances detected with PANPASCO, while gray ones mark original results. The comparison of these two analyses showed very similar results.

Median-Joining networks for samples of the CTB dataset annotated with pairwise distances.

Discussion

We present a new approach for measuring genomic distance of M. tuberculosis isolates, integrating commonly used mapping and variant calling methods. Reads of isolates are mapped to a computational pan-genome built from over a hundred M. tuberculosis genomes to include strain specific genomic regions in the analysis. SNP differences are evaluated pairwise so all high-quality regions and incorporating regions of low-quality or missing information are considered. Our method PANPASCO accurately determines transmission clusters within large sets of samples. Previously published and commonly used methods struggle with large, diverse datasets, showing either low specificity or sensitivity when identifying transmissions. Any sort of error is problematic in a disease outbreak investigation. Methods lacking sensitivity result in overlooking transmission and failing outbreak detection. However, also methods lacking specificity produce unsatisfying results. Any health care system can only follow-up on a limited number of possible transmission events and must prioritize its actions and concentrate capacities and efforts. Having large numbers of spurious and incorrect transmission events detected, will automatically impact the availability of resources to focus on the true underlying transmission events within short time. The major drawback of the exclusion method is its limited resolution, which is driven by the sample with the lowest coverage. Another disadvantage of this method for usage in disease surveillance is that each time a new sample is added to the dataset, the results for previously analyzed isolates change as more regions of the genome have to be excluded to be able to compare the new sample to the dataset. Taking this further, with a rising number of samples at some point, due to random sequencing errors and varying coverage distribution, there will be fewer and fewer sites of the genome with high-quality information available for comparison. This problem is of special interest and high importance regarding long time infectious disease surveillance rather than outbreak investigations and recent transmission analysis. This problem can be avoided when low-quality or missing regions are substituted with the reference sequence. With the substitution method all detectable differences between samples are considered in the distance calculation. However, many artificial differences are introduced as well. When an M. tuberculosis isolate is compared to a reference sequence, usually several hundred SNPs are detected. Within a transmission cluster all but a few of these SNPs are the same. When a subset of these sites cannot be detected due to missing or low-quality reads and are substituted with the reference base, they are counted as differences between the samples and transmissions cannot be detected. The problem increases the more the isolates diverge from the reference sequence. Strains of M. tuberculosis can be assigned to different sub-lineages that differ in many loci and blocks of deletions [41]. For analysis of small transmission clusters a lineage specific genome can be used as the reference genome for optimal results. In larger studies, samples of different strains are often analyzed together, choosing one reference genome that naturally represents only one of the lineages adequately [11, 18, 19, 28]. This means, that for a part of the set of samples a suboptimal reference is used for mapping and variant calling. Nevertheless, identifying the best fitting reference sequence for each sample is no optimal solution. Clusters within a dataset and datasets from different studies will not be comparable with each other. The computational pan-genome approach solves this problem as it allows the integration of genomic information of several M. tuberculosis strains. The computational pan-genome enables the detection of SNPs in the core genome and in strain specific genomic regions at the same time. We showed the specific weaknesses of the exclusion and substitution approach by applying them to a simulation dataset that was constructed to resemble a real dataset. We demonstrate the superior classification result of combining the usage of the computational pan-genome reference and the pairwise comparison. Using PANPASCO to analyze previously published datasets provided additional insight for the epidemiological investigation of these transmission clusters. Having differing numbers of comparable sites for each pair in the analysis, complicates distance calculation. To account for the fact that fewer differences can be detected with fewer comparable sites we project the number of SNPs found for each pair to the full length of the genome, e.g. detecting 2 differences within 2 Mb is different from detecting 2 differences in 4 Mb using a normalized score. For identifying comparable sites and incorporating missing regions into the distance measure we keep track of all, instead of only variable high-quality sites of all samples. In this normalization step we work with the assumption that SNPs in samples are equally distributed over the length of the computational pan-genome. Mutation hot spots and the frequency of variants in the genomes used for the computational pan-genome can be taken into account for a more accurate normalized score. Due to this normalization step, SNP distances to samples with high numbers of low-quality sites can be greater than the true distances. Therefore, sample quality has to be taken into account in cluster analysis. For further improvement of transmission cluster detection, several steps can be taken. The pan-genome used in this study was built from all available reference sequences for M. tuberculosis in NCBI Refseq [38]. However, some of the seven major lineages [43] are underrepresented in this set of genomes (see information about lineages in S7 Table). PANPASCO can be applied to more diverse datasets using a computational pan-genome additionally including reference genomes representing these lineages and animal strains such as Mycobacterium bovis. To increase differentiation between samples, information about additional genetic variation such as mixed base calls for SNPs, insertions, deletions or tandem repeats and microsatellites could be incorporated into the distance measure. There is a need for standard methods for evaluating quality and validity to include these additional measures of genetic variation. This can be achieved by using graph representation of reference genomes and sample sequences such as Cortex [44, 45], vg [46] or panVC [47]. With these approaches all sequence information of the samples can be used in the analysis to detect differences among samples or compared to one or more reference sequences. However, methods for annotation of identified differences are lacking and interpretation of results can be challenging. Small parts of missing information in the analyzed samples can also be imputed from known haplotypes or using a phylogenetic tree of the analyzed samples [48]. This method has the potential to improve the results of all methods described in this study. The success of these methods vary with the availability of information on haplotypes and the variation among related sequences for constructing informative phylogenetic trees. Several studies investigated appropriate SNP distance cutoffs for transmission cluster definition [31]. These cutoffs have to be defined specifically for each species and the environment of transmission. While the cutoffs described for M. tuberculosis seem to be stable between different settings [40], there are also alternative approaches that can be used for transmission cluster definition [49]. Today, WGS-based molecular surveillance of MTB is established in a number of low-incident countries, e.g. USA, UK, Netherlands. Such systems allow for event specific adaption of public health action, patient care, medication and treatment based on pathogen specifics like resistance, virulence and actual spread (outbreak size). Early detection of antibiotic resistance and prevention of further transmission are one of the main tasks on the path to TB elimination. This implies a large number of samples that has to be analyzed—e.g. in Germany there were more than 4000 cases (4099 of 5915 reported cases; corresponding to 83.4%) laboratory confirmed by culture of tuberculosis in 2016 [50]. Nowadays, higher mobility and worldwide migration cause a larger geospatial spread of specific pathogens and increase the diversity of samples. Several studies analyzed and assessed the integration of WGS into routine tuberculosis diagnosis and investigation [11, 51–53]. The authors show the added benefit of using SNP-based analysis in transmission cluster and drug resistance detection in large groups of patients. However, while they discuss the importance of common standards for sequencing techniques and quality, the implications of integrating a large number of samples across different lineages in the same analysis are not addressed. Balancing sensitivity and specificity is key for the analysis of large and diverse groups of samples during outbreak investigations and in TB surveillance, when it is of importance to find each case and expensive to investigate large numbers of false positives. PANPASCO can contribute to achieving these goals by usage of pan-genomic references and improved pairwise SNP-distances.

Materials and methods

PAN—Computational pan-genome mapping

In mapping based whole genome sequencing analyses, the choice of the reference genome can have significant impact on the results [36]. For this reason we built a computational pan-genome from 146 M. tuberculosis genomes available in NCBI RefSeq by February 17th, 2018 with seq-seq-pan [38]. seq-seq-pan aligns all genomes in an iterative way, adding new genomic content step by step. This resulted in a computational pan-genome sequence with 5,205,216 bp (an increase of about 18% compared to 4,411,532 bp of the commonly used M. tuberculosis H37Rv strain) and contains all genomic regions shared by and specific to each included genome (see list of genomes in S7 Table). We use this computational pan-genome sequence as reference sequence with a pipeline that includes various tools for quality control, mapping and variant calling and filtering, with bwa mem [54] for read alignment and GATK for variant detection [55] (Supplementary Methods in S1 Appendix and S1 Fig). Scripts for the whole analysis workflow are provided at https://gitlab.com/rki_bioinformatics/panpasco.

PASCO—Pairwise SNP comparison

The first step of distance calculation is the identification of high-quality SNPs. For this we use several filters to identify regions with low coverage and low-quality and ambiguous sites for all samples. Additionally, SNPs in repetitive regions of the reference genome [56] are excluded from the analysis of real datasets (Supplementary Methods in S1 Appendix, S1 Fig and S8 Table). Then, we compare all samples pairwise, taking into account the set of all variant sites of high-quality (S) in the genomes of a pair of samples (x1 and x2) compared to a reference genome. This way we do not lose information about differing bases that are located in low-quality regions of other, unrelated samples. Opposed to the previously published exclusion and substitution methods, we end up with different numbers of sites for each comparison, due to differing number and length of low-quality regions in each sample. To account for this difference we normalize the SNP count by this number of compared sites. This score reflects the SNP difference per base. We also determine common reference genome sites of the samples (G). For this we compare the low-quality regions of the samples with the whole genome alignment (WGA) that forms the computational pan-genome sequence. The common reference genome sites are composed of high-quality sites of the samples and the low-quality sites that do not overlap with gaps in the WGA of the computational pan-genome. Overlaps with gaps in the WGA indicate that the reason for lack of coverage are strain differences rather than low-quality sequencing (Fig 6).

Fig 6

Reads from two samples mapped to a computational pan-genome sequence with regions of zero coverage.

Reads from two samples mapped to a computational pan-genome sequence with regions of zero coverage.

Regions with no coverage such as A and C are considered to contain as many difference between the samples as found in regions with sufficient coverage. To account for these regions with insufficient coverage, the total expected difference between two samples is calculated using the SNP difference per base—derived from regions covered in both samples—and the set of common reference genome sites. This set is composed of all sites of the genome except regions such as B and D. These regions have low coverage in both samples and overlap with gaps in the whole genome alignment (blue, yellow and green) of the strains used to build the computational pan-genome (purple). This indicates that both samples are related to similar strains that both do not contain this specific genomic region, which should therefore not be considered when calculating the expected number of differences for the whole computational pan-genome. To calculate the expected number of differences for the whole reference genome we multiply the SNP difference per base with the number of common reference genome sites (see Eqs (1) and (2)). We define the distance between the genomes of a pair of samples x1 and x2 as where and |S| and |G| are the number of compared high-quality variant sites and the number of common reference genome sites, respectively.

Simulation dataset

Following published real datasets (UKTB and RAGTB) we simulated 20 transmission clusters with 3 to 55 samples per cluster, resulting in a total number of 323 samples. We chose four M. tuberculosis genomes and assigned five clusters to each of them (S1 Table). For the genomes we chose the M. tuberculosis H37Rv strain, commonly used as reference genome, and three increasingly divergent genomes from the set of genomes that built up the computational pan-genome (NC_000962.3, NZ_CP023628.1, NZ_CP002871.1, NZ_CP017920.1). For each cluster we generated intra- and inter-cluster SNPs and we assigned all inter-cluster and a selection of the intra-cluster SNPs to each sample (see S1 Appendix). We estimated the number and length of low coverage regions in the comprehensive UKTB dataset and introduced them into the simulated samples. Short reads were simulated with NEAT [57]. For this purpose we created a variant calling file (VCF) with all assigned SNPs for each sample, including deletions to simulate regions with low coverage (see S1 Appendix for all simulation details). To create a fair simulated dataset we generated a whole genome alignment of all applied reference genomes with seq-seq-pan [38] to calculate the true distance between the samples (S2 Table). We combined these differences and all simulated SNPs into a distance matrix for all samples, which represents the true distance between all samples. We assessed the number of simulated inter- and intra-cluster SNPs located in regions that are not part of the H37Rv strain by mapping their positions using the coordinate system of the computational pan-genome (S3 Table). Scripts for generation of intra- and intercluster SNPs and deletions are provided at https://gitlab.com/rki_bioinformatics/panpasco.

Supplementary methods.

Appendix S1 contains additional details on analyses used in this study. The mapping and variant calling workflow as well as the creation of the simulation dataset are described. Methods for defining regions to be filtered for the real datasets are also included. (PDF) Click here for additional data file.

Workflow used to analyze the simulation and real datasets.

We divide the task in three parts: read mapping, variant calling and filtering of variants and detection of low-quality regions. We prepare the reads by removing sequence adapters and merging overlapping paired end reads. After that low-quality reads are filtered and reads with ends of low base quality are trimmed. High-quality reads are mapped to a reference and necessary steps for variant calling are taken. We use GATK [55] for variant calling and bedtools [58] to calculate the coverage of the genome. The results are filtered to detect regions of low-quality and high-quality SNPs. (TIF) Click here for additional data file.

Description of clusters in the simulation dataset.

(PDF) Click here for additional data file.

Description of genomes used for the simulation dataset.

Base differences were counted in a whole-genome alignment of the four genomes. Upper triangular part of table shows hamming distance of sequences in WGA ignoring gaps, while the lower part of the table lists all differences, including gaps. (PDF) Click here for additional data file.

Comparison of genomes used in the simulation dataset to the M. tuberculosis strain.

We count the simulated SNPs in all simulated samples located on genome specific regions that are not part of the H37Rv genome. (PDF) Click here for additional data file.

Details of transmission detection in simulated clusters.

S4 Table provides the counts for transmission cluster links for the exclusion, substitution and pairwise method for mappings to the M. tuberculosis H37Rv strain and the computational pan-genome. (PDF) Click here for additional data file.

Comparison of SNP-counting methods using different reference genomes.

We used the all three methods (exclusion, substitution, PANPASCO) with the commonly used M. tuberculosis H37Rv and the computational pan-genome reference genomes for classification of links between samples in the simulated dataset. (PDF) Click here for additional data file.

Comparison of differential SNPs detected in the UKTB7 dataset.

S6 Table contains the full SNP matrix for all samples in the UKTB7 dataset. (PDF) Click here for additional data file.

Genomes in computational pan-genome.

S7 Table provides detailed description of the genomes used for building the computational M. tuberculosis pan-genome including accession numbers, sort order and lineage. (PDF) Click here for additional data file.

Regions filtered in real datasets.

S8 Table provides filtered genes with coordinates for the M. tuberculosis H37Rv genome and mapped to the computational pan-genome. (PDF) Click here for additional data file.

49 in total

1. From molecular to genomic epidemiology: transforming surveillance and control of infectious diseases.

Authors: M J Struelens; S Brisse
Journal: Euro Surveill Date: 2013-01-24

2. Whole-genome-based Mycobacterium tuberculosis surveillance: a standardized, portable, and expandable approach.

Authors: Thomas A Kohl; Roland Diel; Dag Harmsen; Jörg Rothgänger; Karen Meywald Walter; Matthias Merker; Thomas Weniger; Stefan Niemann
Journal: J Clin Microbiol Date: 2014-04-30 Impact factor: 5.948

3. Mycobacterium tuberculosis--heterogeneity revealed through whole genome sequencing.

Authors: Chris Ford; Karina Yusim; Tom Ioerger; Shihai Feng; Michael Chase; Mary Greene; Bette Korber; Sarah Fortune
Journal: Tuberculosis (Edinb) Date: 2012-01-03 Impact factor: 3.131

4. Clinical application of whole-genome sequencing to inform treatment for multidrug-resistant tuberculosis cases.

Authors: Adam A Witney; Katherine A Gould; Amber Arnold; David Coleman; Rachel Delgado; Jasvir Dhillon; Marcus J Pond; Cassie F Pope; Tim D Planche; Neil G Stoker; Catherine A Cosgrove; Philip D Butcher; Thomas S Harrison; Jason Hinds
Journal: J Clin Microbiol Date: 2015-02-11 Impact factor: 5.948

5. Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved.

Authors: Iñaki Comas; Jaidip Chakravartti; Peter M Small; James Galagan; Stefan Niemann; Kristin Kremer; Joel D Ernst; Sebastien Gagneux
Journal: Nat Genet Date: 2010-05-23 Impact factor: 38.330

6. Whole genome sequencing analysis of intrapatient microevolution in Mycobacterium tuberculosis: potential impact on the inference of tuberculosis transmission.

Authors: Laura Pérez-Lago; Iñaki Comas; Yurena Navarro; Fernando González-Candelas; Marta Herranz; Emilio Bouza; Darío García-de-Viedma
Journal: J Infect Dis Date: 2013-08-14 Impact factor: 5.226

7. Does Choice Matter? Reference-Based Alignment for Molecular Epidemiology of Tuberculosis.

Authors: Robyn S Lee; Marcel A Behr
Journal: J Clin Microbiol Date: 2016-04-13 Impact factor: 5.948

8. Use of whole genome sequencing to estimate the mutation rate of Mycobacterium tuberculosis during latent infection.

Authors: Christopher B Ford; Philana Ling Lin; Michael R Chase; Rupal R Shah; Oleg Iartchouk; James Galagan; Nilofar Mohaideen; Thomas R Ioerger; James C Sacchettini; Marc Lipsitch; JoAnne L Flynn; Sarah M Fortune
Journal: Nat Genet Date: 2011-04-24 Impact factor: 38.330

9. Comparative whole-genome analysis of clinical isolates reveals characteristic architecture of Mycobacterium tuberculosis pangenome.

Authors: Vinita Periwal; Ashok Patowary; Shamsudheen Karuthedath Vellarikkal; Anju Gupta; Meghna Singh; Ashish Mittal; Shamini Jeyapaul; Rajendra Kumar Chauhan; Ajay Vir Singh; Pravin Kumar Singh; Parul Garg; Viswa Mohan Katoch; Kiran Katoch; Devendra Singh Chauhan; Sridhar Sivasubbu; Vinod Scaria
Journal: PLoS One Date: 2015-04-08 Impact factor: 3.240

10. IMPUTOR: Phylogenetically Aware Software for Imputation of Errors in Next-Generation Sequencing.

Authors: Matthew Jobin; Haiko Schurz; Brenna M Henn
Journal: Genome Biol Evol Date: 2018-04-01 Impact factor: 3.416

5 in total

1. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads.

Authors: Carlos Valiente-Mullor; Beatriz Beamud; Iván Ansari; Carlos Francés-Cuesta; Neris García-González; Lorena Mejía; Paula Ruiz-Hueso; Fernando González-Candelas
Journal: PLoS Comput Biol Date: 2021-01-27 Impact factor: 4.475

2. Improving tuberculosis surveillance by detecting international transmission using publicly available whole genome sequencing data.

Authors: Andrea Sanchini; Christine Jandrasits; Julius Tembrockhaus; Thomas Andreas Kohl; Christian Utpatel; Florian P Maurer; Stefan Niemann; Walter Haas; Bernhard Y Renard; Stefan Kröger
Journal: Euro Surveill Date: 2021-01

3. Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN.

Authors: Xingyan Kuang; Fan Wang; Kyle M Hernandez; Zhenyu Zhang; Robert L Grossman
Journal: Sci Rep Date: 2022-02-14 Impact factor: 4.379

Review 4. Mycobacterium bovis: From Genotyping to Genome Sequencing.

Authors: Ana M S Guimaraes; Cristina K Zimpel
Journal: Microorganisms Date: 2020-05-03

5. Mycobacterium tuberculosis complex lineage 5 exhibits high levels of within-lineage genomic diversity and differing gene content compared to the type strain H37Rv.

Authors: C N'Dira Sanoussi; Mireia Coscolla; Boatema Ofori-Anyinam; Isaac Darko Otchere; Martin Antonio; Stefan Niemann; Julian Parkhill; Simon Harris; Dorothy Yeboah-Manu; Sebastien Gagneux; Leen Rigouts; Dissou Affolabi; Bouke C de Jong; Conor J Meehan
Journal: Microb Genom Date: 2021-07

5 in total