| Literature DB >> 27832109 |
James B Pettengill1, Arthur W Pightling1, Joseph D Baugher1, Hugh Rand1, Errol Strain1.
Abstract
The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.Entities:
Mesh:
Year: 2016 PMID: 27832109 PMCID: PMC5104361 DOI: 10.1371/journal.pone.0166162
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of genetic distances used to infer the relationships among samples.
| Class | Distance | Description |
|---|---|---|
| Site-based | NUCmer | Suffix array method to efficiently perform pairwise whole-genome alignment |
| Extended MLST | Employs the Basic Local Alignment Search Tool to perform pairwise comparisons of predicted open reading frames | |
| k-mer based | Jaccard Distance | 1 –Jaccard index (i.e., the intersection divided by the union of all k-mers found between two samples) |
| Manhattan Distance | Sum of the absolute differences between the abundance of each k-mer present between two samples | |
| Euclidean Distance | The square root of the sum of square of all pairwise differences in k-mer abundance | |
| Mash Distance | Employs the MinHash [ | |
| Mash Jaccard Distance | The Jaccard Distance (as described above) but based on the sketch size (e.g., the number of hashes) |
Mean and variance of AUC values for each of the different distance methods for each serovar.
| Distance | ||||||||
|---|---|---|---|---|---|---|---|---|
| k-mer based | Site-based | |||||||
| Serovar (N) | Euclidean | Jaccard | Manhattan | Mash | Mash Jaccard | Extended MLST | NUCmer | Average |
| Agona (282) | 0.767 (0.014) | 0.849 (0.011) | 0.822 (0.023) | 0.935 (0.006) | 0.944 (0.005) | 0.985 (0) | 0.959 (0.006) | 0.894 |
| Enteritidis (4455) | 0.868 (0.002) | 0.876 (0.003) | 0.88 (0.003) | 0.959 (0) | 0.959 (0) | 0.987 (0) | 0.983 (0) | 0.930 |
| Heidelberg (580) | 0.919 (0.001) | 0.889 (0.002) | 0.925 (0.001) | 0.939 (0) | 0.943 (0) | 0.994 (0) | 0.984 (0) | 0.942 |
| Infantis (341) | 0.897 (0.002) | 0.919 (0.001) | 0.906 (0.001) | 0.979 (0) | 0.982 (0) | 0.988 (0) | 0.986 (0) | 0.951 |
| Kentucky (627) | 0.709 (0.013) | 0.749 (0.009) | 0.756 (0.011) | 0.875 (0.001) | 0.872 (0.002) | 0.968 (0) | 0.936 (0) | 0.838 |
| Montevideo (287) | 0.823 (0.002) | 0.832 (0.003) | 0.837 (0.004) | 0.921 (0.005) | 0.916 (0.007) | 0.980 (0.001) | 0.968 (0.001) | 0.897 |
| Newport (827) | 0.680 (0.007) | 0.677 (0.006) | 0.680 (0.008) | 0.819 (0.006) | 0.813 (0.003) | 0.976 (0) | 0.948 (0) | 0.799 |
| Senftenberg (232) | 0.793 (0.006) | 0.821 (0.003) | 0.814 (0.007) | 0.925 (0.001) | 0.933 (0.001) | 0.974 (0) | 0.958 (0) | 0.888 |
| Typhimurium (3475) | 0.822 (0.003) | 0.846 (0.003) | 0.846 (0.003) | 0.949 (0) | 0.948 (0) | 0.966 (0) | 0.969 (0) | 0.907 |
| Weltevreden (268) | 0.914 (0.002) | 0.915 (0.001) | 0.934 (0.001) | 0.931 (0.001) | 0.907 (0.001) | 0.983 (0) | 0.982 (0) | 0.938 |
| Average | 0.819 | 0.837 | 0.84 | 0.923 | 0.922 | 0.980 | 0.967 | |
Fig 1Receiver operator curves for each distance (columns) by serovar (rows) combination.
Lines within each panel represent the 100 replicate analyses performed.
Median runtime estimates in seconds.
| Class | Distance | Median |
|---|---|---|
| k-mer based | Euclidean | 53.123 |
| Jaccard | 2.781 | |
| JellyFish | 118.472 | |
| Manhattan | 53.454 | |
| Mash and Mash Jaccard | 1.226 | |
| Site-based | Extended MLST | 46.950 |
| NUCmer | 9.425 |
* For a single pairwise comparison
¶ To count and dump a text file of k-mers per sample
§ Includes the sketching of the focal sample and estimating the distance between it and all other samples
Fig 2Boxplots illustrating the variation in area under the curve (AUC) values across the distance by serovar combinations.
Note the minimum on the y-axis if 0.4.