| Literature DB >> 25410596 |
Todd J Treangen, Brian D Ondov, Sergey Koren, Adam M Phillippy.
Abstract
Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.Entities:
Mesh:
Year: 2014 PMID: 25410596 PMCID: PMC4262987 DOI: 10.1186/s13059-014-0524-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Core-genome SNP accuracy for simulated datasets
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| Mauve |
| 148 | 318 | 198 | 2,877 | 100 | 30,378 | 0.974 | 0.0004 |
| Mauve (c) |
| 0 | 0 | 2 | 38 | 6 | 649 | 0.999 | 0 |
| Mugsy |
| 1,261b | 395 | 1,928 | 3,371 | 1,335 | 34,923 | 0.970 | 0.0036 |
| Mugsy (c) |
| 2 | 0 | 2 | 0 | 1 | 81 | 0.999 | 0 |
| Parsnp |
| 23 | 423 | 45 | 3,494 | 7 | 35,466 | 0.970 | 0.0001 |
| Parsnp (c) |
| 0 | 24 | 0 | 603 | 0 | 10,989 | 0.992 | 0 |
| kSNP |
| 259 | 600 | 908 | 19,730 | 1,968 | 916,127 | 0.280 | 0.0086 |
| Smalt |
| 33 | 110 | 0 | 1,307 | 55 | 22,957 | 0.981 | 0.0001 |
| BWA |
| 0 | 168 | 16 | 1,947 | 27 | 27,091 | 0.9775 | 0.0000 |
Data shown indicates performance metrics of the evaluated methods on the three simulated E. coli datasets (low, medium, and high). Method: Tool used.
(c) indicates aligner ran on closed genomes rather than draft assemblies.
False positive (FP) and false negative (FN) counts for the three mutation rates (low, med, and high). True positive rate TPR: TP/(TP + FN). False discovery rate FDR: FP/(TP + FP). A total of 1,299,178 SNPs were introduced into the 32-genome dataset, across all three mutational rates.
aParadigm employed by each method.
bMugsy’s lower precision was traced to a paralog misalignment that resulted in many false-positive SNPs.
CGA: core genome alignment, FN, number of truth SNP calls not detected, FP, number of SNP calls that are not in truth set, KMER: k-mer based SNP calls, MAP: read mapping, TP: number of SNP calls that agreed with the truth, WGA: whole-genome alignment.
Figure 1Core-genome SNP accuracy for simulated datasets. Results are averaged across low, medium, and high mutation rates. Red squares denote alignment-based SNP calls on draft assemblies, green squares alignment-based SNP calls on closed genomes, and blue triangles for read mapping. Full results for each dataset are given in Table 1.
Figure 2Branch errors for simulated datasets. Simulated E. coli trees are shown for medium mutation rate (0.0001 per base per branch). (A) shows branch length errors as bars, with overestimates of branch length above each branch and underestimates below each branch. Maximum overestimate of branch length was 2.15% (bars above each branch) and maximum underestimate was 4.73% (bars below each branch). (B) shows branch SNP errors as bars, with false-positive errors above each branch and false-negative errors below each branch. The maximum FP SNP value is 6 (bars above each branch) and maximum FN SNP value is 23 (bars below each branch). Note that the bar heights have been normalized by the maximum value for each tree and are not comparable across trees. Outlier results from Mugsy were excluded from the branch length plot, and kSNP results are not shown. All genome alignment methods performed similarly on closed genomes, with Mauve and Mugsy exhibiting the best sensitivity (Table 1).
Comparison of locally collinear alignment block (LCB) count for simulated datasets, on assembled and finished genomes
|
|
|
|
|
|---|---|---|---|
| Mauve | 325 | 363 | 519 |
| Mauve (c) | 150 | 174 | 333 |
| Mugsy | 10,977 | 11,194 | 16,632 |
| Mugsy (c) | 237 | 247 | 351 |
| Parsnp | 205 | 271 | 344 |
| Parsnp (c) | 139 | 190 | 506 |
Method: Tool used.
(c) indicates aligner ran on closed genomes rather than draft assemblies.
Low: >99.99% similarity, Medium: >99.9% similarity, High: >99% similarity.
Comparison to the 31 Mugsy benchmark
|
|
|
|
|
|---|---|---|---|
| Parsnp | 0.3 min | 1,428,407 | 1,171 |
| Mugsy | 100 min | 1,590,820 | 2,394 |
| Mauve | 377 min | 1,568,715 | 1,366 |
| NUCmer + TBA | 80 min | 1,457,575 | 27,075 |
Time: Method runtime from input to output. Core: Size of the aligned core genome measured in base pairs. LCBs: Number of locally colinear blocks in the alignment.
Figure 3Gingr visualization of 826 genomes aligned with Parsnp. The leaves of the reconstructed phylogenetic tree (left) are paired with their corresponding rows in the multi-alignment. A genome has been selected (rectangular aqua highlight), resulting in a fisheye zoom of several leaves and their rows. A SNP density plot (center) reveals the phylogenetic signature of several clades, in this case within the fully-aligned hpd operon (hpdB, hpdC, hpdA). The light gray regions flanking the operon indicate unaligned sequence. When fully zoomed (right), individual bases and SNPs can be inspected.
Figure 4Conserved presence of antiobiotic resistance gene in outbreak. Gingr visualization of conserved bacitracin resistance gene within the Parsnp alignment of 826 P. difficile genomes. Vertical lines indicate SNPs, providing visual support of subclades within this outbreak dataset.
Figure 5Comparison of Parsnp and Comas result on dataset. A Venn diagram displays SNPs unique to Comas et al. [98] (left, blue), unique to Parsnp (right, red), and shared between the two analyses (middle, brown). On top, an unrooted reference phylogeny is given based on the intersection of shared SNPs produced by both methods (90,295 SNPs). On bottom, the phylogenies of Comas et al. (left) and Parsnp (right) are given. Pairs of trees are annotated with their Robinson-Foulds distance (RFD) and percentage of shared splits. The Comas et al. and Parsnp trees are largely concordant with each other and the reference phylogeny. All major clades are shared and well supported by all three trees.
Figure 6Gingr visualization of 171 genomes aligned with Parsnp. The visual layout is the same as Figure 3, but unlike Figure 3, a SNP density plot across the entire genome is displayed. Major clades are visible as correlated SNP densities across the length of the genome.
Performance profile of Parsnp runtime (MUM + alignment) on all evaluated datasets
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
| 32 | 142 | 2 | 2 | 4 | 2 |
|
| 171 | 424 | 12 | 20 | 32 | 14 |
|
| 826 | 1,392 | 46 | 39 | 85 | 71 |
|
| 10,000 | 21,000 | 668 | 201 | 869 | 309 |
Results were generated on a 32-core, 2.2 GHz, 1 TB RAM Linux server. Dataset: the genome set.
aThe number of genomes aligned.
bTotal Mbp aligned.
cThe time spent finding maximal unique matches.
dThe time spent performing gapped multi-alignment with MUSCLE.
eTotal Parsnp runtime (sum of MUM and MUSCLE).
fMaximum memory usage.