| Literature DB >> 21902825 |
Ravi Vijaya Satya1, Nela Zavaljevski, Jaques Reifman.
Abstract
With ever-increasing numbers of microbial genomes being sequenced, efficient tools are needed to perform strain-level identification of any newly sequenced genome. Here, we present the SNP identification for strain typing (SNIT) pipeline, a fast and accurate software system that compares a newly sequenced bacterial genome with other genomes of the same species to identify single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Based on this information, the pipeline analyzes the polymorphic loci present in all input genomes to identify the genome that has the fewest differences with the newly sequenced genome. Similarly, for each of the other genomes, SNIT identifies the input genome with the fewest differences. Results from five bacterial species show that the SNIT pipeline identifies the correct closest neighbor with 75% to 100% accuracy. The SNIT pipeline is available for download at http://www.bhsai.org/snit.html.Entities:
Year: 2011 PMID: 21902825 PMCID: PMC3182885 DOI: 10.1186/1751-0473-6-14
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Figure 1Outline of the SNP identification pipeline. The tandem repeat regions in the input genomes can be masked using the Tandem Repeat Finder (TRF) program. Each input genome is aligned against a user-specified reference genome. The lists of differentiating SNPs and indels between each pair of input genomes are constructed from these pairwise alignments.
Input parameters used for testing SNIT
| Parameter | Value |
|---|---|
| Minimum MUMmer cluster length | 100 |
| Minimum MUMmer exact match | 50 |
| Maximum MUMmer gap | 49 |
| Minimum large indel size | 50 |
| Minimum conserved flank length | 50 |
Summary of the results for five different bacterial species
| Species | No. of | Combined Size | Time | Accuracy |
|---|---|---|---|---|
| 7 | 36 | 4 | 100 | |
| 11 | 22 | 3 | 100 | |
| 4 | 18 | 2 | 100 | |
| 10 | 59 | 19 | 100 | |
| 20 | 144 | 45 | 75 |
Accuracy is defined as the percentage of genomes for which the correct closest neighbors were identified based on published phylogenies for the species. The runs were performed using a single processor on a 3.6 GHz dual processor system with 4 GB RAM.
Figure 2Graphical user interface of the SNIT pipeline. The screenshot of the main interface, showing the default values of the various input parameters.