| Literature DB >> 21524302 |
Todd Z DeSantis1, Keith Keller, Ulas Karaoz, Alexander V Alekseyenko, Navjeet N S Singh, Eoin L Brodie, Zhiheng Pei, Gary L Andersen, Niels Larsen.
Abstract
BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21524302 PMCID: PMC3097142 DOI: 10.1186/1472-6785-11-11
Source DB: PubMed Journal: BMC Ecol ISSN: 1472-6785 Impact factor: 2.964
Simrank database binary file structure and storage requirements.
| File Segment | File Element | Storage Requirement (bytes) |
|---|---|---|
| 1 | F, string ID field size | 10 |
| 2 | K, k-mer length | 10 |
| 3 | N, string count | 10 |
| 4 | string ID array | |
| 5 | offset arraysa | |
| 6 | k-mer arrayb | |
| 7 | offsets index arrayc | 4 |
| 8 | offsets lengths arrayd | 4 |
| 9 | unique k-mers per string arraye | 4 |
| 10 | k, unique k-mer count | 10 |
| 11 | file position of segment 6 | 10 |
aEach k-mer generates a vector of string indices, encoded as an integer array of offsets required to "visit" each string index containing the k-mer. k is the count of unique k-mers, and si is the count of strings containing the ith k-mer. Each offset is stored as a 4-byte integer.
bLexically sorted ASCII text strings of each unique k-mer stored as one byte per character.
c4-byte integer list of file positions for the start of each k-mer's list of offsets.
d4-byte integer list of the byte length of each k-mer's list of offsets.
e4-byte integer list of the count of unique k-mers in each string.
Figure 1Search duration and relative hit results. Comparison of search duration and hits between search tools with various data sets. Data sets are described in Table 2. Search duration is expressed in seconds and shown in log scale. Hit results are expressed as a percentage in relation to subject hit counts from BLAST's local alignments.
Datasets used for performance evaluation
| Data Set | String Type | Mean Length | Database Count | QueryCount | alphabet size | k-mer length | total database k-mers |
|---|---|---|---|---|---|---|---|
| 16Sa | DNA | 1350 | 188,073 | 2000 | 4 | 7 | 16,384 |
| Pyrob | DNA | 150 | 501,532 | 500 | 4 | 6 | 4,096 |
| ITSc | DNA | 627 | 212,367 | 2000 | 4 | 6 | 4,096 |
| Shuffled | DNA | 687 | 1,000,000 | 1000 | 4 | 7 | 16,384 |
| gpIe | RNA | 398 | 20,085 | 5000 | 4 | 7 | 16,360 |
| GP120f | Protein | 175 | 68,119 | 2000 | 20 | 4 | 98,695 |
| Institutesg | Text | 121 | 23,768 | 1000 | 47/61 | 4 | 67,287 |
a Greengenes 16S rRNA gene collection (DeSantis, 2006)
b Roche-454 pyrosequences from gastrointestinal contents (Ochman, 2010)
c Internal Transcribed Spacer region from eukaryotic ribosomal genes.
d Derived from random repetitive shuffling of Ralstonia solanacearum strain UW486 endoglucanase precursor, DQ657652 (Castillo and Greenberg, 2007)
e Group I catalytic introns RFAM RF00028 (Griffiths-Jones, et al., 2003)
f HIV Envelope glycoprotein PFAM PF00516 (Finn, 2008)
g Institute names as displayed in GenBank records. For BLAST and SSAHA2, all non-alphanumeric characters were interpreted as a space for a total of alphabet size of 47, for Simrank no substitution for any of the 61 unique characters was performed.
Figure 2Similarity score comparison. Comparison of DNA sequence similarity scores observed when a single DNA sequence collection is compared to a reference database using either Simrank or an alignment-based scoring system.
Figure 3Simrank sensitivity and specificity. Comparison of sensitivity and specificity of Simrank DNA searches with various k-mer lengths. True hits were defined as those with 97% alignment identity. The x-axis is the false positive rate (FPR - Simrank hits to subjects with <97% alignment identity), the y-axis is the true positive rate(TPR - Simrank hits to subjects with > = 97% alignment identity). Each curve represents the balance of TPR and FPR through the range of Simrank thresholds. Vertical dashed line at y = 0.95, represents a 95% TPR. Inset table lists the FPR and Simrank cutoff for each k-mer search to obtain a 95% TPR.