| Literature DB >> 17018144 |
William A Stokes1, Benjamin S Glick.
Abstract
BACKGROUND: Molecular biologists work with DNA databases that often include entire genomes. A common requirement is to search a DNA database to find exact matches for a nondegenerate or partially degenerate query. The software programs available for such purposes are normally designed to run on remote servers, but an appealing alternative is to work with DNA databases stored on local computers. We describe a desktop software program termed MICA (K-Mer Indexing with Compact Arrays) that allows large DNA databases to be searched efficiently using very little memory.Entities:
Mesh:
Year: 2006 PMID: 17018144 PMCID: PMC1618866 DOI: 10.1186/1471-2105-7-427
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
MICA file structure and data storage requirements
| File Element | Generic Storage Requirement (bytes) | Storage Requirement for Chromosome 1 (bytes) |
| 1 | 1 | |
| 4 | 4 | |
| 1 | 1 | |
| 245,522,847 (234 MB) | ||
| SEGMENT TOTAL | 6 + | 245,522,853 (234 MB) |
| 1 | 1 | |
| 4 | 4 | |
| 1 | 1 | |
| 4 | ||
| 4 | 4 | |
| 4 | 4 | |
| (4 | ||
| (number of partially degenerate | ||
| 8 | 296 | |
| SEGMENT TOTAL | Typically about 2 | |
A MICA file consists of a Sequence Segment (elements A – D) followed by an Index Segment (elements E – M). If a sequence occupies more than 16 chunks, then loading of a MICA index consists of reading elements A – C of the Sequence Segment, skipping over the DNA sequence, and reading elements E – J of the Index Segment. The parameters for human chromosome 1 were: L = 245,522,847 bases; C = 3,747 chunks; S = 37 N-stretches (total of 22,695,000 N's).
(A) A single byte is used to specify the Sequence Segment format.
(B) The size of the Sequence Segment is specified by a 4-byte integer.
(C) A single byte is used to record properties of the sequence, including its topology (linear or circular) and its strandedness (single- or double-stranded).
(D) The DNA sequence is stored as a string of uppercase ASCII characters.
(E) A single byte is used to specify the Index Segment Format.
(F) The size of the Index Segment is specified by a 4-byte integer.
(G) A single byte is used to record the byte order ("endianness") of the index. If the byte order is opposite to that of the machine being used to run the queries, MICA corrects the byte order when processing the index data.
(H) The Chunk Counts Summary is a list of 44-byte integers representing the total number of times each nondegenerate K-mer appears in the sequence. For the MICA index, the 4-base nondegenerate DNA alphabet is arranged in the order G, A, T, C. Thus, the first nondegenerate K-mer listed is GGGG (K = 4) or GGGGGG (K = 6), and the last one listed is CCCC (K = 4) or CCCCCC (K = 6). This lexicographical order yields contiguous index reads for K-mers that end in the most common partially degenerate bases: R (A or G), Y (C or T), and W (A or T).
(I) The Degenerate K-mer Count is a 4-byte integer representing the total number of partially degenerate K-mers in the sequence.
(J) The N-Stretch Count S is a 4-byte integer representing the number of separate stretches of K or more consecutive N's.
(K) The Chunk Data Array is divided into 4partitions corresponding to the 4nondegenerate K-mers. Each partition contains a list of 2-byte integers representing the number of times the K-mer is present in each of the C chunks, followed by a list of 2-byte integers representing the intra-chunk positions of the K-mer in each of the C chunks. The first partition contains the data for GGGG (K = 4) or GGGGGG (K= 6), and the second partition contains the data for GGGA (K = 4) or GGGGGA (K = 6), and so on.
(L) The Degenerate Data Array is a list of the partially degenerate K-mers. Each partially degenerate K-mer is represented as a 4-byte integer that marks the absolute position of the K-mer, followed by a K-byte string that encodes the sequence of the K-mer.
(M) The N-Stretch Data Array consists of S pairs of 4-byte integers that represent the starting positions and lengths of the N-stretches.
Figure 1Pseudocode summary of the basic MICA search routines. See the text for additional information about memory management, intersection algorithms, and merge operations for degenerate K-mers.
Representative sizes, creation times, and loading times for MICA indexes
| Index Size (GB) | Index Creation Time (sec) | Index Loading Time (sec) | ||
| Chromosome 1 | 4 | 0.42 | 19.3 | 0.023 |
| 6 | 0.44 | 27.1 | 0.024 | |
| Random Sequence | 4 | 0.46 | 23.5 | 0.020 |
| 6 | 0.49 | 32.0 | 0.025 | |
| Human Genome | 4 | 5.33 | 262 | 0.49 |
| 6 | 5.67 | 345 | 0.44 |
The sequences of chromosome 1 and the 25 chromosomes comprising the human genome were obtained from the Ensembl database. We also tested a computer-generated random sequence of 246,000,000 bp containing equal proportions of G, A, T, and C. Index size refers to index elements E – M of the MICA file (see Table 1), and does not include the Sequence Segment. The creation time for each index includes the time needed to write the index to disk, but does not include the additional time needed to embed a copy of the sequence within the file; this additional embedding time for chromosome 1 was 16.4 sec for K = 4 or 19.6 sec for K = 6. Index loading refers to the reading of elements A – C and E – J of an index from disk into memory. For the human genome, the values listed are the sums of the values for the individual chromosomes.
Representative search times for K = 4
| Query | Chromosome 1 | Human Genome | ||
| Time (sec) | Hits | Time (sec) | Hits | |
| Nondegenerate 3-mers | 0.82 [0.51] | 6.96 × 106 | 11.5 | 8.90 × 107 |
| Nondegenerate 4-mers | 0.13 [0.028] | 1.69 × 106 | 2.5 | 2.17 × 107 |
| Nondegenerate 6-mers | 0.35 [0.11] | 160,702 | 5.8 | 2.05 × 106 |
| Nondegenerate 8-mers | 0.38 [0.11] | 16,631 | 6.2 | 213,099 |
| Nondegenerate 15-mers | 0.56 [0.11] | 1.39 | 9.0 | 5.81 |
| Nondegenerate 30-mers | 0.54 [0.10] | 1.03 | 8.3 | 1.24 |
| Nondegenerate 100-mers | 0.41 [0.069] | 1.01 | 6.1 | 1.02 |
| Nondegenerate 1000-mers | 0.14 [0.019] | 1.00 | 2.3 | 1.00 |
| 0.77 [0.095] | 1,130 | 13.6 | 14,041 | |
| GDGCHC ( | 0.43 [0.11] | 398,999 | 6.9 | 4,776,086 |
| GCCNNNNNGGC ( | 0.37 [0.12] | 44,761 | 6.1 | 520,776 |
| ACNNNNGTAYC ( | 1.55 [0.34] | 20,243 | 23.2 | 259,837 |
Both DNA strands were searched using K = 4. Results for the 3- to 1000-mer searches are average values obtained by searching with multiple queries. For 3-mers, all 64 possible nondegenerate queries were tested by extending each 3-mer to a partially degenerate 4-mer. For 4-mers, all 256 possible nondegenerate queries were tested. For 6- and 8-mers, 100 randomly chosen nondegenerate queries were tested. In the case of 15- to 1000-mers, each test involved 100 nondegenerate queries that were extracted randomly from chromosome 1 and checked to confirm that a given query had no more than 10 matches in the genome. The Alu 30-mer fragment GGCCGGGCGCGGTGGCTCACGCCTGTAATC is a conserved sequence found at the 5' ends of Alu repeat elements [14]. The three partially degenerate queries are the recognition sequences for the restriction enzymes Bsp1286I, BglI, and BaeI. For chromosome 1, the search times without brackets were obtained after pre-loading only file elements A – C and E – J (see Table 1) into memory, and the faster search times with brackets were obtained after pre-loading the entire file. For the entire genome, the search times include the time needed to load elements A – C and E – J of each file into memory. Thus, the data for chromosome 1 reflect the time needed to search a file that is already open, whereas the data for the entire genome reflect the time needed to search a set of unopened files. To ensure that the search times without brackets reflect MICA performance for newly opened indexes, each search was preceded by a large number of extraneous reads, which flushed the main memory of any prior data from the relevant index.
Representative search times for K = 6
| Query | Chromosome 1 | Human Genome | ||
| Time (sec) | Hits | Time (sec) | Hits | |
| Nondegenerate 3-mers | 1.1 [0.80] | 6.96 × 106 | 14.8 | 8.90 × 107 |
| Nondegenerate 4-mers | 0.28 [0.16] | 1.69 × 106 | 4.2 | 2.17 × 107 |
| Nondegenerate 6-mers | 0.043 [0.0032] | 160,702 | 0.96 | 2.05 × 106 |
| Nondegenerate 8-mers | 0.079 [0.011] | 16,631 | 1.6 | 213,099 |
| Nondegenerate 15-mers | 0.088 [0.0074] | 1.39 | 1.8 | 5.81 |
| Nondegenerate 30-mers | 0.13 [0.0076] | 1.03 | 1.8 | 1.24 |
| Nondegenerate 100-mers | 0.12 [0.0050] | 1.01 | 1.7 | 1.02 |
| Nondegenerate 1000-mers | 0.094 [0.0023] | 1.00 | 1.3 | 1.00 |
| 0.12 [0.0055] | 1,130 | 2.5 | 14,041 | |
| GDGCHC ( | 0.13 [0.031] | 398,999 | 2.5 | 4,776,086 |
| GCCNNNNNGGC ( | 0.44 [0.15] | 44,761 | 6.2 | 520,776 |
| ACNNNNGTAYC ( | 1.40 [0.37] | 20,243 | 21.1 | 259,837 |
Both DNA strands were searched using K = 6. The queries and the procedure were as described in Table 3, except that each 3- or 4-mer was extended to a partially degenerate 6-mer.
Figure 2Example of search times as a function of available RAM. To simulate searching with various amounts of free memory, we instructed MICA (K = 6) to search chromosome 1 for BaeI sites (ACNNNNGTAYC) using the procedures that would be followed if the indicated amounts of RAM were available. For these measurements, each search was preceded by a large number of extraneous reads, which flushed the main memory of any prior data from the index.