| Literature DB >> 26019584 |
Done Stojanov1, Sašo Koceski1, Aleksandra Mileva1, Nataša Koceska1, Cveta Martinovska Bande1.
Abstract
In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.Entities:
Keywords: DNA database; E. coli; all hits; fast indexing and search
Year: 2014 PMID: 26019584 PMCID: PMC4434100 DOI: 10.1080/13102818.2014.959711
Source DB: PubMed Journal: Biotechnol Biotechnol Equip ISSN: 1310-2818 Impact factor: 1.632
Figure 1. DNA data-indexing phase.
Figure 2. DNA pattern searching phase.
Figure 3. Short-read form E. coli 55989 chromosome, base range: 191–300.
Comparison of the number of records in the indexed data structures.
| Base range (Mb) | SSAHA | Dynamic construction of the indexed data structure | Number of redundant records |
|---|---|---|---|
| 1 | 65,536 | 64,422 | 1114 |
| 2 | 65,536 | 65,147 | 389 |
| 3 | 65,536 | 65,346 | 190 |
| 4 | 65,536 | 65,424 | 112 |
| 5 | 65,536 | 65,471 | 65 |
Comparison of the running times for searching different promoter consensus sequences
| Sigma factor | Consensus query | Conversion value | Reneker and Shyu (ms) | Our algorithm (ms) |
|---|---|---|---|---|
| CCGATAT | 9 | 6 | ||
| TATAAT | 8 | 8 | ||
| TTGACA | 10 | 8 | ||
| CTGGTA | 13 | 13 | ||
| CTAAA | f(CTAAA) = 348 | 17 | 15 |
Figure 4. Comparison of the results in Table 2.
Comparison of the matching rates
| Sigma factor | Consensus query | Conversion value | Number of hits – Reneker and Shyu | Number of hits – our algorithm | Number of hits unreported by Reneker and Shyu |
|---|---|---|---|---|---|
| CCGATAT | 2520 | 2526 | 6 | ||
| TATAAT | 3812 | 3819 | 7 | ||
| TTGACA | 4010 | 4019 | 9 | ||
| CTGGTA | 12,001 | 12,018 | 17 | ||
| CTAAA | 18,140 | 18,166 | 26 |
Summary of the advantages of the presented methodology
| Factor | SSAHA | Reneker and Shyu | Our algorithm |
|---|---|---|---|
| Time complexity of the indexing phase | O( | O( | O( |
| Structure of indexed data structure | Hash table with 4 | File | Sorted dictionary with less than 4 |
| Matching rate | Reports limited number of hits | Better matching rate than SSAHA, but unable to detect hits at the beginnings of the sequences | Identifies all DNA pattern hits |