Literature DB >> 26019584

Towards computational improvement of DNA database indexing and short DNA query searching.

Done Stojanov1, Sašo Koceski1, Aleksandra Mileva1, Nataša Koceska1, Cveta Martinovska Bande1.   

Abstract

In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

Entities:  

Keywords:  DNA database; E. coli; all hits; fast indexing and search

Year:  2014        PMID: 26019584      PMCID: PMC4434100          DOI: 10.1080/13102818.2014.959711

Source DB:  PubMed          Journal:  Biotechnol Biotechnol Equip        ISSN: 1310-2818            Impact factor:   1.632


Introduction

Increased knowledge about the complex human genome is revealing its importance and impact on people's lives. The advances in computer science have contributed to the storage, dissemination, search and analysis of human genome data with increased efficiency and accuracy. That is why genetic databases are gaining more and more popularity in the research community today. Genetic databases often include entire genomes and could be used for searching particular sequences in genetic disease analysis, DNA fingerprinting, genetic genealogy or analysis of short sequences, such as non-standard codon structure [1] and codon context frequency.[2] As soon as the first genetic database became available on the Internet, the necessity of fast DNA database-processing algorithms became a challenge that is still a challenge to researchers today. Being database inapplicable, dynamic programming-based solutions for global/local sequence alignment, such as Needleman–Wunsch [3] and Smith–Waterman,[4] have been substituted with faster, heuristic seed-based algorithms such as FASTA (Fast Alignment) [5] and BLAST (Basic Local Alignment Search Tool),[6] performed in two phases. In the first, so-called preprocessing phase, the matching positions of highly similar regions are identified as seeds, and in the second phase, the seeds are extended to local alignment. Usually, not every initial seed is extended to full alignment; instead, many of them are discarded by filtering, which results in lower run-time. One common task – searching a genetic database to find exact matches for a non-degenerate or partially degenerate query – is usually done by using web applications hosted and run on remote web servers.[7] For smaller databases, computer desktop programs can be also used, with all the data kept in the main memory. The main feature of all these algorithms and tools is the phase of database indexing, which precedes and speeds up the actual searching phase. There are also algorithms for searching large genetic databases rapidly on desktop computers with limited RAM, like MICA (K-Mer Indexing with Compact Arrays),[8] which stores indexed data on a disk and retrieves relevant data selectively during the searching phase. Indexing a DNA sequence with MICA is achieved by dividing the sequence in chunks of (216–1) bases, scanning each chunk with a window of width K and storing the positions of all overlapping K-mers in array. One group of algorithms uses suffix trees (OASIS [9] and the three versions of MUMmer [10-12]), and enhanced suffix arrays (ESAs) (Vmatch [13]) for building database indexes. ESAs consist of four arrays (suffix, longest common prefix, child and suffix link arrays) that together reach the full expressiveness of suffix trees. Suffix trees and ESAs are time efficient with complexity of O(n) and space complexity of O(n log(n)). The Kurtz [14] implementation of suffix tree requires 17.25 bytes per base, which is translated to 50 GB for the human genome, while ESAs require 4–8 bytes per base. There are also sparse suffix arrays (SSA), which index every K-th (sparseness factor) suffix of the sequence.[15] Their sparseMEM tool is able to find maximal exact matches faster than the previous methods, while using less memory. The tool essaMEM [16] optimizes the previous method by supplementing SSAs with a sparse child array for large sparseness factors, and this is known as enhanced sparse suffix array (ESSA). Other new strategies deploy compressed indexing techniques that reduce the space complexity to O(n) bits. Techniques include Ferragina–Manzini index (FM-index),[17] compressed suffix arrays (CSA) index [18] and Burrows–Wheeler transform (BWT) index.[19] For DNA sequences, BWT indexing was found to be the most efficient, and the memory requirement is less than 0.3 bytes per base. For the human genome, this requires only 1 GB memory and the whole index can reside in the main memory of a personal computer (PC). CSA index implemented with the BWT is used in [20,21] while BWT index is used in BWT-SW.[22] Another tool, backwardMEM,[23] uses enhanced CSA by indexing each Kth suffix array value. Benson [24] has proposed an algorithm for identification of all tandem repeats, without having to specify the pattern. Neglecting patterns that occur within a few database sequences, TEIRESIAS [25] is able to generate all maximal patterns that appear within at least ( is user-defined) sequences. RAPID [26] is a probabilistic word-searching approach. A different significance is assigned to each match of length k, depending of the number of occurrences of the match. By partitioning the query and the subject sequence in fragments of fixed size, referred as windows, sequence search tree (SST) [27] can identify approximate matches, in time proportional to the logarithm of the database size. Hash-based indexing strategies (SSAHA (Sequence Search and Alignment by Hashing Algorithm) [28,29] and BLAT (BLAST-like alignment tool) [30]), which are currently in more widespread use for DNA databases, require 1 byte or less per base, and they can be orders of magnitude faster than FASTA or BLAST, which index the query sequence rather than the database. SSAHA [28] partitions subject database sequence into non-overlapping words of length k (k-mers), being mapped into numbers according to the SSAHA conversion function. BLAT [30] builds up an index of non-overlapping k-mers and their positions in the database, excluding k-mers that occur too often from the index as well as k-mers containing ambiguity codes. During the search stage, three different strategies are used in BLAT, in order to find homologous regions: searching for perfect hits, allowing at least one mismatch between two hits and searching for multiple perfect matches which are in close proximity to each other. One problem with SSAHA and BLAT is their limitation, which rises from sacrificing the completeness for speed. They cannot detect matches with less than k bases. Some conceptual and computational drawbacks of SSAHA have been solved by Reneker and Shyu.[29] According to Reneker and Shyu,[29] overlapping matches can be identified, if instead of indexing non-overlapping words of k bases, overlapping words of the same size are tracked in the indexed data structure. Reneker and Shyu [29] also pointed out that matches which are shorter than k bases can be identified as suffixes within some of the indexed words of k bases. However, the previous concept is incomplete if some of the matches are located at the beginnings of the genetic sequences. In order to detect all matches, even the ones which are not reported in [29], we propose an improved searching methodology by integrating suffix search and prefix search that provide more exact matching. From a computational viewpoint, we propose a computational upgrade of the indexing formula used by SSAHA and Reneker and Shyu,[29] which results in a k-fold speed-up of the indexing phase. The storage aspects were also improved due to the exclusion of redundant records from the indexed data structure. Instead of a hash table, a sorted dictionary indexed data structure is employed, which allows identification of all matches without having to scan all records, and hence, the better search time performance.

Materials and methods

Database indexing

Hash-based solutions, such as SSAHA and Reneker's improvements of SSAHA, are performed in two phases. In the first phase, DNA database words of k consecutive nucleotides , are mapped into numbers, applying a concrete base-mapping function. SSAHA and Reneker employ different base-mapping functions due to the fact that none of the nucleotides can be zero-mapped. According to SSAHA, adenine is zero-mapped, . Since database words are mapped in integers as , words ending with different number of A's could not be distinguished if SSAHA's base-mapping function is used. For instance, since the mapped value of the words CAA and CAAAA is equal, , and they are assumed to be equal, which is incorrect. In order to overcome SSAHA's mapping inconsistency, Reneker and Shyu [29] proposed a modified base-mapping function: for distinction of words ending with different number of A's. The modified base-mapping function guarantees that a different number will be assigned for a different word, but neither SSAHA nor Reneker and Shyu's algorithm has been improved in terms of the time complexity of the indexing phase. Indexing complete genomes or recomputing the indexed data structure, if the DNA data has been modified, is a time-demanding process that could last even a few days, if it is executed on a PC. In these cases, there is a computational necessity to reduce the time span of the indexing phase. Therefore, we propose a computational upgrade of the indexing formula used in SSAHA and Reneker and Shyu's algorithm that speeds up the indexing k-fold, such that k is the length of the DNA words being mapped, in comparison to SSAHA and Reneker and Shyu's algorithm. The proposed computational upgrade is based on the following. Once the first word from the ith DNA database sequence has been mapped, the mapped value of each successive word can be calculated from the previous one , according to Equation (1). The equation is based on the fact that two successive words and have exactly common nucleotides. Thus, each word can be derived from the previous one , by excluding the leftmost nucleotide ( in Equation (1)), and by shifting the common nucleotides for one position to the left (division by 4) and addition of a new base in Equation (1)): For instance, if AACTT… is a DNA sequence and k = 4, once the first word of four nucleotides AACT has been mapped, , the mapped value of the following word ACTT can be calculated from the previous by applying Equation (1), and so on. Applying the mapping formula used in SSAHA and Reneker and Shyu's algorithm, , 4k operations are performed in order to map a single word. To map a DNA sequence of n nucleotides, n − k + 1 words have to be mapped, i.e. operations are performed. Since the length of the DNA sequence n is greater than k (), approximately operations are performed in order to map all overlapping words of k nucleotides in . Applying the proposed solution based on Equation (1), 4k operations are performed to map the first word only. Since the execution of Equation (1) requires four operations: subtraction (), division (), multiplication () and an addition () per mapped successive word, and there are successive and overlapping words of k nucleotides in , a total of operations would have to be performed, which results in a k-fold increase in the speed of the indexing phase in comparison to SSAHA and Reneker and Shyu's approach. SSAHA and the technique of Reneker and Shyu store indexed data differently. SSAHA precomputes a hash table of keys, such that is the length of the DNA words being mapped. By generating keys, SSAHA guarantees a different key for each different DNA word. Since the hash table is stored in the main memory, which has limited capacity, the application of SSAHA when running on a PC is suitable for indexing small databases. However, due to the fast processor–RAM communication, the time aspects of the indexing and search phase are relatively satisfactory. Opposite to SSAHA, the indexed data can be stored in a file, which is kept on the disk or server, with much more storage capacity than the main memory. This idea, which is suitable for indexing long DNA sequences such as the human genome, has been employed by Reneker and Shyu.[29] However, the additional transfer of data between the RAM and the disk, during the indexing and searching phase, will slow down the time performance. The idea of precomputing a hash table with keys before tracking tuples , such that is the index of the DNA sequence where the word comes from and is its starting position, can be improved in terms of memory complexity. If a greater value is taken for , the number of precomputed keys in the hash table is increased. Since some of the precomputed keys may point zero tuples, part of the main memory may be unnecessarily reserved. In order to avoid situations in which memory is wasted for keys pointing zero tuples, we propose the indexed data structure to be constructed dynamically. By reading and mapping overlapping words of nucleotides found in the database, one can be sure that each key in the indexed data structure will point at least one tuple . For instance, if and the DNA word TTTTCATT is not contained in the database, then the key would be unnecessarily generated and kept in the memory. On the contrary, if the indexed data structure is constructed dynamically, by reading and mapping only words which are found in the database, a corresponding record key would not exist. This would contribute towards an optimization of the memory requirements. If a short DNA database is indexed, by taking a large value for k, a significant part of the main memory would be wasted for keeping redundant keys corresponding to words which were not found in the database. This conceptual drawback of SSAHA not only implies unnecessary storage cost, but also slows down the search of a DNA pattern, since redundant keys are also examined in the searching phase. By reducing the size of the indexed data structure, better search time performance is expected, which may be of particular importance if an indexed data structure that tracks genomic data has to be searched.

DNA pattern search

A database of DNA sequences is primarily searched as part of molecular biology studies, with the main purpose of finding genes, identification of regulatory sequences, predicting intron–exon structures of genes, analysis of short tandem repeats, pseudogenes recognition, finding restriction sites, etc. In some of the cases, such as prediction of a promoter sequence based on a known consensus element, biological information can be attached to the sequences and the database can be searched against relatively short DNA patterns. In order to speed up the search of the indexed DNA database stored in the main memory, instead of a hash table, we use a sorted dictionary SD data structure. Due to the fact that records are sorted in ascending order of the keys, all hits can be identified without having to scan the entire indexed data structure, which is going to improve the search time complexity. Given the set of short DNA patterns searched against the database, database words of nucleotides, such that is the length of the longest query in , are mapped in keys pointing tuples , such that is the index of the database sequence from where the word has been read and is the starting position of the word in the sequence. In such constellations, all query hits can be found either as suffixes or as prefixes within key-mapped words of nucleotides. Also, note that the sorted dictionary is constructed by applying the proposed computational upgrade of the indexing formula used in SSAHA and the method of Reneker and Shyu. Pattern hit at starting position in the ith DNA sequence, such that is the length of the pattern, is found as a suffix of a key-mapped word from the same sequence. For illustration, the query AA is found as a suffix in TTAA, being the second key-mapped word from the DNA sequence S2:CTTAAC…, if k = 4. The exact starting position of the hit can be derived from the tuple , pointed by the sorted dictionary key , by increasing p for , i.e. the tuple is reported. An extension of the DNA pattern up to length with a minimum value of conversion is obtained by adding A's to the left side. By adding C's also to the left side of the pattern, an extension of the same length, but with maximum value of conversion is obtained. Each sorted dictionary , such that Equation (2) is satisfied, points tuple (tuples) tracking words that contain the searched DNA pattern as a suffix. Hits are reported as tuples, such that is a tuple pointed by a that satisfies the following equation: Each sorted dictionary , such that , does not map a word that ends up with the searched DNA pattern. Based on this, all hits at starting positions can be found, until the first , such that is read. None of the sorted dictionary records, such that , have to be considered. Thus, better search time performance is expected compared to SSAHA. The main computational drawback of the improvements of SSAHA proposed by Reneker and Shyu [29] is the inability to find DNA patterns that are located at the very beginnings of the DNA sequences which were indexed. Namely, DNA pattern hits at starting positions , such that is the length of the words being mapped and is the length of the searched DNA pattern, are not reported by the algorithm of Reneker and Shyu. This drawback might result in an incomplete identification of repetitive nucleotide sequences within telomeres if chromosomes are searched in a reverse direction, and in an inability to detect a key DNA pattern, such as promoter consensus, within short DNA reads. Therefore, we propose a solution of the previous drawback, based on checking whether the sorted dictionary satisfies Equations (3) and (4). If the value of the is in the range between and , such that and are obtained by addition of A's and C's to the right side of the searched DNA pattern (, ) and the remainder when dividing by equals zero, then tuple (tuples) pointed by the , track mapped words that contain the searched DNA pattern as a prefix. Reported tuples identify DNA pattern hits unreported by the algorithm of Reneker and Shyu: The flowcharts of the indexing and searching phase are shown in Figures 1 and 2.
Figure 1.

DNA data-indexing phase.

Figure 2.

DNA pattern searching phase.

DNA data-indexing phase. DNA pattern searching phase.

System characteristics

The proposed computational and conceptual upgrades were implemented in C# and tested on a Fujitsu Siemens computer with Core(TM) 2 Duo CPU at 2.67 GHz and 2 GB RAM.

Database used

To evaluate the computational, storage and matching performances of the proposed approach, different Escherichia coli DNA fragments of a total size of 0.1 Gb (giga bases) retrieved from the European Nucleotide Archive,[31] were indexed and the result was compared to that obtained by SSAHA and Reneker and Shyu's algorithm, using the same computational resources.

Results and discussion

Let us consider the short-read form E. coli 55989 chromosome in the base range 191–300 (Figure 3). If , a search of the short read for E. coli thrL gene promoter consensus TACACA, which is located at the −10 position relatively to the TSS (transcription start site), by the algorithm of Reneker and Shyu would not report the hit at the beginning of the short read. This is because when this algorithm is applied for a DNA pattern search shorter than k nucleotides, only hits that are found as suffixes of mapped words are reported. The consensus sequence TACACA is located at the start of the short read (Figure 3) and none of the mapped overlapping words of nucleotides: TACACAAC, ACACAACA, CACAACAT…, contain the searched DNA pattern TACACA as a suffix. Therefore, the TACACA hit at position 0 in the short read is not reported by Reneker and Shyu's approach.
Figure 3.

Short-read form E. coli 55989 chromosome, base range: 191–300.

Short-read form E. coli 55989 chromosome, base range: 191–300. The improvement proposed by us incorporates not only a suffix search strategy, but also a prefix search strategy based on Equations (3) and (4) and, thus, the hit at the beginning of the short read is reported. Before performing any check, , and are computed. When the search comes to the sorted dictionary = 71814, which corresponds to the mapped value of the word TACACAAC, and which contains the searched DNA pattern as a prefix, Equations (3) and (4) are satisfied: and . The consensus hit at the beginning of the short chromosome read is reported, represented with the tuple , given that the short read is the ith DNA sequence being indexed. By applying the proposed computational upgrade of the indexing formula used in SSAHA and Reneker and Shyu's algorithm, the complete E. coli 55989 chromosome, which contains 5 Mb (mega bases), retrieved from the European Nucleotide Archive was indexed for one minute. Excluding the computational upgrade, eight minutes were spent for the same purpose, given that all E. coli 55989 overlapping words of k = 8 nucleotides were mapped. Since the time complexity of the proposed computational upgrade is linear and the entire E. coli DNA data-set of 0.1 Gb was indexed in 20 minutes, the whole human genome, containing approximately 3 Gb, could be indexed in 600 minutes (10 hours). Using the same computational resources, the straightforward application of the indexing formula proposed by Reneker and Shyu would require 3.3 days. When compared to Simpson and Durbin,[32] who estimated that 4.5 days and 700 GB RAM would be required in order to index the human genome, the proposed computational upgrade results in an 11-fold speed-up. The storage improvement was also experimentally analysed. For instance, given that the length of the mapped words equals 8, SSAHA pre-computes a hash table with keys. If the indexed data structure is constructed dynamically as proposed, the number of records in the data structure increases with the size of the indexed DNA data (Table 1). Using the same computational resources, the number of records in the indexed data structure was determined for indexing of 1, 2, 3, 4 and 5 Mb, extracted from E. coli 55989 chromosome (Table 1). According to the results obtained, 64,422 records were tracked in the indexed data structure for 1 Mb DNA. The number of records in the indexed data structure increased to 65,471 records, which were tracked when the entire E. coli 55989 chromosome was mapped. Even in that case, SSAHA unnecessarily keeps 65,536 − 65,471 = 65 keys, which map words of eight nucleotides that were not found in E. coli 55989 chromosome. These 65 records are excluded from the indexed data structure, if the indexed data structure is constructed dynamically as we propose.
Table 1.

Comparison of the number of records in the indexed data structures.

Base range (Mb)SSAHADynamic construction of the indexed data structureNumber of redundant records
165,53664,4221114
265,53665,147389
365,53665,346190
465,53665,424112
565,53665,47165
Comparison of the number of records in the indexed data structures. The use of a sorted dictionary instead of a hash table enables faster identification of all DNA pattern hits. In three out of the five cases, when searching for different promoter consensus sequences recognized by and transcription factors (Table 2 and Figure 4), our algorithm ran faster than the one of Reneker and Shyu. The improvement is due to the computational concept of the proposed searching approach, which is able to identify the same DNA pattern hits as the one of Reneker and Shyu, but without having to examine all records in the indexed data structure. It is also noteworthy that the higher the conversion value of the searched DNA pattern is, the less records are examined, resulting in better search time performance.
Table 2.

Comparison of the running times for searching different promoter consensus sequences

Sigma factorConsensus queryConversion valueReneker and Shyu (ms)Our algorithm (ms)
CCGATATf(CCGATAT) = 986096
TATAATf(TATAAT) = 240688
TTGACAf(TTGACA) = 2170108
CTGGTAf(CTGGTA) = 17881313
CTAAAf(CTAAA) = 3481715
Figure 4.

Comparison of the results in Table 2.

Comparison of the running times for searching different promoter consensus sequences Comparison of the results in Table 2. In addition to the computational improvements, the main conceptual improvement of the proposed methodology, in comparison to the method of Reneker and Shyu, is the ability for detection of all DNA pattern hits, regardless of their starting positions in the sequences. There are no reports until now that SSAHA and the algorithm of Reneker and Shyu are not able to detect DNA pattern hits shorter than k nucleotides, which are located at the beginnings of the sequences. This drawback can be solved by employing a prefix search that provides more exact matching in comparison to Reneker and Shyu's algorithm (Table 3).
Table 3.

Comparison of the matching rates

Sigma factorConsensus queryConversion valueNumber of hits – Reneker and ShyuNumber of hits – our algorithmNumber of hits unreported by Reneker and Shyu
CCGATATf(CCGATAT) = 9860252025266
TATAATf(TATAAT) = 2406381238197
TTGACAf(TTGACA) = 2170401040199
CTGGTAf(CTGGTA) = 178812,00112,01817
CTAAAf(CTAAA) = 34818,14018,16626
Comparison of the matching rates According to the data in Table 3, the number of unidentified hits, when searching the indexed data structure for consensus elements CCGATAT, TATAAT, TTGACA, CTGGTA and CTAAA, ranges between 6 and 26. By applying the proposed methodology, a total of 65 unreported consensus hits located at the beginning of the indexed E. coli DNA fragments were additionally identified (Table 3). Taken together, the results demonstrate that the improvements proposed by us give better computational, storage and matching performances in comparison to SSAHA and Reneker and Shyu's algorithm, using the same computational resources. To summarize the obtained results, Table 4 lists the advantages of the methodology proposed by us in comparison to the other two algorithms.
Table 4.

Summary of the advantages of the presented methodology

FactorSSAHAReneker and ShyuOur algorithm
Time complexity of the indexing phaseO(nk)O(nk)O(n) – k times faster
Structure of indexed data structureHash table with 4k recordsFileSorted dictionary with less than 4k records for a small DNA database
Matching rateReports limited number of hitsBetter matching rate than SSAHA, but unable to detect hits at the beginnings of the sequencesIdentifies all DNA pattern hits
Summary of the advantages of the presented methodology The biological impact of the proposed methodology lies in the ability to detect all DNA pattern hits, regardless of their starting positions in the sequences. Unlike SSAHA and the approach of Renker and Shyu, our algorithm can identify DNA pattern hits which are located at the beginnings of the indexed DNA sequences, by integrating suffix search and prefix search. As a consequence, the algorithm proposed by us is expected to provide more precise matching in DNA pattern analyses by being able to identify key DNA pattern hits that might indicate functional DNA data unreported by SSAHA and Reneker and Shyu's method.

Conclusions

This work presents suggestions for computational and conceptual improvements of the most commonly used hash-based implementations for DNA database indexing and searching, such as SSAHA and Reneker and Shyu's improvements of the SSAHA software tool. We propose a computational upgrade of the indexing formula used in SSAHA and the algorithm of Reneker and Shyu, by which the database can be indexed k-times faster. This is of particular importance if large eukaryotic DNA sequences have to be tracked. Another improvement of our algorithm as compared to SSAHA is more efficient memory use when a relatively short DNA database is indexed: by dynamic construction of the indexed data structure, all redundant keys are excluded (i.e. those to which correspond hash values of words not found in the database), which improves the storage aspects. In addition, we use a sorted dictionary instead of a hash table as an indexed data structure, in order to be able to identify the same hits as Renker and Shyu's algorithm but without having to scan the entire data structure. As a result, the proposed methodology was demonstrated to run faster than Reneker and Shyu's algorithm. These better computational, storage and matching results as compared to SSAHA and Reneker and Shyu's approach, when using the same computational resources, indicate that our algorithm can be considered a promising improvement.
  21 in total

1.  A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

Authors:  C Miller; J Gurd; A Brass
Journal:  Bioinformatics       Date:  1999-02       Impact factor: 6.937

2.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

3.  Fast algorithms for large-scale genome alignment and comparison.

Authors:  Arthur L Delcher; Adam Phillippy; Jane Carlton; Steven L Salzberg
Journal:  Nucleic Acids Res       Date:  2002-06-01       Impact factor: 16.971

4.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size.

Authors:  Eldar Giladi; Michael G Walker; James Z Wang; Wayne Volkmuth
Journal:  Bioinformatics       Date:  2002-06       Impact factor: 6.937

Review 5.  A space-efficient construction of the Burrows-Wheeler transform for genomic data.

Authors:  Ross A Lippert; Clark M Mobarry; Brian P Walenz
Journal:  J Comput Biol       Date:  2005-09       Impact factor: 1.479

6.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Authors:  Zia Khan; Joshua S Bloom; Leonid Kruglyak; Mona Singh
Journal:  Bioinformatics       Date:  2009-04-23       Impact factor: 6.937

7.  Tandem repeats finder: a program to analyze DNA sequences.

Authors:  G Benson
Journal:  Nucleic Acids Res       Date:  1999-01-15       Impact factor: 16.971

8.  Alignment of whole genomes.

Authors:  A L Delcher; S Kasif; R D Fleischmann; J Peterson; O White; S L Salzberg
Journal:  Nucleic Acids Res       Date:  1999-06-01       Impact factor: 16.971

9.  Efficient construction of an assembly string graph using the FM-index.

Authors:  Jared T Simpson; Richard Durbin
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

10.  Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.

Authors:  Jeff Reneker; Chi-Ren Shyu
Journal:  BMC Bioinformatics       Date:  2005-05-03       Impact factor: 3.169

View more
  1 in total

1.  Predicting the Most Deleterious Missense Nonsynonymous Single-Nucleotide Polymorphisms of Hennekam Syndrome-Causing CCBE1 Gene, In Silico Analysis.

Authors:  Khyber Shinwari; Liu Guojun; Svetlana S Deryabina; Mikhail A Bolkov; Irina A Tuzankina; Valery A Chereshnev
Journal:  ScientificWorldJournal       Date:  2021-06-10
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.