| Literature DB >> 22997494 |
Inês Soares1, Ana Goios, António Amorim.
Abstract
The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22997494 PMCID: PMC3444837 DOI: 10.1100/2012/450124
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Comparison of the running times between our approach (Linux and Windows x32 operative systems) and Costa et al. 2011 methodology (Linux platform) [8]. The first column lists the number of sequences and species used in each comparison; the second and third summarize the running times of each algorithm for each set of sequences, in seconds. The tabulated times correspond to the sum of running times of each step. The time spent by the user between each step, although highly time consuming, was not included.
| Running time (seconds) | |||
|---|---|---|---|
| Sequences | Costa et al. 2011 [ | our approach | |
| Linux | Linux | Windows | |
| 10 Pan troglodytes | 8 | 51 | 70 |
| 22 Pan paniscus | 27 | 72 | 105 |
| 29 primates | 46 | 66 | 90 |
| 104 Homosapiens | 537 | 226 | 353 |
| 150 Homosapiens | 1159 | 355 | 666 |
Figure 1Differences between running times of Costa et al. 2011 [8] approach and our suggested methodology.