| Literature DB >> 24991213 |
Brian R King1, Maurice Aburdene2, Alex Thompson2, Zach Warres2.
Abstract
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.Entities:
Keywords: Discrete Fourier transform; Sequence analysis; Sequence similarity
Year: 2014 PMID: 24991213 PMCID: PMC4077688 DOI: 10.1186/1687-4153-2014-8
Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN: 1687-4145
mRNA insulin sequences from 19 animal species in the INS19 dataset
| Human | NM_000207 | 469 | |
| Chimp | NM_001008996 | 416 | |
| Olive baboon | XM_003909376 | 505 | |
| Monkey | J00336 | 392 | |
| Cow | NM_173926 | 434 | |
| Pig | NM_001109772 | 435 | |
| Chicken | NM_205222 | 453 | |
| Dog | NM_001130093 | 463 | |
| Cat | AB043535 | 420 | |
| Guinea pig | NM_001172891 | 442 | |
| Star-nosed mole | XM_004695041 | 291 | |
| Hedgehog | XM_004717178 | 327 | |
| Hamster | XM_005064148 | 450 | |
| Rabbit | NM_001082335 | 433 | |
| Zebrafish | AF036326 | 468 | |
| Butterfly fish | AF199588 | 459 | |
| Clown knifefish | AF199586 | 375 | |
| Flycatcher | XM_005046804 | 324 | |
| Clawed frog | NM_001085882 | 774 |
Avian influenza A subtype frequency in FLU60
| 3 | |
| 1 | |
| 1 | |
| 1 | |
| 13 | |
| 25 | |
| 2 | |
| 6 | |
| 1 | |
| 4 | |
| 2 | |
| 1 |
Figure 1Histogram of observed sequence identity over all pairs of aligned sequences in INS19 dataset. The percent identity is computed for all possible pairs of sequences in the INS19 dataset. Most data averaged between 55% and 75% sequence identity.
Figure 2ICD-based dendrogram for INS19. This figure shows the resulting dendrogram generated based on the ICD method applied on the ICD19 dataset, which contains mRNA sequences taken from 19 different eukaryotic species for the insulin (INS) gene.
Figure 3Alignment-based dendrogram for INS19 This figure shows the resulting dendrogram generated from phylogenetic relationships inferred from pairwise alignments computed over all pairs from the INS19 dataset, which contains mRNA sequences taken from 19 different eukaryotic species for the insulin (INS) gene.
Figure 4Alignment-free-based dendrogram using FFP [2] method for INS19. This figure shows the resulting dendrogram generated from phylogenetic relationships inferred using the FFP method on the INS19 dataset, which contains mRNA sequences taken from 19 different eukaryotic species for the insulin (INS) gene.
Figure 5ICD-based dendrogram for FLU60. This figure shows the resulting dendrogram generated based on the ICD method applied on the FLU60 dataset, which contains 60 sequences of the HA gene of different subtypes of avian influenza type A.
Observed execution time for FLU60
| 157.0 | |
| 27.0 | |
| 53.8 | |
| 7.1 | |
| 0.2 |