| Literature DB >> 12387731 |
John Schwacke1, Jonas S Almeida.
Abstract
BACKGROUND: Recently, Almeida and Vinga offered a new approach for the representation of arbitrary discrete sequences, referred to as Universal Sequence Maps (USM), and discussed its applicability to genomic sequence analysis. Their work generalizes and extends Chaos Game Representation (CGR) of DNA for arbitrary discrete sequences.Entities:
Mesh:
Year: 2002 PMID: 12387731 PMCID: PMC137598 DOI: 10.1186/1471-2105-3-28
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Comparison of encodings for the original sequence, an embedded representation and USM coordinates. Sample encodings for a nucleotide sequence illustrating the equivalence between an embedded encoding and a finite word length, block floating point representation of a standard USM encoding. The values indicated as i represent the initial values of the USM coordinates and the subscripted A, G, T, and C indicate the 0th and 1st bits of the 2-bit USM representation of the associated coordinates.
Execution time performance for standard and Boolean USM implementations.
| Length of | Length of | Total Time | Distance Compute Time | Total Time | Distance Compute Time | Rate (Boolean USM distance | Rate (standard USM distance | Speed Ratio | Memory (kB) |
|---|---|---|---|---|---|---|---|---|---|
| 1,000 | 1,000 | 0.12 | 0.12 | 0.44 | 0.44 | 8,333,333 | 2,272,727 | 3.67 | 94 |
| 2,000 | 2,000 | 0.48 | 0.48 | 1.78 | 1.77 | 8,333,333 | 2,244,669 | 3.71 | 188 |
| 3,000 | 3,000 | 1.09 | 1.09 | 3.98 | 3.98 | 8,249,313 | 2,264,151 | 3.64 | 281 |
| 4,000 | 4,000 | 1.92 | 1.92 | 7.08 | 7.08 | 8,324,662 | 2,259,887 | 3.68 | 375 |
| 5,000 | 5,000 | 3.04 | 3.03 | 11.07 | 11.07 | 8,212,878 | 2,259,376 | 3.64 | 469 |
| 10,000 | 10,000 | 30.53 | 30.52 | 58.17 | 58.16 | 3,275,145 | 1,719,011 | 1.91 | 938 |
| 15,000 | 15,000 | 68.68 | 68.67 | 131.01 | 131.00 | 3,276,158 | 1,717,452 | 1.91 | 1,406 |
| 17,000 | 17,000 | 88.23 | 88.21 | 168.29 | 168.28 | 3,275,678 | 1,717,253 | 1.91 | 1,594 |
| 20,000 | 20,000 | 122.09 | 122.07 | 233.00 | 232.99 | 3,276,406 | 1,716,775 | 1.91 | 1,875 |
| 40,000 | 40,000 | 488.46 | 488.43 | 931.74 | 931.71 | 3,275,587 | 1,717,219 | 1.91 | 3,750 |
Results of performance comparisons of standard USM and Boolean USM implementations in C (gcc 2.95.3, cygwin, Windows 2000, PIII 1 GHz). Sequence lengths are given in nucleotides. Times measure elapsed execution time in seconds. Total times include both USM sequence preparation time and distance calculations for all symbol pairs. Memory is measured in kilobytes and represents the space required to store the USM coordinates for both sequences.
Figure 2Comparison of standard and Boolean USM similar segment length measurements. Pixel images of bi-directional distance determination for standard (A) and Boolean (B) USM implementations. Brighter pixels indicate longer similar segments.
Figure 3Comparison of standard and Boolean USM length measurements for sample nucleotide sequences. Pixel images of bi-directional distance determination for standard (A) and Boolean (B) USM implementations. The sequences are 100 nucleotide segments from the human insulin receptor (INSR) and a chicken tyrosine kinase (CTK-1). Brighter pixels correspond to longer similar segments. The dominant segment is an exact match that is 17 nucleotides long.