| Literature DB >> 22551205 |
Jonas S Almeida1, Alexander Grüneberg, Wolfgang Maass, Susana Vinga.
Abstract
BACKGROUND: The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.Entities:
Year: 2012 PMID: 22551205 PMCID: PMC3394223 DOI: 10.1186/1748-7188-7-12
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Graphic computation of USM encoding by generating forward (Equation 1) and backward (Equation 2) CGR successions. See Table 1 for the numeric representation. The graphic format makes it easy to verify that each position is obtained by moving the coordinates half the distance to the identity edge of the next sequence unit. Note also how the circular seeding, (Figure 2) causes the first coordinate computed for each map to be at half the distance between the last coordinate and the identity edge of the first sequence unit.
Numerical computation of USM encoding by generating forward (Equation 1) and backward (Equation 2) CGR successions.
See figure 1 for a graphical representation of the same succession. Note location of results in usm structure in the table's head.
Figure 2Encoding and decoding the base sequence. The period identifies the junction between the beginning and end of the sequence. Forward encoding (Equation 1, Figure 1 Left) takes place clockwise and Backward encoding (Equation 2, Figure 1 right) takes place counterclockwise. Both forward and backward CGR coordinates are displayed for the 8th unit of the sequence. The adjacent sequence units can be determined (decoded) from those coordinate values alone. As shown later, this observation can be used to assess an alignment by comparing the paired coordinates directly, demonstrating that sequence alignment can be performed through (independent) Map functions.
Figure 3Annotated snapshot of using the companion webApp at . The code hosting project site cgr.googlecode.com includes a tutorial and a video also describing the command line use of the libraries implementing the map-reduce decomposition of sequence analysis.
Encoding of a second sequence to compare (probe) with the base sequence encoded in Table 1.
Detailed calculation of length of similar segment, d, from USM coordinates of individual homologous units.
| Encoding | |
|---|---|
| Reviewing coordinates of positions highlighted in Table 1 and 2 | |
| u.distCGR([0.4225834224013004, 0.6976523056276487],[0.2517343767806896, 0.502943755599859]) 2 | |
| u.distCGR([0.3390888473255761, 0.8645502159677478],[0.33679262961989864, 0.8595585153381897]) 7 | |
| u.dist(ubase.usm[ | |
In this illustrative example, the coordinates for base and probe sequences for nucleotide "c" in position 8 and 5 respectively: acggctg[c]tatctgcgtacggtcgac, and aaag[c]tatctgaaaggtcaaa will be compared using Equation 5. Note array indexes in JavaScript start with 0 (zero), so this corresponds to comparing coordinate indexes 7 and 4. This distance result is also highlighted in Figure 3.
Figure 4Non-genomic sequence comparison example borrowed from [Almeida 2002] of encoding a non-genomic sequence. See Figure 3 for notes on the layout of the web tool.