| Literature DB >> 18412963 |
Yoginder S Dandass1, Shane C Burgess, Mark Lawrence, Susan M Bridges.
Abstract
BACKGROUND: This paper describes techniques for accelerating the performance of the string set matching problem with particular emphasis on applications in computational proteomics. The process of matching peptide sequences against a genome translated in six reading frames is part of a proteogenomic mapping pipeline that is used as a case-study. The Aho-Corasick algorithm is adapted for execution in field programmable gate array (FPGA) devices in a manner that optimizes space and performance. In this approach, the traditional Aho-Corasick finite state machine (FSM) is split into smaller FSMs, operating in parallel, each of which matches up to 20 peptides in the input translated genome. Each of the smaller FSMs is further divided into five simpler FSMs such that each simple FSM operates on a single bit position in the input (five bits are sufficient for representing all amino acids and special symbols in protein sequences).Entities:
Mesh:
Substances:
Year: 2008 PMID: 18412963 PMCID: PMC2374783 DOI: 10.1186/1471-2105-9-197
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An FSM for matching peptide set {ACACD, ACE, CAC}.
A table-oriented representation of the FSM for peptide set {ACACD, ACE, CAC}.
| 1 | 0 | 7 | 0 | 0 | 0 | 0,0,...,0 | 0 | Ø | |
| 1 | 0 | 2 | 0 | 0 | 0 | 0,0,...,0 | 0 | Ø | |
| 3 | 0 | 7 | 0 | 6 | 0 | 0,0,...,0 | 0 | Ø | |
| 1 | 0 | 4 | 0 | 0 | 0 | 0,0,...,0 | 0 | Ø | |
| 3 | 0 | 7 | 5 | 6 | 0 | 0,0,...,0 | 0 | Pep2 | |
| 1 | 0 | 7 | 0 | 0 | 0 | 0,0,...,0 | 0 | Pep1 | |
| 1 | 0 | 7 | 0 | 0 | 0 | 0,0,...,0 | 0 | Pep3 | |
| 8 | 0 | 7 | 0 | 0 | 0 | 0,0,...,0 | 0 | Ø | |
| 1 | 0 | 9 | 0 | 0 | 0 | 0,0,...,0 | 0 | Ø | |
| 3 | 0 | 7 | 0 | 6 | 0 | 0,0,...,0 | 0 | Pep2 | |
The Bit-Split FSM corresponding to bit position 0 (FSM0)
| 1 | 0 | Ø:000 | |
| 2 | 0 | Ø:000 | |
| 3 | 0 | Ø:000 | |
| 4 | 0 | 3,2:110 | |
| 4 | 5 | 3,2:110 | |
| 1 | 0 | 1:001 |
The Bit-Split FSM corresponding to bit position 1 (FSM1)
| 1 | 2 | Ø:000 | |
| 1 | 3 | Ø:000 | |
| 4 | 2 | Ø:000 | |
| 5 | 2 | Ø:000 | |
| 1 | 6 | Ø:000 | |
| 1 | 7 | 2:010 | |
| 5 | 2 | 3:100 | |
| 5 | 8 | 3:100 | |
| 4 | 2 | 1:001 |
The Bit-Split FSM corresponding to bit position 2 (FSM2)
| 1 | 0 | Ø:000 | |
| 2 | 0 | Ø:000 | |
| 3 | 4 | Ø:000 | |
| 5 | 4 | 3:100 | |
| 1 | 0 | 2:010 | |
| 6 | 4 | 3:100 | |
| 6 | 4 | 3,1:101 |
The Bit-Split FSM corresponding to bit position 3 (FSM3)
| 1 | 0 | Ø:0 | |
| 2 | 0 | Ø:0 | |
| 3 | 0 | Ø:0 | |
| 4 | 0 | 3,2:110 | |
| 5 | 0 | 3,2:110 | |
| 5 | 0 | 3,2,1:111 |
The Bit-Split FSM corresponding to bit position 4 (FSM4)
| 1 | 0 | Ø:0 | |
| 2 | 0 | Ø:0 | |
| 3 | 0 | Ø:0 | |
| 4 | 0 | 3,2:110 | |
| 5 | 0 | 3,2:110 | |
| 5 | 0 | 3,2,1:111 |
Bit encoding of selected peptides
| A | 0 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 1 | 0 |
| D | 0 | 0 | 0 | 1 | 1 |
| E | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| M | 0 | 1 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... | ... |
| Y | 1 | 1 | 0 | 0 | 1 |
Figure 2Bit-Split Aho-Corasick FSM Architecture.
Figure 3Aho-Corasick Tile Architecture.
Figure 4Aho-Corasick Implementation Architecture using 140 tiles (clock and reset signals are not shown for clarity).
Tile packing efficiency result
| 5 | 30 | 11.28 | 140.02 | 19.99 | 52.70% |
| 10 | 30 | 15.40 | 141.41 | 19.80 | 81.12% |
| 15 | 30 | 19.79 | 178.40 | 15.70 | 81.53% |
| 20 | 30 | 23.70 | 277.23 | 12.32 | 72.96% |
Operating frequencies of Aho-Corasick designs with a variety of tiles on Virtex-4 FX-140
| 800 | 40 | 100 | 177.054 |
| 1,600 | 80 | 200 | 166.030 |
| 2,400 | 120 | 300 | 167.954 |
| 3,200 | 160 | 400 | 132.503 |
| 4,000 | 200 | 500 | 134.971 |