| Literature DB >> 19208116 |
Wendi Wang1, Peiheng Zhang, Xinchun Liu.
Abstract
BACKGROUND: The emerging next-generation sequencing method based on PCR technology boosts genome sequencing speed considerably, the expense is also get decreased. It has been utilized to address a broad range of bioinformatics problems. Limited by reliable output sequence length of next-generation sequencing technologies, we are confined to study gene fragments with 30 - 50 bps in general and it is relatively shorter than traditional gene fragment length. Anchoring gene fragments in long reference sequence is an essential and prerequisite step for further assembly and analysis works. Due to the sheer number of fragments produced by next-generation sequencing technologies and the huge size of reference sequences, anchoring would rapidly becoming a computational bottleneck. RESULTS AND DISCUSSION: We compared algorithm efficiency on BLAT, SOAP and EMBF. The efficiency is defined as the count of total output results divided by time consumed to retrieve them. The data show that our algorithm EMBF have 3 - 4 times efficiency advantage over SOAP, and at least 150 times over BLAT. Moreover, when the reference sequence size is increased, the efficiency of SOAP will get degraded as far as 30%, while EMBF have preferable increasing tendency.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19208116 PMCID: PMC2648759 DOI: 10.1186/1471-2105-10-S1-S17
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Coding results for variable sampling window length.
| Vector count | 4 | 10 | 20 | 35 | 56 | 84 | 120 | 165 | 220 |
| Binary coding length | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 |
| Vector coding length | 2 | 4 | 5 | 6 | 6 | 7 | 7 | 8 | 8 |
| Compression rate | 0% | 0% | 16.7% | 25% | 40% | 41.7% | 50% | 50% | 55.6% |
The compression rate is calculated as the difference between binary coding length and vector coding length divided by binary coding length. The vector count is calculated as C(w+m-1, w) where w is the sampling window length, m is the size of alphabet used to form the sequences. The vector coding length is the minimum value n which let 2n > vector count holds.
Figure 1Blocking strategy with initial offset. As shown in part A and B, seq1 and seq2 are divided into 5 blocks containing 4 bps each. The gap error caused by missing of character C at 7th position in seq1 made it fail to match with seq2. However, as show in part C with additional reading frame for seq1 with 1 bp shift left. We could collect enough matching blocks (highlighted with dark background) to deduce the hit relationship.
Procedure of EMBF algorithm.
| Let B1 = ⌊ |
| 1.1 Divide S [offset, n] into L bps blocks, as Soffset = {soffset,1,..., soffset, B1}; |
| 1.2 Convert Soffsetto frequency vectors, as ESoffset = {esoffset,1,..., esoffset, B1}; |
| 2.1 Sequentially choose B2 blocks from ESoffset, and set the start position as p; Using all possible combinations to get B2-E blocks. And combine them as ADDR variable. Set the remaining E blocks as r; |
| 2.2 Mapping pair (ADDR,(r, p)) into a hash map M, and chaining possible conflicts; |
| 2.3 Iteratively scan ESoffset for next B2 blocks in ESoffset; |
| 3.1 Divide T into L bps blocks, as T = {t1, t2,..., tB2} and convert them to frequency vector as ET = {et1, et2,..., etB2}; |
| 3.2 Choose B2-E blocks from ET and combining them as ADDR variable, set the remaining E blocks as t; |
| 3.3 Query ADDR in M and pass all returned results as R = {(r1, p1), (r2, p2),...... } to step 4; |
| if ECD(t, ri) < E then record pi; |
The step 1~2 in EMBF are pre-processing steps where a two-level index structure was constructed. Index entry addresses are generated according to different combination of blocks, and require L*n/L*C(m/L,2) = nm2/L2 operations in total. The computing overheads to generate ADDR could be set as constant c, so the total pre-processing costs to build the index is cnm2/L2 ≈ O(n). Step 3 is the first level filtering phase with constant computing cost. The output result count is related to the length of index seed and the size of reference sequence. The second level filtering work is processed in step 4 with time complexity of O(rm/L), where r is the average output count of step 3, see results section for accurate evaluation of the value r. So by excluding the pre-processing steps, the timing complexity of EMBF is O(rm/L) << O(n). The space complexity could be interpreted as memory space used to implement the two-level index structure (see results section for detailed analysis). In order to fit first index into fast storage device to achieve best performance, we could adjust the size of reference sequence and the length of index seed to fine tune the index size and access overheads.
Figure 2Three difference index structures. The numbers in right-up part of the figure gives the offset in reference sequences where the given sequence fragment have identical occurrence. In part A, blocks with dark background indicates placeholder where no actual data exists. In part B, a hash function H is performed to hash input sequences into buckets labelled with 0~4, possible conflicted sequences are chained together. Part C illustrates a binary search tree, and the number at the beginning of each block is used as the search key.
Executing time analysis of EMBF
| 38.93% | 42.13% | 3.57% | 15.37% | |
| 41.00% | 38.71% | 3.06% | 17.23% | |
| 63.78% | 22.41% | 1.6% | 12.12% | |
The filtering and matching column corresponds to time consumed in percentage for step 3 and step 4 in EMBF algorithm repetitively. We also separated the overhead to generate the index access address, and listed it in addressing column. The others column include sequence reading, results writing and some log utility overheads. The value in this table is the mean value of 10 K anchor executing results.
Memory consumption to implement index structure (MB).
| 28.24 | 247 | 275.24 | |
| 49 | 99 | 148 | |
| 176 | 397 | 573 | |
| 342 | 397 | 739 | |
| - | - | 60 | |
| - | - | 562 | |
We divide the memory consumption for EMBF and BTree to two separate parts, the first part is used to build a hash map for EMBF and a traversal query tree for BTree; the second part is used to store positional information for EMBF and remaining sequences for BTree. The -xbps suffix in index name column indicates that the algorithm using seed with length of x bps.
Figure 3Memory cost of EMBF. The data was collected from 33.7, 69.3, 134 and 359 Mbps data set respectively. To evaluate the influence of different seed length 12 bps and 16 bps seed was tested.
Figure 4Filtering results of 10 K query on 359 Mbps dataset. We collected filtering results by anchoring 10 K synthesized sequences on 359 Mbps dataset. The maximum of percentage (3.528%) occurs when x = 1.251, the correspondent filtering result count is 17.834. The residual percentage is well below ± 0.8%, which indicates that the output result count in step 3 of EMBF comply with Gumbel extreme distribution.
Relative speedup comparison.
| 1 | 1/1.57 | 48838 | 42.66 | 1/3.1 | |
| 1 | 76734 | 67.02 | 1/1.97 | ||
| 1 | 1/1145 | 1/151385 | |||
| 1 | 1/132.3 | ||||
| 1 | |||||
The value indicates the speedup when comparing row algorithm with column algorithm. A value n > 1 means that the row algorithm performs n times fast than column algorithm. All data were collected from average performance of 10 K anchor requests.
Result accuracy comparison.
| 202676 | 300375 | NO DATA | |
| 202676 | 300375 | 1433261 | |
| 129930 | 198788 | 900084 | |
| 24544 | 47202 | 107297 | |
| 44298 | 77891 | 217973 | |
| 42907 | 76840 | 213766 | |
The NO DATA indicates that the executing time to get final result was so long, which will be ignored in this paper.
Figure 5Scalability analysis. BLAT with ooc tag enabled will have a better performance, but the completeness of output result will get degraded. The average value of 10 K anchor request was used to smooth out jitter and vibration of individual query request.
Figure 6Efficiency comparison. Efficiency is defined as total output result count divided by total time consumed. The data show that EMBF have 3~4 times efficiency advantage over SOAP, and at least 150 times over BLAT.