| Literature DB >> 28247172 |
Marcin Radom1,2, Piotr Formanowicz3,4.
Abstract
Sequencing by hybridization allows the reconstruction of the DNA string of a given length from smaller fragments. These fragments are obtained in the hybridization experiment in which the DNA hybridizes to a DNA chip. In a classical approach, the chip consists of all oligonucleotides of a given length, with only one type of oligonucleotide for each probe of the chip. In this paper, we propose an algorithm solving the non-classical case of SBH, where the chip probes consist set of oligonucleotides described by some specific pattern. We will present the definition of such a non-classical DNA chip and the algorithm solving a sequencing problem related to such a chip. Unlike recent metaheuristic approaches to the classical SBH problem, the proposed algorithm tries to find an exact sequence, and even in the presence of all the hybridization errors in spectrum is very often able to do so in a short time. If only negative errors from repetitions are allowed, then the algorithm is able to reconstruct sequences having length of thousands nucleotides.Entities:
Keywords: Algorithm; Non-classical SBH; Sequencing by hybridization
Mesh:
Year: 2017 PMID: 28247172 PMCID: PMC6061515 DOI: 10.1007/s12539-017-0220-0
Source DB: PubMed Journal: Interdiscip Sci ISSN: 1867-1462 Impact factor: 2.233
Fig. 1Pseudo-code for the alternating algorithm
Fig. 2Verification of possible next move based on spectrum set S2
Fig. 3Verification with longer overlapping
No errors in the spectrum, average number of solutions for 100 instances for a given k
| Capacity | DNA length | ||||||
|---|---|---|---|---|---|---|---|
| 100 | 200 | 300 | 400 | 500 | 600 | 700 | |
|
| 1.00 | 1.00 | 1.00 | 1.01 | 1.00 | 1.00 | 1.00 |
|
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Only positive errors in the spectrum, k = 8, average number of solutions for 100 instances
| DNA length | Positive errors percent | |||||
|---|---|---|---|---|---|---|
| 0% | 1% | 2% | 3% | 4% | 5% | |
| 100 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 300 | 1.00 | 1.01 | 1.00 | 1.00 | 1.00 | 1.00 |
| 500 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 700 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Algorithm performance for small (k = 8) and large (k = 10) chips, only with negative errors
| DNA length, | Negative errors percent | ||||
|---|---|---|---|---|---|
| 1% | 2% | 3% | 4% | 5% | |
| 100, | 100(99)[100] | 97(97)[97] | 94(94)[94] | 93(93)[93] | 96(92)[96] |
| 300, | 93(93)[93] | 85(83)[84] | 80(73)[79] | 74(68)[72] | 61(56)[60] |
| 500, | 67(64)[67] | 54(50)[54] | 44(39)[43] | 24(19)[23] | 11(10)[11] |
| 700, | 49(46)[48] | 19(16) [18] | 8(6)[7] | 4(3)[3] | 2(2)[2] |
| 100, | 100[99][100] | 99(99)[99] | 95(93)[94] | 95(95)[95] | 89(89)[89] |
| 300, | 96(95)[96] | 89(89)[89] | 88(86)[88] | 84(84)[83] | 83(83)[81] |
| 500, | 87(87)[87] | 83(83)[83] | 78(77)[78] | 64(62)[64] | 59(56)[59] |
| 700, | 87(85)[86] | 75(74)[73] | 56(55)[55] | 50(49)[50] | 47(45)[47] |
Three values per table cell are as follows: the number of instances when at least one DNA sequence has been reconstructed in time, in parenthesis number of instances with only one DNA sequence obtained, in square brackets the number of instances when the target DNA has been reconstructed
Only negative errors in the spectrum, average number of solutions for 100 instances for a given k
| DNA length, | Negative errors percent | ||||
|---|---|---|---|---|---|
| 0% | 1% | 2% | 3% | 4% | |
| 100, | 1.01 | 1.00 | 1.00 | 1.00 | 1.04 |
| 300, | 1.00 | 1.02 | 1.12 | 1.35 | 1.13 |
| 500, | 1.04 | 1.07 | 1.11 | 3.83 | 1.09 |
| 700, | 1.06 | 1.21 | 4.00 | 1.50 | 1.00 |
| 100, | 1.01 | 1.00 | 1.02 | 1.00 | 1.00 |
| 300, | 1.01 | 1.00 | 1.02 | 1.00 | 1.00 |
| 500, | 1.00 | 1.00 | 1.01 | 1.03 | 1.05 |
| 700, | 1.03 | 1.01 | 1.01 | 1.02 | 1.04 |
Algorithm performance for small (k = 8) and large (k = 10) chips, both types of errors
| DNA, | Negative and positive errors percent | ||||
|---|---|---|---|---|---|
| 1% | 2% | 3% | 4% | 5% | |
| 100, | 99(99)[99] | 98(73)[96] | 98(83)[98] | 91(85)[91] | 95(77)[95] |
| 300, | 76(67)[75] | 66(45)[59] | 53(36)[48] | 54(36)[46] | 39(29)[37] |
| 500, | 47(41)[46] | 31(20)[28] | 22(12)[19] | 11(7)[7] | 3(1)[3] |
| 700, | 31(24)[28] | 18(5)[9] | 4(3)[3] | 3(1)[2] | 3(1)[3] |
| 100, | 100(99)[100] | 99(97)[99] | 94(93)[93] | 89(87)[88] | 88(84)[88] |
| 300, | 89(84)[89] | 85(81)[85] | 85(76)[84] | 77(70)[77] | 79(75)[77] |
| 500, | 81(78)[80] | 70(65)[69] | 70(61)[69] | 54(50)[53] | 64(58)[63] |
| 700, | 76(72)[73] | 67(63)[66] | 46(44)[46] | 45(40)[44] | 47(43)[45] |
Three values per table cell are as follows: the number of instances when at least one DNA sequence has been reconstructed in time, in parenthesis number of instances with only one DNA sequence obtained, in square brackets the number of instances when the target DNA has been reconstructed
Average number of solutions for 100 instances, for both types of errors
| DNA length | Negative errors percent | ||||
|---|---|---|---|---|---|
| 0% | 1% | 2% | 3% | 4% | |
| 100, | 1.01 | 1.40 | 1.19 | 1.09 | 1.24 |
| 300, | 1.52 | 1.65 | 2.39 | 2.03 | 1.58 |
| 500, | 1.31 | 1.83 | 10.18 | 1.54 | 3.00 |
| 700, | 1.29 | 12.44 | 5.25 | 2.00 | 2.00 |
| 100, | 1.02 | 1.02 | 1.01 | 1.02 | 1.08 |
| 300, | 1.08 | 1.29 | 1.40 | 1.16 | 1.60 |
| 500, | 1.03 | 1.14 | 1.42 | 1.07 | 1.25 |
| 700, | 1.06 | 1.05 | 1.04 | 1.22 | 1.21 |
Only negative errors from repetitions, k = 10
| DNA length | |||||
|---|---|---|---|---|---|
| 1000 | 2000 | 3000 | 4000 | 5000 | |
| Sequence found | 100/100 | 98/100 | 85/100 | 71/100 | 65/100 |
| Target seq. found | 100/100 | 97/100 | 81/100 | 67/100 | 57/100 |
| Only 1 sequence | 89/100 | 74/100 | 58/100 | 47/100 | 41/100 |
| Many sequences | 11/100 | 24/100 | 27/100 | 24/100 | 24/100 |
| Average repetitions | 26 | 72 | 108 | 128 | 177 |
Proposed algorithm results for the benchmark instances by Blazewicz et al
| DNA length | |||||
|---|---|---|---|---|---|
| 200 | 300 | 400 | 500 | 600 | |
| Hybridization errors | |||||
| 5% positive and 5% negative | |||||
| Sequence found | 34/40 | 33/40 | 32/40 | 33/40 | 26/40 |
| Only 1 sequence | 34/40 | 33/40 | 31/40 | 33/40 | 26/40 |
| Many sequences | 1/40 | 0/40 | 1/40 | 0/40 | 0/40 |
| 10% positive and 10% negative | |||||
| Sequence found | 25/40 | 21/40 | 24/40 | 19/40 | 12/40 |
| Only 1 sequence | 25/40 | 21/40 | 24/40 | 19/40 | 12/40 |
| Many sequences | 0/40 | 0/40 | 0/40 | 0/40 | 0/40 |