| Literature DB >> 29684052 |
Weina Li1,2,3, Jiadong Ren1,2,3.
Abstract
A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29684052 PMCID: PMC5912758 DOI: 10.1371/journal.pone.0195601
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An example of the biological sequence.
| index | sequence |
|---|---|
| 1 | KITIITGGTRGIGFAAAKLFIENGAKVSIFGETQEEVDTALAQLKELYPE |
| 2 | EVALVTGATSGIGLEIARRLGKEGLRVFVCARGEEGLRTTLKELREAGVE |
| 3 | RVALVTGATSGIGLATARLLAAQGHLVFLGARTESDVIATVKALRNDGLEA |
| 4 | PVALVTGATSGIGLAIARRLAALGARTFLCARDEERLAQTVKELRGEGF |
Instances of sequence database.
| Serialnumber | sequence |
|---|---|
| 1 | abcbac |
| 2 | acbcab |
| 3 | bcbabc |
| 4 | acbabc |
Instances of position table.
| item | location information |
|---|---|
| [{1,5}, {1,5}, {4}, {1,4}] | |
| [{2,4}, {3,6}, {1,3,5}, {3,5}] | |
| [{3,6}, {2,4}, {2,6}, {2,6}] |
Fig 1Algorithm model.
Experimental data of algorithm execution efficiency.
| Proteinfamily | Identification number | Total sequences | Average length | Test sequences |
|---|---|---|---|---|
| PF00106 | 86592 | 180.90 | 200 | |
| PF00139 | 3578 | 219.20 | 278 | |
| PF00182 | 2230 | 152.50 | 327 | |
| PF00503 | 6158 | 307.00 | 380 | |
| PF00182 | 63111 | 121.30 | 223 | |
| PF00902 | 3594 | 212.50 | 210 | |
| PF01297 | 5307 | 273.40 | 300 | |
| PF01676 | 5899 | 425.50 | 280 | |
| PF02016 | 2533 | 283.00 | 260 | |
| PF02412 | 6644 | 31.70 | 300 | |
| 2 | PF03171 | 14598 | 104.80 | 135 |
| PF09069 | 1080 | 90.30 | 107 |
Experimental data of algorithm scalability on data size.
| Proteinfamily | Identification number | Total sequences | Average length | Test sequences |
|---|---|---|---|---|
| PF10312 | 782 | 180.20 | 600 | |
| PF09069 | 1080 | 90.30 | 600 | |
| PF00689 | 13024 | 180.10 | 600 |
Experimental data of algorithm scalability on data length.
| Proteinfamily | Identification number | Total | Average length | Test sequences | Groups |
|---|---|---|---|---|---|
| PF08386 | 2515 | 100.60 | 200 | 2 | |
| PF14262 | 678 | 200.50 | 200 | 2 | |
| PF04515 | 2596 | 300.00 | 200 | 2 | |
| PF09818 | 552 | 403.80 | 200 | 2 |
Fig 2Algorithm efficiency comparison.
Fig 3Algorithm scalability comparison of data size.
Fig 4Algorithm scalability comparison of data length.
Sequence patterns under different size of data sets.
| Supportthreshold | BSP | Data size | BSP | Data size | BSP | Data size |
|---|---|---|---|---|---|---|
| 40% | 170 | 100 | 171 | 300 | 171 | 500 |
| 40% | 170 | 100 | 171 | 300 | 171 | 500 |
| 40% | 135 | 100 | 138 | 300 | 139 | 500 |
| 40% | 171 | 200 | 172 | 400 | 173 | 600 |
| 40% | 144 | 200 | 146 | 400 | 124 | 600 |
| 40% | 133 | 200 | 137 | 400 | 136 | 600 |
Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod diam.
| Supportthreshold | BSP | Support threshold | BSP |
|---|---|---|---|
| 5% | 3198 | 25% | 304 |
| 10% | 1037 | 30% | 233 |
| 15% | 625 | 35% | 192 |
| 20% | 413 | 40% | 164 |
The memory usage of algorithms.
| Algorithms | Peak value of memory | Peak value of CPU occupancy ratio | Support threshold |
|---|---|---|---|
| 549.6MB | 54.6% | 40% | |
| 553.3MB | 53.6% | 40% | |
| 602.8MB | 49.6% | 40% |
P-value: Analysis of efficiency difference of algorithm.
| Experiments | Mpbsmi and FBSB | Mpbsmi and MSPM |
|---|---|---|
| Experiment 1 | 0.01378 | 6.6974*10−11 |
| Experiment 2 | 0.00006 | 0.002 |
| Experiment 3 | 0.035 | 0.044 |