| Literature DB >> 26161131 |
Yun Xue1, Zhengling Liao1, Meihang Li1, Jie Luo1, Qiuhua Kuang1, Xiaohui Hu1, Tiechen Li1.
Abstract
Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.Entities:
Mesh:
Year: 2015 PMID: 26161131 PMCID: PMC4464847 DOI: 10.1155/2015/680434
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Raw data matrix.
| Rows | Columns | |||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| Row 1 | 40 | 27 | 35 | 8 | 27 | 57 |
| Row 2 | 7 | 51 | 42 | 13 | 24 | 42 |
| Row 3 | 15 | 43 | 37 | 21 | 31 | 27 |
| Row 4 | 27 | 55 | 49 | 33 | 42 | 59 |
| Row 5 | 20 | 11 | 31 | 37 | 31 | 39 |
Figure 1(a) Original data matrix: 5 rows and 6 columns; (b) three rows exhibit a coherent pattern.
Figure 2Three rows form a coherent ascending pattern under permutated columns.
Transformed sequence data sets.
| Rows | Columns | |||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| Row 1 | 4 | 2 | 5 | 3 | 1 | 6 |
| Row 2 | 1 | 4 | 5 | 3 | 6 | 2 |
| Row 3 | 1 | 4 | 6 | 5 | 3 | 2 |
| Row 4 | 1 | 4 | 5 | 3 | 2 | 6 |
| Row 5 | 2 | 1 | 3 | 5 | 4 | 6 |
Algorithm 1
Figure 4Example of two-candidate prefix tree.
Figure 3Flowchart of our algorithm.
A microarray data matrix D.
| Rows | Columns | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | |
| Row 1 | 120 | 110 | 119 | 100 | |
| Row 2 | 999 | 128 | 80 | 115 | 810 |
| Row 3 | 676 | 300 | 77 | 287 | 264 |
| Row 4 | 197 | 107 | 99 | 587 | 101 |
| Row 5 | 154 | 78 | 20 | 10 | |
An example of column permutated matrix C.
| Rows | Columns | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | |
| Row 1 | 5 | 2 | 3 | 1 | |
| Row 2 | 3 | 4 | 2 | 5 | 1 |
| Row 3 | 3 | 5 | 4 | 2 | 1 |
| Row 4 | 3 | 5 | 2 | 1 | 4 |
| Row 5 | 5 | 4 | 2 | 1 | |
Figure 5Example of 2-frequent prefix tree with “δ = 3.”
Figure 6Results of (ζ + 1)-frequent prefix tree mining.
Number of OPSMs of different row thresholds.
| The row threshold | 3 | 5 | 8 | 10 |
|
| ||||
| Number of biclusters | 9248 | 2791 | 1350 | 771 |
Figure 7Statistical chart of the overlap distribution.
Figure 8Two examples of mined OPSMs on gene data set.
Figure 9Percentage of significant enriched biclusters/clusters by GO Biological Process category for the five selected biclustering methods and our algorithm at different significance levels P.
Figure 10Effect of noise.
Figure 11Effect of overlap.
(a) Candidate 2-subsequences matrix
| Sequences | Common 2-subsequences |
|
| |
| 1, 2 | 5, 1; 2, 1; 3, 1 |
| 1, 3 | 5, 2; 5, 1; 2, 1; 3, 1 |
| 1, 4 | 5, 2; 5, 1; 2, 1; 3, 1 |
| 1, 5 | 5, 2; 5, 1; 2, 1 |
| 2, 3 | 3, 4; 3, 2; 3, 5; 3, 1; 4, 1; 4, 2; 2, 1; 5, 1 |
| 2, 4 | 3, 4; 3, 2; 3, 5; 3, 1; 2, 1; 5, 1 |
| 2, 5 | 4, 2; 4, 1; 2, 1 |
| 3, 4 | 3, 5; 3, 4; 3, 2; 3, 1; 5, 4; 5, 2; 5, 1; 2, 1 |
| 3, 5 | 5, 4; 5, 2; 5, 1; 4, 2; 4, 1; 2, 1 |
| 4, 5 | 5, 2; 5, 1; 5, 4; 2, 1 |
(b) Candidate 3-subsequences matrix
| Sequences | Common 3-subsequences |
|
| |
| 1, 3 | 5, 2, 1 |
| 1, 4 | 5, 2, 1 |
| 1, 5 | 5, 2, 1 |
| 2, 3 | 3, 4, 2; 3, 4, 1; 3, 2, 1; 3, 5, 1; 4, 2, 1 |
| 2, 4 | 3, 5, 1; 3, 2, 1 |
| 2, 5 | 4, 2, 1 |
| 3, 4 | 3, 2, 1; 3, 5, 4; 3, 5, 1; 3, 5, 2; 5, 2, 1 |
| 3, 5 | 5, 4, 2; 5, 4, 1; 5, 2, 1 |
| 4, 5 | 5, 2, 1 |
(c) Candidate 4-subsequences matrix
| Sequences | Common 4-subsequences |
|
| |
| 2, 3 | 3, 4, 2, 1 |
| 3, 4 | 3, 5, 2, 1 |