| Literature DB >> 32292248 |
Youxi Wu1,2,3, Changrui Zhu1, Yan Li4, Lei Guo2, Xindong Wu5,6.
Abstract
Sequential pattern mining (SPM) has been applied in many fields. However, traditional SPM neglects the pattern repetition in sequence. To solve this problem, gap constraint SPM was proposed and can avoid finding too many useless patterns. Nonoverlapping SPM, as a branch of gap constraint SPM, means that any two occurrences cannot use the same sequence letter in the same position as the occurrences. Nonoverlapping SPM can make a balance between efficiency and completeness. The frequent patterns discovered by existing methods normally contain redundant patterns. To reduce redundant patterns and improve the mining performance, this paper adopts the closed pattern mining strategy and proposes a complete algorithm, named Nettree for Nonoverlapping Closed Sequential Pattern (NetNCSP) based on the Nettree structure. NetNCSP is equipped with two key steps, support calculation and closeness determination. A backtracking strategy is employed to calculate the nonoverlapping support of a pattern on the corresponding Nettree, which reduces the time complexity. This paper also proposes three kinds of pruning strategies, inheriting, predicting, and determining. These pruning strategies are able to find the redundant patterns effectively since the strategies can predict the frequency and closeness of the patterns before the generation of the candidate patterns. Experimental results show that NetNCSP is not only more efficient but can also discover more closed patterns with good compressibility. Furtherly, in biological experiments NetNCSP mines the closed patterns in SARS-CoV-2 and SARS viruses. The results show that the two viruses are of similar pattern composition with different combinations.Entities:
Keywords: COVID-19; Closed pattern mining; Nettree; Nonoverlapping sequence pattern; Periodic wildcard gaps; Sequential pattern mining
Year: 2020 PMID: 32292248 PMCID: PMC7118609 DOI: 10.1016/j.knosys.2020.105812
Source DB: PubMed Journal: Knowl Based Syst ISSN: 0950-7051 Impact factor: 8.038
Fig. 1The occurrences of pattern P in sequence .
Related work.
| Research | Pattern type | Type of condition | Pruning strategy | Mining type | Periodic gap constraint | Repetitions of pattern |
|---|---|---|---|---|---|---|
| Yan et al. | Closed | Ignore | Other | Exact | No | Ignore |
| Wang et al. | Closed | Ignore | Other | Exact | No | Ignore |
| Lam et al | Compressed | On–off condition | Apriori | Approximate | No | Ignore |
| Wu et al | Frequent | On–off condition | Apriori | Approximate | Yes | Capture |
| Li et al. | Closed | No-condition | Other | Exact | Yes | Capture |
| Zhang et al. | Frequent | No-condition | Apriori-like | Exact | Yes | Capture |
| Wu et al. | Frequent | Nonoverlapping condition | Apriori | Exact | Yes | Capture |
| This paper | Closed | Nonoverlapping condition | Apriori, other | Exact | Yes | Capture |
Fig. 2A Nettree.
Benchmark datasets.
| Dataset | Type | From | Length |
|---|---|---|---|
| AX8291741 | DNA | Homo sapiens (human) | 10,011 |
| DNA1 | DNA | Homo sapiens AL158070 | 6,000 |
| DNA2 | DNA | Homo sapiens AL158070 | 8,000 |
| DNA3 | DNA | Homo sapiens AL158070 | 10,000 |
| DNA4 | DNA | Homo sapiens AL158070 | 12,000 |
| DNA5 | DNA | Homo sapiens AL158070 | 14,000 |
| DNA6 | DNA | Homo sapiens AL158070 | 16,000 |
| Potato_virus | Virus | Potato virus Y Wilga MV99 | 9,699 |
| Ebola_virus | Virus | Reston Ebola virus | 18,891 |
| SARS-CoV-2 | Virus | Severe acute respiratory syndrome coronavirus 2(COVID-19) | 29,903 |
| SARS | Virus | Severe acute respiratory syndrome-related coronavirus | 29,751 |
AX829174 is used in References [31] and [29] for mining frequent patterns and closed patterns, which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AX829174.1/.
DNA1-6 databases are used in Reference [34], which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AL158070.11.
Potato_virus Y is used in Reference [52] to mine continuous closed patterns and closed patterns with gap constraints and can be downloaded from https://www.ebi.ac.uk/ena/data/view/Taxon:1107954.
Ebola_virus is commonly used in biological sequence analysis and can be downloaded with Potato_virus from https://www.ebi.ac.uk/ena/data/view/Taxon:129003.
This sequence was reported in Reference [66], and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947.
This sequence can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/30271926.
Fig. 3Running time with different strategies.
Fig. 4Support calculation times with different strategies.
Fig. 5Closeness determination times with different strategies.
Fig. 6Pattern number with different pattern length.
Fig. 7Running time and ATC with different pattern length.
Fig. 8Support calculation times with different pattern length.
Fig. 9Pattern number with different databases.
Fig. 10Running time and ATC with different databases.
Fig. 11Support calculation times with different databases.
Fig. 12Comparison of closed pattern number in different databases.