Literature DB >> 32292248

NetNCSP: Nonoverlapping closed sequential pattern mining.

Youxi Wu^1,2,3, Changrui Zhu¹, Yan Li⁴, Lei Guo², Xindong Wu^5,6.

Abstract

Sequential pattern mining (SPM) has been applied in many fields. However, traditional SPM neglects the pattern repetition in sequence. To solve this problem, gap constraint SPM was proposed and can avoid finding too many useless patterns. Nonoverlapping SPM, as a branch of gap constraint SPM, means that any two occurrences cannot use the same sequence letter in the same position as the occurrences. Nonoverlapping SPM can make a balance between efficiency and completeness. The frequent patterns discovered by existing methods normally contain redundant patterns. To reduce redundant patterns and improve the mining performance, this paper adopts the closed pattern mining strategy and proposes a complete algorithm, named Nettree for Nonoverlapping Closed Sequential Pattern (NetNCSP) based on the Nettree structure. NetNCSP is equipped with two key steps, support calculation and closeness determination. A backtracking strategy is employed to calculate the nonoverlapping support of a pattern on the corresponding Nettree, which reduces the time complexity. This paper also proposes three kinds of pruning strategies, inheriting, predicting, and determining. These pruning strategies are able to find the redundant patterns effectively since the strategies can predict the frequency and closeness of the patterns before the generation of the candidate patterns. Experimental results show that NetNCSP is not only more efficient but can also discover more closed patterns with good compressibility. Furtherly, in biological experiments NetNCSP mines the closed patterns in SARS-CoV-2 and SARS viruses. The results show that the two viruses are of similar pattern composition with different combinations.

Entities: CellLine Chemical Disease Gene Species

Keywords: COVID-19; Closed pattern mining; Nettree; Nonoverlapping sequence pattern; Periodic wildcard gaps; Sequential pattern mining

Year: 2020 PMID： 32292248 PMCID： PMC7118609 DOI： 10.1016/j.knosys.2020.105812

Source DB: PubMed Journal: Knowl Based Syst ISSN： 0950-7051 Impact factor: 8.038

Introduction

Sequential pattern mining (SPM) refers to discover the subsequences (also known as patterns) that satisfy the threshold from given sequences [1], [2], [3]. It has been widely applied in various fields, such as big data mining [4], [5], big data intelligence [6], e-commerce shopping analysis [7], biological sequence analysis [8], and event analysis [9]. To handle some specific issues, many methods have been proposed, such as negative SPM [10], [11], maximal frequent pattern mining [12], [13], three-way pattern mining [14], [15], closed SPM [16], [17], gap constraint SPM [18], [19]. Gap constraint SPM is an important branch of traditional SPM. In traditional SPM, a sequence database is a set of sequences, each sequence is a list of elements, and each element is a set of items. For example, a(abc)(ac)c(cd) is a sequence in traditional SPM, since (abc) is an element in the sequence. In gap constraint SPM, a sequence database can be a sequence, and each sequence is a list of items. For example, aabcacccd is a sequence in gap constraint SPM. More importantly, traditional SPM does not calculate the number of occurrences of a pattern in a sequence, while gap constraint SPM does. For example, apparently, pattern “ac” occurs in sequence a(abc)(ac)c(cd) more than once. But traditional SPM neglects the repetition, which leads to the loss of support information. For example, in long-length sequences such as DNA, virus, and consumption records, the repetitive occurrences indicate the frequency of patterns, which reflect information discrimination between patterns, and are of high research value. Gap constraint SPM (or gap constraint sequence pattern mining) can avoid finding many useless patterns by setting gap constraints, while repetitive SPM cannot [20], [21] since it does not set gap constraint. Gap constraint SPM is also different from frequent substring mining [22] and n-gram text mining [23], since the latter mine the pattern without gaps. One key issue of gap constraint SPM is to calculate the support (the number of occurrences) of a pattern in a sequence, which is pattern matching problem [24], [25], [26]. Thus, gap constraint is also called a wildcard gap or flexible wildcards [27], [28] in pattern matching fields. If a pattern has many gaps that are the same, the pattern is called a pattern with periodic gaps [29], described as , where a and b (0 a b) are the minimum and maximum gap constraints, respectively, and m indicates the pattern length [30]. The size of gap constraints can be flexibly set by users, which leads to various applications, such as correlation analysis between DNA and diseases based on gap constraints [29]. Li et al. [31] proposed an effective method to mine the patterns with gap constraints that can be used for feature extraction for sequence classifications [32]. However, introducing gap constraints not only makes the mining method more flexible but also makes the results more complex because the number of patterns increases exponentially as the pattern length increases. To solve this problem, the nonoverlapping condition [33] was proposed, which allows the same sequence letter to match and rematch pattern letters at different positions. Introducing the nonoverlapping condition not only reduces the number of occurrences but also makes the unique patterns richer. Our previous studies have proved that the nonoverlapping SPM is a complete mining method which satisfies the Apriori property [34], [35]. Example 1 is used to explain the periodic gap pattern, gap constraint mining, and nonoverlapping support. Given a sequence database SDB={=AATCATCA, =AATGACTACTCAA, =ATCAGATCAG}. Pattern P=A[0, 2]T[0, 2]C[0, 2]A is a periodic gapped pattern. To make it easier to describe, an occurrence will be described in the form of a sequence landmark. For example, the first occurrence of P in is , which can be written as 1,3,4,5. In this way, the other 2 occurrences of P in are 2,3,4,5 and 5,6,7,8. The results are shown in Fig. 1.

Fig. 1

The occurrences of pattern P in sequence .

Similarly, there are four occurrences of P in both and . If we ignore the repetition, the support of in SDB is 3. If we take the repetition into consideration, the support is 3+4+4=11. The latter method considers occurrences in detail. However, this method also encounters some problems. For example, there are three occurrences for pattern = A[0, 2]T in sequence = AAATCCC, but there are nine occurrences for its super-pattern = A[0, 2]T[0, 2]C. This example shows that the number of patterns increases exponentially with increasing pattern length, which does not satisfy the Apriori property [19]. To solve the above problem, the nonoverlapping condition was proposed [33]. With the nonoverlapping condition, any two occurrences cannot use the same sequence letter in the same position. For example, 1,3,4,5 and 5,6,7,8 are two nonoverlapping occurrences for pattern in sequence , since sequence letter is matched twice with and , respectively. However, 1,3,4,5 and 2,3,4,5 are two overlapping occurrences, since sequence letters , , and are reused by , , and , respectively. The occurrences of pattern P in sequence . Our previous work proposed an effective algorithm NOSEP and reported that the nonoverlapping SPM has better performance than other state-of-the-art gap constraint SPM methods in finding useful patterns in biology sequences and avoiding under-expression and over-expression in time series [34]. However, all frequent patterns discovered by NOSEP can be furtherly compressed. An illustrative example is shown as follows. When minsup=2, gap=[0, 2], with the nonoverlapping condition, there are 16 frequent patterns in the sequence = AATCATCA, which are {“A”, “AA”, “AT”, “AC”, “TC”, “CA”, “AAA”, “AAC”, “AAT”, “ATA”, “ATC”, “ACA”, “AA”, “TC”, “AATA”, “AACA”, “ATCA”}. The supports of these patterns are sup(“A”) =4, sup(“AA”)=3, and 2 for the remaining patterns. Patterns “AT”, “TC”, “AAT”, and “ATC” are unclosed patterns, since these patterns are sub-patterns of pattern “AATC” and their supports are all 2. In this way, 16 frequent patterns can be compressed into 6 closed patterns, “A”, “AA”, “AATC”, “AATA”, “AACA”, and “ATCA”. This example shows that closed patterns can effectively compress frequent patterns without losing support information. To reduce redundant patterns and improve the mining speed, this paper adopts the closed pattern mining strategy to obtain lossless compression of frequent patterns. The contributions of this paper are as follows: The problem of nonoverlapping periodic gapped closed SPM is addressed, and a complete algorithm NetNCSP (Nettree for Nonoverlapping Closed Sequential Pattern) is proposed. To calculate the support, NetNCSP employs the backtracking strategy to match the pattern in the Nettree structure, which reduces the time complexity. More importantly, NetNCSP adopts three pruning strategies to find closed patterns. We show that NetNCSP is a complete algorithm that satisfies the Apriori property. A large number of comparative experiments show that NetNCSP is not only more efficient, but also possesses remarkable pattern compressibility. The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 defines the problem. Section 4 proposes the NetNCSP algorithm, and demonstrates the completeness, complexities, and Apriori property. Section 5 makes a comparative experimental analysis. Section 6 draws the conclusion of this paper.

Related work

Agrawal et al. [36] proposed SPM. Based on this research, many achievements have been made, such as high utility mining [37], contrast SPM [38], [39], and closed SPM [40], [41]. Closed SPM can effectively compress the frequent patterns [42], [43]. For example, assuming that the supports of patterns “A”, “AT”, and “ATC” are equal, then patterns “A” and “AT” are called redundant patterns. Hence, patterns “A”, “AT”, “ATC” can be compressed into “ATC” without losing support information. Besides closed pattern mining, there are other methods to achieve pattern compression, such as generator mining [44] and maximal pattern mining [45]. Generator mining aims to find the set of patterns with minimal length, while closed pattern mining focuses on finding the set of patterns with maximal length. Maximal pattern mining finds the set of patterns whose super-patterns are infrequent. Closed SPM has become a research hotspot because of its impressive compression performance [46], [47] and has been widely used in many essential fields, such as recommendation systems [48], clustering analysis [49], [50], [51], genetic engineering [52], disease diagnosis [53], and software engineering [54], [55]. However, these studies ignored the repetitions that may contain more relevant information in long sequences. Noticing this disadvantage, Ding et al. [33] proposed the CloGSgrow algorithm, which studied the repetitive occurrences of patterns. CloGSgrow calculates the occurrences of the super-patterns based on those of the sub-patterns and the results show that the repetitive occurrence pattern is compressible, which is of high research value for long sequences. Li et al. [31] added periodic gap constraints and the experiments show that introducing gap constraints improves the mining results. Another important branch of SPM is gap constraint SPM. This method aims to discover subsequences from sequences that satisfy the gap constraints and support threshold. Existing studies are based on three types of conditions, no-condition [29], [56], one-off condition [8], [57], and nonoverlapping condition [33], [34], [58]. The no-condition allows sequence letters to be reused by patterns for an unlimited time. Zhang et al. [29] first proposed SPM under no-condition in DNA sequences. However, pattern mining under no-condition does not satisfy the Apriori property, which means it is necessary to expand the search space to mine all frequent patterns [19]. The one-off condition allows the sequence letters to be matched no more than once [59]. In Example 1, there is only one occurrence 1,3,4,5 of in with the one-off condition. Pattern mining under the one-off condition was applied to biological sequence mining [8], which is an NP-Hard problem, since its computational complexity is the same as that of an iterative shuffle problem [60]. Therefore, heuristic strategies are applied to calculate the pattern support. Hence, pattern mining under the one-off condition belongs to approximate mining. The nonoverlapping condition allows the sequence letters to be rematched by the different pattern letters and to be matched no more than once by the same pattern letter. Although the nonoverlapping condition is more complex than the other two, our previous studies have demonstrated that the pattern mining under the nonoverlapping condition is not only a complete mining method with the Apriori property but can also discover more valuable patterns than the other two conditions [34], [35]. Consequently, pattern mining under the nonoverlapping condition outperforms those of the no-condition and one-off condition. A comparison of the related research work is shown in Table 1.

Table 1

Related work.

Research	Pattern type	Type of condition	Pruning strategy	Mining type	Periodic gap constraint	Repetitions of pattern
Yan et al. [61]	Closed	Ignore	Other	Exact	No	Ignore
Wang et al. [62]	Closed	Ignore	Other	Exact	No	Ignore
Lam et al [63]	Compressed	On–off condition	Apriori	Approximate	No	Ignore
Wu et al [8]	Frequent	On–off condition	Apriori	Approximate	Yes	Capture
Li et al. [31]	Closed	No-condition	Other	Exact	Yes	Capture
Zhang et al. [29]	Frequent	No-condition	Apriori-like	Exact	Yes	Capture
Wu et al. [34]	Frequent	Nonoverlapping condition	Apriori	Exact	Yes	Capture
This paper	Closed	Nonoverlapping condition	Apriori, other	Exact	Yes	Capture

As can be seen from Table 1, Reference [34] is the closest to this paper. The differences between Reference [34] and this paper are as follows. To obtain the maximum pattern support, NETGAP prunes the invalid nodes after obtaining a nonoverlapping occurrence [34]. Hence, the time complexity of NETGAP is O(m m n W), where m, n, and W are the length of pattern and sequence, and the maximum gap, respectively. However, in this paper, we propose the BackTr algorithm which employs a backtracking strategy to calculate the pattern support without pruning the invalid nodes, which will reduce the time complexity to O(m n W). In addition, Reference [34] focused on mining the frequent patterns, while this paper mines the closed patterns and adopts three pruning strategies, inheriting, predicting, and determining to predict the frequency and closeness of the patterns. Related work.

Problem definition

A sequence with length n can be described as S=, where , denotes the set of items, and indicates the size. A periodic gap constraint pattern P with length can be written as P=[a, b], where a and b are integers (0 a ) that indicate the minimum and maximum gaps, respectively. L=, is an occurrence of pattern P in sequence S, if and only if 1 n and a --1 b, where (1 j m and 1 n). Suppose there is another occurrence , . and are two nonoverlapping occurrences if and only if 1 j m and . The nonoverlapping support of pattern P in sequence S is represented by sup(P, S). If is no less than support threshold , pattern is called a frequent pattern. In Example 1, ={A, T, C, G}, =4, minsup=2, and len=[1,7], the nonoverlapping occurrences of P in are 1, 3, 4, 5 and 6, 7, 8, 9 . We can know that sup(P, )=2 minsup. Thus, if we mine the frequent pattern in , P is a frequent pattern. Suppose L=, is an occurrence. If minlen -+1 maxlen, then L is an occurrence that satisfies the length constraints, where minlen and maxlen are the minimum and maximum length constraints, respectively. Given pattern , and letters and . is defined as the right gap super-pattern of , and is the prefix sub-pattern of Q, i.e. prefix(Q)=. Similarly, R=lP=l[a, b][a, b][a, b][a, b] is defined as the left gap super-pattern of and is the suffix sub-pattern of R, i.e. suffix(R)=. If prefix(Q)=suffix(R)=, R and Q can be connected into a super-pattern T with length m+2, where . This process of generating the super-pattern from sub-patterns is called pattern growth [64]. Suppose pattern =A[0, 2]T. Patterns =G[0, 2]A[0, 2]T and = A[0, 2]T[0, 2]C are the left and right gap super-patterns of P, respectively. Therefore, patterns R and Q can be connected into super-pattern T with length 4, i.e. T=R Q=G[0, 2]A[0, 2]T[0, 2]C. Suppose P is a frequent pattern. P is a closed pattern if there is no super-pattern of P which satisfies sup()=sup(P); otherwise, P is an unclosed pattern (or a redundant pattern). In Example 1, P=A[0, 2]T[0, 2]C[0, 2]A. One of its super-patterns is =A[0, 2]T[0, 2]C[0, 2]A[0, 2]G. The nonoverlapping occurrences of P and in are {1,2,3,4, 4,7,8,9} and {1,2,3,4,5, 4,7,8,9,10}, respectively, i.e. sup(P, )=sup(, )=2. Hence, pattern P is a redundant pattern.

Nonoverlapping closed SPM algorithm

Section 4.1 proposes the BackTr algorithm to calculate the pattern support. Section 4.2 introduces the principle of the pattern growth strategy to generate candidate patterns. Section 4.3 proposes three pruning strategies to determine closed patterns. We show the NetNCSP algorithm in Section 4.4.

Support calculating

Given a sequence and a pattern with gap constraints, all occurrences can be represented by a Nettree [65] which is an extended tree structure with multiple roots and parents. Since the nodes with the same label can appear on a Nettree for multiple times, is used to represent node in the -th level. A path from a root to a leaf in the Nettree corresponds to an occurrence of the pattern in the sequence. The problem of calculating the support of pattern P in sequence S with the nonoverlapping condition means that all Nettree nodes cannot be reused in the same level [34]. The above properties make Nettree the most suitable for representing the nonoverlapping occurrences of a pattern. In our previous work [34], NETGAP was proposed employing Nettree, which is a complete method to calculate the nonoverlapping occurrence. However, the weakness of NETGAP is the lower efficiency since NETGAP must prune the invalid nodes after obtaining a nonoverlapping occurrence. Based on the above reasons, we propose the BackTr algorithm, which is of superior efficiency. Example 6, Example 7 will illustrate the principles of NETGAP and BackTr, respectively. Given pattern P=A[0,3]T[0,3]C[0,3]A and sequence =ATACTTATTCGACA, a Nettree can be created as shown in Fig. 2. The first step of NETGAP is to prune the invalid nodes which are , , and . Then NETGAP selects the first root node and finds a root–leaf path employing the leftmost child strategy. In Fig. 2, it is easy to obtain the first root–leaf path , , , , marked in yellow. The corresponding occurrence is 1,2,4,7. Then, NETGAP deletes nodes , , , and . After that, NETGAP finds the invalid nodes in the new Nettree. It is clear that there are no invalid nodes at that time. Then NETGAP obtains the second root–leaf path , , , , marked in red and its corresponding occurrence is 3,6,10,12. After pruning nodes , , , and , NETGAP finds an invalid node and prunes it. Finally, NETGAP obtains the third occurrence 7,9,13,14, which is marked in blue. Hence, NETGAP gets three nonoverlapping occurrences.

Fig. 2

A Nettree.

In this example, we use the same pattern and sequence as in Example 6. BackTr does not need to find and prune invalid nodes , , and , and gets the first occurrence 1,2,4,7. After that, BackTr selects the second root and finds its first child node , which has no child node. In that case, the algorithm backtracks to node and finds its second child which is node . Thus, BackTr will get another occurrence 3,6,10,12. Similarly, BackTr can find the third occurrence 7,9,13,14. After that, there is no occurrence in the rest of the Nettree. Hence, BackTr also gets the same three nonoverlapping occurrences as NETGAP. A Nettree. From Example 6, Example 7, it can be concluded that the two algorithms employ different methods to find the same nonoverlapping occurrences. However, NETGAP needs to find and prune the invalid nodes for three times, while BackTr does not need to prune these nodes, which will reduce the time complexity. BackTr is given in Algorithm 1. BackTr is complete. Our previous work showed that the complete algorithm should iteratively find the minimum occurrence [34]. BackTr iteratively selects the leftmost child from the minimum root to get the minimum occurrence. Hence, BackTr is complete. In the worst case, the space and time complexities of BackTr are both , and the average space and time complexities are , where , , , are the pattern length, sequence length, , and item number , respectively. Obviously, each node in a Nettree will be accessed at most once. Hence, the time complexity of BackTr is consistent with the space complexity of a Nettree. A Nettree has m levels. Each level has a maximum of n nodes. Each node has a maximum of w children. Therefore, the space complexity of a Nettree is O(m n w) in the worst case. Hence, the time complexity of a Nettree is also O(m n w). On average, each level has n/r nodes and each node has w/r children. Hence, the average space and time complexities are O(m n w/r/r). As we know, the average time complexity of the NETGAP algorithm is O(m n w/r/r) [34]. Hence, BackTr outperforms NETGAP.

Candidate pattern generating

Traditional candidate pattern generation methods include breadth-first and depth-first. In this paper, the pattern growth strategy can effectively reduce the generation of redundant patterns. An example is as follows. In Example 1, with =2 and =[0,3], there are nine frequent patterns in with length 2: {“AA”, “AT”, “AC”, “AG”, “TC”, “TA”, “TG”, “CA”, “CG”}. With breadth-first or depth-first strategy, 9 × 4=36 candidate patterns with length 3 will be generated. On the other hand, since “TT” is not frequent, super-patterns “ATT” and “TTG” are also not frequent according to the Apriori property and can be pruned. According to the pattern growth strategy, there are 14 candidate patterns: {“AAA”, “AAT”, “AAG”, “AAC”, “CAA”, “CAT”, “CAC”, “CAG”, “TAA”, “TAT”, “TAC”, “TAG”, “TCA”, “TCG”}. This example illustrates that the pattern growth strategy outperforms the breadth-first and depth-first strategies.

Closed pattern determining

In this subsection, we propose three pruning strategies to find closed patterns. Although BackTr can reduce the time complexity to calculate the support, it is also very complex. Therefore, we propose an inheriting strategy to predict the closeness of a pattern. The unclosed patterns will be pruned before support calculation using BackTr.

Inheriting

Given pattern P=, and letters l and r. If there is a right gap super-pattern Q=Pr which satisfies sup(Q)=sup(P), then P is called a right unclosed pattern. In the same way, if there is a left gap super-pattern R=lP and sup(R)=sup(P), P is called a left unclosed pattern. Otherwise, P is called a right or left closed pattern. In Example 1, pattern =T[0, 2]A has two nonoverlapping occurrences, and in =ATCAGATCAG. By traversing the left gap of all occurrences, it can be found that there is a common letter “A”. Therefore, there are two nonoverlapping occurrences for super-pattern =A[0, 2]T[0, 2]A in , i.e. sup(, )=sup(, )=2. Hence, is a left unclosed pattern. Similarly, there exists a letter “G” in all occurrences of the right gaps of , i.e. =T[0, 2]A[0, 2]G and sup(, )=sup(, )=2. Hence, is a right unclosed pattern. However, there are two nonoverlapping occurrences for pattern =A[0,3]G in , i.e. and . Yet, there is no common letter found in the left and right gaps of . Hence, is a closed pattern. If sub-pattern P is a left unclosed pattern in sequence S, then all its right super-patterns are also left unclosed patterns. Therefore, pattern can be safely pruned. The same strategy can be applied to the right side. Suppose pattern P is a left unclosed pattern, which means there is a super-pattern =lP whose support is the same as that of pattern P, i.e. sup(, S)=sup(P, S). We will show that pattern =Pr has a super-pattern =lQ=lPr whose support is the same as that of pattern , i.e. sup(, S)=sup(Q, S). Suppose , , is an occurrence of pattern . , is an occurrence of pattern P. Since P is a left unclosed pattern, i.e. sup(lP, S)= sup(P, S), we know that , , is an occurrence of pattern lP. Therefore, , , , is an occurrence of pattern . Thus, sup(, S)=sup(Q, S). Hence, pattern is also a left unclosed pattern. It should be noticed that only the unclosed property can be inherited not the closed property. In =AATGACTACTCAA, there are three nonoverlapping occurrences, 3, 7, and 10 for pattern “T” in , i.e. sup(“T”, )=3. Since there are also three nonoverlapping occurrences, , , and for the right gap super-pattern “T[0, 2]A” of pattern “T”, i.e. sup(“T[0, 2]A”, )=sup (“T”, )=3, pattern “T” is a right unclosed pattern. In addition, since “T” is a right unclosed pattern, then “A[0, 2]T” is also a right unclosed pattern according to Theorem 3. The verification is as follows. There are three nonoverlapping occurrences, , , and for pattern “A[0, 2]T”, and three nonoverlapping occurrences, 1,3,5, 5,7,8, and 8,10,12 for pattern “A[0, 2]T[0, 2]A” in , i.e. sup(“A[0, 2]T[0, 2]A”, )=sup(“A[0, 2]T”, )=3. Hence, pattern “A[0, 2]T” is a right unclosed pattern, which is consistent with Theorem 3. If pattern P is either a left or a right unclosed pattern, then P is an unclosed pattern. If pattern P is a left unclosed pattern, then there is a super-pattern =lP which satisfies sup(P, S)= sup(, S). Thus, according to Definition 6, pattern P is an unclosed pattern. The same strategy can be applied to the right side. In Example 9, pattern “T” is a right unclosed pattern, since there exists a right gap super-pattern “T[0, 2]A”, that sup(“T[0, 2]A”, )= sup(“T”, ) =3. In the following subsections, we will propose two strategies to detect the closeness of the frequent patterns.

Predicting

Let pattern be the pattern with the highest support of all the patterns that can be connected with pattern P. If ¿ , then P is a closed pattern. Knowing that sup(P) sup(Q), min(sup(P), sup(Q)) is sup(Q). Patterns P and Q can generate super-pattern R by pattern growth. According to Apriori, sup(R) min(sup(P), sup(Q)). Thus,sup(R) sup(Q) sup (P). Therefore, there is no super-pattern R of P that satisfies sup(R)=sup (P). Hence, P is a closed pattern. In Example 1, patterns “AA”, “AT”, “AG”, and “AC” are the frequent patterns with length 2 in sequence . It is known that, sup(“AA”)=3 and sup(“AC”)=sup(“AT”)=sup(“AG”)= 2. The support of the pattern “AA” is greater than that of other patterns. According to Theorem 3, pattern “AA” is a closed pattern.

Determining

If P is both a left and right closed pattern, then P is a closed pattern. Suppose is a left closed pattern which means that there is no super-pattern lP that satisfies . Similarly, suppose is a right closed pattern, which means there is no super-pattern Pr that satisfies . Hence, is a closed pattern according to Definition 6. In Example 1, sup(“A”, )=6, there is no left or right gap super-pattern of pattern “A” that has the same support as pattern “A”. Thus, pattern “A” is a left closed pattern and a right closed pattern. Hence, pattern “A” is a closed pattern.

NetNCSP

In this subsection, we propose NetNCSP. At the beginning, NetNCSP traverses the sequence to find the frequent letters, and stores them in candidate set . In the following procedures, three pruning strategies are applied to check the closeness of pattern P in . In the first step, if P is an unclosed pattern according to Theorem 3, then NetNCSP will add P to temporary candidate set and restart from the first step with another P in . Otherwise, NetNCSP goes to the second step. In the second step, BackTr will calculate the nonoverlapping support of P and store the result in sup. If sup is less than minsup, NetNCSP will go to the first step with another P in . Otherwise, NetNCSP adds P to and goes to the third step. In the third step, NetNCSP will check the closeness of pattern P according to Theorem 5, Theorem 6. If the pattern is closed, NetNCSP will add P to nonoverlapping closed pattern set . After that, NetNCSP will restart from the first step with another P in . After all patterns in being traversed, NetNCSP employs the pattern growth strategy to generate the candidate set using . NetNCSP will stop until is empty. All closed patterns are stored in . The time complexity of NetNCSP is in the worst case and on average, where is the number of runs of the BackTr algorithm. According to Theorem 1, the time complexity of BackTr is O(m n w) in the worst case, and O(m n w/r/r) on average. Since BackTr runs t times, the time complexity of NetNCSP is O(m n w t) and the average time complexity of NetNCSP is O(m n w t/r/r). The space complexity of NetNCSP, the same as that of BackTr, is in the worst case and in the average case. The space of the candidate patterns and the closed patterns can be neglected. Therefore, the space complexity of NetNCSP is the same as that of BackTr. According to Theorem 1, the space complexity of NetNCSP is O(m n w) in the worst case and O(m n w/r/r) in the average case.

Experimental analysis

Section 5.1 explains the benchmark datasets. Section 5.2 introduces the competitive algorithms. Section 5.3 shows the mining performance of different strategies, such as candidate generation strategies, pruning strategies, and support calculation strategies. Section 5.4 verifies the mining capability of the proposed algorithm and the competitive algorithms. Section 5.5 furtherly reports biological application in COVID-19. All experiments are conducted on a computer with Intel Core i5, 1.6 GHz CPU, 8 GB 1600 MHz DDR3 memory, and Mac OS (10.14.5) operating system. All the algorithms are developed in Visual Studio Code 1.36.1 which also runs as the experimental environment.

Benchmark datasets

Table 2 explains the benchmark datasets used in the following experiments.

Table 2

Benchmark datasets.

Dataset	Type	From	Length
AX8291741a	DNA	Homo sapiens (human)	10,011
DNA1b	DNA	Homo sapiens AL158070	6,000
DNA2	DNA	Homo sapiens AL158070	8,000
DNA3	DNA	Homo sapiens AL158070	10,000
DNA4	DNA	Homo sapiens AL158070	12,000
DNA5	DNA	Homo sapiens AL158070	14,000
DNA6	DNA	Homo sapiens AL158070	16,000
Potato_virusc	Virus	Potato virus Y Wilga MV99	9,699
Ebola_virusd	Virus	Reston Ebola virus	18,891
SARS-CoV-2e	Virus	Severe acute respiratory syndrome coronavirus 2(COVID-19)	29,903
SARSf	Virus	Severe acute respiratory syndrome-related coronavirus	29,751

AX829174 is used in References [31] and [29] for mining frequent patterns and closed patterns, which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AX829174.1/.

DNA1-6 databases are used in Reference [34], which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AL158070.11.

Potato_virus Y is used in Reference [52] to mine continuous closed patterns and closed patterns with gap constraints and can be downloaded from https://www.ebi.ac.uk/ena/data/view/Taxon:1107954.

Ebola_virus is commonly used in biological sequence analysis and can be downloaded with Potato_virus from https://www.ebi.ac.uk/ena/data/view/Taxon:129003.

This sequence was reported in Reference [66], and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947.

This sequence can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/30271926.

Benchmark datasets. AX829174 is used in References [31] and [29] for mining frequent patterns and closed patterns, which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AX829174.1/. DNA1-6 databases are used in Reference [34], which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AL158070.11. Potato_virus Y is used in Reference [52] to mine continuous closed patterns and closed patterns with gap constraints and can be downloaded from https://www.ebi.ac.uk/ena/data/view/Taxon:1107954. Ebola_virus is commonly used in biological sequence analysis and can be downloaded with Potato_virus from https://www.ebi.ac.uk/ena/data/view/Taxon:129003. This sequence was reported in Reference [66], and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947. This sequence can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/30271926.

Baseline methods

NetNCSP-noinh and NetNCSP-nocheck: To verify the efficiency of pruning strategies, NetNCSP-noinh and NetNCSP-nocheck are proposed. NetNCSP-noinh removes the inheriting strategy, while NetNCSP-nocheck removes the predicting and determining strategies to determine closed patterns according to the definition. NetNCSP-netgap: To analyze the effect of BackTr, NetNCSP-netgap is proposed, which applies NETGAP strategy to mine the closed patterns. NetNCSP-bf and NetNCSP-df: To analyze the effect of pattern growth strategy, NetNCSP-bf and NetNCSP-df are proposed to generate candidate patterns according to breadth-first and depth-first strategies, respectively. NOSEP and CloGSgrow: To analyze the differences between NetNCSP and classical SPM algorithms, we employ NOSEP and CloGSgrow as competitive algorithms which were proposed in References [33] and [34], respectively. NetNCSP-nogap: To analyze the effect of gap constraint, NetNCSP-nogap is proposed to mine continuous patterns without gap. The datasets and all algorithms can be downloaded from http://wuc.scse.hebut.edu.cn.

Mining performance

In this subsection, we will verify the mining performance of NetNCSP with different strategies. Five competitive algorithms are selected, NetNCSP-bf, NetNCSP-df, NetNCSP-noinh, NetNCSP-nocheck, and NetNCSP-netgap. We use five databases to carry out the experiments, DNA2, DNA4, DNA5, Potato_virus and Ebola_virus. The parameters are len =[1,2000], gap= [0,200] and minsup=2000. The running time, support calculation times, and closeness determination times are shown in Fig. 3, Fig. 4, Fig. 5.

Fig. 3

Running time with different strategies.

Fig. 4

Support calculation times with different strategies.

Fig. 5

Closeness determination times with different strategies.

The results indicate the following observations: Running time with different strategies. Support calculation times with different strategies. Closeness determination times with different strategies. The inheriting, predicting, and determining strategies are significantly effective. From Fig. 3, it is clear that NetNCSP is faster than NetNCSP-noinh and NetNCSP-nocheck. NetNCSP, NetNCSP-noinh, and NetNCSP-nocheck run 308.9, 403.3, and 1523.4 s in Potato_virus, respectively. From Fig. 4, NetNCSP-noinh and NetNCSP-nocheck calculate support 1150 and 5460 times, respectively, while NetNCSP only calculates 856 times. From Fig. 5, NetNCSP-noinh and NetNCSP-nocheck determine closeness 1364 and 5460 times, respectively, while NetNCSP only needs 916 times. The experiments verify that NetNCSP employs the inheriting, predicting, and determining strategies, which improve the mining efficiency significantly. Hence, NetNCSP outperforms NetNCSP-noinh and NetNCSP-nocheck. The BackTr strategy is significantly effective. According to Fig. 3, Fig. 4, Fig. 5, NetNCSP and NetNCSP-netgap have the same closeness determination times and support calculation times, but the running time of NetNCSP is less than that of NetNCSP-netgap. For example, in Fig. 3, the running time of NetNCSP and NetNCSP-netgap are 679.0 and 1140.3 s in Ebola_virus, respectively. The reason is that NetNCSP employs the BackTr strategy, and reduces the running time by pruning invalid nodes, thus reducing the time complexity from O(m m n W) to O(m n W). Hence, NetNCSP outperforms NetNCSP-netgap. The pattern growth strategy is significantly effective. From Fig. 3, we can see that NetNCSP is faster than NetNCSP-bf and NetNCSP-df. For example, NetNCSP runs 679.0 s in Ebola_virus, while NetNCSP-bf and NetNCSP-df run 1290.7 and 1293.4 s, respectively. The reason is that NetNCSP calculates 940 candidate patterns, while NetNCSP-bf and NetNCSP-df calculate 1364 candidate patterns. The reason lies in that NetNCSP employs pattern growth strategy to generate the candidate patterns, while NetNCSP-bf and NetNCSP-df employ the breadth-first and depth-first strategies, respectively. The pattern growth strategy is more effective than the other two, which is consistent with the analysis in Example 8. Hence, NetNCSP outperforms NetNCSP-bf and NetNCSP-df. In conclusion, NetNCSP has better performance than all the competitive algorithms.

Mining capability

In this subsection, we carry out two experiments to verify the nonoverlapping closed pattern mining ability and the performance of NetNCSP. We used the DNA1 database to conduct the first experiment to mine patterns with pattern lengths 2 to 7. The parameters are len=[1,1000], gap=[0,100], and minsup=1200. The results are shown in Fig. 6, Fig. 7, Fig. 8.

Fig. 6

Pattern number with different pattern length.

Fig. 7

Running time and ATC with different pattern length.

Fig. 8

Support calculation times with different pattern length.

To compare the mining capability of algorithms in different databases, six sequences are included to mine closed patterns with length 5, which are DNA1, DNA3, DNA6, AX829174, Potato_virus, and Ebola_virus. The parameters are len=[1,1000], gap=[0,100], and minsup=1200. The results are shown in Fig. 9, Fig. 10, Fig. 11.

Fig. 9

Pattern number with different databases.

Fig. 10

Running time and ATC with different databases.

Fig. 11

Support calculation times with different databases.

Pattern number with different pattern length. Running time and ATC with different pattern length. Support calculation times with different pattern length. The results indicate the following observations: Pattern number with different databases. Running time and ATC with different databases. Support calculation times with different databases. NetNCSP outperforms NOSEP since NetNCSP not only compresses the patterns effectively but also achieves higher efficiency. For example, in Fig. 6, when the pattern length is 7, NOSEP gets 1864 patterns, while NetNCSP gets 1188. The running time of NOSEP and NetNCSP are 431.1 and 351.8 s, respectively. Similar phenomena can be discovered in different databases. For example, in Fig. 10, the running time of NOSEP and NetNCSP are 263.8 and 189.6 s in AX829174, respectively. The reasons are as follows. Firstly, NetNCSP adopts a more efficient backtracking strategy in calculating pattern support, and reduces the time complexity from O(m m n W) to O(m n W). More importantly, NetNCSP adopts three effective pruning strategies. The predicting and inheriting strategies prune the redundant unclosed patterns before calculating support of candidate patterns, and the determining strategy determines the closeness of frequent patterns avoiding the generation of candidate pattern of longer length. Hence, NetNCSP outperforms NOSEP. NetNCSP outperforms CloGSgrow since NetNCSP mines more closed patterns and is more efficient in terms of ATC (Average mining Time per Closed pattern). For example, in Fig. 6, when the pattern length is 5, NetNCSP mines 282 closed patterns and CloGSgrow only gets 15. Although Fig. 7 shows that when the pattern length is 5, the running time of NetNCSP and CloGSgrow are 64.1 and 5.6 s respectively, the ATCs of NetNCSP and CloGSgrow are 64.1/282=0.2 and 5.6/15=0.4 s, respectively. Similar phenomena can be discovered in different databases. For example, in Fig. 9, Fig. 10, CloGSgrow takes 83.3 s to get 33 closed patterns in Ebola_virus, while NetNCSP takes 453.8 s to get 912. The ATCs of CloGSgrow and NetNCSP are 2.5 and 0.5 s, respectively. The reasons are as follows. First, NetNCSP, as a complete algorithm, mines the complete closed patterns. Thus, more closed patterns can be discovered. Second, NetNCSP adopts three pruning strategies to reduce the running time effectively. As a result, NetNCSP outperforms CloGSgrow. NetNCSP can compress the patterns effectively. For example, in Fig. 6, NOSEP and NetNCSP find 399 frequent patterns and 282 closed patterns with length 5, respectively. Thus, the compression rate is (399-282)/399= 29.3%. In Fig. 9, NOSEP and NetNCSP mine 1364 frequent patterns and 912 closed patterns in Ebola_virus, respectively. Thus, the compression rate is 33.1%. The reason is that NOSEP mines the complete set of the frequent patterns which contains redundant patterns, while NetNCSP mines a subset of the closed patterns according to three closeness determination pruning strategies. NetNCSP reduces the support calculation times effectively. For example, Fig. 8 shows that when mining patterns with length 7, NetNCSP calculates the support for 1443 times, while NOSEP needs 2134 times. Similar phenomena can be found in different databases. For example, Fig. 11 shows that NetNCSP calculates support for 932 times in Ebola_virus, while NOSEP needs 1364 times. The reason is that NetNCSP adopts the predicting and inheriting strategies to prune the redundant patterns, thus, reducing the support calculation times. As a result, NetNCSP reduces support calculation times. In conclusion, NetNCSP compresses the patterns effectively and outperforms competitive algorithms.

Biological application

Recently, a severe respiratory syndrome named COVID-19 spreads around the world, and over 300,000 people worldwide have been identified with the disease. COVID-19 is caused by the SARS-CoV-2 virus [66], which is reported to be closely related to a group of SARS-like coronaviruses (of 89.1% nucleotide similarity). In this subsection, we use the proposed algorithm to study the similarities between the two viruses. NetNCSP is employed to mine closed frequent patterns from SARS-CoV-2 and SARS viruses. We set the parameters with len = [1,2000], gap = [0,200], and minsup = 2000 to mine patterns of length 2 to 10. The comparison of closed pattern number is reported in Fig. 12.

Fig. 12

Comparison of closed pattern number in different databases.

The results report the following observations. When the pattern length is less than 8, the number of closed patterns discovered from SARS-CoV-2 and SARS is almost the same, while when the pattern length is greater than 7, the number of closed patterns is significantly different. For example, in Fig. 12, when the pattern length is 4, the number of closed patterns discovered from SARS-CoV-2 and SARS are the same 18. When pattern length is 9, the closed patterns are 270 and 370, respectively. The possible reasons for this phenomenon are as follows. The basic pattern composition of SARS-CoV-2 and SARS is the same. Thus, the shorter patterns are nearly the same, while the longer ones have diversities with different pattern combinations. Comparison of closed pattern number in different databases.

Conclusion

In this paper, we tackle the problem of nonoverlapping periodic gapped closed SPM and propose an effective closed pattern mining algorithm, named NetNCSP, which has two major steps, support calculation and closeness determination. In the process of support calculation, NetNCSP adopts the BackTr algorithm with backtracking strategy to calculate the nonoverlapping support of patterns in Nettree. In the process of closeness determination, NetNCSP adopts three pruning strategies, inheriting, predicting, and determining. The inheriting strategy can avoid invalid calculations on the redundant patterns. The predicting strategy can reduce the number of candidate patterns effectively. The determining strategy determines the closeness of the frequent patterns and prunes the redundant patterns. NetNCSP is of lower time complexity than the state-of-the-art algorithms. Experiments are carried out on long-length sequence databases, such as DNA and virus. The experimental results show that NetNCSP has better performance than competitive algorithms and compresses the frequent patterns effectively. In the experiment of biological application, we employ NetNCSP in mining biological sequences in SARS-CoV-2 and SARS viruses, the results show that the two viruses are of similar pattern composition with different combinations. In this paper, we focus on nonoverlapping closed sequential pattern mining in a long sequence. Experimental results report that NetNCSP can compress the frequent patterns effectively in long sequences with a small item set, such as DNA/virus sequences. However, we notice that, NetNCSP cannot compress the frequent patterns in a short sequence with a large item set, such as a protein sequence, and NetNCSP is more suitable when the gap constraints are large. Hence, generator mining and maximal pattern mining can be explored in the future.

CRediT authorship contribution statement

Youxi Wu: Conceptualization, Methodology, Formal analysis, Supervision, Funding acquisition. Changrui Zhu: Software, Writing - original draft, Validation, Investigation, Data curation. Yan Li: Investigation, Writing - review & editing. Lei Guo: Validation, Resources. Xindong Wu: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

7 in total

1. e-RNSP: An Efficient Method for Mining Repetition Negative Sequential Patterns.

Authors: Xiangjun Dong; Yongshun Gong; Longbing Cao
Journal: IEEE Trans Cybern Date: 2018-10-05 Impact factor: 11.448

2. Mining Top- k Useful Negative Sequential Patterns via Learning.

Authors: Xiangjun Dong; Ping Qiu; Jinhu Lu; Longbing Cao; Tiantian Xu
Journal: IEEE Trans Neural Netw Learn Syst Date: 2019-01-10 Impact factor: 10.451

3. PMBC: pattern mining from biological sequences with wildcard constraints.

Authors: Xindong Wu; Xingquan Zhu; Yu He; Abdullah N Arslan
Journal: Comput Biol Med Date: 2013-03-16 Impact factor: 4.589

4. NOSEP: Nonoverlapping Sequence Pattern Mining With Gap Constraints.

Authors:
Journal: IEEE Trans Cybern Date: 2017-09-28 Impact factor: 11.448

5. Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns.

Authors: Shameek Ghosh; Jinyan Li; Longbing Cao; Kotagiri Ramamohanarao
Journal: J Biomed Inform Date: 2016-12-21 Impact factor: 6.317

6. Diagnosis of coronary artery disease using an efficient hash table based closed frequent itemsets mining.

Authors: Ramesh Dhanaseelan; M Jeya Sutha
Journal: Med Biol Eng Comput Date: 2017-09-14 Impact factor: 2.602

7. A new coronavirus associated with human respiratory disease in China.

Authors: Fan Wu; Su Zhao; Bin Yu; Yan-Mei Chen; Wen Wang; Zhi-Gang Song; Yi Hu; Zhao-Wu Tao; Jun-Hua Tian; Yuan-Yuan Pei; Ming-Li Yuan; Yu-Ling Zhang; Fa-Hui Dai; Yi Liu; Qi-Min Wang; Jiao-Jiao Zheng; Lin Xu; Edward C Holmes; Yong-Zhen Zhang
Journal: Nature Date: 2020-02-03 Impact factor: 49.962

7 in total

2 in total

1. NetNMSP: Nonoverlapping maximal sequential pattern mining.

Authors: Yan Li; Shuai Zhang; Lei Guo; Jing Liu; Youxi Wu; Xindong Wu
Journal: Appl Intell (Dordr) Date: 2022-01-10 Impact factor: 5.019

2. Decoding Task-Based fMRI Data with Graph Neural Networks, Considering Individual Differences.

Authors: Maham Saeidi; Waldemar Karwowski; Farzad V Farahani; Krzysztof Fiok; P A Hancock; Ben D Sawyer; Leonardo Christov-Moore; Pamela K Douglas
Journal: Brain Sci Date: 2022-08-17

2 in total