Literature DB >> 35035093

NetNMSP: Nonoverlapping maximal sequential pattern mining.

Yan Li¹, Shuai Zhang², Lei Guo³, Jing Liu², Youxi Wu^2,4, Xindong Wu^5,6.

Abstract

Nonoverlapping sequential pattern mining, as a kind of repetitive sequential pattern mining with gap constraints, can find more valuable patterns. Traditional algorithms focused on finding all frequent patterns and found lots of redundant short patterns. However, it not only reduces the mining efficiency, but also increases the difficulty in obtaining the demand information. To reduce the frequent patterns and retain its expression ability, this paper focuses on the Nonoverlapping Maximal Sequential Pattern (NMSP) mining which refers to finding frequent patterns whose super-patterns are infrequent. In this paper, we propose an effective mining algorithm, Nettree for NMSP mining (NetNMSP), which has three key steps: calculating the support, generating the candidate patterns, and determining NMSPs. To efficiently calculate the support, NetNMSP employs the backtracking strategy to obtain a nonoverlapping occurrence from the leftmost leaf to its root with the leftmost parent node method in a Nettree. To reduce the candidate patterns, NetNMSP generates candidate patterns by the pattern join strategy. Furthermore, to determine NMSPs, NetNMSP adopts the screening method. Experiments on biological sequence datasets verify that not only does NetNMSP outperform the state-of-the-arts algorithms, but also NMSP mining has better compression performance than closed pattern mining. On sales datasets, we validate that our algorithm guarantees the best scalability on large scale datasets. Moreover, we mine NMSPs and frequent patterns in SARS-CoV-1, SARS-CoV-2 and MERS-CoV. The results show that the three viruses are similar in the short patterns but different in the long patterns. More importantly, NMSP mining is easier to find the differences between the virus sequences.

Entities: Chemical

Keywords: Backtracking strategy; COVID-19; Gap constraint; MERS-CoV; Maximal pattern mining; Nonoverlapping pattern mining; Sequential pattern mining

Year: 2022 PMID： 35035093 PMCID： PMC8743106 DOI： 10.1007/s10489-021-02912-3

Source DB: PubMed Journal: Appl Intell (Dordr) ISSN： 0924-669X Impact factor: 5.019

Introduction

Sequential pattern mining [1, 2], as an important part of data mining [3, 4] and knowledge discovery [5], aims to mine frequent subsequences whose support is no lower than the given threshold. Many kinds of sequential pattern mining methods were proposed, such as outlying pattern mining [6], maximal sequential pattern mining [7-9], three-way sequential pattern mining [10-13], negative sequential pattern mining [14, 15], periodic pattern mining [16, 17], co-location pattern mining [18], contrast subspace mining [19-22], closed sequential pattern mining [23], utility pattern mining [24-29], and repetitive sequential pattern mining [30, 31]. Traditional sequential pattern mining neglects the number of occurrences of a pattern in a sequence, while repetitive sequential pattern mining does not [32]. Thus, repetitive sequential pattern mining can find more patterns. However, some of them are meaningless patterns. To solve this issue, based on repetitive sequential pattern mining, sequential pattern mining with gap constraints was proposed to find the frequent subsequences (known as patterns) which do not have to be continuous. The pattern can be represented as , max]p, where min and max, called a group of gap constraints, are the minimal and maximal wildcards between p and p, respectively [33, 34]. In sequential pattern mining, gap constraints are often the same , such as p[a,b]p or with gap = [a,b]. This mining method is also called periodic sequential pattern mining [35] and has been applied in many applications, such as time series analysis [36, 37] and feature selection [38, 39]. Nonoverlapping sequential pattern mining (or sequential pattern mining under the nonoverlapping condition) [40, 41] is a kind of sequential pattern mining with gap constraints, which means that each item (s) in a sequence can be matched by an item (p) at most once [42]. Recent studies showed that the nonoverlapping sequential pattern mining is a completeness mining method with the Apriori property [30]. Importantly, compared with other the state-of-the-art methods, it is easier to find valuable frequent patterns, and solves the problem of under-expression and over-expression of patterns [30]. Example 1 illustrates the nonoverlapping support which is a key issue in nonoverlapping sequential pattern mining.

Example 1

Suppose we have a sequence S1 = s1s2s3s4s5 = ACACA and a pattern P = p1[0,2]p2[0,2]p3 = A[0,2]C[0,2]A, (or P = p1p2p3 = ACA with gap = [0,2]). All occurrences are shown in Fig. 1.

Fig. 1

All occurrences of pattern P in sequence S1

All occurrences of pattern P in sequence S1 In this example, pattern P is a pattern with gap constraints, which has four occurrences in sequence S1. It is easy to know that subsequence s1s2s3 is an occurrence of pattern P which can be written as < 1, 2, 3>. Similarly, the other three occurrences are < 1, 2, 5>, < 1, 4, 5>, and < 3, 4, 5>. Among them, < 1, 2, 3> and < 1, 2, 5> are two overlapping occurrences, since s1 is matched by p1 twice in the two occurrences, while < 1, 2, 3> and < 3, 4, 5> are two nonoverlapping occurrences, since s3 is matched by p3 and p1 in these two occurrences, respectively. Therefore, the nonoverlapping support of pattern P in sequence S1 is 2. Traditional nonoverlapping sequential pattern mining algorithms mainly focused on discovering all frequent patterns [30]. However, one of the disadvantages is that the mining pattern set is large and contains too many redundant short patterns. To handle the problem of reducing redundant patterns, nonoverlapping closed sequential pattern mining was proposed [23], which means that there is no super-pattern with the same support. This method can effectively compress frequent patterns in a sequence with large gap, but it is less effective in a multiple-sequence dataset or in a sequence with small gap. Nonoverlapping Maximal Sequential Pattern (NMSP) mining is another method to reduce redundant patterns, which means that all super-patterns of the maximal patterns are infrequent. Example 2 shows that maximal pattern mining has better compression performance than closed pattern mining.

Example 2

Given sequence database SDB = {S1 = ACACA, S2 = CACAC}, gap constraint gap = [0,2], and support threshold minsup = 2. Let us consider S1 at first. The supports of patterns ”AC”, ”CA”, and ”ACA” in sequence S1 are all 2. Therefore, all the three patterns are frequent patterns and pattern ”ACA” is a closed pattern. Thus, patterns ”AC” and ”CA” are compressed. Pattern ”A” is a closed pattern, since the supports of all its super-patterns ”AA”, ”AC”, and ”CA” are all 2, which are less than the support of pattern ”A”. Therefore, pattern ”A” cannot be compressed in closed pattern mining. However, pattern ”A” can be compressed in NMSP mining, since its super-patterns are frequent patterns. Hence, from Table 1, NMSP mining has better compression performance in S1.

Table 1

Comparison of mining results

Pattern type	Pattern set	Count
Frequent patterns in S₁	{A,C,AA,AC,CA,ACA}	6
Closed patterns in S₁	{A,AA,ACA}	3
NMSPs in S₁	{AA,ACA }	2
Frequent patterns in SDB	{A,C,AA,AC,CA,CC,AAC,ACA,ACC,CAA,CAC,CCA,ACAC,CACA}	14
Closed patterns in SDB	{A,C,AA,AC,CA,CC,AAC,ACA,ACC,CAA,CAC,CCA,ACAC,CACA}	14
NMSPs in SDB	{AAC,ACC,CAA,CCA,ACAC,CACA }	6

Comparison of mining results Moreover, if we mine nonoverlapping closed sequential patterns in SDB, we notice that there is no frequent pattern whose support is the same as that of its super-patterns. Thus, all the frequent patterns are closed patterns, which means that closed pattern mining cannot compress frequent patterns in SDB. However, there are only 6 maximal patterns in SDB in Table 1. Hence, NMSP mining also has better compression performance in SDB. The comparison between frequent patterns and NMSPs is shown in Fig. 2.

Fig. 2

Comparison between frequent patterns and NMSPs in Example 2. All nodes are frequent patterns, while green nodes are NMSPs

Comparison between frequent patterns and NMSPs in Example 2. All nodes are frequent patterns, while green nodes are NMSPs In Fig. 2, we can see that NMSP mining effectively compresses the frequent patterns, since all subpatterns of NMSPs are frequent patterns. Hence, NMSPs can be seen as the boundary of frequent patterns, since all their subpatterns are frequent and their superpatterns are infrequent. Moreover, NMSPs provide boundary information for frequent and infrequent patterns. More importantly, in bioinformatics, we know that homologous viruses have high similarity. If we mine frequent patterns in two homologous viruses, most frequent patterns are the same. Since researchers pay more attention to functional and pathogenic divergence, the common patterns are redundant. To find the difference between the two viruses easily, NMSPs can not only represent all frequent patterns, but also effectively prune the common redundant patterns. To verify this claim, we mine the NMSPs and frequent patterns in SARS-CoV-1, SARS-CoV-2 and MERS-CoV. Experimental results show that NMSP mining is easier to find the differences between the virus sequences. The main contributions are as follows. To compress frequent nonoverlapping patterns in SDB and provide the boundary information between frequent and infrequent patterns, we address NMSP mining and propose an effectiveness mining algorithm, called Nettree for NMSP mining (NetNMSP). To calculate the nonoverlapping support effectively, NetNMSP adopts the backtracking strategy to iteratively search the leftmost parent nodes to get the nonoverlapping occurrences by the Netback algorithm. Meanwhile, NetNMSP employs the pattern join strategy to generate candidate patterns and the screening method to find NMSPs. Experiments on biological sequences verify that not only does NetNMSP have better performance than other competitive algorithms, but also the maximal pattern mining has better compression performance than the closed pattern mining. We mine NMSPs and frequent patterns in SARS-CoV-1, SARS-CoV-2 and MERS-CoV. The results show that the three viruses are similar in short patterns but different in long patterns. More importantly, NMSP mining is easier to find the differences between the virus sequences. The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 addresses the definitions of the problem. Section 4 proposes NetNMSP which employs the backtracking strategy to calculate the nonoverlapping support, the pattern join strategy to generate candidate pattern, and the screening method to find NMSPs. Section 5 reports experimental results and analyzes the performance of NetNMSP on biological and sales sequences. Section 6 makes the conclusion of this paper.

Related work

Sequential pattern mining has been widely applied in many fields [43, 44], such as ICU patient risk prediction [45, 46] and public reactioms analysis on twitter [47]. A variety of mining methods have been investigated [48]. Some methods are derived from different data types, such as the mining of event logs [49] and transaction databases [50]. Meanwhile, some methods are derived from different tasks, such as negative sequential pattern mining [14, 15], frequent pattern discovery with tri-partition alphabets [51], utility pattern mining [52-57], and contrast pattern mining [58, 59]. Moreover, some methods are derived from different characteristics of patterns, such as frequent pattern mining [60], rare pattern mining [61], top-k pattern mining [16], closed pattern mining [62], and maximal pattern mining [63, 64]. To consider the number of occurrences for a pattern in an SDB, gap constraint sequential pattern mining was proposed, which can be divided into three types: no-condition [16, 65, 66], the one-off condition [67-69], and the nonoverlapping condition [30, 40]. The comparison of the occurrences of pattern P = ACA with gap constraint gap = [0,2] in sequence S = ACACA under different conditions is shown in Table 2.

Table 2

Comparison of occurrences under different conditions

Condition	Support	Occurrences
No-condition	4	< 1, 2, 3 >, < 1, 2, 5 >, < 1, 4, 5 >, and < 3, 4, 5 >
One-off condition	1	< 1, 2, 3 >
Nonoverlapping condition	2	< 1, 2, 3 > and < 3, 4, 5 >

Comparison of occurrences under different conditions As shown in Table 2, no-condition means that when the gap constraint is satisfied, each subsequence can be matched multiple times by different positions in a pattern. In brief, all occurrences can be acceptable. Thus, there are 4 occurrences under no-condition, i.e. {< 1, 2, 3>, < 1, 2, 5>, < 1, 4, 5>, < 3, 4, 5>}. It is easy to calculate the support under no-condition. But both its support and support rate are not monotonicity. Thus, the Apriori-like property is employed to mine all frequent patterns, which enlarges the searching space. The one-off condition means that each subsequence can be matched at most once. Thus, there is only one occurrence under the one-off condition, i.e. < 1, 2, 3>. Although this method satisfies the Apriori property, calculating the support under the one-off condition is an NP-hard problem. Therefore, it is a kind of approximate mining method. As mentioned in the Introduction section, there are 2 occurrences under the nonoverlapping condition, i.e. {< 1, 2, 3>, < 3, 4, 5>}. Wu et al. [42] proved that calculating the support under the nonoverlapping condition can be done in polynomial time. Karim et al. [30] showed that nonoverlap ping sequential pattern mining satisfies the Apriori property and proposed an effective algorithm NOSEP. Hence, nonoverlapping sequential pattern mining achieves the balance between mining completeness and mining efficiency. Table 3 shows a comparison of the related studies.

Table 3

Comparison of related studies

Literature	Type of condition	Gap constraint	Support	Mining Type	Pruning strategy
Yun et al.[63]	Weighted	No	Exact	Maximal	Weight
Lee et al.[64]	Weighted	No	Approximate	Maximal	Anti-monotone
Min et al.[51]	No-Condition	Yes	Exact	All	Aprior
Li et al.[65]	No-Condition	Yes	Exact	Long	Aprior-like
Lam et al.[67]	One-off Condition	Yes	Approximate	Closed	Apriori
Wu et al.[30]	Nonoverlapping Condition	Yes	Exact	All	Apriori
Wu et al. [23]	Nonoverlapping Condition	Yes	Exact	Closed	Apriori
This paper	Nonoverlapping Condition	Yes	Exact	Maximal	Apriori

Comparison of related studies From Table 3, the work of [30] and [23] are the most related work. The differences between the above researches and this paper are as follows. First, the problems to be solved are different. Wu et al. [30] focused on mining all frequent patterns, while the work of [23] mined all closed patterns to compress the frequent patterns. Second, the methods of calculating the pattern support are different. NETGAP [30] needs to find and prune the invalid nodes after obtaining each nonoverlapping occurrence. Therefore, the time complexity of NETGAP is , where m, n, w and r are pattern length, sequence length, gap width and item number, respectively. BackTr [23] employed a backtracking strategy to find the minimal occurrence in the Nettree and did not need to find and prune the invalid nodes. Thus, the time complexity of BackTr is . In this paper, we propose Netback to calculate supports which adopts backtracking strategy to find the maximal occurrence and does not need to find and prune the invalid nodes. Hence, the time complexity of Netback is also , which is the same as that of BackTr and less than that of NETGAP. The difference between BackTr and Netback is that BackTr iteratively finds the minimal occurrences, while Netback iteratively finds the maximal occurrences.

Problem definition

Definition 1

(Sequence and Sequence Database) A sequence can be expressed as S = (n > 0), which consists of n items arranged by order, where s ∈ Σ (1 ≤ i ≤ n), Σ denotes the category of the items, and the size is |Σ|. A sequence database that contains T sequences can be expressed as SDB = .

Definition 2

(Pattern with Periodic Gap Constraint) The pattern with periodic gap constraint can be described as P = p1[a,b]p2[a,b]...p[a,b]p (m > 0) (or P = p1p2⋯pp with gap = [a,b]), where a and b are the minimal and maximal wildcards between p and p (0 < i < m). A wildcard can represent any item in Σ.

Definition 3

(Occurrence) Suppose we have a sequence S = and a pattern P = p1p2...pp with gap = [a,b]. If (0 < j1 < j2 < … < j ≤ n) and a ≤ j − j − 1 ≤ b, then occ = < j1,j2, is an occurrence of pattern P in sequence S.

Example 3

Suppose we have a sequence S1 = s1s2s3s4s5 = ACACA and a pattern P = p1[0,3]p2[0,3]p3 = A[0, 2]C[0, 2]A. There are four occurrences {< 1, 2, 3>, < 1, 2, 5>, < 1, 4, 5>, < 3, 4, 5>} of pattern P in sequence S1. If gap = [0, 1], then < 1, 2, 5> and < 1, 4, 5> do not satisfy the gap constraint, since 5-2-1 = 2 > 1 and 4-1-1 = 2 > 1.

Definition 4

(Nonoverlapping Occurrence and Support) Suppose and are two occurrences. If and only if , then occ and are two nonoverlapping occurrences. If any two occurrences in a set are nonoverlapping, then the set is called nonoverlapping occurrence set. The number of maximum nonoverlapping occurrences is the support of pattern P in sequence S under the nonoverlapping condition, which is denoted by sup(P,S). The support of pattern P in SDB = {S1,S2,⋯ ,S} is .

Definition 5

(Frequent Pattern, FP) If sup(P,SDB) is no less than the minimum support threshold, then pattern P is called a frequent pattern, abbreviated as FP.

Example 4

Suppose there is a sequence database SDB = { S1 = ACACA, S2 = CACAC }. Given a pattern P = ACA with gap = [0,2], we know that sup(P,S1) = 2, since the maximum nonoverlapping occurrence set of pattern P in sequence S1 is {< 1, 2, 3>, < 3, 4, 5>}. Similarly, sup(P,S2) = 1. Hence, . Suppose the minimum support threshold is minsup = 2. Pattern P is a frequent pattern in sequence database SDB, since sup(P,SDB) = 3 > 2.

Definition 6

(NMSP) If pattern P is a frequent pattern and all its super–patterns are infrequent, then pattern P is an NMSP. When sequence database, gap constraints, and support threshold are given, our problem is to find all NMSPs. For example, when SDB = { S1 = ACACA, S2 = CACAC }, and minsup = 2 are given, we find all NMSPs which are {AAC,ACC,CAA,CCA,ACAC,CACA}.

NetNMSP

To deal with NMSP mining, there are three key factors affecting the mining performance: calculating pattern support, pruning candidate pattern, and determining NMSPs. Therefore, Section 4.1 proposes the Netback algorithm which employs the backtracking strategy to calculate the support of a pattern in a Nettree. Section 4.2 employs the pattern join strategy to generate candidate patterns. Section 4.3 adopts the screening method to find NMSPs. The NetNMSP algorithm is proposed in Section 4.4.

Netback

In this section, we review the related concepts of Nettree [66] at first. Then, we introduce the principles of NETGAP [30] and BackTr [23]. Finally, we propose algorithm Netback to calculate the support. Obviously, the nonoverlapping occurrences are a subset of all occurrences which can be expressed by a Nettree.

Definition 7

(Nettree) Nettree is an extended tree structure, which has one or more root nodes. In a Nettree, a node can have more than one parent node. Some nodes may have the same node ID, which are located in different levels. represents node j in the ilevel [70].

Definition 8

(Root-Leaf Path, RLP) Suppose a Nettree has m levels, a node in the first level is a root node, and a node in the m level is a leaf node. A path from a root node to a leaf node in a Nettree is a root-leaf path (RLP) [30]. Example 5 shows that all occurrences can be expressed by RLPs in a Nettree.

Example 5

Suppose we have a sequence S = s1s2s3s4s5s6s7s8s9s10s11s12s13 = ATTCATCACATCA and a pattern P = p1[0,3]p2[0,3]p3[0,3]p4 = A[0,3]T[0,3]C[0,3]A. We create root , since s1 =‘A’ = p1. Using this method, all roots , , , and can be created. We create node , since s3 = ‘T’ = p2 and there is a parent-child relationship between nodes and , since s1 and s3 satisfy the gap constraints [0,3], i.e. 0 ≤ 3-1-1 = 1 ≤ 3. We create node , since s6 = ‘T’ = p2 and there is a parent-child relationship between nodes and , since s5 and s6 satisfy the gap constraints [0,3]. But we cannot create a parent-child relationship between nodes and , since s1 and s6 do not satisfy the gap constraints [0,3], i.e. 6-1-1 > 3. Similarly, we create all nodes and parent-child relationships in a Nettree shown in Fig. 3. The following characteristics should be noticed. The Nettree has 5 roots, , , , , and . Some nodes have the same node ID. For example, nodes and are two nodes with the same ID 5. Some nodes have more than one parent. For example, node has two parents, nodes and . There are 12 occurrences < 1, 2, 4, 5>, < 1, 2, 4, 8>, < 1, 3, 4, 5>, < 1, 3, 4, 8>, < 1, 3, 7, 8>, < 1, 3, 7, 10>, < 5, 6, 7, 8>, < 5, 6, 7, 10>, < 5, 6, 9, 10>, < 5, 6, 9, 13>, < 8, 11, 12, 13>, and < 10, 11, 12, 13>. All of them can be expressed by RLPs in a Nettree. For example, < 1, 3, 7, 8> is an occurrence and its corresponding path < , , > is an RLP.

Fig. 3

The Nettree of pattern P = A[0,3]T[0,3]C[0,3]A in sequence S = ATTCATCACATCA. The Nettree has four levels, since the length of pattern P is four. Nodes , , , , and are the roots. Nodes , , , and are the leaves. Nodes and are two nodes with the same ID 5 in the first and fourth levels, respectively. Node has two parents nodes and . Path , , is an RLP and its corresponding occurrence is < 1, 3, 7, 8> According to the definition of nonoverlapping occurrences, any two occurrences cannot use the same node in the Nettree. For example, < 1, 2, 4, 5> and < 1, 3, 7, 8> are two overlapping occurrences, since they use a common node , while < 1, 2, 4, 5> and < 5, 6, 7, 8> are two nonoverlapping occurrences, since there is no common used node in the Nettree. To calculate the support in a Nettree, we employ an iterative strategy to find the maximal occurrence.

Definition 9

(Maximal Occurrence) Suppose occ = is an occurrence of pattern P in sequence S. For any occurrence = if (∀i(1 ≤ i ≤ m)), then occurrence occ is the maximal occurrence of pattern P in sequence S.

Definition 10

(Maximal Root-Leaf Path, MRLP) An RLP that iterates the rightmost child node (i.e. the largest child node) from the rightmost root node to its leaf node in a Nettree is the maximal root-leaf path (MRLP). Although the NETGAP algorithm [30] employs the Nettree structure to calculate the pattern support accurately, it has to find and prune the invalid nodes after obtaining each nonoverlapping occurrence. Therefore, the time complexity of NETGAP is , where m, n, w, and r are the length of pattern and sequence, the maximum gap, and the size of characters (i.e. |Σ|), respectively. To calculate the support effectively, we propose Netback which employs the backtracking strategy and iteratively finds the maximal occurrences. The time complexity of Netback is reduced to , which is the same as that of BackTr [23] which iteratively finds the minimal occurrence. The following two examples illustrate the principles of NETGAP, BackTr, and Netback, respectively.

Example 6

As shown in Example 5 and Fig. 3, NETGAP [30] iteratively finds the nonoverlapping occurrences from the leftmost root to its leaf. Firstly, starting from the leftmost root , NETGAP finds the leftmost parent child (i.e. node ). Iterating this step, NETGAP finds RLP < >, which is an MRLP and its corresponding occurrence is < 1, 2, 4, 5>. Then nodes , , , and are pruned. After that, to find all the nonoverlapping occurrences, NETGAP has to find and prune the invalid nodes. Therefore, node is pruned. Then, NETGAP gets the second occurrence < 5, 6, 7, 8>. Similarly, after pruning nodes , , , and , nodes and are found and pruned. Finally, NETGAP finds the third occurrence < 8, 11, 12, 13>. Hence, NETGAP finds three nonoverlapping occurrences. BackTr [23] also iteratively finds the nonoverlapping occurrences from the leftmost root to its leaf, but it does not need to find and prune the invalid nodes, since it employs the backtracking strategy. Therefore, BackTr also finds the same three nonoverlapping occurrences as NETGAP: < 1, 2, 4, 5>, < 5, 6, 7, 8> and < 8, 11, 12, 13>. But the time complexity of BackTr is less than that of NETGAP. We show that Netback can find the same number of occurrences. The time complexity of Netback is the same as that of BackTr and less than that of NETGAP.

Example 7

As shown in Example 5, Netback adopts the backtracking strategy to iteratively find the nonoverlapping occurrences from the rightmost root to its leaf. Firstly, starting from the rightmost root , Netback finds its rightmost child node. Since has no child, Netback find the next root . It is easy to know that there is only one RLP < > with root . Thus, Netback finds the first occurrence < 10,11,12,13 >. Similarly, Netback finds the second occurrence < 5,6,9,10 >. Now, Netback selects root which has two children: and . To find the maximal occurrence, Netback iteratively selects the rightmost child node. Therefore, is selected, and then is selected. has two children: and . Since is used in occurrence < 5,6,9,10 >, Netback selects . Thus, Netback finds the third occurrence < 1,3,7,8 >. Hence, Netback also finds three different nonoverlapping occurrences shown in Fig. 4, and the number of occurrences is the same as that of NETGAP and BackTr.

Fig. 4

Three maximal occurrences found by Netback

Three maximal occurrences found by Netback From above two examples, we know that both NETGAP and Netback can find the same number of nonoverlapping occurrences and Netback is more effective than NETGAP, since NETGAP has to find and prune the invalid nodes, while Netback employs the backtracking strategy and does not need to find and prune the invalid nodes, which reduces the time complexity. The main process of the Netback algorithm is shown as follows. First, Netback creates the Nettree according to P and S. Then, Netback iteratively finds the rightmost nonoverlapping occurrences from the rightmost root with the backtracking strategy until all roots are visited. The Netback algorithm is shown in Algorithm 1.

Theorem 1

The space and time complexities of Netback are both O(m × n × w) in the worst case and in the average case, where m, n, w, and r are pattern length, sequence length, gap width (b − a + 1, gap = [a,b]), and item number |Σ|, respectively.

Proof

A Nettree has m levels. Each level has no more than n nodes. Each node has no more than w children. Therefore, the space complexity of Netback is O(m × n × w) in the worst case. Netback employs the backtracking strategy to find the minimal occurrences. Thus, each node could be visited no more than once. As a result, the time complexity of Netback is the same as the space complexity and is also O(m × n × w) in the worst case. In the average case, each level has no more than nodes, and each node has no more than children. Therefore, the space and time complexities of Netback are in the average case. □

Pattern join strategy

Since the nonoverlapping sequential pattern mining satisfies the Apriori property, we employ the pattern join strategy [31] to generate candidate patterns.

Definition 11

(Prefix and Suffix Pattern and Pattern join) Given event items α, β (α, β ∈ Σ), if there is a pattern Q = Pα, then P is called the prefix pattern of Q, denoted by Prefix (Q). If pattern R = βP, then P is called the suffix pattern of R, denoted by Suffix (R). Patterns Q and R are the super–patterns of pattern P, and pattern P is the sub-pattern of patterns Q and R. Suppose there are two patterns Q = Pα, and R = βP. We get a new super-pattern by pattern join since Prefix (Q) = Suffix (R) = P. Although the set enumeration tree strategy can be used to generate the candidate patterns, the following example shows that the pattern join strategy outperforms the set enumeration tree strategy.

Example 8

Suppose we have a sequence S = s1s2s3s4s5s6s7s8s9s10s11s12s13 = ATTCATCACATCA, Σ = {A, T, C}, gap = [0,3], and the minimum support threshold minsup = 3. Set F3 including all 11 frequent patterns with length 3 is {AAA , AAC, ACA, ACC, ATA, ATC, CAA, CAC, CCA, TCA, TCC}. The principle of the set enumeration tree strategy is adding a character in Σ at the end of each pattern to generate a new pattern. Thus, each pattern can generate |Σ| candidate patterns. Taking frequent pattern AAA as an example, the set enumeration tree strategy generates three candidate patterns, AAAA, AAAC, and AAAT since Σ = {A, T, C}. Therefore, this strategy generates 11*3 = 33 candidate patterns. However, the pattern join strategy generates 18 candidate patterns shown in set C4 = { AAAA, AAAC, AACA, AACC, ACAA, ACAC, ACCA, ATCA, ATCC, CAAA, CAAC, CACA, CACC, CCAA, CCAC, TCAA, TCAC, TCCA }. The number of candidate patterns generated by the set enumeration tree strategy and pattern join strategy is compared in Table 4. We can see that with increasing pattern length, more candidate patterns can be pruned by the pattern join strategy. Hence, the pattern join strategy is more effective than the set enumeration tree strategy.

Table 4

Number of different length patterns

Length of pattern	Length	Length	Length	Length	Length	Length	Total
	= 1	= 2	= 3	= 4	= 5	= 6
Number of frequent patterns	3	7	11	9	2	0	32
Number of candidate patterns by set enumeration tree	3	9	21	33	27	6	99
Number of candidate patterns by pattern join	3	9	17	18	8	0	55

Number of different length patterns

Screening method

Suppose pattern P is a frequent pattern. According to Definition 6, if pattern P is an NMSP, we should generate all its super–patterns and determine that all these super–patterns are infrequent. Obviously, this method is very complex. In this section, we propose the screening method to find NMSPs. The principle is shown as follows. Suppose all frequent patterns with length m are stored in frequent pattern set F. We generate all candidate patterns with length m + 1 using F and store them in candidate pattern set C. If pattern P in set C is a frequent pattern, then its prefix and suffix patterns are not NMSPs. Thus, we delete Prefix(P) and Suffix(P) from set F. According to this method, after checking all patterns in set C, the remaining patterns in set F are NMSPs. It is worth emphasizing that if pattern P is infrequent, we cannot say that its prefix and suffix patterns are NMSPs. The following example illustrates the principle of the screening method.

Example 9

In Example 8, pattern “AACA” is a frequent pattern since sup (“AACA”, S) = 3. Thus, its prefix pattern “AAC” and suffix pattern “ACA” are not NMSPs. Pattern “AACC” is an infrequent pattern since sup(“AACC”, S) = 2. Its prefix pattern “AAC” and suffix pattern “ACC” are not NMSPs since their super-patterns “AACA” and “ACCA” are frequent patterns whose supports are both 3.

Example 10

In Example 8, we use 11 frequent patterns F3 = {AAA , AAC, ACA, ACC, ATA, ATC, CAA, CAC, CCA, TCA, TCC} to generate 18 candidate patterns, C4 = {AAAA, AAAC, AACA, AACC, ACAA, ACAC, ACCA, ATCA, ATCC, CAAA, CAAC, CACA, CACC, CCAA, CCAC, TCAA, TCAC, TCCA}. Pattern “AACA” is a frequent pattern. Thus, patterns “AAC” and “ACA” in F3 are not NMSPs and deleted. Similarly, patterns “ACAA”, “ACAC”, “ACCA”, “ATCA”, “CACA”, “TCAA”, “TCAC”, and “TCCA” are frequent patterns stored in F4. Therefore, patterns “ACC”, “ATC”, “CAA”, “CAC”, “CCA”, “TCA”, and “TCC” are not NMSPs and deleted. Hence, the remaining patterns in F3 are “AAA” and “ATA” which are NMSPs. We iterate the above process to find all NMSPs.

NetNMSP

In this section, we propose the NetNMSP algorithm and analyze the space and time complexities of it. The main process of the NetNMSP algorithm is shown as follows. First, NetNMSP generates candidate pattern set C with pattern length m + 1 using frequent pattern set F. Then, NetNMSP calculates the support of pattern P in set C. If pattern P is frequent, then store it in frequent pattern set F, and its prefix and suffix patterns in F are not NMSPs and deleted. NetNMSP iterates the above process until all patterns in set C have been checked. All patterns remaining in F are NMSPs and are stored in F. Finally, NetNMSP iterates the above process until pattern set C is empty. The NetNMSP algorithm is shown in Algorithm 2.

Theorem 2

The space complexity of the NetNMSP algorithm is , where m, n, L, w, and r are the length of the longest pattern, the length of the longest sequence in database, the number of the candidate patterns, gap width (b − a + 1, gap = [a,b]), and item number (i.e.|Σ|), respectively. The space of the NetNMSP algorithm consists of two parts, the space of frequent patterns and candidate patterns and the space of Netback. It is easy to know that the space complexity of the first part is O(m × L). Meanwhile, the space complexity of the Netback algorithm is . Hence, the space complexity of the NetNMSP algorithm is . □

Theorem 3

The time complexity of the NetNMSP algorithm is , where N is the total length of all sequences. According to Theorem 1, the time complexity of Netback is for a sequence. The time complexity of Netback for all sequences is therefore . Thus, for all candidate patterns, the time complexity of NetNMSP is . Since the binary search is used in PatternJoin to generate candidate patterns, the time complexity of generating all candidate patterns is O(L × log(L)). Therefore, the time complexity of the NetNMSP algorithm is . □

Experimental results and analysis

Section 5.1 explains the benchmark datasets and the baseline methods. Section 5.2 verifies the correctness of the NetNMSP algorithm. Section 5.3 reports the time efficiency of the NetNMSP algorithm. Section 5.4 compares the compression performance of maximal sequential pattern mining and closed sequential pattern mining. Section 5.5 reports the NMSPs in a biological application of COVID-19.

Benchmark datasets and baseline methods

The experimental running environment is: Intel(R) Core(TM) i5-3120M, 2.50GHZ CPU, 8GB RAM, Windows 7, and the 64-bit operating system computer. The program development environment is VC++ 6.0. To verify the performance of the NetNMSP algorithm, this paper adopts DNA, protein, and virus sequences as experimental data. All algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/NetNMSP. The datasets are shown in Table 5.

Table 5

Description of datasets

Dataset	Type	Source	Number of	Description	Total length
			sequences
DNA1¹	DNA	Homo Sapiens AL158070	1	Single	6,000
DNA2	DNA	Homo Sapiens AL158070	1	Single sequence	8,000
DNA3	DNA	Homo Sapiens AL158070	1	Single sequence	10,000
DNA4	DNA	Homo Sapiens AL158070	1	Single sequence	12,000
DNA5	DNA	Homo Sapiens AL158070	1	Single sequence	14,000
DNA6	DNA	Homo Sapiens AL158070	1	Single sequence	16,000
SDB1²	Protein	ASTRAL95_1_161	507	Multiple/unequal length	91,875
SDB2	Protein	ASTRAL95_1_161	338	Multiple/unequal length	62,985
SDB3	Protein	ASTRAL95_1_161	169	Multiple/unequal length	32,503
SDB4	Protein	ASTRAL95_1_171	590	Multiple/unequal length	109,424
SDB5	Protein	ASTRAL95_1_171	400	Multiple/unequal length	73,425
SDB6	Protein	ASTRAL95_1_171	200	Multiple/unequal length	37,327
Baby1³	Babysale	Sales of baby products	1,636	Multiple/unequal length	73,272
Baby2	Babysale	Sales of baby products	2,077	Multiple/unequal length	94,152
Baby3	Babysale	Sales of baby products	2,544	Multiple/unequal length	115,088
Baby4	Babysale	Sales of baby products	3,057	Multiple/unequal length	137,941
Super1⁴	Superstore	Superstore time series	1	Single sequence	100,001
Super2	Superstore	Superstore time series	1	Single sequence	120,001
Super3	Superstore	Superstore time series	1	Single sequence	140,001
Super4	Superstore	Superstore time series	1	Single sequence	161,048
TSS⁵	Human genes	Transcriptional Start Sites	200	Multiple / equal length	20,000
SARS-CoV-1⁶	DNA of the virus	Severe Acute Respiratory Syndrome Coronavirus	1	Single sequence	29,751
SARS-CoV-2⁷	DNA of the virus	Severe Acute Respiratory Syndrome Coronavirus 2	1	Single sequence	29,903
MERS-CoV⁸	DNA of the virus	Middle East Respiratory Syndrome Coronavirus	1	Single sequence	30,119

1 Homo Sapiens AL158070 is a DNA sequence, which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AL158070.11

2 ASTRAL is a database of protein sequences based on the SCOP database, which can be downloaded from https://scop.berkeley.edu/astral/subsets/ver=1.61 and https://scop.berkeley.edu/astral/ver=1.71

3 Babysale is a sales dataset for infant products, which can be downloaded from https://tianchi.aliyun.com/dataset/dataDetail?dataId=45

4 Superstore is a sales dataset from SuperStore, which can be downloaded from https://tianchi.aliyun.com/dataset/dataDetail?dataId=93285

5 TSS (Transcriptional Start Sites) contains 200 human genes of positive and negative classes, which comes from http://dbtss.hgc.jp/

6 SARS-CoV-1 (Severe Acute Respiratory Syndrome Coronavirus 1) is the gene sequence of virus causing in 2003, which was reported by Ref [71] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/30271926?report=fasta

7 SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) is the gene sequence of virus causing COVID-19, which was reported by Ref [72] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=fasta

8 MERS-CoV (Middle East Respiratory Syndrome Coronavirus) was reported by Ref [73] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_019843.3?report=fasta

Description of datasets 1 Homo Sapiens AL158070 is a DNA sequence, which can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AL158070.11 2 ASTRAL is a database of protein sequences based on the SCOP database, which can be downloaded from https://scop.berkeley.edu/astral/subsets/ver=1.61 and https://scop.berkeley.edu/astral/ver=1.71 3 Babysale is a sales dataset for infant products, which can be downloaded from https://tianchi.aliyun.com/dataset/dataDetail?dataId=45 4 Superstore is a sales dataset from SuperStore, which can be downloaded from https://tianchi.aliyun.com/dataset/dataDetail?dataId=93285 5 TSS (Transcriptional Start Sites) contains 200 human genes of positive and negative classes, which comes from http://dbtss.hgc.jp/ 6 SARS-CoV-1 (Severe Acute Respiratory Syndrome Coronavirus 1) is the gene sequence of virus causing in 2003, which was reported by Ref [71] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/30271926?report=fasta 7 SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) is the gene sequence of virus causing COVID-19, which was reported by Ref [72] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=fasta 8 MERS-CoV (Middle East Respiratory Syndrome Coronavirus) was reported by Ref [73] and can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_019843.3?report=fasta It should be noted that, SDB1-6 are protein sequences composed of 20 amino acids, and DNA1-6, SARS-CoV-1, SARS-CoV-2, MERS-CoV, and TSS are DNA sequences composed of four deoxynucleotides, A, T, C, and G. In this paper, GSgrow-MAX, NOSEP-MAX, MAXB, MAXD, NetNMSP-S, and NOSEP are employed as competitive algorithms whose principles are shown as follows. GSgrow-MAX: GSgrow [40] algorithm is an efficient mining algorithm, which employs INSgrow algorithm to approximately calculate the support. Based on GSgrow, GSgrow-MAX further finds the maximal patterns in the frequent patterns. NOSEP-MAX: To verify the effectiveness of Netback, NOSEP-MAX adopts NETGAP to calculate the support which was employed in NOSEP [30]. MAXB and MAXD: To verify the pruning efficiency of pattern join, we propose the MAXB and MAXD algorithms, which employ the breadth-first and depth-first searching for the set enumeration tree to generate the candidate patterns, respectively. NetNMSP-S: To verify the efficiency of the screening method, NetNMSP-S is proposed and mines NMSPs according to Definition 6. NOSEP [30] and NetNCSP [23]: To verify the compression effect of NMSPs, we employ the NOSEP and NetNCSP algorithms to mine all frequent patterns and closed patterns.

Mining efficiency

To further illustrate the Efficiency of the NetNMSP algorithm, we perform experiments on single-sequence DNA datasets (DNA1 to DNA6) and multi-sequence protein datasets (SDB1 to SDB6). We select GSgrow-Max, NOSEP-MAX, MAXB, MAXD, NetNMSP-S, and NetNMSP as the competitive algorithms. Considering the difference between DNA and protein datasets, we set minsup = 700, gap = [0,3] on the DNA datasets, and minsup = 1800, gap = [0,10] on the protein datasets. The comparisons of the number of NMSPs, running time and the number of candidate patterns on DNA and protein datasets are shown in Figs. 5, 6, 7, 8, 9 and 10, respectively.

Fig. 5

Comparison of number of NMSPs on DNA1 to DNA6

Fig. 6

Comparison of running time on DNA1 to DNA6

Fig. 7

Comparison of number of candidate patterns on DNA1 to DNA6

Fig. 8

Comparison of number of NMSPs on SDB1 to SDB6

Fig. 9

Comparison of running time on SDB1 to SDB6

Fig. 10

Comparison of number of candidate patterns on SDB1 to SDB6

Comparison of number of NMSPs on DNA1 to DNA6 Comparison of running time on DNA1 to DNA6 Comparison of number of candidate patterns on DNA1 to DNA6 Comparison of number of NMSPs on SDB1 to SDB6 Comparison of running time on SDB1 to SDB6 Comparison of number of candidate patterns on SDB1 to SDB6 The results indicate the following observations. In summary, NetNMSP outperforms all competitive algorithms. Although GSgrow-MAX is faster than NetNMSP, NetNMSP has better performance than GSgrow-MAX. From Figs. 6 and 9, we know that GSgrow-MAX is faster than all other algorithms. However, from Figs. 5 and 8, we know that GSgrow-MAX discovers less NMSPs. For example, on DNA6, GSgrow-MAX runs for 1.41 s, while NetNMSP runs for 444 s. However, GSgrow-MAX only finds 262 NMSPs, while NetNMSP finds 465. The reasons are as follows. GSgrow-MAX employs INSgrow to calculate the support. INSgrow is an approximate algorithm and the time complexity of INSgrow is less than that of Netback. Therefore, GSgrow-MAX runs faster than NetNMSP. However, GSgrow-MAX is also an approximate algorithm. Thus, some frequent patterns cannot be discovered by GSgrow. Moreover, with the increase of the sequence, more NMSPs will be lost. For instance, on DNA1, GSgrow-MAX discovers 16 out of 17 NMSPs, while on DNA6, GSgrow-MAX discovers 262 out of 465 NMSPs, since the length of DNA1 is shorter than that of DNA6. The protein datasets are composed of multiple sequences, and each sequence is short. Thus, GSgrow-MAX can find out most NMSPs on protein datasets. Hence, NetNMSP outperforms GSgrow-MAX. NetNMSP has better performance than NOSEP-MAX. NetNMSP checks the same number of candidate patterns as NOSEP-MAX and finds the same NMSPs as NOSEP-MAX, and NetNMSP runs faster than NOSEP-MAX. For example, on SDB1, both NetNMSP and NOSEP-MAX check 2,045 candidate patterns and discover 179 NMSPs, and NetNMSP runs for 239 s, while NOSEP-MAX runs for 371 s. The reasons are as follows. NetNMSP and NOSEP-MAX employ the same candidate pattern generation strategy, and Netback and NETGAP are exact algorithms. Therefore, both NetNMSP and NOSEP-MAX check the same number of candidate patterns and discover the same number of NMSPs. However, NETGAP has to find and prune the invalid nodes, while Netback does not. Thus, the time complexity of Netback is less than that of NETGAP. Hence, NetNMSP runs faster than NOSEP-MAX. NetNMSP has better performance than MAXB and MAXD. NetNMSP finds the same NMSPs as MAXB and MAXD, checks less candidate patterns than MAXB and MAXD, and runs faster than MAXB and MAXD. For example, on DNA2, NetNMSP, MAXB and MAXD all find 38 NMSPs, and check 179, 412 and 412 candidate patterns, respectively, and cost 11, 40 and 40 s, respectively. The reasons are as follows. The three algorithms all adopt Netback to calculate the support, but employ different candidate pattern generation strategies: the pattern join strategy, breadth-first and depth-first searching for the set enumeration tree. In Section 4.2, we show that the pattern join strategy outperforms the set enumeration tree strategy. Hence, NetNMSP checks less candidate patterns than MAXB and MAXD, and runs faster than MAXB and MAXD. NetNMSP has better performance than NetNMSP-S. NetNMSP checks the same number of candidate patterns as NetNMSP-S and finds the same NMSPs as NetNMSP-S, and runs faster than NetNMSP-S. For example, on SDB4, both NetNMSP and NetNMSP-S check 3,882 candidate patterns and find 325 NMSPs, and run for 574 and 806 s, respectively. The reason is that NetNMSP employs the screening method to find NMSPs, while NetNMSP-S uses the definition. Hence, NetNMSP runs faster than NetNMSP-S.

Scalability analysis

To analyze the scalability of the NetNMSP algorithm, we select two datasets in Table 5: multi-sequence dataset BABYSALE and single-sequence dataset Superstore. We set gap = [0,3] and minpau = 6,000 for Babysale and gap = [0,7] and minpau = 6,000 for Superstore. The number of NMSPs, running time and number of candidate patterns are shown in Figs. 11, 12 and 13, respectively.

Fig. 11

Comparison of number of NMSPs

Fig. 12

Comparison of running time

Fig. 13

Comparison of number of candidate patterns

Comparison of number of NMSPs Comparison of running time Comparison of number of candidate patterns The results show the following observations. With the increase of sequence length, the number of NMSPs, running time and the number of candidate patterns increase. For example, the lengths of Baby1 and Baby4 are 73,272 and 137,941, respectively. There are 18 and 118 NMSPs, the running time is 46 and 892 s, and the number of candidate patterns is 88 and 637 on Baby1 and Baby4 datasets, respectively. This phenomenon can be found on SUPERSTORE dataset. The reasons are as follows. With the increase of sequence length, more frequent patterns can be found. Therefore, there are more candidate patterns. As a result, the running time increases. More importantly, more patterns are NMSPs. Meanwhile, from Fig. 12, we can see that NOSEP-MAX, MAXB, MAXD, NetNMSP-S and NetNMSP have the same tendency in running time. However, the tendency of NetNMSP is more gentle than other algorithms which means that our algorithm guarantees the best scalability on large scale datasets.

Compression Performance

To show the compression performance of NMSP mining, we report the number of nonoverlapping frequent sequential patterns mined by NOSEP [30], the nonoverlapping closed sequential patterns mined by NetNCSP [23], and NMSPs mined by NetNMSP. In the experiments, we also report the proportion of closed patterns and maximal patterns in frequent patterns, expressed by rate_close (rate_close =) and rate_max (rate_max = ), where cp, mp, and fp are the number of nonoverlapping closed patterns, maximal patterns, and frequent patterns, respectively. To evaluate the compression performance of NMSPs, we report the number of nonoverlapping frequent patterns, closed patterns, and NMSPs on single-sequence DNA2 and multiple-sequence datasets SDB5. On DNA2, we increase gap from [0,1] to [0,6] with minsup = 800. On SDB5, we increase gap from [0,5] to [0,15] with minsup = 1800. To further show the consistence of the compression performance of NMSPs, we also use different support thresholds. Single-sequence DNA dataset DNA4, multiple-sequence DNA dataset TSS, and multiple-sequence protein dataset SDB2 are selected. On DNA4 and TSS, we decrease minsup from 950 to 450 with gap = [0,3]. On SDB2, we decrease minsup from 2000 to 1250 with gap = [0,15]. The experimental results are shown in Figs. 14, 15, 16, 17 and 18.

Fig. 14

Comparison on DNA2 with minsup = 800

Fig. 15

Comparison on SDB5 with minsup = 1800

Fig. 16

Comparison on DNA4 with gap = [0,3]

Fig. 17

Comparison on TSS with gap = [0,3]

Fig. 18

Comparison on SDB2 with gap = [0,15]

Comparison on DNA2 with minsup = 800 Comparison on SDB5 with minsup = 1800 Comparison on DNA4 with gap = [0,3] Comparison on TSS with gap = [0,3] Comparison on SDB2 with gap = [0,15] The results indicate the following observations. From Figs. 14–18, all experiment results show that NMSP mining has better compression performance than nonoverlapping closed pattern mining in all cases, such as single sequence, multiple-sequences, different gap constraints, and different support thresholds. In all experiments the number of closed patterns is the same as that of frequent patterns, while the number of NMSPs is less than that of frequent patterns. For example, in Fig. 14, the number of frequent patterns, closed patterns, and NMSPs are 317, 317, and 150, respectively. The same phenomena can be found in Figs. 15–18. The reasons are as follows. For nonoverlapping closed pattern mining, if pattern P and its supper-pattern Q are frequent patterns and have the same support, then pattern P can be compressed. However, for NMSP mining, if pattern P and its supper-pattern Q are frequent patterns, then pattern P can be compressed no matter whether its support is the same as that of pattern Q. Therefore, NMSP mining is easy to obtain better compressing performance than nonoverlapping closed pattern mining. The less |Σ| is, the better compressing ability of NMSP mining is. From Table 5, we know that |Σ| of DNA2, DNA4, and TSS are all 4, and |Σ| of SDB5 and SDB2 are both 20. From Figs. 14–18, NMSP mining remains about 50% to 65% frequent patterns with |Σ| = 4, while NMSP mining remains about 75% to 85% with |Σ| = 20. For example, on DNA2, NMSP mining remains 58.3% frequent patterns with gap = [0,1] and minsup = 800, on DNA4, remains 57.5% frequent patterns with gap = [0,3] and minsup = 950, and on TSS, remains 56.7% frequent patterns with gap = [0,3] and minsup = 950. However, on SDB5, NMSP mining remains 82.8% frequent patterns with gap = [0,5] and minsup = 1,800, and on SDB2, remains 77.3% frequent patterns with gap = [0,15] and minsup = 2,000. The reasons are as follows. When the number of the frequent patterns is the same with different size of Σ, apparently, the less |Σ| is, the longer the maximal length of the frequent patterns is. Since the subpatterns of the frequent patterns are also frequent patterns, which could be compressed by NMSPs. Therefore, many short frequent patterns can be compressed when |Σ| is less. In summary, NMSP mining has better compression performance than nonoverlapping closed pattern mining in many cases.

Case study

On February 11, 2020, the novel coronavirus pneumonia was named as COVID-19 by the World Health Organization (WHO). Meanwhile, the International Committee on Taxonomy of Viruses announced the official name of a new coronavirus: Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The disease caused a global disaster, infecting more than 200,000,000 people and killing more than 4,500,000 by August 29, 2021. Many researchers have studied SARS-CoV-2 from different aspects. For example, Nawaz et al. [74] adopted sequential pattern mining method to find hidden patterns which can be used to examine the evolution and variations in COVID-19 strains. The DNA of the three viruses, SARS-CoV-1, SARS-CoV-2, and MERS-CoV, is single sequence. To study their complete genome, in this section, we set gap = [0,4] and minsup = 2500. We employ intersection, union, and rate to evaluate the similarity of each two sequences, where intersection of two sets of mined patterns A and B is the set of patterns which are in both A and B, union is the set of patterns which are in A or B, and rate = , where t and u are the number of patterns in intersection and union, respectively. We report the number of frequent patterns and NMSPs on two sequences with different lengths. The comparison results between MERS-CoV and SARS-CoV-1, SARS-CoV-1 and SARS-CoV-2, and MERS-CoV and SARS-CoV-2 are shown in Figs. 19, 20, 21, 22, 23 and 24.

Fig. 19

Comparison of number of frequent patterns between MERS-CoV and SARS-CoV-2

Fig. 20

Comparison of number of NMSPs between MERS-CoV and SARS-CoV-2

Fig. 21

Comparison of number of frequent patterns between SARS-CoV-1 and SARS-CoV-2

Fig. 22

Comparison of number of NMSPs between SARS-CoV-1 and SARS-CoV-2

Fig. 23

Comparison of number of frequent patterns between MERS-CoV and SARS-CoV-1

Fig. 24

Comparison of number of NMSPs between MERS-CoV and SARS-CoV-1

Comparison of number of frequent patterns between MERS-CoV and SARS-CoV-2 Comparison of number of NMSPs between MERS-CoV and SARS-CoV-2 Comparison of number of frequent patterns between SARS-CoV-1 and SARS-CoV-2 Comparison of number of NMSPs between SARS-CoV-1 and SARS-CoV-2 Comparison of number of frequent patterns between MERS-CoV and SARS-CoV-1 Comparison of number of NMSPs between MERS-CoV and SARS-CoV-1 The results indicate the following observations. Both frequent pattern mining and NMSP mining show that most of the same patterns are short patterns. In Figs. 19 and 20, there are 316 and 313 frequent patterns and 171 and 161 NMSPs in MERS-CoV and SARS-CoV-2, respectively. In these patterns, 248 frequent patterns and 68 NMSPs are the same, and most of them are short patterns. For example, when the pattern lengths are less than 6, we know that 209 frequent patterns and 52 NMSPs are the same. Meanwhile, the same phenomenon can also be found in Figs. 21 to 24. The possible reasons are as follows. The basic structure of MERS-CoV, SARS-CoV-1 and SARS-CoV-2 are the same. Therefore, the short NMSPs are the same. However, the three viruses have many different characteristics, which lead to different frequent patterns and NMSPs in long patterns. NMSP mining not only effectively compresses the frequent patterns, but also is easier to find the difference between each two viruses. For example, there are 268 and 313 frequent patterns in SARS-CoV-1 and SARS-CoV-2, respectively, while there are 150 and 161 NMSPs in SARS-CoV-1 and SARS-CoV-2, respectively. Therefore, NMSPs remain no more than 60% frequent patterns. More importantly, NMSP mining is easier to find different patterns from each two viruses. For example, there are both 97 frequent patterns with length 4 in MERS-CoV and SARS-CoV-1, respectively. Among them, 87 patterns are the same. Therefore, the intersection and union are 87 and 117, respectively. Thus, the rate is 74% in frequent patterns. However, there are 43 and 66 NMSPs with length 4 in MERS-CoV and SARS-CoV-1, respectively. The intersection and union are 33 and 76, respectively, resulting in a rate of 43% in NMSPs. Hence, NMSP mining makes it easier to find the difference between each two viruses. SARS-CoV-1 is more similar to SARS-CoV-2 than MERS-CoV. From Fig. 20, there are 171 and 161 NMSPs in MERS-CoV and SARS-CoV-2, respectively. Among them, 68 patterns are the same. Therefore, the intersection and union are 68 and 263, respectively. Thus, the total similarity between MERS-CoV and SARS-CoV-2 is about 26%. However, there are 150 and 161 NMSPs in SARS-CoV-1 and SARS-CoV-2, respectively. The intersection and union are 93 and 218, respectively, resulting in a rate of SARS-CoV-1 and SARS-CoV-2 is 43%. Hence, compared with MERS-CoV, SARS-CoV-1 is more similar to SARS-CoV-2. SARS-CoV-1 is more similar to MERS-CoV than SARS-CoV-2. From the above analysis, we know that the total similarity between MERS-CoV and SARS-CoV-2 is about 26%. However, from Fig. 24, there are 171 and 150 NMSPs in MERS-CoV and SARS-CoV-1, respectively. Among them, 95 patterns are the same. Therefore, the intersection and union are 95 and 226, respectively. Thus, the total similarity between MERS-CoV and SARS-CoV-1 is about 42%. Hence, the similarity between MERS-CoV and SARS-CoV-1 is higher than that between MERS-CoV and SARS-CoV-2.

Conclusion

This paper studies NMSP mining. As a pattern compression technology, without changing mining parameters, NMSP mining can compress the pattern set and reduce redundant patterns. To solve the problem, this paper proposes the NetNMSP algorithm which employs the backtracking strategy to calculate the support, the pattern join strategy to generate the candidate patterns, and the screening method to find NMSPs. To calculate the support, we propose the Netback algorithm which adopts the backtracking method to find an occurrence with no need to find and prune invalid nodes in the Nettree. Since NMSP mining satisfies the Apriori property, NetNMSP adopts the pattern join strategy to generate the candidate patterns. Meanwhile, NetNMSP employs the screening method to find NMSPs to improve the mining efficiency. Experiments on DNA and protein sequence datasets verify that NetNMSP can exactly mine all NMSPs in sequence datasets. Moreover, it shows that not only does NetNMSP outperform other competitive algorithms, but also NMSP mining has better compression performance than closed sequential pattern mining. Experiments on sales datasets validate that our algorithm guarantees the best scalability on large scale datasets. More importantly, we mine NMSPs and frequent patterns in SARS-CoV-1, SARS-CoV-2 and MERS-CoV. The results show that NMSP mining is easier to find the differences between the virus sequences.

13 in total

1. HUOPM: High-Utility Occupancy Pattern Mining.

Authors: Wensheng Gan; Jerry Chun-Wei Lin; Philippe Fournier-Viger; Han-Chieh Chao; Philip S Yu
Journal: IEEE Trans Cybern Date: 2019-02-20 Impact factor: 11.448

2. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique.

Authors: Leyi Wei; Pengwei Xing; Gaotao Shi; Zhiliang Ji; Quan Zou
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-02-16 Impact factor: 3.710

3. Mining Top- k Useful Negative Sequential Patterns via Learning.

Authors: Xiangjun Dong; Ping Qiu; Jinhu Lu; Longbing Cao; Tiantian Xu
Journal: IEEE Trans Neural Netw Learn Syst Date: 2019-01-10 Impact factor: 10.451

4. Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns.

Authors: Shameek Ghosh; Jinyan Li; Longbing Cao; Kotagiri Ramamohanarao
Journal: J Biomed Inform Date: 2016-12-21 Impact factor: 6.317

5. Hypotension Risk Prediction via Sequential Contrast Patterns of ICU Blood Pressure.

Authors: Shameek Ghosh; Mengling Feng; Hung Nguyen; Jinyan Li
Journal: IEEE J Biomed Health Inform Date: 2015-07-07 Impact factor: 5.772

6. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia.

Authors: Ali M Zaki; Sander van Boheemen; Theo M Bestebroer; Albert D M E Osterhaus; Ron A M Fouchier
Journal: N Engl J Med Date: 2012-10-17 Impact factor: 91.245

7. Top-k Self-Adaptive Contrast Sequential Pattern Mining.

Authors: Youxi Wu; Yuehua Wang; Yan Li; Xingquan Zhu; Xindong Wu
Journal: IEEE Trans Cybern Date: 2022-10-17 Impact factor: 19.118

8. Using artificial intelligence techniques for COVID-19 genome analysis.

Authors: M Saqib Nawaz; Philippe Fournier-Viger; Abbas Shojaee; Hamido Fujita
Journal: Appl Intell (Dordr) Date: 2021-02-17 Impact factor: 5.019

9. NetNCSP: Nonoverlapping closed sequential pattern mining.

Authors: Youxi Wu; Changrui Zhu; Yan Li; Lei Guo; Xindong Wu
Journal: Knowl Based Syst Date: 2020-03-31 Impact factor: 8.038

10. Analysis of multimerization of the SARS coronavirus nucleocapsid protein.

Authors: Runtao He; Frederick Dobie; Melissa Ballantine; Andrew Leeson; Yan Li; Nathalie Bastien; Todd Cutts; Anton Andonov; Jingxin Cao; Timothy F Booth; Frank A Plummer; Shaun Tyler; Lindsay Baker; Xuguang Li
Journal: Biochem Biophys Res Commun Date: 2004-04-02 Impact factor: 3.575