Zhao-Hui Qi1, Xiao-Qin Qi, Chen-Chen Liu. 1. School of Computer and Information Engineering, Shijiazhuang Railway Institute, Shijiazhuang, Hebei 050043, People's Republic of China. zhqi_yh2004@yahoo.com.cn
Abstract
We introduce a new approach to investigate problem of DNA sequence alignment. The method consists of three parts: (i) simple alignment algorithm, (ii) extension algorithm for largest common substring, (iii) graphical simple alignment tree (GSA tree). The approach firstly obtains a graphical representation of scores of DNA sequences by the scoring equation R(0)*R-S(0)*S-T(0)*(a+bk). Then a GSA tree is constructed to facilitate solving the problem for global alignment of 2 DNA sequences. Finally we give several practical examples to illustrate the utility and practicality of the approach. Crown Copyright 2009. Published by Elsevier Ltd. All rights reserved.
We introduce a new approach to investigate problem of DNA sequence alignment. The method consists of three parts: (i) simple alignment algorithm, (ii) extension algorithm for largest common substring, (iii) graphical simple alignment tree (GSA tree). The approach firstly obtains a graphical representation of scores of DNA sequences by the scoring equation R(0)*R-S(0)*S-T(0)*(a+bk). Then a GSA tree is constructed to facilitate solving the problem for global alignment of 2 DNA sequences. Finally we give several practical examples to illustrate the utility and practicality of the approach. Crown Copyright 2009. Published by Elsevier Ltd. All rights reserved.
The decoding of different genomes, in particular the human genomes has triggered a great deal of bioinformatics research. The research of sequence similarity to a known protein sequence or DNA sequence has been an important method to provide the first clues about the function of a newly sequenced gene. The research becomes increasingly useful in the analysis of newly sequenced genes as the sequence databases, such as DNA and protein databases, continue to grow in size. There are a number of standard schemes widely used to search for homologous sequences in nucleotide and protein databases to distinguish biologically significant relationships from chance similarities (Smith and Waterman, 1981; Waterman, 1984; Pearson and Lipman, 1988; Altschul et al., 1990, Altschul et al., 1997; Tatiana and Thomas, 1999). The dynamic programming algorithms (Smith and Waterman, 1981; Waterman, 1984) assign scores to insertions, deletions and replacement, and compute an alignment of two sequences. These algorithms are impractical for searching large database because of their computational requirements. Rapid heuristic algorithms (Pearson and Lipman, 1988; Altschul et al., 1990, Altschul et al., 1997) compare protein and DNA sequences much faster than the above methods. They are widely used for large database searches. However, a number of important scientific contexts involve the comparison of only two sequences and do not require a time-consuming database search. In order to meet these needs, many important tools for large database searches (Tatiana and Thomas, 1999; http://www.ebi.ac.uk/Tools/emboss/align/index.html; http://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi) are further developed to employ global alignment of 2 protein and nucleotide sequences. These methods can provide the global profile of similarity degree. Recently, many graphical methods have also been used to examine the global similarities/dissimilarities among the coding sequences of different species (Qi et al., 2007; Qi and Qi, 2007, Qi and Qi, 2009; Qi and Fan, 2007; Yao et al., 2006, Yao et al., 2008a, Yao et al., 2008b; Randić et al., 2003a, Randić et al., 2003b). Because of the advantages in visualization, graphical methods have become a powerful bioinformatics tool for the analysis of complicated biological systems, such as enzyme-catalyzed system (Chou et al., 1979; Chou, 1980; Chou and Forsen, 1980, Chou and Forsen, 1981; Chou and Liu, 1981; Myers and Palmer, 1985; Zhou and Deng, 1984; Chou, 1989, Chou, 1990; Kuzmic et al., 1992; Andraos, 2008), protein folding kinetics (Chou, 1990, Chou, 1993), condon usage (Chou and Zhang, 1992; Zhang and Chou, 1994), HIV reverse transcriptase inhibition mechanisms (see Althaus et al., 1993a, Althaus et al., 1993b, Althaus et al., 1993c, as well as a review article (Chou et al., 1994)), base frequency distribution in the anti-sense strands (Chou et al., 1996) and classifying organisms (Sorimachi and Okayasu, 2008; Okayasu and Sorimachi, 2009; Qi et al., 2009). Recently, the images of cellular automata were used to represent biological sequences (Xiao et al., 2005a), predict protein subcellular location (Xiao et al., 2006a), investigate HBV virus gene missense mutation (Xiao et al., 2005b) and HBV viral infections (Xiao et al., 2006b), predicting protein structural classes (Xiao et al., 2008) and G-protein-coupled receptor functional classes (Xiao et al., 2009), as well as analyze the fingerprint of SARS coronavirus (Wang et al., 2005; Gao et al., 2006).In this paper, we suggest a heuristic approach to align a pair of DNA sequences with the tree data structure. Some graphical descriptions are also used to intuitively explain the scheme. The method consists of three parts: (i) simple alignment algorithm, (ii) extension algorithm for largest common substring, (iii) graphical simple alignment tree (GSA tree). Here, we firstly obtain a 2-dimension (2D) graphical curve by graphical representation of scores of DNA sequences. A good simple alignment of the DNA sequences is generated when the score of the scoring curve reaches its peak value. The 2D graphical curve can intuitively show the global change of simple alignment scores based on scoring matrix. Then the largest common substrings of the good simple alignment are found out. Because of the limit of the initial alignment reaching peak score, the largest common substrings of original two sequences may be split by the largest common substrings of the good simple alignment. In order to protect these common substrings from being split, an extension algorithm for the largest common substring is suggested. Then all largest common substrings are found out when the initial alignment reaches peak score. The process is repeatedly done until GSA tree comes into being. The tree can facilitate solving the problem for global alignment of 2 DNA sequences. At last, we give several practical examples to illustrate the utility and practicality of the approach.
Simple alignment algorithm and graphical representation of scores of DNA sequences based on scoring matrix
The scoring equation based on scoring matrix
In bioinformatics, scoring matrix is also called as substitution matrix. As for protein sequences, the matrix is often based on observed substitution rates, derived from the substitution frequencies seen in multiple alignments of sequences. Every possible identity and substitution is assigned a score based on the observed frequencies in alignments of related proteins. Similarly, every possible identity and substitution in alignments of related DNA sequences is also assigned a score. However, the scoring matrix in alignment of DNA sequences is relatively simple and intuitional. Table 1
is a scoring matrix. An identity in scoring matrix is assigned a positive score R
. A substitution is assigned a positive score S
. But the score S must be subtracted from the total score of an alignment. For example, an alignment of two short DNA sequences ATGGTGCAACTGACT and ATGGTGCACTTGACT is the following:The score of alignment should be .
Table 1
A usual scoring matrix.
A
C
G
T
A
R
−S
−S
−S
C
−S
R
−S
−S
G
−S
−S
R
−S
T
−S
−S
−S
R
A usual scoring matrix.Another important problem for alignment is the treatment of gaps, i.e., spaces inserted to optimize the alignment score. A ‘gap open’ penalty is one that is the cost for the first space of each gap spaces. A ‘gap extension’ penalty is one that is the cost for one of each gap spaces except for the first space. Typically, the cost of extending a gap is set to be 5–10 times lower than the cost for opening a gap (http://www.ebi.ac.uk/Tools/emboss/align/index.html). There is one way to compute a penalty for a gap of n positions: gap opening penalty + gap extension penalty. Now, let parameter a be gap opening penalty and parameter b be gap extension penalty. And let parameter k be . Then we have the scoring equation: , where R is the score of each match, and S is the score of each mismatch and is the score of each gap. The parameters R
0, S
0 and T
0 denote the total amount of matches, the total amount of mismatches and the total amount of gaps, respectively.
Simple alignment algorithm and graphical representation of scores of DNA sequences based on scoring matrix
There are two primary DNA sequences: G
1 () and G
2 (), where M and N denote the length of G
1 and G
2, respectively. Fig. 1
shows the building steps of all possible simple alignments without spaces within G
1 and G
2. The gap formed by spaces lies in the hanging ends of the overlap. Step 1 of Fig. 1 gives the initial alignment that the last base of G
1 overlaps the first base of G
2. Then every time G
1 moves one base position along the direction of G
2. Step of Fig. 1 gives the last alignment that the first base of G
1 overlaps the last base of G
2. Every alignment has a score according to the scoring equation . Then we can obtain a serial of dots , where x denotes index of steps and y denotes the score value corresponding to x. When one connects adjacent dots with lines, then one obtains a zigzag like curve of definite geometrical shape. In Fig. 2
we illustrate the graphical score representation of simple alignments of sequences G
1 and G
2 (G
1, GGCCTCTGCCTAATCACACAGATCTAACAGGATTATTTC; G
2, GGCCTCT GCCTTATTACACAAATCTTAACAGGACTATTTC). The scoring equation is , where identity score R is 9, substitution score S is 1, gap opening penalty a is 15, and gap extension penalty b is 1.
Fig. 1
The building steps of all possible alignments without spaces within G1 and G2.
Fig. 2
The figure illustrates the graphical score representation of simple alignments of sequences G1 and G2. The scoring equation is , where identity score R is 9, substitution score S is 1, gap opening penalty a is 15, and gap extension penalty b is 1.
The building steps of all possible alignments without spaces within G1 and G2.The figure illustrates the graphical score representation of simple alignments of sequences G1 and G2. The scoring equation is , where identity score R is 9, substitution score S is 1, gap opening penalty a is 15, and gap extension penalty b is 1.The values about the parameters in the above scoring equation are chosen according to a choice of ‘EMBOSS Pairwise Alignment Algorithms-needle’ (http://www.ebi.ac.uk/Tools/emboss/align/index.html). It is well known that changing the values of the parameters may change the number and length of gaps in an alignment. However, there is no analytical formula that determines the ‘best’ gap values to use, so that one may wish to experiment with values in order to explore more of the alignment ‘space’ (Tatiana and Thomas, 1999). As for needle, one can experiment with different combinations of parameters. Here, we choose some typical values of parameters, such as R=9, S=1, a=15 and b=1. Of course, one can choose different values by his experiment. In this paper, we choose these values anywhere in order to maintain consistency in the context.Fig. 2 clearly shows the graphical ‘signatures’ of simple alignments of sequences G
1 and G
2. Obviously, graphical ‘signatures’ enable much easier visual inspection of simple alignments of DNA sequences than their representation by strings over the DNA alphabet {A, T, G, C}. A close look at Fig. 2, one can easily find out that the score of the simple alignment reaches its peak score when the index is 39. In this paper, we call the simple alignment with peak score as a good simple alignment of sequences G
1 and G
2.The 2D scoring curve can intuitively show the global change of simple alignment scores based on scoring matrix. One can easily obtain the peak point by observing the curve. Of course, it is not necessary to draw the graphical scoring curve when one dose not want to observe the global change of simple alignment scores or need deal with thousands of pairs of DNA sequences. He can also determine the peak point by doing data comparison.
Improved simple alignment algorithm with less consuming time
The above simple alignment algorithm shown in Fig. 1 is a time-consuming algorithm. Its time complexity is if the two aligned sequences have equal length N. Here, we provide an improved alignment process to obtain a lower time complexity. A close look at Fig. 1 shows that the beginning steps and the ending steps of the sliding process are not necessary. These unnecessary steps consume some time. Now, we need to find out the unnecessary steps to save time.Fig. 3
shows the improved process. Step (1) of Fig. 3 gives the initial alignment that the first base of G
1 overlaps the first base of G
2. Then the sliding process is done along the left and the right, respectively. Firstly, we consider the sliding process is done along the right, as shown in step (2). Every alignment has a score S according to the scoring equation . Let score S be the score of the ith step. Let be the top score within all the scores (i.e., , ). As for every step, we still define another score . The score is obtained by another scoring equation . In the new scoring equation we assume all overlapped bases are matched each other. The sliding process is stopped when . Obviously, the remaining steps along the right are unnecessary because all scores of the remaining steps have no chance to obtain a higher score than .
Fig. 3
Improved simple alignment process with less consuming time (The length of G1 is M. The length of G2 is N. And let ).
Improved simple alignment process with less consuming time (The length of G1 is M. The length of G2 is N. And let ).Then we consider the sliding process is done along the left. The top score within all the scores is (). The score is obtained by the scoring equation . The sliding process is stopped when . The remaining steps along the left are unnecessary.By the above sliding process the improved simple alignment algorithm can save some time. Now we discuss the time complexity. Let N and M be the length of sequences G
1 and G
2, respectively. Let k be the total amount of the sliding steps when the sliding process is stopped. Then we can draw a conclusion that the time complexity is if . Obviously, the more two sequences are similar, the smaller the parameter k is. Especially, the time complexity is if .
Methods
Extension algorithm for the largest common substring
By the simple alignment algorithm a good simple alignment R of G
1 and G
2 is determined when the score of the simple alignment reaches its peak. Let C be the largest common substrings of R, where . If , there is no largest common substring. When , there are m largest common substrings, where and k denotes the number of matches within a largest common substring. Here, we can see that the largest common substring C () of R devotes its maximal score to the initial alignment reaching peak score. The alignment shows the approximate overall alignment feature of G
1 and G
2. However, some C of the largest common substrings of R may be a part of a larger common substring than C because of the limit of the initial alignment reaching peak score. In order to protect a larger common substring than C from being split by C and find out the substring, an extension algorithm for the largest common substring is suggested as the following.Let K be the length of the largest common substring of G
1 and G
2. Its value is generated when the simple alignment algorithm applies to the sequences G
1 and G
2.When , none of the largest common substrings C of R can be extended into larger common substring. The largest common substrings of R are, .When , there exists at least a larger common substring than C. There are several sub-steps to find out the larger common substrings as the following:Let be the number of mismatches from the right end of to the left end of C. When , denotes the number of mismatches from the left end of G
1 and G
2 to the left end of C
1. Similarly, let be the number of mismatches from the right end of C to the left end of . When , denotes the number of mismatches from the right end of C
1 to the right end of G
1 and G
2.When , the K mismatches are extracted from the left of C. Otherwise, the mismatches are extracted from the left of C. Similarly, when , the K mismatches are extracted from the right of C. Otherwise, the mismatches are extracted from the right of C. Then the sequences extracted from the left of C, C and from the right of C are connected into two new sub-sequences and .Now, we apply the simple alignment algorithm to and . If there exist a new larger common substring within and than C, we will face a choice: the new larger common substring or C. If there is an increment of sore when the new larger common substring comes into being, we will replace the original C with the new substring also called as C.As for every C of R, the original C is replaced by the new largest common substring if the new substring exists.Now, we give a practical example to illustrate the above steps for a better understanding. There are two random sequences: G
1 (GCCTAGTTCCCCCA) and G
2 (GCCTCGCATCCCCCA). By the simple alignment algorithm a good simple alignment R of G
1 and G
2 is determined, where the alignment R is The largest common substring C of R is “GCCT” (C
1) and “CCCC” (C
2). However, the largest common substring of G
1 and G
2 is “TCCCCCA”. Its length K is 7. Obviously, there exists at least a larger common substring than “GCCT” or “CCCC”. As for “GCCT”, the two new sub-sequences are “GCCTAGTTC” () and “GCCTCGCAT” (), respectively. Because there is no increment of sore, the original common substring “GCCT” (C
1) is unchanged. As for “CCCC”, the two new sub-sequences are “AGTTCCCCCA” () and “CGCATCCCCCA” (), respectively. Because there is an increment of sore, the original common substring “CCCC” (C
2) is replaced with the new largest common substring “TCCCCCA”.In sum, the extension algorithm for largest common substring protects a larger common substring than C from being split by C.Now, we let U denote the substrings spaced by C, where . Then the strings G
1 and G
2 are aligned into two types of substrings: (i) The largest common substrings C with continuous matches: (of course, a single match is also permitted) and (ii) The substrings U with continuous mismatches: and both without spaces. Obviously, a good simple alignment R of G
1 and G
2 is alternately organized by C and U (e.g. , where . Here, the superscript of C or U denotes the level of sub-alignment. The superscript “1” denotes the first level. And the subscript denotes the index of substrings in the current sub-alignment. The alignment R is the result of the first level sub-alignment).In order to expressly explain the above parameters, we give a practical example. There are two random sequences: G
1 (GCCTAGTTCCCCCA) and G
2 (GCCTCGCATCCCCCA). The and denote the base of G
1 and G
2, respectively. They become a match when . By the simple alignment algorithm a good simple alignment R of G
1 and G
2 is determined, where the alignment R is And by the extension algorithm the largest common substring C of R is “GCCT” and “TCCCCCA”. Then G
1 and G
2 are organized by C into three substrings: is “GCCT”, is “AGT” and “CGCA”, is “TCCCCCA”.
Graphical simple alignment tree (GSA tree)
Given sequences G
1 and G
2, a graphical simple alignment (GSA) tree is built up by the aforesaid simple alignment algorithm and extension algorithm for largest common substring. In the following, we will show how to use the algorithms to construct a GSA tree of G
1 and G
2.Let R be a good simple alignment of G
1 and G
2. Let and be the largest common substring of R and be the substring spaced by , respectively. We can see that the largest common substring of R devotes its maximal score to the global alignment. However, the of R may provide a increment of the score to the global alignment if given appropriate gaps within , though the scores of these are low in the first level sub-alignment.In order to explore the appropriate gaps in , we give the following several operation steps.Compute the scores of all simple alignments of by the simple alignment algorithm. A good simple alignment of is generated when its score reaches its peak.If there is a increment of the score due to appropriate gaps within , can be further divided into the second level sub-alignment. When and how to add the gaps in the sequences? Now, let be the largest common substrings of , where . Then there are two sub-steps as the following:If , there are m largest common substrings. Then can be further divided into the second level sub-alignment by . Let be the substrings spaced by , where . Then every substring () of becomes a leaf node in GSA tree. There are no gaps within them. As for every substring () of , the operation flow goes back to the step (1). And the level of sub-alignment enters the next.If , there is no the largest common substrings in . Then can not be further broken down. The good simple alignment of becomes a leaf node in GSA tree. The two sequences of might be entirely overlapping, or partially overlapping, or one sequence might be aligned entirely internally to the other. When the two sequences of are entirely overlapping, there are no gaps within . Otherwise, the hanging ends of the overlap come into being the gaps of . The relative position of these gaps is fixed, and becomes the gaps within the final global alignment.We repeatedly do the above steps until all U in the last level sub-alignment cannot be further decomposed by the simple alignment algorithm and the extension algorithm. Then we can obtain a graphical simple alignment tree for strings G
1 and G
2, consisting of a series of substrings. The Fig. 4
illustrates an example of GSA tree.
Fig. 4
An example of graphical simple alignment tree (Strings G1 and G2 includes 6 substrings in the first level sub-alignment: . The substring includes 3 substrings in the second level sub-alignment: . The substring includes 2 substrings in the second level sub-alignment: . The substring includes 3 substrings in the second level sub-alignment: . The substring includes 2 substrings in the third level sub-alignment: . The substring includes 2 substrings in the third level sub-alignment: ).
An example of graphical simple alignment tree (Strings G1 and G2 includes 6 substrings in the first level sub-alignment: . The substring includes 3 substrings in the second level sub-alignment: . The substring includes 2 substrings in the second level sub-alignment: . The substring includes 3 substrings in the second level sub-alignment: . The substring includes 2 substrings in the third level sub-alignment: . The substring includes 2 substrings in the third level sub-alignment: ).
The global alignment problem based on GSA tree
We can obtain a global alignment of strings G
1 and G
2 when their GSA tree is generated. In the following, we will show how to use GSA tree to generate a global alignment. Observing the GSA tree (e.g. Fig. 4), we can see that there are two types of nodes: inner nodes and leaf nodes. The inner nodes consist of substrings U that can be aligned to more substrings. The leaf nodes include substrings C and U, where U cannot be further aligned by the GSA tree method. The global alignment of strings G
1 and G
2 should be composed of all leaf nodes. In order to obtain the global alignment, the GSA tree is traversed by post-order traversal of tree. Then all inner nodes are deleted from the result of post-order traversal. We will achieve the global alignment of strings G
1 and G
2. For example, the result of post-order traversal of Fig. 4 is The global alignment is by removing all of inner nodes.
Application and discussion
The application of GSA tree
In this section, we give three examples to see the validity of GSA tree for the global alignment of 2 DNA sequences. As for every example, we compare the results by GSA tree with the results by the important heuristic approach “EMBOSS Pairwise Alignment Algorithms—needle” (http://www.ebi.ac.uk/Tools/emboss/align/index.html).We firstly apply the proposed method to the discussed 2 sequences: G
1 (GGCCTCTGCCTAATCACACAGATCTAACAGGATTATTTC) and G
2 (GGCC TCTGCCTTATTACACAAATCTTAACAGGACTATTTC). The scoring equation is , where identity score R is 9, substitution score S is 1, gap opening penalty a is 15, and gap extension penalty b is 1. Of course, one may also choose other values as the parameters of the scoring equation according to practical requirement. Here, we choose the same values in order to keep the context consistency. In Fig. 2, we have described the graphical score representation of simple alignments of sequences G
1 and G
2. One can easily find out that the score of the simple alignment reaches its peak when index of step in Fig. 1 is 39. Then we can obtain the first level sub-alignment results according to the simple alignment algorithm and the extension algorithm. By similar rules, we can get all possible sub-alignment results. Fig. 5
shows the GSA tree to be used to align the sequences G
1 and G
2. Table 2
lists all substrings of Fig. 5 and the max score of each pair of substrings. Then we obtain the global alignment of the sequences: . According to the results of Table 2, we illustrate the global alignment results as the following:The notation “‐” denotes the gap within sequence. Then we apply “needle” program to G
1 and G
2. The values about gap opening penalty a and gap extension penalty b are the same as the values proposed the GSA tree algorithm. The default values about identity score R and substitution score S are chosen because of no other choice. The alignment results are shown as the following:Obviously, we get consistent results by two different methods. Moreover, similar results can be found out in Table 1 of Randić et al. (2006).
Fig. 5
The graphical simple alignment tree to be used to align the sequences G1 and G2 (G1, GGCCTCTGCCTAATCACACAGATCTAACAGGATTATTTC; G2, GGCCTCTGCCTTATTACACAAATCTTAACAGGACTATTTC).
Table 2
All substrings in Fig. 5 and the max score of each substring.
Substrings
The max score
Level 1
G1, G2
C11(G1)=C11(G2): GGCCTCTGCCT_
99
U21(G1): AATCACACAGATCTAACAGGATTATTTC
127
U21(G2): TATTACACAAATCTTAACAGGACTATTTC
Level 2
U21(G1)
U12(G1): AATCACACAGATC
72
U21(G2)
U12(G2): TATTACACAAATCT
C22(G1)=C22(G2): TAACAGGA_
72
U32(G1): TTATTTC
U32(G2): CTATTTC
53
Level 3
U12(G1)
U13(G1): AATC
U13(G2): TATT
16
U12(G2)
C23(G1)=C23(G2): ACACA_
45
U33(G1): GATC
U33(G2): AATCT
11
U32(G1)U32(G2)
U43(G1): T
U43(G2): C
−1
C53(G1)=C53(G2): TATTTC_
54
Level4
U13(G1)U13(G2)
U14(G1): A
U14(G2): T
−1
C24(G1)=C24(G2): AT_
18
U34(G1): C
U34(G2): T
−1
U33(G1)U33(G2)
U44(G1): G
U44(G2): A
−1
C54(G1)=C54(G2): ATC_
27
U64(G1): –
U64(G2): T
−15
The graphical simple alignment tree to be used to align the sequences G1 and G2 (G1, GGCCTCTGCCTAATCACACAGATCTAACAGGATTATTTC; G2, GGCCTCTGCCTTATTACACAAATCTTAACAGGACTATTTC).All substrings in Fig. 5 and the max score of each substring.The aforesaid example about the validity examination gives rise to a question: Is it possible to use the GSA tree in order to facilitate solving less similar or longer DNA sequence alignment problem? The answer is positive. Now, we randomly give two less similar sequences: G
1 and G
2 (G
1, GCCCTCGCGGGCAACATTTAATTCACAGCCAGTTCTCTCAACAGTGATTATC; G
2, CTGGGTCTTCAGGTCCTTTATGCTTAACACAAATCTATCGTTA ACAGGACTATTCT). Like the above example, the scoring equation is , where identity score R is 9, substitution score S is 1, gap existence penalty a is 15, and gap extension penalty b is 1. By GSA tree algorithm, we can get the GSA tree and all possible sub-alignment results. Fig. 6
shows the GSA tree to be used to align the sequences G
1 and G
2. Table 3
lists all substrings in Fig. 6 and the max score of each pair of substrings. Then we obtain the global alignment of the sequences:According to Table 3, we illustrate the global alignment results as the following:According to the results, we can obtain some conclusions of the global alignment between the sequences G
1 and G
2: Length: 61; Identities: 31/61 (50.8%); Gaps: 14/61 (22.9%); Score: 169. Then we apply “needle” program to G
1 and G
2. The alignment results are shown as the following:Then we can obtain some conclusions of the global alignment between the sequences G
1 and G
2: Length: 64; Identities: 32/64 (50%); Gaps: 20/64 (31.2%); Score: 29.
Fig. 6
The graphical simple alignment tree to be used to align the sequences G1 and G2 (G1, GCCCTCGCGGGCAACATTTAATTCACAGCCAGTTCTCTCAACAG TGATTATC; G2, CTGGGTCTTCAGGTCCTTTATGCTTAACACAAATCTATC GTTAACAGGACTATTCT).
Table 3
All substrings in Fig. 6 and the max score of each substring.
Substrings
The max score
Level 1
G1, G2
U11(G1): GCCCTCGCGGGCAACATTTAATTC
40
U11(G2): CTGGGTCTTCAGGTCCTTTATGCTTA
C21(G1)=C21(G2): ACA_
27
U31(G1): GCCAGTTCTCTCAACAGTGAT
49
U31(G2): CAAATCTATCGTTAACAGGAC
C41(G1)=C41(G2): TAT_
27
U51(G1): C–;
U51(G2): TCT
−17
Level 2
U11(G1)
U12(G1): GCCCTCGCG
U12(G2): CTGGGTCTTCA
−1
U11(G2)
C22(G1)=C22(G2): GG_
18
U32(G1): CAACATTTAA
U32(G2): TCCTTTATGC
10
C42(G1)=C42(G2): TT_
18
U52(G1): C
U52(G2): A
−1
U31(G1)
U62(G1): GCCAGTTC
U62(G2):CAAATCTA
12
U31(G2)
C72(G1)=C72(G2): TC_
18
U82(G1): TCAACAGT
U82(G2): GTTAACAG
23
C92(G1)=C92(G2): GA_
18
U102(G1): T
U102(G2): C
−1
Level 3
U12(G1)
U13(G1): GCCC
U13(G2): CTGGGTCT
−2
U12(G2)
C23(G1)=C23(G2): TC_
18
U33(G1): GCG
U33(G2): A–
−17
U32(G1)
U43(G1): CAACA
U43(G2): TCC
−9
U32(G2)
C53(G1)=C53(G2): TTTA_
36
U63(G1): A–
U63(G2):TGC
−17
U62(G1)
U73(G1):GCC
U73(G1): CAA
−3
U62(G2)
C83(G1)=C83(G2):A_
9
U93(G1):GT
U93(G1):TC
−2
C103(G1)=C103(G2): T_
9
U113(G1): C
U113(G2): A
−1
U82(G1)
U123(G1): TC
U123(G2): GTT
−7
U82(G2)
C133(G1)=C133(G2): AACAG_
45
U143(G1): T
U143(G2): –
−15
Level 4
U13(G1)
U14(G1): –
U14(G2): CTGG
−18
U13(G2)
C24(G1)=C24(G2): G_
9
U34(G1): C
U34(G2): T
−1
C44(G1)=C44(G2): C_
9
U54(G1): C
U54(G2): T
−1
U43(G1)
U64(G1): CAA
U64(G2): T–
−17
U43(G2)
C74(G1)=C74(G2): C_
9
U84(G1): A
U84(G2): C
−1
U123(G1)U123(G2)
U94(G1): –
U94(G2): G
−15
C104(G1)=C104(G2): T_
9
U114(G1): C
U114(G2): T
−1
The graphical simple alignment tree to be used to align the sequences G1 and G2 (G1, GCCCTCGCGGGCAACATTTAATTCACAGCCAGTTCTCTCAACAG TGATTATC; G2, CTGGGTCTTCAGGTCCTTTATGCTTAACACAAATCTATC GTTAACAGGACTATTCT).All substrings in Fig. 6 and the max score of each substring.Obviously, in this example we get different results by two different methods. A close look at the alignments reveals that there are two different areas (the bold areas illustrate the almost identical ones): the bases from 1 to 20 by GSA tree vs. the bases from 1 to 21 by needle, and the bases from 34 to 46 by GSA tree vs. the bases from 35 to 49 by needle. As for the first area, GSA tree obtains 7 matches by 3 gap openings and 5 gap extensions while needle gets 6 matches by 2 gap openings and 8 gap extensions. In the second area GSA tree gets 5 matches by 1 gap openings and 0 gap extensions while needle obtains 7 matches by 2 gap openings and 3 gap extensions. From the global view, GSA tree receives 31 matches by 7 gap openings and 7 gap extensions while needle gets 32 matches by 7 gap openings and 13 gap extensions. However, it is difficult for us to determine which one is the better when it comes to the best global alignment of the sequences considered. In fact, a close look at the alignments discovers that both of the methods can find out those very similar local areas, such as the bold areas of the alignments.Finally, we give the third example with two long and very similar sequences to see the validity of GSA tree for the global alignment of 2 DNA sequences. Here, we apply the proposed method to the complete coding sequence part of beta globin gene of Human (ACCESSION U01317) and Opossum (ACCESSION J03643) as shown in Table 4
. The scoring equation is , where identity score R is 9, substitution score S is 1, gap existence penalty a is 15, and gap extension penalty b is 1. For simplification, we do not show the detail of GSA tree and the results of all substrings. In Fig. 7
, we illustrate the final result. The notation “‐” denotes the gap within sequence. According to the figure, we can obtain some conclusions of the global alignment between string G
1 (the complete coding sequence of ACCESSION U01317) and string G
2 (the complete coding sequence of ACCESSION J03643): Length: 448; Identities: 328/448 (73.2%); Gaps: 8/448 (1.8%); Score: 2772. Then we apply the web tool for “needle” program to G
1 and G
2. The conclusions about the global alignment is the following: Length: 448; Identities: 328/448 (73.2%); Gaps: 8/448 (1.8%); Score: 1128.0. From the global view, the two methods almost get the same statistics results except for their scores. There is only one different area: the bases from 33 to 34 by GSA tree vs. the bases from 33 to 34 by needle. In fact, the two different areas are equivalent to each other. This example shows that it is possible to use the GSA tree in order to facilitate solving long and very similar DNA sequence alignment problem.
Table 4
The complete coding sequence part of beta globin gene of Human (ACCESSION U01317) and Opossum (ACCESSION J03643).
The global alignment between string a and string b. Identities: 328/448 (73.2%); Gaps: 8/448 (1.8%); Score: 2772.
The complete coding sequence part of beta globin gene of Human (ACCESSION U01317) and Opossum (ACCESSION J03643).The global alignment between string a and string b. Identities: 328/448 (73.2%); Gaps: 8/448 (1.8%); Score: 2772.
Discussion
The GSA tree algorithm is a gradual algorithm that step by step explores suitable gaps to improve the score of the alignment between 2 DNA sequences. The scheme firstly uses the simple alignment algorithm and the extension algorithm to construct a GSA tree. The substrings denoted by leaf node of GSA tree are looked on as a part of the final global alignment. Obviously, GSA tree method is a heuristic algorithm. There is no analytical formula to determine the ‘best’ gap values. Though GSA tree algorithm is a heuristic method, and could not always get the optimal alignment, the validity and practicality of the results can be ensured.Firstly, we discuss the influence about the initial simple alignment of GSA tree algorithm. In Section 3.2, we describe in detail the construction steps of GSA tree by the initial simple alignment. As for the overall result, the initial simple alignment is very important. It determines the extent to which the consolidated results close to optimal results. Next, we discuss the initial simple alignment and its optimized features.By the simple alignment algorithm a good simple alignment R of sequences G
1 and G
2 is determined when the score of the simple alignment reaches its peak. Then the largest common substrings C of R is determined, where . When , there are m the largest common substrings. Because of the limit of the initial alignment reaching peak score, some C of the largest common substrings of R may be a part of a larger common substring than C. An extension algorithm for the largest common substring is suggested to protect a larger common substring than C from being split by C if the larger common substring exists. Then we replace the original C with the new larger common substring also called as C. Once the largest common substrings C is determined, it is looked on as a part of the overall result. Then in an optimal alignment of G
1 and G
2 (This hypothetical alignment may be obtained by an absolutely optimized algorithm), the two sequences of C by GSA tree might be entirely overlapping in the optimal alignment, or partially overlapping, or one sequence might be entirely isolated from the other. Obviously, for the latter two cases, a great deal of gaps and mismatches may be introduced. So the substring C is most likely to appear in the optimal alignment.The above explanation gives some approximate analysis instead of strict mathematical reasoning, but the conclusion is reasonable from the biological point of view. As the sequences under comparison are protein coding, gaps with lengths other than multiples of three are highly unlikely, whereas the GSA tree algorithm can avoid many single or two-base gaps by the approximate approach. Unlike “EMBOSS Pairwise Alignment Algorithms—needle (global)” or “—water (local)”, the proposed GSA tree algorithm is to find a balance between the global and local. The algorithm is for aligning two sequences over their entire length. The match areas with local superiority are produced in the global context. In bioinformatics, it may be reasonable to assume that in the global context the local areas with closely related sequences should be reflected because of the stability of these areas. As for two possibly related sequences, giving priority to these local areas may be a better choice. The accurate optimization result could not show these local features because of the global optimization. The proposed GSA tree algorithm is for aligning two sequences over their entire length. This works best with closely related sequences. If one uses GSA tree to align very distantly related sequences, it will produce a result but much of the alignment may have little or no biological significance. The three examples of 4.1 give practical proofs for the validity and practicality of the GSA tree algorithm. The Example 1 gives two very similar but very short sequences. The GSA tree achieves the same results as needle. The Example 3 gives two similar and long sequences. There is only one different area:This example shows that giving priority to these local areas may be a better choice. The Example 2 gives two random sequences. There are two different areas in their results. A close look at the alignments discovers that both of the methods can find out those very similar local areas. The main differences between them lie in the very distantly related areas.Finally, the following analysis shows why we consider the substring C in the only initial alignment reaching peak score. Because of the limit of the initial alignment reaching peak score, some C of the largest common substrings of R may be a part of a larger common substring than C. Then an extension algorithm for the largest common substring is suggested to protect a larger common substring than C from being split by C if the larger common substring exists. In the extension algorithm we introduce three parameters: K, and , where K be the length of the largest common substring of and . When , none of the largest common substrings C of R can be extended into larger common substring. The largest common substrings of R are , respectively. There is no larger common substring than C split by C. Otherwise, when , there exists at least a larger common substring than C. The larger common substring and C might be partially overlapping, or one might be entirely isolated from the other. When they are entirely isolated from each other, the larger common substring can be completely preserved and found out in the next sub-alignment. As for being partially overlapping, we face a choice: the larger common substring or C. In order to give a better choice, we use parameters K, and to construct two new sub-sequences and . We will replace the original C with the new substring also called as C if there is an increment of sore when the new larger common substring comes into being. So we can protect a larger common substring than from being split by and eliminate the limitations caused by the only initial alignment reaching peak score.
Authors: I W Althaus; A J Gonzales; J J Chou; D L Romero; M R Deibel; K C Chou; F J Kezdy; L Resnick; M E Busso; A G So Journal: J Biol Chem Date: 1993-07-15 Impact factor: 5.157