Literature DB >> 25983537

Empirical Transition Probability Indexing Sparse-Coding Belief Propagation (ETPI-SCoBeP) Genome Sequence Alignment.

Aminmohammad Roozgard¹, Nafise Barzigar¹, Shuang Wang², Xiaoqian Jiang², Samuel Cheng¹.

Abstract

The advance in human genome sequencing technology has significantly reduced the cost of data generation and overwhelms the computing capability of sequence analysis. Efficiency, efficacy, and scalability remain challenging in sequence alignment, which is an important and foundational operation for genome data analysis. In this paper, we propose a two-stage approach to tackle this problem. In the preprocessing step, we match blocks of reference and target sequences based on the similarities between their empirical transition probability distributions using belief propagation. We then conduct a refined match using our recently published sparse-coding belief propagation (SCoBeP) technique. Our experimental results demonstrated robustness in nucleotide sequence alignment, and our results are competitive to those of the SOAP aligner and the BWA algorithm. Moreover, compared to SCoBeP alignment, the proposed technique can handle sequences of much longer lengths.

Entities: CellLine Chemical Disease Gene Species

Keywords: belief propagation; empirical transition probability; genome sequence alignment; indexing; sparse-coding

Year: 2015 PMID： 25983537 PMCID： PMC4426956 DOI： 10.4137/CIN.S13887

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

I. Introduction

In bioinformatics, sequence alignment is an important way to identify similar regions that might be associated with similar functional and structural relationship between sequences. With the quick growth of genomic data, it is important to develop effective sequence alignment techniques that are scalable. The past decade has witnessed the development of many sequence alignment technologies. Cancers are caused by the collection of genomic sequence changes.1 Therefore, alignment and analyses of cancer genome sequences provide basics to understand cancer biology, diagnosis, and therapy. In general, pairwise sequence alignment methods can be classified into local and global approaches. The global alignment attempts to find the best match between two strings with similar lengths through global optimization. In contrast, the local alignment is usually used to identify regions of similarity between a short query and a longer sequence. Global alignments2–5 are less prone to demonstrating false homology as each letter of one sequence is constrained to being aligned to only one letter of the other. Local alignments,6–9 on the other hand, can cope with rearrangements between non-syntenic, orthologous sequences by identifying similar regions in sequences; this, however, comes at the expense of a higher false positive rate because of the inability of local aligners to take into account overall conservation maps.10 A lot of efforts have been made to improve the efficiency and efficacy of sequence alignments. The ClustalW program proposed by Thompson and Larkin11,12 uses a multi-stage mechanism to weigh and to align subsequences based on sequence divergences. In addition, sequence annealing technique incrementally builds sequence alignment one at a time by checking whether a single match is consistent with a partial multiple alignment.13 Darling et al proposed a hidden Markov model that uses a sum-of-pairs breakpoint score to facilitate the detection of rearrangement breakpoints, when genomes have unequal gene content.14 Mummer is a highly efficient suffix tree-based matching tool for whole genome alignment as well as incomplete genomes.15 Researchers also proposed heuristics to accelerate sequence alignment. For example, the bounded sparse dynamic programming (BSDP) is used to support rapid approximation of exhaustive alignment in Slater and Birney.16 Another heuristic-driven approach, namely FastTree, is a tree-based method that stores profiles of internal nodes in a tree, such that candidate joins can be quickly identified. FastTree is also scalable for handling alignments over 10,000 sequences.17,18 Maximum-likelihood-based approaches like PhyML and RAxML-VI-HPC have been developed as well. PhyML19 used a hill-climbing algorithm that adjusts tree topology and branch length at each tree modification iteration. RAxML-VI-HPC,20 which stands for randomized accelerated maximum likelihood for high-performance computing, takes advantages of a parallel program to support large-scale genome alignment. In this paper, we propose a novel alignment method that uses sparse coding21 and empirical transition probability to tackle the scalability challenge. Thanks to the sparse representation, our mechanism can handle long sequences with reduced memory footprint. We also leverage belief propagation (BP) to combine local and neighboring information of candidate nucleotides into consideration, and generate matching scores to determine the best match. The rest of this paper is structured as follows. Section II introduces our proposed method. Section III presents our results, including the comparison against SOAP aligner22 and BWA.23 Finally, we draw our conclusions in Section IV.

II. Proposed Method

In this section, we present our genome indexing and alignment framework in detail, where the proposed method includes three steps: indexing, index matching, and sequence matching. In this paper, we refer to “reference sequence” as the baseline sequence and try to align a “read sequence” against the baseline sequence.

Indexing

The current genome indexing methods generate huge indices before performing the actual alignment to decrease the alignment time.24,25 The indexing process can be very time-consuming. In contrast, our proposed indexing technique provides a faster and light-weight alternative for index generation, which is similar to the big data retrieval systems that were proposed.26–28 These indices can reduce the search space and provide an estimation of the read sequence locations in the reference sequence. The proposed genome indexing technique models a nucleotide sequence as a graph by counting the transitions between each pair of nucleotides. To be more specific, as shown in Figure 1, we consider a graph with four states according to the different types of nucleotides and 16 vertices according to all possible transitions between nucleotides. We read the first nucleotide of the sequence and treat it as the initial state. Then, we move from one state to the other state by scanning the next nucleotide repeatedly till the end of the sequence. Afterward, we calculate the number of nucleotide transitions where we count how many times we pass one vertex in the graph and store them in a 4 × 4 matrix. Finally, we normalize the resulting matrix as follows: where k is the number that has the S-type nucleotide immediately before the W-type nucleotide.

Figure 1

The transition diagram between nucleotides. k is the number of appearance of the W-type nucleotide immediately after the S-type nucleotide, where s, w ∊ {a, c, g, t}.

If the length of a sequence is larger than a given threshold, ie, h, we divide it into subsequences with maximum length of h, where each subsequence will have o nucleotides that overlap with their neighbors. We set so that each pair of nucleotides can be counted at least twice. For each subsequence i, we count the transition of the nucleotides from the start of the subsequence till its end to reveal the number of different nucleotides that reside beside each other. In Figure 2, an input sequence with h = 250 is used to demonstrate the proposed indexing process, where I is the calculated index for the input sequence based on the transition graph shown on the left hand side. Finally, we normalize the transition matrix, which will be used to find the approximate location of each subsequence in the next step.

Figure 2

An example of the indexing procedure for a small sample subsequence.

Index matching

The index matching step is designed to find similar indices based on global information of the sequence. We define a symmetric distance function between two index matrices, I and J, as follows: Dmse(I, J) = ||I − J||, where ||·|| is the Frobenius norm of the matrix. After generating the indices of the reference sequence and the read sequence, the D distances to all reference sequence indices are calculated, where the top t most similar indices in terms of D are chosen as candidate indices. To find the best matched index, we resort to BP on a factor graph. In this paper, we provide a concise review about the BP algorithm on factor graph for the proposed algorithm. Interested readers can check our earlier publications in Refs. 29–31 for more details about the factor graph design and the BP algorithm. We apply BP to the factor graph of the test sequence with n candidate nucleotides as the prior knowledge. BP updates the probability of candidate nucleotides based on the probabilities of their neighbors. Then, the candidate index numbers are fed to a factor graph, and the corresponding D of each of candidates is employed to calculate the initial probability (prior probability) of each candidate. Then, message passing (ie, forward and backward) algorithm is applied to calculate the best match indices. The corresponding subsequences of these indices are used in the next step.

Sequence matching

The sequence matching step is based on sparse coding and BP algorithm. In this step, we use the subsequences that were selected in the previous step to generate an over-complete dictionary. Then, for each nucleotide in the read sequence, we pick n candidate nucleotides using sparse coding. By applying BP to a factor graph, we can obtain the best match for each nucleotide in the read sequence. A detailed description about the sequence matching can be found in our recent publications.31,32 A summary of the main procedure for our proposed alignment method is shown in Algorithm 1.

Algorithm 1 Proposed nucleotide sequence alignment algorithm for estimating the location of the input sequence

Inputs: a reference sequence X ∊ R^M, a test sequence Y ∊ R^N, number of the candidate state matrix k, and number of the candidate points n.

Initialize: a 4 × 4 state matrix I storing the numbers of nucleotide states (2) and nucleotide overlap v.

Fill the reference state matrix I: For each subsequence x_i, G ∊ X with v nucleotide overlap in each direction perform:

• I_i = MakeIndex(x_i)

Fill the test state matrix J: For each subsequence, y_j ∊ Y with v nucleotide overlap in each direction perform:

• J_j = MakeIndex(y_j)

• [c_j, ρ_j]= FindCandidates(J_j, I, k)

Refine the candidate state matrix:

• ρ^=BP(c,ρ)

Find the corresponding nucleotide in the reference sequence X: For each subsequence, y_j ∊ Y with v nucleotide overlap in each direction perform:

• z_j = FindBestSubsequence(X, y_j, θ, n) (see Refs. 31, 32 for more details).

Output: the estimated version of aligned sequence Z

I = MakeIndex(x) fills the state matrix I using the relationship of nucleotides in the subsequence x. The subsequence x is scanned through all its nucleotides, and the corresponding counts will be stored into the state matrix I. For example, k in I in (2) shows how many times the nucleotide C will be identified, which is next to the nucleotide G in the subsequence xi. Note that each subsequence x has a separate state matrix I, where i is the subsequence index. [c, ρ] = FindCandidates(J, I, k) identifies k candidate state matrices that are highly similar to the test state matrix J in I and stores their indices in vector c and their probabilities in vector ρ. Note that the approach will compute the mean square error (MSE) of the test state matrix J with each possible I of the reference state matrices and select I that has the smallest MSEs. models the problem by a factor graph and applies BP33 to update probability ρ. The updated probability can be used to align the reference state matrix index onto the test state matrix index. In our case, we assign a variable node for each test state matrix index and connect each pair of neighboring state matrix indices with a factor node. Also, we introduce one extra factor node to take care of the prior knowledge obtained in the MSE step for each test state matrix index (for more details, see Ref. 32). z = FindBestSubsequence(X, y, θ, n) finds the corresponding location for a nucleotide y ∊ Y. In this step, the reference nucleotide sequence X and the test nucleotide sequence Y are converted into two integer sequences. Then, an over-complete dictionary is built with all subsequences in the X. We then apply sparse coding followed by using BP to identify the best matches (see Refs. 31 and 32 for more details). Note that we used non-overlapped subsequences to build the dictionary. This change decreases the memory usage and the accuracy of the proposed algorithm in comparison with 1D-SCoBeP31 but it increases the speed of our alignment algorithm. Algorithm 1 Proposed nucleotide sequence alignment algorithm for estimating the location of the input sequence

III. Experimental Results

We designed our experiments based on the work in Darling et al.14 to evaluate the proposed method for aligning the nucleotide sequences and to compare it with SOAP aligner,22 BWA,23 and 1D-SCoBeP.31 We considered the problem of aligning a sequence of human nucleotides from the National Center for Biotechnology Information34 and Cancer Genomics Hub.35 To evaluate the performance of our approach, we conducted two sets of tests on the nucleotide sequences. In the first set, we selected 50 short subsequences of human genomes and then used SOAP aligner, BWA, 1D-SCoBeP, and the proposed method to find the location of selected subsequence nucleotide in the human chromosome. All the four algorithms successfully passed this test. We created 20 shuffled subsequences of the reference sequence as follows: for each read sequence R, we cut it into five pieces p1, p2, p3, p4, and p5. Then, we switched p2 with p4. Therefore, we converted a read sequence R = [p1, p2, p3, p4, p5] into a new read sequence . Figure 3 shows that the result of the 1D-SCoBeP and the proposed method show a better performance with a gap of 100–120 nucleotides away from the ground truth. Since we were using non-overlapped subsequences for the dictionary generation, the gap between the proposed method and the ground truth was larger than those reported in 1D-SCoBeP.31 In our experiments, the following parameters were used: the number of candidate points n = 3, the sparsity factor k = 3, and the dictionary column size a = 200.

Figure 3

The results of proposed method for non-collinear nucleotide sequence alignment are shown. (A) Comparison among alignment results of the ground truth, 1D-SCoBeP,31 and the proposed method. (B) Magnified black square in (A) shows the gap between the proposed method, ground truth, and 1D-SCoBeP31 on the jump point. The x-axis and y-axis are the index numbers of the original genome sequences and the shuffled genome sequences, respectively.

To evaluate the robustness of the proposed method, we generate indices for long human genome sequences (ie, 5 × 108 nucleotides), where h = 10000 and o = 5000. Moreover, we synthesized insertion, deletion, and mutation (ie, indel) in these sequences. For indel rate, we picked 105 number of subsequences with size of 104 nucleotides. Then, we randomly modified a certain number of nucleotides (based on the indel rate) and aligned them with the references. We counted the number of times the alignment location and real subsequence location (ie, ground truth) are matched, where the accuracy is defined as the count of the successfully aligned sequences over total number of the subsequences. Figure 4 shows the accuracy of alignment of the proposed method, BWA, and SOAP aligner in the presence of the different indel rates. The proposed method showed similar accuracies even when we increased the indel rate to 3%. Moreover, the proposed algorithm still showed more than 75% accuracy even after we modified 5% of the nucleotides in our selected subsequences. In contrast, the accuracy of the BWA and SOAP aligner decreased sharply as the indel rates increase.

Figure 4

Accuracy of BWA,23 SOAP aligner,22 and the proposed method in the presence of different Indel rates, where the testing genome sequences were obtained from Ref. 34.

We investigate the impact of small indel rate in the range from 0.5 to 1.5% in Figure 5. In this figure, we showed the accuracy of 1% indels in red for the dataset used in Figure 4 as reference. To verify our result, we repeat the experiments with different indel steps and different read locations, and present the results in green and blue, respectively. Note that each point in this figure was obtained from the evaluation over 105 read sequences. There are slight variations among the curves because of statistical deviation. The summary of the indel rate accuracy is shown in Table 1.

Figure 5

The percentage of successful alignments in the presence of 0.5–1.5% indels. The green line is the percentage of successful alignments, where the rate of the indels is changing between 0.7 and 1.3% at equal step of 0.05%. The blue line is the percentage of successful alignments, where the rate of the indels is changing between 0.5 and 1.5% at equal step of 0.1%, and the red line is the same as in Figure 4. Each point represents 105 random site selection with same indel rate. Note that the genome sequences used in this study were obtained from the National Center for Biotechnology Information and the Cancer Genomics Hub.34, 35

Table 1

Percentage of successful alignments.

% OF THE INDELS	ACCURACY OF RED LINE	ACCURACY OF BLUE LINE	ACCURACY OF GREEN LINE
0.00	80.33	–	–
0.50	–	81.19	–
0.60	–	79.38	–
0.70	–	80.90	80.64
0.75	–	–	80.63
0.80	–	79.82	80.29
0.85	–	–	78.90
0.90	–	78.09	80.63
0.95	–	–	78.74
1.00	79.85	79.71	78.04
1.05	–	–	80.43
1.10	–	80.49	79.70
1.15	–	–	78.22
1.20	–	79.81	79.78
1.25	–	–	79.16
1.30	–	80.54	78.94
1.40	–	80.70	–
1.50	–	79.52	–
2.00	79.09	–	–
3.00	78.90	–	–
4.00	76.33	–	–
5.00	75.90	–	–
6.00	72.09	–	–
7.00	72.86	–	–
8.00	69.79	–	–
9.00	67.87	–	–
10.00	66.42	–	–

The computational complexity of proposed is mainly determined by the following five steps: (1) indexing, (2) index matching, (3) extracting subsequence nucleotides as features and constructing the dictionary, (4) finding candidate nucleotides via sparse coding, and (5) applying BP. Assume that the sizes of the read and reference sequences are N and M nucleotides, respectively. The time required to create indexes is O(M + N), in order to scan whole read and reference sequences. The number of reference sequence indexes is and similarly, . Therefore, the time for the index matching is . After index matching step, the size of search space reduces from M to , where I is the number of selected indexes I, and h is the size of each index. The time required for feature extraction will be where a is the size of the vector of extracted features for each nucleotide. The dictionary construction step involves the normalization of each column, which requires amount of time. Thus, the total time complexity of the first step is . In the next step, the time complexity of subspace pursuit (SP) is ,,36 where f is the number of iterations for searching the sparse vector. As we have to repeat the process to find candidate points for all N feature vectors, the time complexity of finding candidate points by SP is . Then, the time complexity of BP in our factor graph is , where v is the number of iterations before converging, and n is the number of candidates in each variable node. Finally, the time complexity of proposed method will be .

IV. Conclusion

In this paper, we proposed a sparse coding and BP-based method for indexing and alignment genome sequences. The proposed method builds a transition matrix based on the neighboring nucleotides of an input sequence and then reduces the search space by selecting the top K most similar subsequences based on their distances. The proposed algorithm selects candidate nucleotides by using sparse coding with an overcomplete dictionary, which was constructed from the nucleotides of reference sequence in the indexing step. BP algorithm is then applied to select the best matches. Through experimental results, we showed that the proposed algorithms are comparable to SOAP aligner,22 BWA,23 and 1D-SCoBeP31 in terms of the alignment accuracy. In addition, the proposed method is robust to insertions, deletions, and mutations in the genome sequences when comparing with SOAP aligner and BWA. Finally, the proposed method is able to process much longer sequences than our previous 1D-SCoBeP approach.

27 in total

Empirical Transition Probability Indexing Sparse-Coding Belief Propagation (ETPI-SCoBeP) Genome Sequence Alignment.

I. Introduction

II. Proposed Method

Indexing

Index matching

Sequence matching

III. Experimental Results

IV. Conclusion

1. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

2. AVID: A global alignment program.

3. Fast and sensitive alignment of large genomic sequences.

4. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

5. Robust coding schemes for indexing and retrieval from large face databases.

6. SOAP2: an improved ultrafast tool for short read alignment.

7. FastTree 2--approximately maximum-likelihood trees for large alignments.

8. Medical image registration using sparse coding and belief propagation.

9. Nucleotide sequence alignment using sparse coding and belief propagation.

10. Human-mouse alignments with BLASTZ.