| Literature DB >> 25140318 |
Evandro A Marucci1, Geraldo F D Zafalon1, Julio C Momente1, Leandro A Neves1, Carlo R Valêncio1, Alex R Pinto2, Adriano M Cansian1, Rogeria C G de Souza1, Yang Shiyou3, José M Machado1.
Abstract
With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE.Entities:
Mesh:
Year: 2014 PMID: 25140318 PMCID: PMC4130029 DOI: 10.1155/2014/563016
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Examples of compressed alphabets.
| Alphabet ( | Classes |
|---|---|
| SE-B (14) | A, C, D, EQ, FY, G, H, IV, KR, LM, N, P, ST, W |
| SE-B (10) | AST, C, DN, EQ, FY, G, HW, ILMV, KR, P |
| SE-V (10) | AST, C, DEN, FY, G, H, ILMV, KQR, P, W |
| Li-A (10) | AC, DE, FWY, G, HN, IV, KQR, LM, P, ST |
| Li-B (10) | AST, C, DEQ, FWY, G, HN, IV, KR, LM, P |
| Solis-D (10) | AM, C, DNS, EKQR, F, GP, HT, IV, LY, W |
| Solis-G (10) | AEFIKLMQRVW, C, D, G, H, N, P, S, T, Y |
| Murphy (10) | A, C, DENQ, FWY, G, H, ILMV, KR, P, ST |
| SE-B (8) | AST, C, DHN, EKQR, FWY, G, ILMV, P |
| SE-B (6) | AST, CP, DEHKNQR, FWY, G, ILMV |
| Dayhoff (6) | AGPST, C, DENQ, FWY, HKR, ILMV |
Comparison between the use of the alphabet A and the compressed alphabet SE-V (10).
| Seq1: sAaNiLvGEnlvcKvaDFGLARl | |
| Seq2: aArNiLvGEnyicKvaDFGLARl | |
| Seq3: | |
| Using the default amino acids alphabet | |
|
| |
| Seq1: AAaNILIGENlIcKIaDFGLARI | |
| Seq2: AArNILIGENyIcKIaDFGLARI | |
| Seq3: | |
| Using the compressed alphabet SE-V (10) | |
| (each class member is represented by the first letter in alphabetical order) | |
Figure 1Example of how the similarities matrix calculation is distributed between the slaves.
Figure 2Similarities matrix that is built by the master processor.
Figure 3Flowchart of the parallel algorithm for multiple sequence similarities calculation.
Figure 4Processing time, in seconds, of the parallel algorithm for 2, 4, 8, and 15 nodes executing with four different datasets.