| Literature DB >> 31068211 |
Marie-Ève Lambert1,2, Julie Arsenault3,4, Benjamin Delisle3,4, Pascal Audet3,4, Zvonimir Poljak5, Sylvie D'Allaire3,4.
Abstract
BACKGROUND: Porcine reproductive and respiratory syndrome (PRRS) is a major threat to the swine industry. It is caused by the PRRS virus (PRRSV). Determination and comparison of the nucleotide sequences of PRRSV strains provides useful information in support of control initiatives or epidemiological studies on transmission patterns. The alignment of sequences is the first step in analyzing sequence data, with multiple algorithms being available, but little is known on the impact of this methodological choice. Here, a study was conducted to evaluate the impact of different alignment algorithms on the resulting aligned sequence dataset and on practical issues when applied to a large field database of PRRSV open reading frame (ORF) 5 sequences collected in Quebec, Canada, from 2010 to 2014. Five multiple sequence alignment programs were compared: Clustal W, Clustal Omega, Muscle, T-Coffee and MAFFT.Entities:
Keywords: Alignment algorithm; Genetic similarity; PRRS; Porcine reproductive and respiratory syndrome virus; Sequence
Mesh:
Year: 2019 PMID: 31068211 PMCID: PMC6505299 DOI: 10.1186/s12917-019-1890-0
Source DB: PubMed Journal: BMC Vet Res ISSN: 1746-6148 Impact factor: 2.741
Fig. 1Operational workflow used for computations of analytical criteria. Analytical criteria results are pictured in blue whereas pairwise matrices represent intermediary steps involved in computations. Illustrated with a fictive dataset of 4 sequences aligned with 2 algorithms (A and B)
Fig. 2Impact of open gap penalty value on average pairwise similarity, proportion of pairwise comparison having ≥97.5% genetic similarity and maximal number of gaps introduced per sequence for Clustal W, MAFFT, T-Coffee and Muscle. The different statistics were computed on each aligned dataset. Results obtained for dataset sizes of 238, 476 and 1191 were averaged over 10, 5 and 2 replicates, respectively. Results for the 2383 dataset are shown only for algorithms that generated results in less than two weeks. Recombinants were included in the datasets. Arrows indicate open gap penalty value selected for further analyses. *Default value of open gap penalty as defined by the algorithm user manual
Results on analytical criteria investigated in a comparative study on PRRSV sequence alignment algorithmsa
| Criterion | Algorithm | ||||
|---|---|---|---|---|---|
| Clustal W | MAFFT | T-Coffee | Muscle | Clustal 0mega | |
| 1. Similarity: average pairwise genetic similarity (%) of aligned sequences within the dataset (mean ± standard deviation) | |||||
| Replicate 1 (1191 sequences) | 88.77 ± 4.19 | 88.84 ± 4.17 | 88.71 ± 4.23 | 88.78 ± 4.19 | 88.78 ± 4.19 |
| Replicate 2 (1191 sequences) | 88.68 ± 4.11 | 88.69 ± 4.11 | 88.28 ± 4.31 | 88.69 ± 4.11 | 88.69 ± 4.11 |
| 2. Proportion of pairwise comparisons of sequences having ≥ 97.5% genetic similarity (%) | |||||
| Replicate 1 (1191 sequences) | 5.17 | 5.17 | 5.19 | 5.17 | 5.17 |
| Replicate 2 (1191 sequences) | 4.91 | 4.91 | 4.66 | 4.91 | 4.91 |
| 3. Length of aligned dataset: number of sites per sequence in the aligned dataset | |||||
| Replicate 1 (1191 sequences) | 603 | 606 | 607 | 603 | 603 |
| Replicate 2 (1191 sequences) | 603 | 603 | 609 | 603 | 603 |
| 4. Average sum of pairs (SP) score: proportion of shared homologies with reference alignment (%)b | |||||
| Clustal W as reference | – | 99.93 | 99.74 | 99.91 | 99.94 |
| MAFFT as reference | 99.93 | – | 99.78 | 99.97 | 99.97 |
| T-Coffee as reference | 99.92 | 99.96 | – | 99.94 | 99.97 |
| Muscle as reference | 99.91 | 99.97 | 99.76 | – | 99.95 |
| Clustal Omega as reference | 99.94 | 99.97 | 99.78 | 99.95 | – |
| |
|
|
|
|
|
| 5. Congruent cells ≥ 97.5% similarity: proportion of cells between two pairwise similarity matrices having the same binary value (0: < 97.5%; 1: ≥97.5%) for genetic similarityb | |||||
| Clustal W as reference | – | 100.00 | 99.86 | 99.99 | 99.99 |
| MAFFT as reference | 100.00 | – | 99.86 | 99.99 | 99.99 |
| T-Coffee as reference | 99.86 | 99.86 | – | 99.86 | 99.86 |
| Muscle as reference | 99.99 | 99.99 | 99.86 | – | 99.99 |
| Clustal Omega as reference | 99.99 | 99.99 | 99.86 | 99.99 | – |
| |
|
|
|
|
|
aThe open gap penalties used was 30 for Clustal W, 7 for MAFFT, −200 for T-Coffee, −1000 for Muscle and default for Clustal Omega. The dataset included 2383 sequences collected in 2010–2014 divided in two replicates
bAverage of 2 replicates of 1191 sequences
Results on technical criteria investigated in a comparative study on PRRSV sequence alignment algorithmsa
| Criterion | Algorithm | ||||
|---|---|---|---|---|---|
| Clustal W | MAFFT | T-Coffee | Muscle | Clustal 0mega | |
| 1. Handling capability of large dataset: capacity to generate results in less than 2 weeks (yes/no) | |||||
| 10 replicates of 238 sequences | yes | yes | yes | yes | yes |
| 5 replicates of 476 sequences | yes | yes | yes | yes | yes |
| 2 replicates of 1191 sequences | yes | yes | yes | yes | yes |
| Full dataset (2383 sequences) | no | yes | no | yes | yes |
| 2. Rapidity: average time (minutes) necessary to align (Linux platform, 10 physical cores) | |||||
| 10 replicates of 238 sequences | 12.8 | 0.2 | 13.1 | 0.2 | 0.2 |
| 5 replicates of 476 sequences | 57.1 | 1.0 | 56.1 | 0.7 | 0.4 |
| 2 replicates of 1191 sequences | 1040.5 | 7.0 | 540.0 | 3.9 | 1.2 |
| Full dataset (2383 sequences) | n/a | 28.5 | n/a | 17.0 | 2.9 |
| 3. Multiplatform availability (yes/no) | |||||
| Web, Windows and Linux | yes | yes | yes | yes | yes |
| 4. Management of IUB ambiguity symbol characters: ability to manage symbols other than A, T, C and G | |||||
| List of managed symbols | N | N, R, Y, W, | N, R, Y, W, | N, R, Y | N |
aThe open gap penalties used was 30 for Clustal W, 7 for MAFFT, −200 for T-Coffee, −1000 for Muscle and default for Clustal Omega. The dataset included 2383 sequences collected in 2010–2014
Differences in results for analytical criteria when excluding or not recombinants for the different algorithmsa
| Criterion | Algorithm | ||||
|---|---|---|---|---|---|
| Clustal W | MAFFT | T-Coffee | Muscle | Clustal 0mega | |
| 1. Difference in similarity: average pairwise genetic similarity (%) of aligned sequences within the dataset (mean) | |||||
| Replicate 1 | 0.01 | −0.05 | 0.01 | 0.01 | 0.01 |
| Replicate 2 | 0.02 | 0.01 | 0.15 | 0.01 | 0.01 |
| 2. Difference in proportion of pairwise comparisons of sequences having ≥ 97.5% genetic similarity (%) | |||||
| Replicate 1 | 0.07 | 0.07 | 0.07 | 0.07 | 0.07 |
| Replicate 2 | 0.05 | 0.05 | 0.27 | 0.05 | 0.04 |
| 3. Difference in length of aligned dataset: number of sites per sequence in the aligned dataset | |||||
| Replicate 1 | 0 | −3 | 0 | 0 | 0 |
| Replicate 2 | 0 | 0 | −1 | 0 | 0 |
| 4. Difference in average sum of pairs (SP) score: proportion of shared homologies with reference alignment (%)b | |||||
| Clustal W as reference | – | 0.01 | 0.04 | 0.01 | 0.01 |
| MAFFT as reference | 0.01 | – | 0.04 | 0.01 | 0.00 |
| T-Coffee as reference | 0.01 | 0.01 | – | 0.01 | 0.00 |
| Muscle as reference | 0.01 | 0.01 | 0.04 | – | 0.00 |
| Clustal Omega as reference | 0.01 | 0.00 | 0.04 | 0.00 | – |
| |
|
|
|
|
|
| 5. Difference in congruent cells ≥ 97.5% similarity: proportion of cells between two pairwise similarity matrices having the same binary value (0: < 97.5%; 1: ≥97.5%) for genetic similarityb | |||||
| Clustal W as reference | – | −0.01 | 0.11 | 0.00 | 0.00 |
| MAFFT as reference | −0.01 | – | 0.11 | 0.01 | 0.00 |
| T-Coffee as reference | 0.11 | 0.11 | – | 0.11 | 0.11 |
| Muscle as reference | 0.00 | 0.01 | 0.11 | – | 0.00 |
| Clustal Omega as reference | 0.00 | 0.00 | 0.11 | 0.00 | – |
| |
|
|
|
|
|
aThe open gap penalties used was 30 for Clustal W, 7 for MAFFT, −200 for T-Coffee, −1000 for Muscle and default for Clustal Omega. The five criteria presented in Table 1 for the two replicates including recombinants (Replicates 1 and 2, n = 1191) were re-evaluated for each replicate without recombinants (Replicates 1 and 2, n = 1183). Then, differences in results were computed (i.e. the result obtained with recombinant was subtracted from the result obtained without recombinant
bAverage of 2 replicates of 1183 sequences