| Literature DB >> 28373895 |
Safa Jammali1, Esaie Kuitche1, Ayoub Rachati1, François Bélanger1, Michelle Scott2, Aïda Ouangraoua1.
Abstract
BACKGROUND: Frameshift translation is an important phenomenon that contributes to the appearance of novel coding DNA sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of gene coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. The former does not allow to account for frameshift translations and up to now, the latter exclusively accounts for frameshift translation initiation, not considering the length of the translation disruption caused by a frameshift.Entities:
Keywords: Coding DNA sequences pairwise alignment; Dynamic programming; Frameshifts
Year: 2017 PMID: 28373895 PMCID: PMC5374649 DOI: 10.1186/s13015-017-0101-4
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Top an example of three CDS Seq1, Seq2 and Seq3. Middle an optimal alignment between Seq1 and Seq2 with a FS translation region of length 15. Bottom an optimal alignment between Seq1 and Seq3 with a FS translation region of length 30
Fig. 2An alignment of length 48 between two CDS, A (13 codons) and B (14 codons). The number arrays indicate the positions of the consecutive alignment columns. Codons of A and B are colored according to the set to which they belong: IM codons in blue color, FSext codons in red color, InDel codons in green color and FSinit codons in black color. MFS nucleotides contained in FSinit codons are underlined
Fig. 3Illustration of the configurations of alignment considered in Lemma 1 for computing D(i, j) in cases 1 and 2. The right-most nucleotides of the sequences and A[1 .. i] and B[1 .. j] are represented using the character x. The nucleotides are colored according to the type of the codon to which they belong : IM codons in blue color, FSext codons in red color, InDel codons in green color and FSinit codons in black color. The nucleotides that appear in gray color are those belonging to codons whose type has not yet been decided. In such case, the table is used in order to decide of the type of these codons subsequently and adjust the score accordingly
Fig. 4Top rough representation of the real alignment of CDS FAM86C1-002, FAM86B1-001 and FAM86B2-202. Rectangular colored portions represent concatenations of nucleotides in the alignment while blank portions represent concatenations of gap symbols. The lengths of the alignment portions are given at the bottom. The colors of the nucleotide regions indicate the coding frame in which they are translated, taking the frame of CDS FAM86C1-002 as reference. For example, there is a nucleotide region of length 89 shared by the three CDS and translated in 3 different coding frames. Bottom real alignment of three CDS (figure obtained using the visualization software seaview [29]). Nucleotides are colored according to the codon structure of the first CDS FAM86C1-002
Detailed description of the ten gene families of the mammalian dataset
| Gene family | Human gene | # of genes | # of CDS | Length |
|
|---|---|---|---|---|---|
| I (FAM86) | ENSG00000118894 | 6 | 14 | 10335 | 91 |
| II (HBG017385) | ENSG00000143867 | 6 | 10 | 8988 | 45 |
| III (HBG020791) | ENSG00000179526 | 6 | 10 | 11070 | 45 |
| IV (HBG004532) | ENSG00000173020 | 17 | 33 | 52356 | 528 |
| V (HBG016641) | ENSG00000147041 | 13 | 33 | 64950 | 528 |
| VI (HBG014779) | ENSG00000233803 | 28 | 44 | 45813 | 946 |
| VII (HBG012748) | ENSG00000134545 | 24 | 44 | 28050 | 946 |
| VIII (HBG015928) | ENSG00000178287 | 5 | 19 | 5496 | 171 |
| IX (HBG004374) | ENSG00000140519 | 13 | 30 | 36405 | 435 |
| X (HBG000122) | ENSG00000105717 | 11 | 24 | 27081 | 276 |
| Total number of pairs of CDS | 4011 | ||||
For each gene family, the family identifier used in [6] or [12], the Ensembl identifier of a human gene member of the family, the number of human, mouse and cow genes in the family, the total number of CDS of these genes, the total sum of lengths of these CDS and the number of distinct pairs of CDS are given
Description of the five methods considered in the experiment
| Method | Alignment approach and specific parameters | FS initiation cost | Other parameters |
|---|---|---|---|
| fse | Present approach |
|
|
| fse0 | Present approach | ||
| macse_p | Ranwez et al. [ | ||
| needleprot | NW [ | Not applicable | |
| needlenuc | NW [ | Not applicable |
|
|
| |||
|
| |||
| match/mismatch |
For each method, the alignment approach and the values of specific and common parameters are given
Comparison with MACSE multiple alignments benchmark
| fs_open_cost | fse0 | fse (−1) | fse (−0.5) | fse (−0.2) | macse_p | needlenuc | needleprot |
|---|---|---|---|---|---|---|---|
| -10 |
| 79.40 (1364) |
|
| 77.17 (1076) | 50.95 (255) | 78.82 (972) |
| -20 |
|
|
|
| 78.29 (1389) | ||
| -30 |
|
|
|
| 47.35 (742) |
Percentage of nucleotides aligned with the same partner as in the benchmark alignments induced by the MACSE multiple alignments, for each method for varying fs_open_cost (, and ) and fs_extend_cost (, and ). In each case, the number of CDS pairs with an alignment that presents the highest similarity with the corresponding benchmark alignment as compared to the other methods is given in parenthesis. The best results are indicated in italics
Values of the six criteria for the noFS dataset (variations as compared to needleprot)
| fs_open_cost (# CDS pairs) | Method | Identity_NT | Identity_AA | Gap_init | Gap_length | FS_init | FS_length |
|---|---|---|---|---|---|---|---|
| −10 (1672) | fse0 | 3281 (1158) | 5376 (1222) | −495 (1606) | −2718 (1521) | 0 (1672) | 0 (1672) |
| fse | |||||||
| macse_p | 8120 (955) | 27942 (676) | 3701 (711) | 9618 (1102) | 0 (1672) | 0 (1672) | |
| needlenuc | 170239 (156) | −82002 (442) | 104811 (218) | 21422(427) | 44488 (256) | 263365 (256) | |
|
|
|
|
|
|
|
| |
| −20 (3441) | fse0 | 1409 (2612) | −8622 (2672) | −3564 (3169) | −9984 (3057) | 0 (3441) | 0 (3441) |
| fse | |||||||
| macse_p | 24909 (1437) | 95844 (1011) | 13778 (1076) | 30884 (1791) | 0 (3441) | 0 (3441) | |
| needleenuc | 547203 (176) | −177285 (680) | 317256 (219) | 52510 (552) | 138204 (257) | 844401 (257) | |
|
|
|
|
|
|
|
| |
| −30 (3740) | fse0 | 1368 (2834) | −10788 (2912) | −4047 (3448) | −11316 (3321) | 0 (3740) | 0 (3740) |
| fse | |||||||
| macse_p | 27840 (1547) | 106512 (1078) | 15561 (1117) | 34726 (1846) | 0 (3740) | 0 (3740) | |
| needlenuc | 610305 (177) | −192231 (709) | 351748 (219) | 47356 (573) | 154255 (257) | 948418 (257) | |
|
|
|
|
|
|
|
|
For varying values of the parameter fs_open_cost, the number of CDS pairs in the dataset is given.
The values of the criteria for the reference method “needleprot” are indicated in italics characters. For each of the other methods (fse, fse0, macse_p, needlenuc), the variations of the criteria values as compared to the reference values are given. For each criteria and each method, the number of CDS pairs that have the closest value to the reference needleprot value is given in parentheses
Values of the six criteria for the FS dataset
| fs_open_ cost | fs_extend_cost (# CDS pairs) | Method | Identity_ NT | Identity_ AA | Gap_ init | Gap_ length | FS_init (avg) | FS_ length |
|---|---|---|---|---|---|---|---|---|
| −10 | −1 (212) |
| 166002 | 325212 | 895 | 60662 | 226 ( | 20219 |
|
| 165720 | 325026 | 901 | 60624 | 216 ( | 18705 | ||
| macse_p | 166167 | 324999 | 1445 | 61562 | 432 (2.03 ± 3.06) | 22742 | ||
| needlenuc | 172959 | 321348 | 5053 | 60038 | 2103 (9.91 ± 26.73) | 29616 | ||
| −0.5 (386) |
| 252590 | 464712 | 2400 | 114859 | 482 ( | 31777 | |
|
| 251647 | 463407 | 2387 | 115269 | 401 ( | 26982 | ||
| macse_p | 253715 | 465594 | 4161 | 117165 | 1306 (3.38 ± 4.53) | 41742 | ||
| needlenuc | 279682 | 452673 | 19408 | 113195 | 8032 (20.80 ± 31.02) | 68226 | ||
| −0.2 (619) |
| 371062 | 641748 | 5334 | 204370 | 805 ( | 43381 | |
|
| 370260 | 640377 | 5270 | 204806 | 688 ( | 37376 | ||
| macse_p | 374729 | 646893 | 9308 | 208344 | 2893 (4.67 ± 5.34) | 72030 | ||
| needlenuc | 442564 | 618270 | 48799 | 209420 | 19751 (31.90 ± 34.48) | 141217 | ||
| −20 | −1 (161) |
| 123814 | 244350 | 461 | 40315 | 168 ( | 17770 |
|
| 123610 | 244149 | 468 | 40195 | 164 ( | 16924 | ||
| macse_p | 123541 | 243591 | 709 | 40585 | 223 (1.38 ± 1.03) | 18119 | ||
| needlenuc | 125452 | 242742 | 1493 | 39031 | 650 (4.03 ± 5.85) | 19405 | ||
| −0.5 (189) |
| 147476 | 291147 | 549 | 49485 | 197 ( | 19599 | |
|
| 147401 | 291048 | 557 | 49363 | 194 ( | 19279 | ||
| macse_p | 147143 | 290271 | 838 | 49841 | 260 (1.37 ± 0.98) | 19976 | ||
| needlenuc | 149551 | 289086 | 1872 | 47515 | 808 (4.27 ± 6.17) | 21440 | ||
| −0.2 (216) |
| 161906 | 318117 | 723 | 55383 | 225 ( | 21300 | |
|
| 161865 | 318099 | 732 | 55393 | 223 ( | 21115 | ||
| macse_p | 161622 | 317205 | 1061 | 55715 | 306 (1.41 ± 0.99) | 21997 | ||
| needlenuc | 165260 | 315531 | 2851 | 53613 | 1186 (5.49 ± 6.82) | 24403 | ||
| −30 | −1 (71) |
| 47071 | 91266 | 230 | 26303 | 76 ( | 12845 |
|
| 46872 | 91032 | 233 | 26183 | 72 ( | 12302 | ||
| macse_p | 46936 | 90876 | 372 | 26325 | 118 (1.66 ± 1.25) | 13142 | ||
| needlenuc | 48290 | 91017 | 866 | 26135 | 391 (5.50 ± 5.67) | 13829 | ||
| −0.5 (154) |
| 120558 | 237768 | 445 | 37975 | 159 ( | 17554 | |
|
| 120504 | 237678 | 452 | 37851 | 157 ( | 17319 | ||
| macse_p | 120338 | 237084 | 691 | 38047 | 212 (1.37 ± 1.00) | 17926 | ||
| needlenuc | 122084 | 236904 | 1321 | 37531 | 575 (3.73 ± 5.14) | 18877 | ||
| −0.2 (178) |
| 137451 | 271041 | 525 | 46049 | 184 ( | 18995 | |
|
| 137440 | 271008 | 531 | 45917 | 183 ( | 18872 | ||
| macse_p | 137175 | 270258 | 803 | 46187 | 244 ( 1.37 ± 0.97) | 19395 | ||
| needlenuc | 139489 | 269139 | 1803 | 44303 | 779 (4.38 ± 6.27) | 20859 |
For varying values of the parameters fs_open_cost and fs_extend_cost, the number of CDS pairs in the dataset is given. The values of the criteria for the fse, fse0, macse_p, needlenuc methods are indicated. For each method, the average number of FS_init per alignment, with corresponding standard error values are also indicated
The best results are indicated in italics
Values of the six criteria for the ambiguFS dataset
| fs_open_ cost | fs_extend_cost (# CDS pairs) | Method | Identity_ NT | Identity_ AA | Gap_ init | Gap_ length | FS_init (avg) | FS_ length |
|---|---|---|---|---|---|---|---|---|
| −10 | −1 (2127) | fse0 (862) | 1095102 | 1737105 | 24489 | 908218 | 1111 (1.28 | 42730 |
| fse | 1086546 | 1719774 | 23483 | 906540 | 0 | 0 | ||
| macse_p (2076) | 1124316 | 1790199 | 45335 | 936002 | 12436 (5.99 | 216772 | ||
| needleprot | 1085007 | 1723950 | 25288 | 916518 | 0 | 0 | ||
| −0.5 (1953) | fse0 (688) | 1008514 | 1597605 | 22984 | 854021 | 855 (1.24 | 31172 | |
| fse (2) | 1003293 | 1587258 | 22102 | 853793 | 2 (1.0 | 80 | ||
| macse_p (1902) | 1036768 | 1649604 | 42619 | 880399 | 11562 (6.07 | 197772 | ||
| needleprot | 1001957 | 1591134 | 23790 | 863199 | 0 | 0 | ||
| −0.2 (1720) | fse0 (455) | 890042 | 1420569 | 20050 | 764510 | 532 (1.16 | 19568 | |
| fse (3) | 887372 | 1415403 | 19465 | 764162 | 3 (1.0 | 92 | ||
| macse_p (1669) | 915754 | 1468305 | 37472 | 789220 | 9975 (5.97 | 167484 | ||
| needleprot | 886178 | 1418748 | 20955 | 772272 | 0 | 0 | ||
| −20 | −1 (409) | fse0 (100) | 219277 | 358554 | 3633 | 153487 | 120 (1.2 | 6937 |
| fse | 216936 | 353586 | 3619 | 152391 | 0 | 0 | ||
| macse_p (403) | 225976 | 374391 | 6509 | 158165 | 1348 (3.34 | 36179 | ||
| needleprot | 216842 | 355656 | 4172 | 153957 | 0 | 0 | ||
| −0.5 (381) | fse0 (72) | 195615 | 311757 | 3545 | 144317 | 91 (1.26 | 5108 | |
| fse | 194048 | 308448 | 3505 | 144045 | 0 | 0 | ||
| macse_p (375) | 202374 | 327711 | 6380 | 148909 | 1311 (3.49 | 34322 | ||
| needleprot | 193980 | 310632 | 4051 | 145563 | 0 | 0 | ||
| −0.2 (354) | fse0 (45) | 181185 | 284787 | 3371 | 138419 | 63 (1.4 | 3407 | |
| fse (1) | 180151 | 282693 | 3344 | 138217 | 1 (1.0 | 40 | ||
| macse_p (348) | 187895 | 300777 | 6157 | 143035 | 1265 (3.63 | 32301 | ||
| needleprot | 180116 | 284946 | 3883 | 139731 | 0 | 0 | ||
| −30 | −1 (200) | fse0 (119) | 151090 | 289617 | 805 | 42437 | 120 (1.01 | 6818 |
| fse | 147590 | 282018 | 852 | 40221 | 0 | 0 | ||
| macse_p (200) | 152626 | 292254 | 1309 | 43043 | 378 (1.89 | 14515 | ||
| needleprot | 147228 | 281472 | 933 | 40455 | 0 | 0 | ||
| −0.5 (117) | fse0 (36) | 77603 | 143115 | 590 | 30765 | 37 (1.02 | 2109 | |
| fse | 76678 | 141108 | 626 | 29913 | 0 | 0 | ||
| macse_p (117) | 79224 | 146046 | 990 | 31321 | 284 (2.42 | 9731 | ||
| needleprot | 76561 | 141036 | 703 | 30099 | 0 | 0 | ||
| −0.2 (93) | fse0 (12) | 60710 | 109842 | 510 | 22691 | 12 (1.0 | 668 | |
| fse | 60407 | 109170 | 518 | 22491 | 0 | 0 | ||
| macse_p (93) | 62387 | 112872 | 878 | 23181 | 252 (2.70 | 8262 | ||
| needleprot | 60270 | 109122 | 581 | 22677 | 0 | 0 |
For varying values of the parameters fs_open_cost and fs_extend_cost, and for each method, the number of CDS pairs displaying a FS translation is given. The values of the criteria for each method are indicated. For each method, the average number of FS_init per alignment, with corresponding standard error values are also indicated
Pairwise similarity scores and number of FS translation regions computed by the methods
| fs_open_cost | Method |
|
|
|
|---|---|---|---|---|
| −10 | fse0 | 0.42 (1) | 0.58 (2) | 0.45 (1) |
| fse (-1) | 0.33 (1) | 0.27 (1) | 0.18 (1) | |
| fse (-0.5) | 0.37 (1) | 0.43 (1) | 0.31 (1) | |
| fse (-0.2) | 0.40 (1) | 0.52 (1) | 0.39 (1) | |
| macse_p | 0.40 (4) | 0.54 (6) | 0.44 (1) | |
| −20 | fse0 | 0.39 (1) | 0.54 (1) | 0.41 (1) |
| fse (-1) | 0.36 (0) | 0.24 (1) | 0.14 (1) | |
| fse (-0.5) | 0.34 (1) | 0.39 (1) | 0.28 (1) | |
| fse (-0.2) | 0.37 (1) | 0.48 (1) | 0.36 (1) | |
| macse_p | 0.33 (4) | 0.47 (6) | 0.35 (1) | |
| −30 | fse0 | 0.35 (1) | 0.50 (1) | 0.38 (1) |
| fse (-1) | 0.36 (0) | 0.20 (1) | 0.11 (1) | |
| fse (-0.5) | 0.36 (0) | 0.35 (1) | 0.25 (1) | |
| fse (-0.2) | 0.33 (1) | 0.44 (1) | 0.33 (1) | |
| macse_p | 0.27 (4) | 0.39 (6) | 0.29 (1) | |
| needlenuc | 0.16 (23) | 0.35 (15) | −0.36 (1) | |
| needleprot | 0.38 (0) | −0.12 (0) | −0.13 (0) |
Normalized pairwise similarity scores and number of FS translation regions computed by the five methods for the 3-CDS manually-built benchmark composed of CDS FAM86C1-002, FAM86B1-001 and FAM86B2-202 (Similarity scores are normalized by dividing them by the lengths of alignments)
Similarity relationships between the groups G1, G2 and G3 for the five methods
| fs_open_cost | Method | ((G1,G3),G2) | ((G1,G2),G3) |
|---|---|---|---|
| −10 | fse (-1) | X | |
| fse (-0.5) | X | ||
| fse (-0.2) | X | ||
| fse0 | X | ||
| macse_p | X | ||
| −20 | fse (-1) | X | |
| fse (-0.5) | X | ||
| fse (-0.2) | X | ||
| fse0 | X | ||
| macse_p | X | ||
| −30 | fse (-1) | X | |
| fse (-0.5) | X | ||
| fse (-0.2) | X | ||
| fse0 | X | ||
| macse_p | X | ||
| needlenuc | X | ||
| needleprot | X |
Similarity relationships between the splicing orthology groups G1, G2 and G3 computed using the similarity matrices of the five methods for the 21-CDS dataset
Running time in seconds for each method
| Gene family | fse0 | fse | macse_p | needlenuc | needleprot |
|---|---|---|---|---|---|
| I | 299 | 291 | 53 | 97 | 22 |
| II | 270 | 260 | 45 | 93 | 20 |
| III | 377 | 389 | 54 | 62 | 20 |
For each method and gene families I, II, and III, the running time was calculated on the same computer (24 processors of 2.1GHz each and 10GB of RAM) with the parameters fs_open_cost = and fs_extend_cost =
| 18 |
|
| For | |
| + | 11 |
|
| For |
| + | 3 |
|
| For |
| + | 1 |
|
| For |
| + | 5 |
|
| For |
| + | 3 |
|
| For |
| Total = 12.55 nm | ||||