| Literature DB >> 26457114 |
Kieran Boyce1, Fabian Sievers1, Desmond G Higgins1.
Abstract
BACKGROUND: Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.Entities:
Keywords: Clustal; Kalign; Large scale alignment; Mafft; Multiple sequence alignment; Muscle; Pfam; Sequence order
Year: 2015 PMID: 26457114 PMCID: PMC4599319 DOI: 10.1186/s13015-015-0057-1
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Duplicate sequence percentages in HomFam protein families
| Protein family | Total seqs | Unique seqs | % Dup |
|---|---|---|---|
| aadh | 3119 | 2348 | 24.72 |
| aat | 25,090 | 19,879 | 20.77 |
| Acetyltransf | 46,279 | 31,943 | 30.98 |
| ace | 3983 | 3787 | 4.92 |
| adh | 21,326 | 15,452 | 27.54 |
| aldosered | 13,270 | 10,787 | 18.71 |
| Ald_Xan_dh_2 | 2583 | 2037 | 21.14 |
| annexin | 3133 | 2288 | 26.97 |
| asp | 3249 | 2979 | 8.31 |
| az | 1057 | 892 | 15.61 |
| biotin_lipoyl | 11,826 | 7332 | 38.00 |
| blmb | 17,194 | 13,102 | 23.80 |
| blm | 9097 | 7145 | 21.46 |
| bowman | 494 | 218 | 55.87 |
| cah | 1374 | 1197 | 12.88 |
| ChtBD | 769 | 447 | 41.87 |
| cryst | 1153 | 909 | 21.16 |
| cyclo | 6282 | 4967 | 20.93 |
| cys | 4303 | 3910 | 9.13 |
| cyt3 | 379 | 347 | 8.44 |
| cytb | 3200 | 2622 | 18.06 |
| DEATH | 1176 | 874 | 25.68 |
| DMRL_synthase | 2094 | 1423 | 32.04 |
| egf | 7762 | 5405 | 30.36 |
| flav | 4606 | 3103 | 32.63 |
| GEL | 2190 | 1583 | 27.72 |
| ghf10 | 1497 | 1393 | 6.95 |
| ghf11 | 516 | 461 | 10.66 |
| ghf13 | 12,597 | 9870 | 21.65 |
| ghf1 | 4350 | 3471 | 20.21 |
| ghf22 | 748 | 608 | 18.72 |
| ghf5 | 2711 | 2355 | 13.13 |
| glob | 3942 | 2828 | 28.26 |
| gluts | 10,085 | 7841 | 22.25 |
| gpdh | 7683 | 4993 | 35.01 |
| hip | 162 | 115 | 29.01 |
| hla | 13,460 | 9148 | 32.03 |
| HLH | 6776 | 3417 | 49.57 |
| HMG_box | 4774 | 2988 | 37.41 |
| hom | 12,029 | 6044 | 49.75 |
| hormone_rec | 3504 | 2896 | 17.35 |
| hpr | 3344 | 1878 | 43.84 |
| hr | 3702 | 1985 | 46.38 |
| icd | 5673 | 4505 | 20.59 |
| il8 | 1062 | 799 | 24.76 |
| ins | 787 | 524 | 33.42 |
| int | 7567 | 6185 | 18.26 |
| KAS | 2064 | 1490 | 27.81 |
| kringle | 1082 | 821 | 24.12 |
| kunitz | 2256 | 1753 | 22.30 |
| ldh | 7353 | 3094 | 57.92 |
| LIM | 6423 | 3729 | 41.94 |
| ltn | 1056 | 909 | 13.92 |
| lyase_1 | 7627 | 5611 | 26.43 |
| mmp | 1421 | 1136 | 20.06 |
| mofe | 2561 | 2326 | 9.18 |
| msb | 4876 | 4094 | 16.04 |
| myb_DNA-binding | 10,393 | 7124 | 31.45 |
| OTCace | 4790 | 3234 | 32.48 |
| oxidored_q6 | 3343 | 1974 | 40.95 |
| p450 | 21,001 | 19,700 | 6.19 |
| PDZ | 14,944 | 9552 | 36.08 |
| peroxidase | 4509 | 3589 | 20.40 |
| phc | 2945 | 1961 | 33.41 |
| phoslip | 928 | 803 | 13.47 |
| profilin | 682 | 579 | 15.10 |
| proteasome | 5715 | 4549 | 20.40 |
| Rhodanese | 14,043 | 10,011 | 28.71 |
| rhv | 17,970 | 9151 | 49.08 |
| ricin | 740 | 548 | 25.94 |
| rnasemam | 492 | 438 | 10.98 |
| rrm | 27,590 | 18,692 | 32.25 |
| rub | 1430 | 975 | 31.82 |
| rvp | 93,675 | 64,987 | 30.62 |
| scorptoxin | 355 | 311 | 12.39 |
| sdr | 50,144 | 40,212 | 19.81 |
| seatoxin | 88 | 63 | 28.41 |
| serpin | 3136 | 2957 | 5.71 |
| slectin | 927 | 749 | 19.20 |
| sodcu | 2031 | 1586 | 21.91 |
| sodfe | 4447 | 2728 | 38.65 |
| Stap_Strp_toxin | 634 | 174 | 72.56 |
| sti | 608 | 536 | 11.84 |
| subt | 7506 | 6469 | 13.81 |
| Sulfotransfer | 2484 | 2269 | 8.65 |
| tgfb | 1598 | 1022 | 36.04 |
| tim | 3894 | 2909 | 25.30 |
| tms | 2113 | 1518 | 28.16 |
| TNF | 551 | 417 | 24.32 |
| toxin | 488 | 450 | 7.79 |
| trfl | 830 | 742 | 10.60 |
| tRNA-synt_2b | 11,288 | 7670 | 32.05 |
| uce | 4545 | 3744 | 17.62 |
| zf-CCHH | 88,330 | 45,901 | 48.03 |
The list of HomFam protein families, the total number of sequences in each family, the number of unique sequences, and the percentage of the total number of sequences that are duplicates
Fig. 1Difference in TC core scores for random samples and in reverse order. The difference in the TC core scores for 1000 randomly-selected sequences and in reverse order. 68 HomFam protein families. samples per family
Fig. 2Unique distances by number of sequences for each alignment program. The number of unique distances with increasing number of sequences. Each line is the mean of 100 samples for each HomFam protein family
Fig. 3Theoretical maximum and actual number of unique distances, and maximum theoretical number of sequences that can be aligned without duplicate distance measures. The theoretical maximum number of unique distances for each HomFam family, the actual number of unique distances found in the datasets used to generate Fig. 1, and the maximum theoretical number of sequences that can be aligned without generating duplicate distance measures based on the calculation that N sequences will produce distance measures
Fig. 4Count of the differences in forward and reverse TC core scores. The number of samples within each HomFam family where the forward and reverse TC core scores are different. samples for each family and dataset size