| Literature DB >> 28968741 |
Brandon D Pickett1, Justin B Miller1, Perry G Ridge1.
Abstract
MOTIVATION: One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a 'good enough' solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a 'good enough' solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions.Entities:
Mesh:
Year: 2017 PMID: 28968741 PMCID: PMC5860095 DOI: 10.1093/bioinformatics/btx538
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Conceptual representation of Kmer-SSR. Although we implement some filters and tricks to speed up Kmer-SSR runtime, each SSR is identified through kmer decomposition, which allows the identification of instances when the same SSR period occurs k bases from the previously identified SSR period
Fig. 2.Pseudocode for the Kmer-SSR algorithm. The function passesBooleanFilter ensures SSRs are not duplicates of previously reported SSRs. The function passesUserFilters (function not shown) completes other user-specified options, which may include: minimum SSR length, minimum and maximum number of periods, finding specific SSRs and sequence length bounds
Comparisons of all nine SSR-identification algorithms across six genomes with period sizes of 1–7 and a minimum SSR length of 16 bases
| Comparison with Kmer-SSR | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CPU Time (mm:ss) | Real Time (mm:ss) | SSRs Reported | aSSRs After Adjustments | bSSRs In Range | cNumber Correct | dNumber Correct & Fixed | ePercent Correct & Fixed | SSRs Unique to Software | SSRs Unique to Kmer-SSR | SSRs Shared | ||
| GMATo | 2:38 | 2:38 | 20 623 008 | 16 369 297 | 16 871 | 16 871 | 16 870 | 100 | 0 | 8194 | 10 090 | |
| Kmer-SSR | 2:24 | 0:24 | 18 284 | 18 284 | 18 284 | 18 284 | 18 284 | 100 | NA | NA | NA | |
| MREPS | 0:09 | 0:09 | 25 639 | 25 639 | 18 284 | 18 284 | 18 284 | 100 | 0 | 0 | 18 284 | |
| PRoGeRF | 18:07 | 18:07 | 16 841 656 | 16 840 821 | 17 763 | 17 762 | 17 763 | 100 | 0 | 610 | 17 674 | |
| QDD | 19:11 | 19:11 | 60 994 | 60 994 | 18 009 | 18 009 | 18 009 | 100 | 0 | 732 | 17 552 | |
| SA-SSR | 338:47 | 33:55 | 18 166 | 18 166 | 18 166 | 18 166 | 18 166 | 100 | 0 | 442 | 17 842 | |
| SSR-Pipeline | 611:55 | 611:55 | 19 173 282 | 17 301 120 | 18 044 | 18 044 | 18 044 | 100 | 0 | 913 | 17 371 | |
| SSRIT | 1:29 | 1:29 | 87 073 | 74 121 | 18 284 | 18 284 | 18 284 | 100 | 0 | 0 | 18 284 | |
| TRF | 2:09 | 2:09 | 422 851 | 411 644 | 42 157 | 13 872 | 17 307 | 41.05 | 0 | 1560 | 16 724 | |
| GMATo | 3:30 | 3:30 | 26 512 280 | 21 624 294 | 50 401 | 50 401 | 50 139 | 99 | 0 | 23 086 | 34 416 | |
| Kmer-SSR | 3:26 | 0:19 | 57 502 | 57 502 | 57 502 | 57 502 | 57 502 | 100 | NA | NA | NA | |
| MREPS | 0:14 | 0:14 | 94 875 | 94 875 | 57 502 | 57 502 | 57 502 | 100 | 0 | 0 | 57 502 | |
| PRoGeRF | 37:55 | 37:55 | 8 071 102 | 8 020 213 | 32 043 | 31 989 | 32 004 | 100 | 0 | 25 588 | 31 914 | |
| QDD | 8:51 | 8:51 | 216 943 | 216 943 | 55 470 | 55 470 | 55 470 | 100 | 0 | 3002 | 54 500 | |
| SA-SSR | 1324:33 | 167:48 | 56 833 | 56 833 | 56 833 | 56 833 | 56 833 | 100 | 0 | 1214 | 56 288 | |
| SSR-Pipeline | 632:10 | 632:10 | 26 973 434 | 23 032 838 | 56 729 | 56 729 | 56 729 | 100 | 0 | 1793 | 55 709 | |
| SSRIT | 2:00 | 2:00 | 310 109 | 252 223 | 57 502 | 57 502 | 57 502 | 100 | 0 | 0 | 57 502 | |
| TRF | 8:52 | 8:52 | 1 022 145 | 990 316 | 181 973 | 25 451 | 45 773 | 25.15 | 0 | 14 546 | 42 956 | |
| GMATo | 1:12 | 1:12 | 9 501 860 | 7 535 749 | 22 546 | 22 546 | 22 362 | 99 | 0 | 8463 | 13 636 | |
| Kmer-SSR | 1:10 | 0:13 | 22 099 | 22 099 | 22 099 | 22 099 | 22 099 | 100 | NA | NA | NA | |
| MREPS | 0:05 | 0:05 | 26 862 | 26 862 | 22 099 | 22 099 | 22 099 | 100 | 0 | 0 | 22 099 | |
| PRoGeRF | 8:14 | 8:14 | 7 696 269 | 7 695 012 | 21 729 | 21 668 | 21 684 | 100 | 0 | 494 | 21 605 | |
| QDD | 7:43 | 7:43 | 49 016 | 49 016 | 21 805 | 21 805 | 21 805 | 100 | 0 | 908 | 21 191 | |
| SA-SSR | 2075:03 | 648:00 | 21 862 | 21 862 | 21 862 | 21 862 | 21 862 | 100 | 0 | 690 | 21 409 | |
| SSR-Pipeline | 1958:54 | 1958:54 | 8 948 450 | 7 954 899 | 21 857 | 21 857 | 21 857 | 100 | 0 | 987 | 21 112 | |
| SSRIT | 0:43 | 0:43 | 69 645 | 58 065 | 22 099 | 22 099 | 22 099 | 100 | 0 | 0 | 22 099 | |
| TRF | 5:03 | 5:03 | 293 378 | 283 764 | 40 343 | 11 255 | 16 911 | 41.92 | 0 | 6144 | 15 955 | |
| GMATo | 1:02 | 1:02 | 8 810 607 | 7 126 425 | 82 643 | 82 643 | 82 526 | 100 | 0 | 28 714 | 62 967 | |
| Kmer-SSR | 1:12 | 0:08 | 91 681 | 91 681 | 91 681 | 91 681 | 91 681 | 100 | NA | NA | NA | |
| MREPS | 0:05 | 0:05 | 121 835 | 121 835 | 91 681 | 91 681 | 91 681 | 100 | 0 | 0 | 91 681 | |
| PRoGeRF | 11:42 | 11:42 | 4 629 786 | 4 604 499 | 60 176 | 60 174 | 60 174 | 100 | 0 | 31 707 | 59 974 | |
| QDD | 3:44 | 3:44 | 171 686 | 171 686 | 88 017 | 88 017 | 88 017 | 100 | 0 | 5295 | 86 386 | |
| SA-SSR | 723:31 | 236:01 | 90 700 | 90 700 | 90 700 | 90 700 | 90 700 | 100 | 0 | 1635 | 90 046 | |
| SSR-Pipeline | 246:35 | 246:35 | 9 292 900 | 7 397 561 | 90 810 | 90 810 | 90 810 | 100 | 0 | 1759 | 89 922 | |
| SSRIT | 0:42 | 0:42 | 265 894 | 202 531 | 91 681 | 91 681 | 91 681 | 100 | 0 | 0 | 91 681 | |
| TRF | 17:30 | 17:30 | 642 904 | 602 301 | 178 902 | 40 772 | 75 742 | 42.34 | 0 | 18 962 | 72 719 | |
| GMATo | 0:59 | 0:59 | 7 981 869 | 6 500 395 | 7739 | 7739 | 7736 | 100 | 0 | 3259 | 5528 | |
| Kmer-SSR | 0:58 | 0:10 | 8 787 | 8 787 | 8787 | 8787 | 8787 | 100 | NA | NA | NA | |
| MREPS | 0:04 | 0:04 | 12 885 | 12 885 | 8787 | 8787 | 8787 | 100 | 0 | 0 | 8787 | |
| PRoGeRF | 7:32 | 7:32 | 6 639 989 | 6 639 933 | 8669 | 8668 | 8668 | 100 | 0 | 131 | 8656 | |
| QDD | 4:29 | 4:29 | 27 774 | 27 774 | 8319 | 8319 | 8319 | 100 | 0 | 621 | 8166 | |
| SA-SSR | 642:36 | 91:59 | 8719 | 8719 | 8719 | 8719 | 8719 | 100 | 0 | 152 | 8635 | |
| SSR-Pipeline | 1498:06 | 1498:06 | 7 763 141 | 6 874 175 | 8720 | 8720 | 8720 | 100 | 0 | 253 | 8534 | |
| SSRIT | 0:35 | 0:35 | 39 472 | 35 941 | 8787 | 8787 | 8787 | 100 | 0 | 0 | 8787 | |
| TRF | 1:53 | 1:53 | 223 938 | 215 818 | 22 730 | 6132 | 8192 | 36.04 | 0 | 891 | 7896 | |
| GMATo | 0:23 | 0:23 | 3 281 592 | 2 674 303 | 1101 | 1101 | 1101 | 100 | 0 | 588 | 887 | |
| Kmer-SSR | 0:23 | 0:04 | 1475 | 1475 | 1475 | 1475 | 1475 | 100 | NA | NA | NA | |
| MREPS | 0:02 | 0:02 | 2293 | 2293 | 1475 | 1475 | 1475 | 100 | 0 | 0 | 1475 | |
| PRoGeRF | 3:43 | 3:43 | 1 065 515 | 1 065 510 | 492 | 492 | 492 | 100 | 0 | 988 | 487 | |
| QDD | 0:47 | 0:47 | 8672 | 8672 | 1368 | 1368 | 1368 | 100 | 0 | 139 | 1336 | |
| SA-SSR | 338:50 | 60:55 | 1430 | 1430 | 1430 | 1430 | 1430 | 100 | 0 | 57 | 1418 | |
| SSR-Pipeline | 9:32 | 9:32 | 3 124 288 | 2 820 560 | 1427 | 1427 | 1427 | 100 | 0 | 73 | 1402 | |
| SSRIT | 0:14 | 0:14 | 12 276 | 10 386 | 1475 | 1475 | 1475 | 100 | 0 | 0 | 1475 | |
| TRF | 0:26 | 0:26 | 62 616 | 61 038 | 4634 | 755 | 1242 | 26.80 | 0 | 290 | 1185 | |
| GMATo | 9:44 | 9:44 | 76 711 216 | 61 830 463 | 181 301 | 181 301 | 180 734 | 100 | 0 | 72 304 | 127 524 | |
| Kmer-SSR | 9:33 | 1:18 | 199 828 | 199 828 | 199 828 | 199 828 | 199 828 | 100 | NA | NA | NA | |
| MREPS | 0:39 | 0:39 | 284 389 | 284 389 | 199 828 | 199 828 | 199 828 | 100 | 0 | 0 | 199 828 | |
| PRoGeRF | 87:13 | 87:13 | 44 944 317 | 44 865 988 | 140 872 | 140 753 | 140 785 | 100 | 0 | 59 518 | 140 310 | |
| QDD | 44:45 | 44:45 | 535 085 | 535 085 | 192 988 | 192 988 | 192 988 | 100 | 0 | 10 697 | 189 131 | |
| SA-SSR | 5443:20 | 1238:38 | 197 710 | 197 710 | 197 710 | 197 710 | 197 710 | 100 | 0 | 4190 | 195 638 | |
| SSR-Pipeline | 4957:12 | 4957:12 | 75 275 495 | 65 381 153 | 197 587 | 197 587 | 197 587 | 100 | 0 | 5778 | 194 050 | |
| SSRIT | 5:43 | 5:43 | 784 469 | 633 267 | 199 828 | 199 828 | 199 828 | 100 | 0 | 0 | 199 828 | |
| TRF | 35:53 | 35:53 | 2 667 832 | 2 564 881 | 470 739 | 98 237 | 165 167 | 35.09 | 0 | 42 393 | 157 435 | |
Note: This table shows that Kmer-SSR reports all possible SSRs in reasonable runtime with more refined user control and filtering options relative to the other softwares. We ran all comparisons on a 2.3 Ghz Intel Haswell processor. Although each algorithm was given the same amount of memory and CPUs, due to hardware variability of the CPU, runtimes could vary by up to 20%. Also, MREPS required pre-processing of the fasta files, which typically added anywhere from a few seconds to several minutes to the runtime (not depicted in the table), depending on the pre-processing approach used. Similarly, we did not include the time required to edit SSRIT and QDD’s source code in order for their programs to function over the period sizes in these tests. SSR-Pipeline could not finish searching for 1-mers in chromosome 6 of the Anolis carolinensis in 21 days of runtime. Accordingly, the chromosome was split into 24 approximately equal sized chunks (i.e. approximately 3.3 Mb each) and each chunk was searched for 1-mers separately by SSR-Pipeline. The required time for each chunk was summed (approximately 5 hours) and used in place of 504 hours (21 days).
The SSRs After Adjustments column reflects the number of SSRs that we did not remove or alter for purposes of making the comparison simpler. SSRs that were exact duplicates, duplicates with only the repeat number varying, duplicates that varied only by cycle (e.g. ACG versus CGA with the same number of repeats right next to each other), entirely surrounded by another SSR, or not atomic (e.g. ATAT repeated 2 times instead of AT repeated 8 times) were removed. SSRs that shared the same base and overlapped were combined into one SSR (e.g. AT repeated 8 times at position 1 and AT repeated 6 times at position 11 would be combined to AT repeated 11 times at position 1).
The SSRs In Range column is the number of SSRs from the previous column that were 16 nt or longer and had a period size of 1–7 (inclusive).
The Number Correct column is the number of SSRs In Range that were actually present in the sequence.
The Number Correct and Fixed is the Number Correct plus a few incorrect SSRs that we are able to fix (e.g. a program might report an AT repeated 30 times, but it only repeated 20 times in the sequence).
The Percent Correct and Fixed is the percent of SSRs in Range that were correct or fixed.
We documented each SSR algorithm’s basic usages and options based on the documentation from each algorithm
| GUI | Output | Language | Algorithm | Type | Period | Repeats | Multi-threaded | Search for Specific SSRs | |
|---|---|---|---|---|---|---|---|---|---|
| Kmer-SSR | TSV | C ++ | K-mer Decomposition | Exact | 1+ | 2+ | X | X | |
| SA-SSR | TSV | C ++ | Combinatorial | Exact | 1+ | 2+ | X | X | |
| GMATo | X | TSV | Perl & Java | Regular Expressions | Exact | 1–10 | 2+ | ||
| MREPS | Text | C | Combinatorial | Inexact | 1+ | 2+ | |||
| PRoGeRF | Web | TSV | Perl | ? | Inexact | 1–12 | 2+ | ||
| QDD | SCSV | Perl | ? | Exact | 1–6 | 5+ | |||
| SSR-Pipeline | FASTA | Python | ? | Exact | 1–25 | 2+ | |||
| SSRIT | TSV | Perl | Regular Expressions | Exact | 2–4 | 2+ | |||
| TRF | X | Text | ? | Heuristic | Inexact | 1+ | 2+ |
Note: All algorithms can run in a Linux environment, accept command line options and take a fasta file as input.
Columns in the table are as follows: GUI= Graphical user input available. The algorithms create either a text file, tab separated values (TSV), semicolon separated values (SCSV), or fasta file. The language in which the program is written is followed by the method that the algorithm uses and the type of SSRs it can find (exact or inexact). Minimum SSR period sizes and SSR repeat numbers are also listed. Finally, we list if the algorithm is multithreaded or configurable to search for specific SSRs. Only Kmer-SSR and SA-SSR are multithreaded and configurable to search for specific SSRs.