| Literature DB >> 27170037 |
B D Pickett1, S M Karlinsey1, C E Penrod1, M J Cormier1, M T W Ebbert1, D K Shiozawa1, C J Whipple1, P G Ridge1.
Abstract
UNLABELLED: Simple Sequence Repeats (SSRs) are used to address a variety of research questions in a variety of fields (e.g. population genetics, phylogenetics, forensics, etc.), due to their high mutability within and between species. Here, we present an innovative algorithm, SA-SSR, based on suffix and longest common prefix arrays for efficiently detecting SSRs in large sets of sequences. Existing SSR detection applications are hampered by one or more limitations (i.e. speed, accuracy, ease-of-use, etc.). Our algorithm addresses these challenges while being the most comprehensive and correct SSR detection software available. SA-SSR is 100% accurate and detected >1000 more SSRs than the second best algorithm, while offering greater control to the user than any existing software.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27170037 PMCID: PMC5013907 DOI: 10.1093/bioinformatics/btw298
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary of results from comparisons of SA-SSR with other SSR detection algorithms
| Comparison with SA-SSR | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| CPU time | Real time | SSRs reported | SSRs In range | Number correct | Percent correct | SSRs unique to software | SSRs unique to SA-SSR | Shared SSRs | |
| GMATo | 329:18 | 329:18 | 72 713 858 | 15 284 | 6617 | 43.29 | 20 | 34 237 | 3851 |
| MREPS | 393:02 | 393:02 | 75 552 | 37 076 | 37 076 | 100 | 71 | 1340 | 36 748 |
| PRoGeRF | 3194:18 | 3194:18 | 5 457 129 | 2278 | 2268 | 99.56 | 2 | 35 864 | 2224 |
| QDD | 24:17 | 24:17 | 53 248 | 17 418 | 17 418 | 100 | 10 | 20 759 | 17 329 |
| SA-SSR | 28 820:12 | 2416:32 | 38 088 | 38 088 | 38 088 | 100 | NA | NA | NA |
| SSR-Pipeline | 1411:21 | 1411:21 | 60 344 067 | 36 398 | 36 398 | 100 | 68 | 3047 | 35 041 |
| SSRIT | 2:12 | 2:12 | 13 217 | 13 217 | 13 217 | 100 | 5 | 24 951 | 13 137 |
| TRF | 12:14 | 12:14 | 2 035 715 | 1 47 284 | 33 876 | 23.00 | 12 | 7423 | 30 665 |
This is a combination of results across each of the genomes included in the comparison. For more detailed results see Supplementary Tables S2, S4–S31.
aMREPS timing includes the pre- and post-processing time for each genome necessary to adjust positions to account for removing ‘incorrect symbols’ and Ns. The additional times are an average of multiple approaches.
bWe only considered SSRs with period sizes 1–7 (inclusive) and lengths of at least 16 nucleotides (nt). The difference between the number of SSRs in range and reported is due exclusively to SSR length (less than 16 nt) and period size (greater than 7).
cWhenever possible, we salvaged correct SSRs that were inside incorrect SSRs reported by other software packages. For example, in Drosophila melanogaster, we recovered three for PRoGeRF and 8408 for TRF. To illustrate, in sequence JXOZ01000043.1, TRF reports a CT repeated 36 times at position 2171. While TRF does correctly identify a low-complexity region with many CT repeats, there are not 36 perfect repeats in a row. In this case, we salvaged two perfect CT regions, each repeating 8 times.
dDetailed pairwise comparisons can be found in Supplementary Tables S4–S31.