Literature DB >> 28968741

Kmer-SSR: a fast and exhaustive SSR search algorithm.

Brandon D Pickett¹, Justin B Miller¹, Perry G Ridge¹.

Abstract

MOTIVATION: One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a 'good enough' solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a 'good enough' solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions.
RESULTS: We present Kmer-SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy-to-use manner. The exhaustive Kmer-SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer-SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user-friendly manner than any other SSR identification algorithm.
AVAILABILITY AND IMPLEMENTATION: The source code is freely available on GitHub at https://github.com/ridgelab/Kmer-SSR. CONTACT: perry.ridge@byu.edu.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28968741 PMCID： PMC5860095 DOI： 10.1093/bioinformatics/btx538

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Simple sequence repeats (SSRs) are short repetitive regions of DNA where at least one base is tandemly repeated many times due to slipped-strand mispairing and errors occurring in DNA replication, repair, or recombination (Levinson and Gutman, 1987). For decades, SSRs have been studied to determine phenotypic differences caused by increased copy numbers of short repetitive sequences (Kashi and King, 2006). Moreover, SSRs account for quantitative genetic variation and phenotypic differences without lowering species fitness (Kashi ). SSR concentration varies not only between different species, but also between different chromosomes within the same species, and cannot be explained by assessing the nucleotide composition of sequences (Katti ). Because SSRs reveal characteristic functions of DNA replication, recombination and repair, they are important in studying biological systems interactions, as well as studying repeat expansion-based diseases with next-generation sequencing data (Kashi and King, 2006). Many different approaches have been used to identify SSRs. Here, we propose the use of k-mers. The term k-mer refers to a subsequence of length ‘k’ derived from a given sequence, while k-mer decomposition refers to all possible substrings of length ‘k’ that can be made from a sequence. Uses for k-mer decomposition have previously been outlined in instances such as genome assembly and machine learning (Chikhi and Medvedev, 2014; Ghandi ). Although k-mers have been used to identify similar subsequences as in (Han ), to our knowledge SSR identification has never been attempted through k-mer decomposition.

2 Materials and methods

2.1 Overview

Kmer-SSR utilizes k-mer decomposition to provide an exhaustive or filtered approach to finding all SSRs in a given sequence (Figs 1 and 2). Our version of k-mer decomposition works by identifying all subsequences of length ‘k’ while tracking the start position of each k-mer. K-mer lengths are defined by the user as the SSR period length. Kmer-SSR minimizes the usage of random access memory (RAM) by performing k-mer decomposition and only storing k-mers that are the same as the preceding k-mer (SSR period length). If a k-mer is not identical to a k-mer found k bases previously, the previously identified k-mers will be discarded and k-mer decomposition will occur for the rest of the sequence.

Fig. 1.

Conceptual representation of Kmer-SSR. Although we implement some filters and tricks to speed up Kmer-SSR runtime, each SSR is identified through kmer decomposition, which allows the identification of instances when the same SSR period occurs k bases from the previously identified SSR period Pseudocode for the Kmer-SSR algorithm. The function passesBooleanFilter ensures SSRs are not duplicates of previously reported SSRs. The function passesUserFilters (function not shown) completes other user-specified options, which may include: minimum SSR length, minimum and maximum number of periods, finding specific SSRs and sequence length bounds

2.2 Memory requirements

We used the following techniques to limit memory requirements: Identify SSRs from left to right: Kmer-SSR checks each position starting at the leftmost position of the sequence for each SSR period size (i.e. k-mer length) given by the user. This method allowed us to store only a single potential SSR and immediately either discard it if it was not repeated or write it to a file if it was a valid SSR. Identify SSRs with the largest period size first: Since Kmer-SSR does not store previously identified SSRs in memory, it is necessary to search for SSRs in a specific order, or else risk reporting SSRs fully enclosed within larger SSRs. To avoid this issue, we take the period sizes given by the user and search for SSRs from the longest period size to the smallest (e.g. if the user wants to search for 2-mers and 7-mers, we search for all 7-mer SSRs in the sequences before we search for 2-mer SSRs). When an SSR is discovered, an atomicity check is conducted to determine if the k-mer can be broken down to a smaller subsequence. An SSR is considered atomic if no smaller SSRs exist inside the first period. For example, ATATATAT would be identified as a 4-mer (ATAT) repeated twice, but ATAT is not atomic because AT (repeated twice) occurs within the first period. Thus, it is ignored because it is an invalid 4-mer and, if the user requested searching for 2-mers, it would be discovered again as a 2-mer (AT) repeated four times. If the atomicity check fails, the SSR is not reported. When an atomic (i.e. valid) SSR is discovered, the iterator moves just past the SSR, minus the current period size being searched, to ensure that overlapping SSRs are identified. For example, ACAACAACACACACAC has ACA repeated three times starting at position 0. Additionally, AC repeats five times starting at position 6. After finding the ACA repeat, we would miss the full AC repeat if we skipped to the end of the ACA repeat and resumed searching from there. Only by backtracking as described above (9–3 = 6), do we find the full AC repeat. Note that each of the nucleotides between positions 0 and 5 need not be searched for SSRs because Kmer-SSR has already found SSRs with larger period sizes than the current period size. In other words, since Kmer-SSR has already found SSRs with larger period sizes, the maximum possible overlap with the current SSR (ACA) and an adjacent following SSR is k (which is three in this example), removing the need to search for SSRs from the start of a valid SSR to k bases from the end of that SSR. Create a Boolean filter array: To ensure that SSRs are unique and do not end in the same positions, we created a Boolean filter array of the same length as the sequence being analyzed, which is initiated to false. In C ++, the implementation of this array only requires one bit per position, so the memory requirement is nominal. When an SSR is discovered, we first ensure that at least one position in the first or last SSR period size on either end of the SSR is false in the Boolean array. If one position is false, we assign all values within the array that correspond to all positions in the SSR to true. The filter allows us to ignore completely overlapping SSRs because overlapping SSRs will be set to ‘true’ at the positions at the ends of the SSR. By utilizing the above-mentioned methods, we were able to limit the amount of RAM needed to O(n), where n is the sequence length, and the constant value is slightly more than one byte (one byte to store each sequence base and one bit allocated in the Boolean filter for each base).

2.3 SSR filters

Next, we implemented a comprehensive filter that allows users to control the output of Kmer-SSR based on atomicity, cyclic duplicates, enclosed SSRs, minimum SSR length and specific SSR period sizes. Pseudocode for Kmer-SSR is in Figure 2. The following are different filters that are optionally applied to the output of Kmer-SSR:

Fig. 2.

Pseudocode for the Kmer-SSR algorithm. The function passesBooleanFilter ensures SSRs are not duplicates of previously reported SSRs. The function passesUserFilters (function not shown) completes other user-specified options, which may include: minimum SSR length, minimum and maximum number of periods, finding specific SSRs and sequence length bounds

Atomicity check: The atomicity check ensures that the smallest period size for each SSR is reported. For instance, if an ATAT repeats four times, it would be reported as an AT repeated eight times because AT is the smallest period size within ATAT. Cyclic duplicates: Many SSRs create equally viable SSRs with slightly different positions reported. For instance, in the sequence ATATATATATATATATA, it is arguably equally valid to report the AT repeated eight times starting at position zero as it would be to report TA repeating eight times starting at position one. To avoid duplicate reporting of cyclic duplicates and ensure the longest SSR is always reported, we choose and report only the leftmost SSR. So, in this instance, only the AT repeated eight times would be reported. Enclosed SSRs: Occasionally, SSRs might be completely enclosed within other SSRs. For example, in the sequence TAAAATTAAAATTAAAAT, the SSR TAAAAT is repeated three times, but within each TAAAAT there is an A that repeats four times. In this case, we only report the longest SSR, TAAAAT, repeated three times. SSR length: We allow the user to input minimum and maximum SSR lengths via command line options. By default, SSRs are only reported if they are at least 16 nucleotides long. Set specific period sizes: We allow the user to input specific period sizes to be checked (e.g. 1, 3, 5 would look for SSRs with period sizes of one, three and five), or ranges of period sizes (e.g. 1–7 would look for SSRs with period sizes one through seven). By default, Kmer-SSR reports SSRs of period sizes one through seven. SSRs outside of the user specified range are not reported. Number of repeats: We allow the user to input minimum and maximum numbers of repeats via command line options. By default, SSRs must repeat at least twice to be reported. Enumerated SSRs: If the user is interested in a very limited set of SSRs, they may specify those via a command line option and no other SSRs will be reported. Sequence length: The user may specify minimum and maximum bounds on the length of an input sequence, outside of which the program will not search or report SSRs. By default, if a sequence is less than 100 bases or more than 500 megabases, it will be ignored.

3 Results

We conducted pairwise comparisons of Kmer-SSR against the following SSR identification algorithms: GMATo (Wang ), MREPS (Kolpakov ), PRoGeRF (Lopes ), QDD (Meglécz ), SA-SSR (Pickett ), SSR-Pipeline (Miller ), SSRIT (Temnykh ) and TRF (Benson, 1999). These comparisons were performed on DNA sequences from six different species (whole genome assembly unless otherwise noted): Anolis carolinensis chromosome 6 (CM000942.1), Chlamydomonas reinhardtii (assembly v5.5) (Merchant ), Danio rerio chromosome 25 (CM002909.1), Dictyostelium doscoideum (GCA_0000044695.1), Physcomitrella patens chromosome 1 (assembly v3.3) and Saccharomyces cerevisiae (GCA_001634645.1). Table 1 displays the computational time of each algorithm and the number of SSRs correctly identified for each dataset (CPU Time and Real Time columns).

Table 1.

Comparisons of all nine SSR-identification algorithms across six genomes with period sizes of 1–7 and a minimum SSR length of 16 bases

										Comparison with Kmer-SSR
		CPU Time (mm:ss)	Real Time (mm:ss)	SSRs Reported	^aSSRs After Adjustments	^bSSRs In Range	^cNumber Correct	^dNumber Correct & Fixed	^ePercent Correct & Fixed	SSRs Unique to Software	SSRs Unique to Kmer-SSR	SSRs Shared
Anolis carolinensis (chr 6)	GMATo	2:38	2:38	20 623 008	16 369 297	16 871	16 871	16 870	100	0	8194	10 090
	Kmer-SSR	2:24	0:24	18 284	18 284	18 284	18 284	18 284	100	NA	NA	NA
	MREPS	0:09	0:09	25 639	25 639	18 284	18 284	18 284	100	0	0	18 284
	PRoGeRF	18:07	18:07	16 841 656	16 840 821	17 763	17 762	17 763	100	0	610	17 674
	QDD	19:11	19:11	60 994	60 994	18 009	18 009	18 009	100	0	732	17 552
	SA-SSR	338:47	33:55	18 166	18 166	18 166	18 166	18 166	100	0	442	17 842
	SSR-Pipeline	611:55	611:55	19 173 282	17 301 120	18 044	18 044	18 044	100	0	913	17 371
	SSRIT	1:29	1:29	87 073	74 121	18 284	18 284	18 284	100	0	0	18 284
	TRF	2:09	2:09	422 851	411 644	42 157	13 872	17 307	41.05	0	1560	16 724
Chlamydomonas reihardtii	GMATo	3:30	3:30	26 512 280	21 624 294	50 401	50 401	50 139	99	0	23 086	34 416
	Kmer-SSR	3:26	0:19	57 502	57 502	57 502	57 502	57 502	100	NA	NA	NA
	MREPS	0:14	0:14	94 875	94 875	57 502	57 502	57 502	100	0	0	57 502
	PRoGeRF	37:55	37:55	8 071 102	8 020 213	32 043	31 989	32 004	100	0	25 588	31 914
	QDD	8:51	8:51	216 943	216 943	55 470	55 470	55 470	100	0	3002	54 500
	SA-SSR	1324:33	167:48	56 833	56 833	56 833	56 833	56 833	100	0	1214	56 288
	SSR-Pipeline	632:10	632:10	26 973 434	23 032 838	56 729	56 729	56 729	100	0	1793	55 709
	SSRIT	2:00	2:00	310 109	252 223	57 502	57 502	57 502	100	0	0	57 502
	TRF	8:52	8:52	1 022 145	990 316	181 973	25 451	45 773	25.15	0	14 546	42 956
Danio rerio (chr 25)	GMATo	1:12	1:12	9 501 860	7 535 749	22 546	22 546	22 362	99	0	8463	13 636
	Kmer-SSR	1:10	0:13	22 099	22 099	22 099	22 099	22 099	100	NA	NA	NA
	MREPS	0:05	0:05	26 862	26 862	22 099	22 099	22 099	100	0	0	22 099
	PRoGeRF	8:14	8:14	7 696 269	7 695 012	21 729	21 668	21 684	100	0	494	21 605
	QDD	7:43	7:43	49 016	49 016	21 805	21 805	21 805	100	0	908	21 191
	SA-SSR	2075:03	648:00	21 862	21 862	21 862	21 862	21 862	100	0	690	21 409
	SSR-Pipeline	1958:54	1958:54	8 948 450	7 954 899	21 857	21 857	21 857	100	0	987	21 112
	SSRIT	0:43	0:43	69 645	58 065	22 099	22 099	22 099	100	0	0	22 099
	TRF	5:03	5:03	293 378	283 764	40 343	11 255	16 911	41.92	0	6144	15 955
Dictyostelium doscoideum	GMATo	1:02	1:02	8 810 607	7 126 425	82 643	82 643	82 526	100	0	28 714	62 967
	Kmer-SSR	1:12	0:08	91 681	91 681	91 681	91 681	91 681	100	NA	NA	NA
	MREPS	0:05	0:05	121 835	121 835	91 681	91 681	91 681	100	0	0	91 681
	PRoGeRF	11:42	11:42	4 629 786	4 604 499	60 176	60 174	60 174	100	0	31 707	59 974
	QDD	3:44	3:44	171 686	171 686	88 017	88 017	88 017	100	0	5295	86 386
	SA-SSR	723:31	236:01	90 700	90 700	90 700	90 700	90 700	100	0	1635	90 046
	SSR-Pipeline	246:35	246:35	9 292 900	7 397 561	90 810	90 810	90 810	100	0	1759	89 922
	SSRIT	0:42	0:42	265 894	202 531	91 681	91 681	91 681	100	0	0	91 681
	TRF	17:30	17:30	642 904	602 301	178 902	40 772	75 742	42.34	0	18 962	72 719
Physcomitrella patens (chr 1)	GMATo	0:59	0:59	7 981 869	6 500 395	7739	7739	7736	100	0	3259	5528
	Kmer-SSR	0:58	0:10	8 787	8 787	8787	8787	8787	100	NA	NA	NA
	MREPS	0:04	0:04	12 885	12 885	8787	8787	8787	100	0	0	8787
	PRoGeRF	7:32	7:32	6 639 989	6 639 933	8669	8668	8668	100	0	131	8656
	QDD	4:29	4:29	27 774	27 774	8319	8319	8319	100	0	621	8166
	SA-SSR	642:36	91:59	8719	8719	8719	8719	8719	100	0	152	8635
	SSR-Pipeline	1498:06	1498:06	7 763 141	6 874 175	8720	8720	8720	100	0	253	8534
	SSRIT	0:35	0:35	39 472	35 941	8787	8787	8787	100	0	0	8787
	TRF	1:53	1:53	223 938	215 818	22 730	6132	8192	36.04	0	891	7896
Saccharyomyces cerevisiae	GMATo	0:23	0:23	3 281 592	2 674 303	1101	1101	1101	100	0	588	887
	Kmer-SSR	0:23	0:04	1475	1475	1475	1475	1475	100	NA	NA	NA
	MREPS	0:02	0:02	2293	2293	1475	1475	1475	100	0	0	1475
	PRoGeRF	3:43	3:43	1 065 515	1 065 510	492	492	492	100	0	988	487
	QDD	0:47	0:47	8672	8672	1368	1368	1368	100	0	139	1336
	SA-SSR	338:50	60:55	1430	1430	1430	1430	1430	100	0	57	1418
	SSR-Pipeline	9:32	9:32	3 124 288	2 820 560	1427	1427	1427	100	0	73	1402
	SSRIT	0:14	0:14	12 276	10 386	1475	1475	1475	100	0	0	1475
	TRF	0:26	0:26	62 616	61 038	4634	755	1242	26.80	0	290	1185
Combined	GMATo	9:44	9:44	76 711 216	61 830 463	181 301	181 301	180 734	100	0	72 304	127 524
	Kmer-SSR	9:33	1:18	199 828	199 828	199 828	199 828	199 828	100	NA	NA	NA
	MREPS	0:39	0:39	284 389	284 389	199 828	199 828	199 828	100	0	0	199 828
	PRoGeRF	87:13	87:13	44 944 317	44 865 988	140 872	140 753	140 785	100	0	59 518	140 310
	QDD	44:45	44:45	535 085	535 085	192 988	192 988	192 988	100	0	10 697	189 131
	SA-SSR	5443:20	1238:38	197 710	197 710	197 710	197 710	197 710	100	0	4190	195 638
	SSR-Pipeline	4957:12	4957:12	75 275 495	65 381 153	197 587	197 587	197 587	100	0	5778	194 050
	SSRIT	5:43	5:43	784 469	633 267	199 828	199 828	199 828	100	0	0	199 828
	TRF	35:53	35:53	2 667 832	2 564 881	470 739	98 237	165 167	35.09	0	42 393	157 435

Note: This table shows that Kmer-SSR reports all possible SSRs in reasonable runtime with more refined user control and filtering options relative to the other softwares. We ran all comparisons on a 2.3 Ghz Intel Haswell processor. Although each algorithm was given the same amount of memory and CPUs, due to hardware variability of the CPU, runtimes could vary by up to 20%. Also, MREPS required pre-processing of the fasta files, which typically added anywhere from a few seconds to several minutes to the runtime (not depicted in the table), depending on the pre-processing approach used. Similarly, we did not include the time required to edit SSRIT and QDD’s source code in order for their programs to function over the period sizes in these tests. SSR-Pipeline could not finish searching for 1-mers in chromosome 6 of the Anolis carolinensis in 21 days of runtime. Accordingly, the chromosome was split into 24 approximately equal sized chunks (i.e. approximately 3.3 Mb each) and each chunk was searched for 1-mers separately by SSR-Pipeline. The required time for each chunk was summed (approximately 5 hours) and used in place of 504 hours (21 days).

The SSRs After Adjustments column reflects the number of SSRs that we did not remove or alter for purposes of making the comparison simpler. SSRs that were exact duplicates, duplicates with only the repeat number varying, duplicates that varied only by cycle (e.g. ACG versus CGA with the same number of repeats right next to each other), entirely surrounded by another SSR, or not atomic (e.g. ATAT repeated 2 times instead of AT repeated 8 times) were removed. SSRs that shared the same base and overlapped were combined into one SSR (e.g. AT repeated 8 times at position 1 and AT repeated 6 times at position 11 would be combined to AT repeated 11 times at position 1).

The SSRs In Range column is the number of SSRs from the previous column that were 16 nt or longer and had a period size of 1–7 (inclusive).

The Number Correct column is the number of SSRs In Range that were actually present in the sequence.

The Number Correct and Fixed is the Number Correct plus a few incorrect SSRs that we are able to fix (e.g. a program might report an AT repeated 30 times, but it only repeated 20 times in the sequence).

The Percent Correct and Fixed is the percent of SSRs in Range that were correct or fixed.

Comparisons of all nine SSR-identification algorithms across six genomes with period sizes of 1–7 and a minimum SSR length of 16 bases Note: This table shows that Kmer-SSR reports all possible SSRs in reasonable runtime with more refined user control and filtering options relative to the other softwares. We ran all comparisons on a 2.3 Ghz Intel Haswell processor. Although each algorithm was given the same amount of memory and CPUs, due to hardware variability of the CPU, runtimes could vary by up to 20%. Also, MREPS required pre-processing of the fasta files, which typically added anywhere from a few seconds to several minutes to the runtime (not depicted in the table), depending on the pre-processing approach used. Similarly, we did not include the time required to edit SSRIT and QDD’s source code in order for their programs to function over the period sizes in these tests. SSR-Pipeline could not finish searching for 1-mers in chromosome 6 of the Anolis carolinensis in 21 days of runtime. Accordingly, the chromosome was split into 24 approximately equal sized chunks (i.e. approximately 3.3 Mb each) and each chunk was searched for 1-mers separately by SSR-Pipeline. The required time for each chunk was summed (approximately 5 hours) and used in place of 504 hours (21 days). The SSRs After Adjustments column reflects the number of SSRs that we did not remove or alter for purposes of making the comparison simpler. SSRs that were exact duplicates, duplicates with only the repeat number varying, duplicates that varied only by cycle (e.g. ACG versus CGA with the same number of repeats right next to each other), entirely surrounded by another SSR, or not atomic (e.g. ATAT repeated 2 times instead of AT repeated 8 times) were removed. SSRs that shared the same base and overlapped were combined into one SSR (e.g. AT repeated 8 times at position 1 and AT repeated 6 times at position 11 would be combined to AT repeated 11 times at position 1). The SSRs In Range column is the number of SSRs from the previous column that were 16 nt or longer and had a period size of 1–7 (inclusive). The Number Correct column is the number of SSRs In Range that were actually present in the sequence. The Number Correct and Fixed is the Number Correct plus a few incorrect SSRs that we are able to fix (e.g. a program might report an AT repeated 30 times, but it only repeated 20 times in the sequence). The Percent Correct and Fixed is the percent of SSRs in Range that were correct or fixed. Because Kmer-SSR is multithreaded and robust to fasta files with unknown nucleotides, the real time for SSR identification using Kmer-SSR is faster than any other algorithm. Although MREPS reports a faster real time identification of SSRs, the program does not usually run with sequences containing unknown characters. With the addition of the time necessary to make the input fasta files usable for MREPS, it underperformed Kmer-SSR in all six datasets (Table 1, RealTime column). We found that with the exception of TRF, all algorithms tested were 100% accurate in identifying SSRs; however, only Kmer-SSR, MREPS and SSRIT reported all possible filtered SSRs within the range specified for each dataset (Table 1, SSRs In Range column). Although SSRIT has a faster CPU time than Kmer-SSR, it does not have the multithreading capabilities of Kmer-SSR, nor does it allow for querying of SSRs other than period sizes 2–4 without directly editing the algorithm’s source code.

4 Discussion

SSR identification is important in many biological comparisons. It is important to have 100% accuracy in SSR identification because primers often depend on the exact SSR sequence with conserved flanking sequences (Robinson ), and phenotypic variations associated with SSRs require an accurate portrayal of a genome. Furthermore, determining the exact SSR copy number is important in species identification and aids in the identification of discrete families and individuals. Kmer-SSR fills a usability gap in SSR identification. While many SSR identification algorithms exist, it is often difficult to install, use and read the output from the algorithms available. Two of the main strengths of Kmer-SSR are its usability and the SSR filters that are easily accessible to help answer biological questions. Installing Kmer-SSR is at least as easy to install as other algorithms. Kmer-SSR was implemented in C ++. It does not require any editing of the source code to find SSRs of different lengths or filter overlapping SSRs, and provides a robust documentation for its command line options. Step-by-step instructions for installation and implementation of Kmer-SSR are available with the algorithm’s source code at http://github.com/ridgelab/Kmer-SSR. The filters available in Kmer-SSR help answer primary biological questions. Instead of inundating a researcher with duplicate SSRs, Kmer-SSR eliminates overlapping SSRs by only reporting the left-most SSR in each sequence when multiple SSRs are equally valid. Furthermore, longer SSRs are typically more biologically interesting, so completely enclosed SSRs are not included in the output. Importantly, these filters still allow for overlapping SSRs where at least one period size is completely outside of the previously reported SSR. These filters set Kmer-SSR apart from all other SSR identification algorithms because of its ease of use as well as its utility. As we compared other algorithms, a few difficulties arose that made it challenging to directly compare the output from each program. We learned that QDD does not allow the sequence header line to contain the vertical bar [|] (and possibly other characters that have special meaning in a regular expression). Also, analysis of 1-mers in longer sequences, such as the lizard genome, exceeded 21 days in SSR-pipeline. MREPS also required pre-splitting of the input sequence files because the algorithm does not accept any characters besides A, T, C and G in the sequence lines (it will accept a very limited number of well-distributed Ns). SSRIT requires directly editing the source code to query period sizes other than lengths two through four. Similarly, QDD requires directly editing its source code to retrieve different period lengths and different SSR lengths. QDD defaults to 1-mers that must be 1 million bases long and 2-mers through 6-mers that must repeat at least 5 times. Furthermore, unlike some other algorithms, the output format for Kmer-SSR is easily parsable, and can be exported directly to an Excel spreadsheet or another tab delimited parser. GMATO, ProGeRF, SSRIT and SA-SSR have similar output formats (although, ProGeRF and SSRIT do not provide column headers). MREPS and TRF are text-based reports with embedded tables. QDD provides a semicolon-separated value report with a few fixed columns followed by a variable number of columns thereafter depending on the number of SSRs found in a given sequence. SSR-Pipeline provides FASTA formatted output where the SSRs are encoded in the header (see Table 2). MREPS, PRoGeRF and TRF attempt to identify SSRs through heuristics. Heuristics is a common approach to achieve an adequate solution to a problem that is either too computationally intensive to check all possible solutions, or does not have a good approach to calculate the exact solution (Clancey, 1985). Table 2 displays features of each software package per each software package’s documentation (Benson, 1999; Kolpakov ; Lopes ; Meglécz ; Miller ; Pickett ; Temnykh ; Wang ).

Table 2.

We documented each SSR algorithm’s basic usages and options based on the documentation from each algorithm

	GUI	Output	Language	Algorithm	Type	Period	Repeats	Multi-threaded	Search for Specific SSRs
Kmer-SSR		TSV	C ++	K-mer Decomposition	Exact	1+	2+	X	X
SA-SSR		TSV	C ++	Combinatorial	Exact	1+	2+	X	X
GMATo	X	TSV	Perl & Java	Regular Expressions	Exact	1–10	2+
MREPS		Text	C	Combinatorial	Inexact	1+	2+
PRoGeRF	Web	TSV	Perl	?	Inexact	1–12	2+
QDD		SCSV	Perl	?	Exact	1–6	5+
SSR-Pipeline		FASTA	Python	?	Exact	1–25	2+
SSRIT		TSV	Perl	Regular Expressions	Exact	2–4	2+
TRF	X	Text	?	Heuristic	Inexact	1+	2+

Note: All algorithms can run in a Linux environment, accept command line options and take a fasta file as input.

Columns in the table are as follows: GUI= Graphical user input available. The algorithms create either a text file, tab separated values (TSV), semicolon separated values (SCSV), or fasta file. The language in which the program is written is followed by the method that the algorithm uses and the type of SSRs it can find (exact or inexact). Minimum SSR period sizes and SSR repeat numbers are also listed. Finally, we list if the algorithm is multithreaded or configurable to search for specific SSRs. Only Kmer-SSR and SA-SSR are multithreaded and configurable to search for specific SSRs.

We documented each SSR algorithm’s basic usages and options based on the documentation from each algorithm Note: All algorithms can run in a Linux environment, accept command line options and take a fasta file as input. Columns in the table are as follows: GUI= Graphical user input available. The algorithms create either a text file, tab separated values (TSV), semicolon separated values (SCSV), or fasta file. The language in which the program is written is followed by the method that the algorithm uses and the type of SSRs it can find (exact or inexact). Minimum SSR period sizes and SSR repeat numbers are also listed. Finally, we list if the algorithm is multithreaded or configurable to search for specific SSRs. Only Kmer-SSR and SA-SSR are multithreaded and configurable to search for specific SSRs. While Kmer-SSR provides a substantially better user experience with more filters and options than all other algorithms, Kmer-SSR has several weaknesses. First, since Kmer-SSR is an exact algorithm, it is not as fast as the heuristic approach of MREPS when there are only canonical nucleotides in a sequence. Second, due to the kmer decomposition approach used in Kmer-SSR, it is unable to identify fuzzy repeat regions where only one or two nucleotides differ from an exact repeat. Although not necessary for many applications, fuzzy repeats would provide Kmer- SSR with increased functionality that is not currently possible with the algorithm’s implementation. Third, Kmer-SSR has no web interface. Unlike all other algorithms, Kmer-SSR offers the convenience of a completely exhaustive search in linear time (though with a larger constant factor than normal). This truly exhaustive search is entirely filter- free. As an example, that means it would report an ACG repeated seven times at position 1, six times at position 4, five times at position 7, etc. This is likely not necessary for most applications. However, with the exhaustive option, we set an upper limit for all SSR identifications. Furthermore, since genome complexity is important in primer design and predicting recombination events (Murray ), the exhaustive option could be used as an easy approach to determine the proportion of a sequence that repeats.

17 in total

1. Comparative sequence analysis of human minisatellites showing meiotic repeat instability.

Authors: J Murray; J Buard; D L Neil; E Yeramian; K Tamaki; C Hollies; A J Jeffreys
Journal: Genome Res Date: 1999-02 Impact factor: 9.043

2. mreps: Efficient and flexible detection of tandem repeats in DNA.

Authors: Roman Kolpakov; Ghizlane Bana; Gregory Kucherov
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

Review 4. Simple sequence repeats as a source of quantitative genetic variation.

Authors: Y Kashi; D King; M Soller
Journal: Trends Genet Date: 1997-02 Impact factor: 11.639

Review 5. Slipped-strand mispairing: a major mechanism for DNA sequence evolution.

Authors: G Levinson; G A Gutman
Journal: Mol Biol Evol Date: 1987-05 Impact factor: 16.240

6. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential.

Authors: S Temnykh; G DeClerck; A Lukashova; L Lipovich; S Cartinhour; S McCouch
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

7. Differential distribution of simple sequence repeats in eukaryotic genome sequences.

Authors: M V Katti; P K Ranjekar; V S Gupta
Journal: Mol Biol Evol Date: 2001-07 Impact factor: 16.240

8. The Chlamydomonas genome reveals the evolution of key animal and plant functions.

Authors: Sabeeha S Merchant; Simon E Prochnik; Olivier Vallon; Elizabeth H Harris; Steven J Karpowicz; George B Witman; Astrid Terry; Asaf Salamov; Lillian K Fritz-Laylin; Laurence Maréchal-Drouard; Wallace F Marshall; Liang-Hu Qu; David R Nelson; Anton A Sanderfoot; Martin H Spalding; Vladimir V Kapitonov; Qinghu Ren; Patrick Ferris; Erika Lindquist; Harris Shapiro; Susan M Lucas; Jane Grimwood; Jeremy Schmutz; Pierre Cardol; Heriberto Cerutti; Guillaume Chanfreau; Chun-Long Chen; Valérie Cognat; Martin T Croft; Rachel Dent; Susan Dutcher; Emilio Fernández; Hideya Fukuzawa; David González-Ballester; Diego González-Halphen; Armin Hallmann; Marc Hanikenne; Michael Hippler; William Inwood; Kamel Jabbari; Ming Kalanon; Richard Kuras; Paul A Lefebvre; Stéphane D Lemaire; Alexey V Lobanov; Martin Lohr; Andrea Manuell; Iris Meier; Laurens Mets; Maria Mittag; Telsa Mittelmeier; James V Moroney; Jeffrey Moseley; Carolyn Napoli; Aurora M Nedelcu; Krishna Niyogi; Sergey V Novoselov; Ian T Paulsen; Greg Pazour; Saul Purton; Jean-Philippe Ral; Diego Mauricio Riaño-Pachón; Wayne Riekhof; Linda Rymarquis; Michael Schroda; David Stern; James Umen; Robert Willows; Nedra Wilson; Sara Lana Zimmer; Jens Allmer; Janneke Balk; Katerina Bisova; Chong-Jian Chen; Marek Elias; Karla Gendler; Charles Hauser; Mary Rose Lamb; Heidi Ledford; Joanne C Long; Jun Minagawa; M Dudley Page; Junmin Pan; Wirulda Pootakham; Sanja Roje; Annkatrin Rose; Eric Stahlberg; Aimee M Terauchi; Pinfen Yang; Steven Ball; Chris Bowler; Carol L Dieckmann; Vadim N Gladyshev; Pamela Green; Richard Jorgensen; Stephen Mayfield; Bernd Mueller-Roeber; Sathish Rajamani; Richard T Sayre; Peter Brokstein; Inna Dubchak; David Goodstein; Leila Hornick; Y Wayne Huang; Jinal Jhaveri; Yigong Luo; Diego Martínez; Wing Chi Abby Ngau; Bobby Otillar; Alexander Poliakov; Aaron Porter; Lukasz Szajkowski; Gregory Werner; Kemin Zhou; Igor V Grigoriev; Daniel S Rokhsar; Arthur R Grossman
Journal: Science Date: 2007-10-12 Impact factor: 47.728

9. ProGeRF: proteome and genome repeat finder utilizing a fast parallel hash function.

Authors: Robson da Silva Lopes; Walas Jhony Lopes Moraes; Thiago de Souza Rodrigues; Daniella Castanheira Bartholomeu
Journal: Biomed Res Int Date: 2015-02-25 Impact factor: 3.411

10. GMATo: A novel tool for the identification and analysis of microsatellites in large genomes.

Authors: Xuewen Wang; Peng Lu; Zhaopeng Luo
Journal: Bioinformation Date: 2013-06-08

3 in total

1. Developing an ultra-efficient microsatellite discoverer to find structural differences between SARS-CoV-1 and Covid-19.

Authors: Mahmoud Naghibzadeh; Hossein Savari; Abdorreza Savadi; Nayyereh Saadati; Elahe Mehrazin
Journal: Inform Med Unlocked Date: 2020-05-21

2. SSRgenotyper: A simple sequence repeat genotyping application for whole-genome resequencing and reduced representational sequencing projects.

Authors: Daniel H Lewis; David E Jarvis; Peter J Maughan
Journal: Appl Plant Sci Date: 2020-12-03 Impact factor: 1.936

3. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Authors: Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song
Journal: Front Big Data Date: 2022-01-18

3 in total