Literature DB >> 28968741

Kmer-SSR: a fast and exhaustive SSR search algorithm.

Brandon D Pickett1, Justin B Miller1, Perry G Ridge1.   

Abstract

MOTIVATION: One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a 'good enough' solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a 'good enough' solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions.
RESULTS: We present Kmer-SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy-to-use manner. The exhaustive Kmer-SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer-SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user-friendly manner than any other SSR identification algorithm.
AVAILABILITY AND IMPLEMENTATION: The source code is freely available on GitHub at https://github.com/ridgelab/Kmer-SSR. CONTACT: perry.ridge@byu.edu.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 28968741      PMCID: PMC5860095          DOI: 10.1093/bioinformatics/btx538

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Simple sequence repeats (SSRs) are short repetitive regions of DNA where at least one base is tandemly repeated many times due to slipped-strand mispairing and errors occurring in DNA replication, repair, or recombination (Levinson and Gutman, 1987). For decades, SSRs have been studied to determine phenotypic differences caused by increased copy numbers of short repetitive sequences (Kashi and King, 2006). Moreover, SSRs account for quantitative genetic variation and phenotypic differences without lowering species fitness (Kashi ). SSR concentration varies not only between different species, but also between different chromosomes within the same species, and cannot be explained by assessing the nucleotide composition of sequences (Katti ). Because SSRs reveal characteristic functions of DNA replication, recombination and repair, they are important in studying biological systems interactions, as well as studying repeat expansion-based diseases with next-generation sequencing data (Kashi and King, 2006). Many different approaches have been used to identify SSRs. Here, we propose the use of k-mers. The term k-mer refers to a subsequence of length ‘k’ derived from a given sequence, while k-mer decomposition refers to all possible substrings of length ‘k’ that can be made from a sequence. Uses for k-mer decomposition have previously been outlined in instances such as genome assembly and machine learning (Chikhi and Medvedev, 2014; Ghandi ). Although k-mers have been used to identify similar subsequences as in (Han ), to our knowledge SSR identification has never been attempted through k-mer decomposition.

2 Materials and methods

2.1 Overview

Kmer-SSR utilizes k-mer decomposition to provide an exhaustive or filtered approach to finding all SSRs in a given sequence (Figs 1 and 2). Our version of k-mer decomposition works by identifying all subsequences of length ‘k’ while tracking the start position of each k-mer. K-mer lengths are defined by the user as the SSR period length. Kmer-SSR minimizes the usage of random access memory (RAM) by performing k-mer decomposition and only storing k-mers that are the same as the preceding k-mer (SSR period length). If a k-mer is not identical to a k-mer found k bases previously, the previously identified k-mers will be discarded and k-mer decomposition will occur for the rest of the sequence.
Fig. 1.

Conceptual representation of Kmer-SSR. Although we implement some filters and tricks to speed up Kmer-SSR runtime, each SSR is identified through kmer decomposition, which allows the identification of instances when the same SSR period occurs k bases from the previously identified SSR period

Conceptual representation of Kmer-SSR. Although we implement some filters and tricks to speed up Kmer-SSR runtime, each SSR is identified through kmer decomposition, which allows the identification of instances when the same SSR period occurs k bases from the previously identified SSR period Pseudocode for the Kmer-SSR algorithm. The function passesBooleanFilter ensures SSRs are not duplicates of previously reported SSRs. The function passesUserFilters (function not shown) completes other user-specified options, which may include: minimum SSR length, minimum and maximum number of periods, finding specific SSRs and sequence length bounds

2.2 Memory requirements

We used the following techniques to limit memory requirements: Identify SSRs from left to right: Kmer-SSR checks each position starting at the leftmost position of the sequence for each SSR period size (i.e. k-mer length) given by the user. This method allowed us to store only a single potential SSR and immediately either discard it if it was not repeated or write it to a file if it was a valid SSR. Identify SSRs with the largest period size first: Since Kmer-SSR does not store previously identified SSRs in memory, it is necessary to search for SSRs in a specific order, or else risk reporting SSRs fully enclosed within larger SSRs. To avoid this issue, we take the period sizes given by the user and search for SSRs from the longest period size to the smallest (e.g. if the user wants to search for 2-mers and 7-mers, we search for all 7-mer SSRs in the sequences before we search for 2-mer SSRs). When an SSR is discovered, an atomicity check is conducted to determine if the k-mer can be broken down to a smaller subsequence. An SSR is considered atomic if no smaller SSRs exist inside the first period. For example, ATATATAT would be identified as a 4-mer (ATAT) repeated twice, but ATAT is not atomic because AT (repeated twice) occurs within the first period. Thus, it is ignored because it is an invalid 4-mer and, if the user requested searching for 2-mers, it would be discovered again as a 2-mer (AT) repeated four times. If the atomicity check fails, the SSR is not reported. When an atomic (i.e. valid) SSR is discovered, the iterator moves just past the SSR, minus the current period size being searched, to ensure that overlapping SSRs are identified. For example, ACAACAACACACACAC has ACA repeated three times starting at position 0. Additionally, AC repeats five times starting at position 6. After finding the ACA repeat, we would miss the full AC repeat if we skipped to the end of the ACA repeat and resumed searching from there. Only by backtracking as described above (9–3 = 6), do we find the full AC repeat. Note that each of the nucleotides between positions 0 and 5 need not be searched for SSRs because Kmer-SSR has already found SSRs with larger period sizes than the current period size. In other words, since Kmer-SSR has already found SSRs with larger period sizes, the maximum possible overlap with the current SSR (ACA) and an adjacent following SSR is k (which is three in this example), removing the need to search for SSRs from the start of a valid SSR to k bases from the end of that SSR. Create a Boolean filter array: To ensure that SSRs are unique and do not end in the same positions, we created a Boolean filter array of the same length as the sequence being analyzed, which is initiated to false. In C ++, the implementation of this array only requires one bit per position, so the memory requirement is nominal. When an SSR is discovered, we first ensure that at least one position in the first or last SSR period size on either end of the SSR is false in the Boolean array. If one position is false, we assign all values within the array that correspond to all positions in the SSR to true. The filter allows us to ignore completely overlapping SSRs because overlapping SSRs will be set to ‘true’ at the positions at the ends of the SSR. By utilizing the above-mentioned methods, we were able to limit the amount of RAM needed to O(n), where n is the sequence length, and the constant value is slightly more than one byte (one byte to store each sequence base and one bit allocated in the Boolean filter for each base).

2.3 SSR filters

Next, we implemented a comprehensive filter that allows users to control the output of Kmer-SSR based on atomicity, cyclic duplicates, enclosed SSRs, minimum SSR length and specific SSR period sizes. Pseudocode for Kmer-SSR is in Figure 2. The following are different filters that are optionally applied to the output of Kmer-SSR:
Fig. 2.

Pseudocode for the Kmer-SSR algorithm. The function passesBooleanFilter ensures SSRs are not duplicates of previously reported SSRs. The function passesUserFilters (function not shown) completes other user-specified options, which may include: minimum SSR length, minimum and maximum number of periods, finding specific SSRs and sequence length bounds

Atomicity check: The atomicity check ensures that the smallest period size for each SSR is reported. For instance, if an ATAT repeats four times, it would be reported as an AT repeated eight times because AT is the smallest period size within ATAT. Cyclic duplicates: Many SSRs create equally viable SSRs with slightly different positions reported. For instance, in the sequence ATATATATATATATATA, it is arguably equally valid to report the AT repeated eight times starting at position zero as it would be to report TA repeating eight times starting at position one. To avoid duplicate reporting of cyclic duplicates and ensure the longest SSR is always reported, we choose and report only the leftmost SSR. So, in this instance, only the AT repeated eight times would be reported. Enclosed SSRs: Occasionally, SSRs might be completely enclosed within other SSRs. For example, in the sequence TAAAATTAAAATTAAAAT, the SSR TAAAAT is repeated three times, but within each TAAAAT there is an A that repeats four times. In this case, we only report the longest SSR, TAAAAT, repeated three times. SSR length: We allow the user to input minimum and maximum SSR lengths via command line options. By default, SSRs are only reported if they are at least 16 nucleotides long. Set specific period sizes: We allow the user to input specific period sizes to be checked (e.g. 1, 3, 5 would look for SSRs with period sizes of one, three and five), or ranges of period sizes (e.g. 1–7 would look for SSRs with period sizes one through seven). By default, Kmer-SSR reports SSRs of period sizes one through seven. SSRs outside of the user specified range are not reported. Number of repeats: We allow the user to input minimum and maximum numbers of repeats via command line options. By default, SSRs must repeat at least twice to be reported. Enumerated SSRs: If the user is interested in a very limited set of SSRs, they may specify those via a command line option and no other SSRs will be reported. Sequence length: The user may specify minimum and maximum bounds on the length of an input sequence, outside of which the program will not search or report SSRs. By default, if a sequence is less than 100 bases or more than 500 megabases, it will be ignored.

3 Results

We conducted pairwise comparisons of Kmer-SSR against the following SSR identification algorithms: GMATo (Wang ), MREPS (Kolpakov ), PRoGeRF (Lopes ), QDD (Meglécz ), SA-SSR (Pickett ), SSR-Pipeline (Miller ), SSRIT (Temnykh ) and TRF (Benson, 1999). These comparisons were performed on DNA sequences from six different species (whole genome assembly unless otherwise noted): Anolis carolinensis chromosome 6 (CM000942.1), Chlamydomonas reinhardtii (assembly v5.5) (Merchant ), Danio rerio chromosome 25 (CM002909.1), Dictyostelium doscoideum (GCA_0000044695.1), Physcomitrella patens chromosome 1 (assembly v3.3) and Saccharomyces cerevisiae (GCA_001634645.1). Table 1 displays the computational time of each algorithm and the number of SSRs correctly identified for each dataset (CPU Time and Real Time columns).
Table 1.

Comparisons of all nine SSR-identification algorithms across six genomes with period sizes of 1–7 and a minimum SSR length of 16 bases

Comparison with Kmer-SSR
CPU Time (mm:ss)Real Time (mm:ss)SSRs ReportedaSSRs After AdjustmentsbSSRs In RangecNumber CorrectdNumber Correct & FixedePercent Correct & FixedSSRs Unique to SoftwareSSRs Unique to Kmer-SSRSSRs Shared
Anolis carolinensis (chr 6)GMATo2:382:3820 623 00816 369 29716 87116 87116 8701000819410 090
Kmer-SSR2:240:2418 28418 28418 28418 28418 284100NANANA
MREPS0:090:0925 63925 63918 28418 28418 2841000018 284
PRoGeRF18:0718:0716 841 65616 840 82117 76317 76217 763100061017 674
QDD19:1119:1160 99460 99418 00918 00918 009100073217 552
SA-SSR338:4733:5518 16618 16618 16618 16618 166100044217 842
SSR-Pipeline611:55611:5519 173 28217 301 12018 04418 04418 044100091317 371
SSRIT1:291:2987 07374 12118 28418 28418 2841000018 284
TRF2:092:09422 851411 64442 15713 87217 30741.050156016 724
Chlamydomonas reihardtiiGMATo3:303:3026 512 28021 624 29450 40150 40150 13999023 08634 416
Kmer-SSR3:260:1957 50257 50257 50257 50257 502100NANANA
MREPS0:140:1494 87594 87557 50257 50257 5021000057 502
PRoGeRF37:5537:558 071 1028 020 21332 04331 98932 004100025 58831 914
QDD8:518:51216 943216 94355 47055 47055 4701000300254 500
SA-SSR1324:33167:4856 83356 83356 83356 83356 8331000121456 288
SSR-Pipeline632:10632:1026 973 43423 032 83856 72956 72956 7291000179355 709
SSRIT2:002:00310 109252 22357 50257 50257 5021000057 502
TRF8:528:521 022 145990 316181 97325 45145 77325.15014 54642 956
Danio rerio (chr 25)GMATo1:121:129 501 8607 535 74922 54622 54622 362990846313 636
Kmer-SSR1:100:1322 09922 09922 09922 09922 099100NANANA
MREPS0:050:0526 86226 86222 09922 09922 0991000022 099
PRoGeRF8:148:147 696 2697 695 01221 72921 66821 684100049421 605
QDD7:437:4349 01649 01621 80521 80521 805100090821 191
SA-SSR2075:03648:0021 86221 86221 86221 86221 862100069021 409
SSR-Pipeline1958:541958:548 948 4507 954 89921 85721 85721 857100098721 112
SSRIT0:430:4369 64558 06522 09922 09922 0991000022 099
TRF5:035:03293 378283 76440 34311 25516 91141.920614415 955
Dictyostelium doscoideumGMATo1:021:028 810 6077 126 42582 64382 64382 526100028 71462 967
Kmer-SSR1:120:0891 68191 68191 68191 68191 681100NANANA
MREPS0:050:05121 835121 83591 68191 68191 6811000091 681
PRoGeRF11:4211:424 629 7864 604 49960 17660 17460 174100031 70759 974
QDD3:443:44171 686171 68688 01788 01788 0171000529586 386
SA-SSR723:31236:0190 70090 70090 70090 70090 7001000163590 046
SSR-Pipeline246:35246:359 292 9007 397 56190 81090 81090 8101000175989 922
SSRIT0:420:42265 894202 53191 68191 68191 6811000091 681
TRF17:3017:30642 904602 301178 90240 77275 74242.34018 96272 719
Physcomitrella patens (chr 1)GMATo0:590:597 981 8696 500 395773977397736100032595528
Kmer-SSR0:580:108 7878 787878787878787100NANANA
MREPS0:040:0412 88512 885878787878787100008787
PRoGeRF7:327:326 639 9896 639 93386698668866810001318656
QDD4:294:2927 77427 77483198319831910006218166
SA-SSR642:3691:598719871987198719871910001528635
SSR-Pipeline1498:061498:067 763 1416 874 17587208720872010002538534
SSRIT0:350:3539 47235 941878787878787100008787
TRF1:531:53223 938215 81822 7306132819236.0408917896
Saccharyomyces cerevisiaeGMATo0:230:233 281 5922 674 3031101110111011000588887
Kmer-SSR0:230:0414751475147514751475100NANANA
MREPS0:020:0222932293147514751475100001475
PRoGeRF3:433:431 065 5151 065 5104924924921000988487
QDD0:470:478672867213681368136810001391336
SA-SSR338:5060:55143014301430143014301000571418
SSR-Pipeline9:329:323 124 2882 820 5601427142714271000731402
SSRIT0:140:1412 27610 386147514751475100001475
TRF0:260:2662 61661 0384634755124226.8002901185
CombinedGMATo9:449:4476 711 21661 830 463181 301181 301180 734100072 304127 524
Kmer-SSR9:331:18199 828199 828199 828199 828199 828100NANANA
MREPS0:390:39284 389284 389199 828199 828199 82810000199 828
PRoGeRF87:1387:1344 944 31744 865 988140 872140 753140 785100059 518140 310
QDD44:4544:45535 085535 085192 988192 988192 988100010 697189 131
SA-SSR5443:201238:38197 710197 710197 710197 710197 71010004190195 638
SSR-Pipeline4957:124957:1275 275 49565 381 153197 587197 587197 58710005778194 050
SSRIT5:435:43784 469633 267199 828199 828199 82810000199 828
TRF35:5335:532 667 8322 564 881470 73998 237165 16735.09042 393157 435

Note: This table shows that Kmer-SSR reports all possible SSRs in reasonable runtime with more refined user control and filtering options relative to the other softwares. We ran all comparisons on a 2.3 Ghz Intel Haswell processor. Although each algorithm was given the same amount of memory and CPUs, due to hardware variability of the CPU, runtimes could vary by up to 20%. Also, MREPS required pre-processing of the fasta files, which typically added anywhere from a few seconds to several minutes to the runtime (not depicted in the table), depending on the pre-processing approach used. Similarly, we did not include the time required to edit SSRIT and QDD’s source code in order for their programs to function over the period sizes in these tests. SSR-Pipeline could not finish searching for 1-mers in chromosome 6 of the Anolis carolinensis in 21 days of runtime. Accordingly, the chromosome was split into 24 approximately equal sized chunks (i.e. approximately 3.3 Mb each) and each chunk was searched for 1-mers separately by SSR-Pipeline. The required time for each chunk was summed (approximately 5 hours) and used in place of 504 hours (21 days).

The SSRs After Adjustments column reflects the number of SSRs that we did not remove or alter for purposes of making the comparison simpler. SSRs that were exact duplicates, duplicates with only the repeat number varying, duplicates that varied only by cycle (e.g. ACG versus CGA with the same number of repeats right next to each other), entirely surrounded by another SSR, or not atomic (e.g. ATAT repeated 2 times instead of AT repeated 8 times) were removed. SSRs that shared the same base and overlapped were combined into one SSR (e.g. AT repeated 8 times at position 1 and AT repeated 6 times at position 11 would be combined to AT repeated 11 times at position 1).

The SSRs In Range column is the number of SSRs from the previous column that were 16 nt or longer and had a period size of 1–7 (inclusive).

The Number Correct column is the number of SSRs In Range that were actually present in the sequence.

The Number Correct and Fixed is the Number Correct plus a few incorrect SSRs that we are able to fix (e.g. a program might report an AT repeated 30 times, but it only repeated 20 times in the sequence).

The Percent Correct and Fixed is the percent of SSRs in Range that were correct or fixed.

Comparisons of all nine SSR-identification algorithms across six genomes with period sizes of 1–7 and a minimum SSR length of 16 bases Note: This table shows that Kmer-SSR reports all possible SSRs in reasonable runtime with more refined user control and filtering options relative to the other softwares. We ran all comparisons on a 2.3 Ghz Intel Haswell processor. Although each algorithm was given the same amount of memory and CPUs, due to hardware variability of the CPU, runtimes could vary by up to 20%. Also, MREPS required pre-processing of the fasta files, which typically added anywhere from a few seconds to several minutes to the runtime (not depicted in the table), depending on the pre-processing approach used. Similarly, we did not include the time required to edit SSRIT and QDD’s source code in order for their programs to function over the period sizes in these tests. SSR-Pipeline could not finish searching for 1-mers in chromosome 6 of the Anolis carolinensis in 21 days of runtime. Accordingly, the chromosome was split into 24 approximately equal sized chunks (i.e. approximately 3.3 Mb each) and each chunk was searched for 1-mers separately by SSR-Pipeline. The required time for each chunk was summed (approximately 5 hours) and used in place of 504 hours (21 days). The SSRs After Adjustments column reflects the number of SSRs that we did not remove or alter for purposes of making the comparison simpler. SSRs that were exact duplicates, duplicates with only the repeat number varying, duplicates that varied only by cycle (e.g. ACG versus CGA with the same number of repeats right next to each other), entirely surrounded by another SSR, or not atomic (e.g. ATAT repeated 2 times instead of AT repeated 8 times) were removed. SSRs that shared the same base and overlapped were combined into one SSR (e.g. AT repeated 8 times at position 1 and AT repeated 6 times at position 11 would be combined to AT repeated 11 times at position 1). The SSRs In Range column is the number of SSRs from the previous column that were 16 nt or longer and had a period size of 1–7 (inclusive). The Number Correct column is the number of SSRs In Range that were actually present in the sequence. The Number Correct and Fixed is the Number Correct plus a few incorrect SSRs that we are able to fix (e.g. a program might report an AT repeated 30 times, but it only repeated 20 times in the sequence). The Percent Correct and Fixed is the percent of SSRs in Range that were correct or fixed. Because Kmer-SSR is multithreaded and robust to fasta files with unknown nucleotides, the real time for SSR identification using Kmer-SSR is faster than any other algorithm. Although MREPS reports a faster real time identification of SSRs, the program does not usually run with sequences containing unknown characters. With the addition of the time necessary to make the input fasta files usable for MREPS, it underperformed Kmer-SSR in all six datasets (Table 1, RealTime column). We found that with the exception of TRF, all algorithms tested were 100% accurate in identifying SSRs; however, only Kmer-SSR, MREPS and SSRIT reported all possible filtered SSRs within the range specified for each dataset (Table 1, SSRs In Range column). Although SSRIT has a faster CPU time than Kmer-SSR, it does not have the multithreading capabilities of Kmer-SSR, nor does it allow for querying of SSRs other than period sizes 2–4 without directly editing the algorithm’s source code.

4 Discussion

SSR identification is important in many biological comparisons. It is important to have 100% accuracy in SSR identification because primers often depend on the exact SSR sequence with conserved flanking sequences (Robinson ), and phenotypic variations associated with SSRs require an accurate portrayal of a genome. Furthermore, determining the exact SSR copy number is important in species identification and aids in the identification of discrete families and individuals. Kmer-SSR fills a usability gap in SSR identification. While many SSR identification algorithms exist, it is often difficult to install, use and read the output from the algorithms available. Two of the main strengths of Kmer-SSR are its usability and the SSR filters that are easily accessible to help answer biological questions. Installing Kmer-SSR is at least as easy to install as other algorithms. Kmer-SSR was implemented in C ++. It does not require any editing of the source code to find SSRs of different lengths or filter overlapping SSRs, and provides a robust documentation for its command line options. Step-by-step instructions for installation and implementation of Kmer-SSR are available with the algorithm’s source code at http://github.com/ridgelab/Kmer-SSR. The filters available in Kmer-SSR help answer primary biological questions. Instead of inundating a researcher with duplicate SSRs, Kmer-SSR eliminates overlapping SSRs by only reporting the left-most SSR in each sequence when multiple SSRs are equally valid. Furthermore, longer SSRs are typically more biologically interesting, so completely enclosed SSRs are not included in the output. Importantly, these filters still allow for overlapping SSRs where at least one period size is completely outside of the previously reported SSR. These filters set Kmer-SSR apart from all other SSR identification algorithms because of its ease of use as well as its utility. As we compared other algorithms, a few difficulties arose that made it challenging to directly compare the output from each program. We learned that QDD does not allow the sequence header line to contain the vertical bar [|] (and possibly other characters that have special meaning in a regular expression). Also, analysis of 1-mers in longer sequences, such as the lizard genome, exceeded 21 days in SSR-pipeline. MREPS also required pre-splitting of the input sequence files because the algorithm does not accept any characters besides A, T, C and G in the sequence lines (it will accept a very limited number of well-distributed Ns). SSRIT requires directly editing the source code to query period sizes other than lengths two through four. Similarly, QDD requires directly editing its source code to retrieve different period lengths and different SSR lengths. QDD defaults to 1-mers that must be 1 million bases long and 2-mers through 6-mers that must repeat at least 5 times. Furthermore, unlike some other algorithms, the output format for Kmer-SSR is easily parsable, and can be exported directly to an Excel spreadsheet or another tab delimited parser. GMATO, ProGeRF, SSRIT and SA-SSR have similar output formats (although, ProGeRF and SSRIT do not provide column headers). MREPS and TRF are text-based reports with embedded tables. QDD provides a semicolon-separated value report with a few fixed columns followed by a variable number of columns thereafter depending on the number of SSRs found in a given sequence. SSR-Pipeline provides FASTA formatted output where the SSRs are encoded in the header (see Table 2). MREPS, PRoGeRF and TRF attempt to identify SSRs through heuristics. Heuristics is a common approach to achieve an adequate solution to a problem that is either too computationally intensive to check all possible solutions, or does not have a good approach to calculate the exact solution (Clancey, 1985). Table 2 displays features of each software package per each software package’s documentation (Benson, 1999; Kolpakov ; Lopes ; Meglécz ; Miller ; Pickett ; Temnykh ; Wang ).
Table 2.

We documented each SSR algorithm’s basic usages and options based on the documentation from each algorithm

GUIOutputLanguageAlgorithmTypePeriodRepeatsMulti-threadedSearch for Specific SSRs
Kmer-SSRTSVC ++K-mer DecompositionExact1+2+XX
SA-SSRTSVC ++CombinatorialExact1+2+XX
GMAToXTSVPerl & JavaRegular ExpressionsExact1–102+
MREPSTextCCombinatorialInexact1+2+
PRoGeRFWebTSVPerl?Inexact1–122+
QDDSCSVPerl?Exact1–65+
SSR-PipelineFASTAPython?Exact1–252+
SSRITTSVPerlRegular ExpressionsExact2–42+
TRFXText?HeuristicInexact1+2+

Note: All algorithms can run in a Linux environment, accept command line options and take a fasta file as input.

Columns in the table are as follows: GUI= Graphical user input available. The algorithms create either a text file, tab separated values (TSV), semicolon separated values (SCSV), or fasta file. The language in which the program is written is followed by the method that the algorithm uses and the type of SSRs it can find (exact or inexact). Minimum SSR period sizes and SSR repeat numbers are also listed. Finally, we list if the algorithm is multithreaded or configurable to search for specific SSRs. Only Kmer-SSR and SA-SSR are multithreaded and configurable to search for specific SSRs.

We documented each SSR algorithm’s basic usages and options based on the documentation from each algorithm Note: All algorithms can run in a Linux environment, accept command line options and take a fasta file as input. Columns in the table are as follows: GUI= Graphical user input available. The algorithms create either a text file, tab separated values (TSV), semicolon separated values (SCSV), or fasta file. The language in which the program is written is followed by the method that the algorithm uses and the type of SSRs it can find (exact or inexact). Minimum SSR period sizes and SSR repeat numbers are also listed. Finally, we list if the algorithm is multithreaded or configurable to search for specific SSRs. Only Kmer-SSR and SA-SSR are multithreaded and configurable to search for specific SSRs. While Kmer-SSR provides a substantially better user experience with more filters and options than all other algorithms, Kmer-SSR has several weaknesses. First, since Kmer-SSR is an exact algorithm, it is not as fast as the heuristic approach of MREPS when there are only canonical nucleotides in a sequence. Second, due to the kmer decomposition approach used in Kmer-SSR, it is unable to identify fuzzy repeat regions where only one or two nucleotides differ from an exact repeat. Although not necessary for many applications, fuzzy repeats would provide Kmer- SSR with increased functionality that is not currently possible with the algorithm’s implementation. Third, Kmer-SSR has no web interface. Unlike all other algorithms, Kmer-SSR offers the convenience of a completely exhaustive search in linear time (though with a larger constant factor than normal). This truly exhaustive search is entirely filter- free. As an example, that means it would report an ACG repeated seven times at position 1, six times at position 4, five times at position 7, etc. This is likely not necessary for most applications. However, with the exhaustive option, we set an upper limit for all SSR identifications. Furthermore, since genome complexity is important in primer design and predicting recombination events (Murray ), the exhaustive option could be used as an easy approach to determine the proportion of a sequence that repeats.
  17 in total

1.  Comparative sequence analysis of human minisatellites showing meiotic repeat instability.

Authors:  J Murray; J Buard; D L Neil; E Yeramian; K Tamaki; C Hollies; A J Jeffreys
Journal:  Genome Res       Date:  1999-02       Impact factor: 9.043

2.  mreps: Efficient and flexible detection of tandem repeats in DNA.

Authors:  Roman Kolpakov; Ghizlane Bana; Gregory Kucherov
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

3.  Tandem repeats finder: a program to analyze DNA sequences.

Authors:  G Benson
Journal:  Nucleic Acids Res       Date:  1999-01-15       Impact factor: 16.971

Review 4.  Simple sequence repeats as a source of quantitative genetic variation.

Authors:  Y Kashi; D King; M Soller
Journal:  Trends Genet       Date:  1997-02       Impact factor: 11.639

Review 5.  Slipped-strand mispairing: a major mechanism for DNA sequence evolution.

Authors:  G Levinson; G A Gutman
Journal:  Mol Biol Evol       Date:  1987-05       Impact factor: 16.240

6.  Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential.

Authors:  S Temnykh; G DeClerck; A Lukashova; L Lipovich; S Cartinhour; S McCouch
Journal:  Genome Res       Date:  2001-08       Impact factor: 9.043

7.  Differential distribution of simple sequence repeats in eukaryotic genome sequences.

Authors:  M V Katti; P K Ranjekar; V S Gupta
Journal:  Mol Biol Evol       Date:  2001-07       Impact factor: 16.240

8.  The Chlamydomonas genome reveals the evolution of key animal and plant functions.

Authors:  Sabeeha S Merchant; Simon E Prochnik; Olivier Vallon; Elizabeth H Harris; Steven J Karpowicz; George B Witman; Astrid Terry; Asaf Salamov; Lillian K Fritz-Laylin; Laurence Maréchal-Drouard; Wallace F Marshall; Liang-Hu Qu; David R Nelson; Anton A Sanderfoot; Martin H Spalding; Vladimir V Kapitonov; Qinghu Ren; Patrick Ferris; Erika Lindquist; Harris Shapiro; Susan M Lucas; Jane Grimwood; Jeremy Schmutz; Pierre Cardol; Heriberto Cerutti; Guillaume Chanfreau; Chun-Long Chen; Valérie Cognat; Martin T Croft; Rachel Dent; Susan Dutcher; Emilio Fernández; Hideya Fukuzawa; David González-Ballester; Diego González-Halphen; Armin Hallmann; Marc Hanikenne; Michael Hippler; William Inwood; Kamel Jabbari; Ming Kalanon; Richard Kuras; Paul A Lefebvre; Stéphane D Lemaire; Alexey V Lobanov; Martin Lohr; Andrea Manuell; Iris Meier; Laurens Mets; Maria Mittag; Telsa Mittelmeier; James V Moroney; Jeffrey Moseley; Carolyn Napoli; Aurora M Nedelcu; Krishna Niyogi; Sergey V Novoselov; Ian T Paulsen; Greg Pazour; Saul Purton; Jean-Philippe Ral; Diego Mauricio Riaño-Pachón; Wayne Riekhof; Linda Rymarquis; Michael Schroda; David Stern; James Umen; Robert Willows; Nedra Wilson; Sara Lana Zimmer; Jens Allmer; Janneke Balk; Katerina Bisova; Chong-Jian Chen; Marek Elias; Karla Gendler; Charles Hauser; Mary Rose Lamb; Heidi Ledford; Joanne C Long; Jun Minagawa; M Dudley Page; Junmin Pan; Wirulda Pootakham; Sanja Roje; Annkatrin Rose; Eric Stahlberg; Aimee M Terauchi; Pinfen Yang; Steven Ball; Chris Bowler; Carol L Dieckmann; Vadim N Gladyshev; Pamela Green; Richard Jorgensen; Stephen Mayfield; Bernd Mueller-Roeber; Sathish Rajamani; Richard T Sayre; Peter Brokstein; Inna Dubchak; David Goodstein; Leila Hornick; Y Wayne Huang; Jinal Jhaveri; Yigong Luo; Diego Martínez; Wing Chi Abby Ngau; Bobby Otillar; Alexander Poliakov; Aaron Porter; Lukasz Szajkowski; Gregory Werner; Kemin Zhou; Igor V Grigoriev; Daniel S Rokhsar; Arthur R Grossman
Journal:  Science       Date:  2007-10-12       Impact factor: 47.728

9.  ProGeRF: proteome and genome repeat finder utilizing a fast parallel hash function.

Authors:  Robson da Silva Lopes; Walas Jhony Lopes Moraes; Thiago de Souza Rodrigues; Daniella Castanheira Bartholomeu
Journal:  Biomed Res Int       Date:  2015-02-25       Impact factor: 3.411

10.  GMATo: A novel tool for the identification and analysis of microsatellites in large genomes.

Authors:  Xuewen Wang; Peng Lu; Zhaopeng Luo
Journal:  Bioinformation       Date:  2013-06-08
View more
  3 in total

1.  Developing an ultra-efficient microsatellite discoverer to find structural differences between SARS-CoV-1 and Covid-19.

Authors:  Mahmoud Naghibzadeh; Hossein Savari; Abdorreza Savadi; Nayyereh Saadati; Elahe Mehrazin
Journal:  Inform Med Unlocked       Date:  2020-05-21

2.  SSRgenotyper: A simple sequence repeat genotyping application for whole-genome resequencing and reduced representational sequencing projects.

Authors:  Daniel H Lewis; David E Jarvis; Peter J Maughan
Journal:  Appl Plant Sci       Date:  2020-12-03       Impact factor: 1.936

3.  BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Authors:  Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song
Journal:  Front Big Data       Date:  2022-01-18
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.