| Literature DB >> 23034809 |
Hani Z Girgis1, Sergey L Sheetlin.
Abstract
Microsatellites (MSs) are DNA regions consisting of repeated short motif(s). MSs are linked to several diseases and have important biomedical applications. Thus, researchers have developed several computational tools to detect MSs. However, the currently available tools require adjusting many parameters, or depend on a list of motifs or on a library of known MSs. Therefore, two laboratories analyzing the same sequence with the same computational tool may obtain different results due to the user-adjustable parameters. Recent studies have indicated the need for a standard computational tool for detecting MSs. To this end, we applied machine-learning algorithms to develop a tool called MsDetector. The system is based on a hidden Markov model and a general linear model. The user is not obligated to optimize the parameters of MsDetector. Neither a list of motifs nor a library of known MSs is required. MsDetector is memory- and time-efficient. We applied MsDetector to several species. MsDetector located the majority of MSs found by other widely used tools. In addition, MsDetector identified novel MSs. Furthermore, the system has a very low false-positive rate resulting in a precision of up to 99%. MsDetector is expected to produce consistent results across studies analyzing the same sequence.Entities:
Mesh:
Year: 2012 PMID: 23034809 PMCID: PMC3592430 DOI: 10.1093/nar/gks881
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Converting a series of nucleotides to a series of scores. To score the nucleotide ‘A’ (surrounded by a gray box), we search for an exact copy or the best inexact copy of the word starting at ‘A’ within the flanking sequences (red with dashed underlines). Thus, we calculate the identity scores (Equation 4) of this word and every word in the left and the right flanking sequences. The score of the nucleotide ‘A’ is the best identity score. For example, the identity score of this word and the first word of the left flanking sequence is 3. The identity score of this word and the second word of the right flanking sequence is 6 which is the best possible score. Therefore, the score of the nucleotide ‘A’ is 6 (surrounded by a gray box). The score series, which is the output of the scoring component, is shown at the lower part of the figure. Notice the correspondence between the repeated ‘AT’ motif and the part of the output consisting of consecutive 6s.
Figure 2.(A) The HMM structure. (B) The prior probabilities. (C) The transition probabilities. (D) The emission probabilities. (E) A series of states that likely generated a series of scores. and represent the non-MS and the MS states.
The HMM performance on the three sets
| Window | Training sensitivity (%) | Validation sensitivity (%) | Testing sensitivity (%) | Mean FPR (bp/Mbp) | Mean precision (%) |
|---|---|---|---|---|---|
| 88.9 | 88.4 | 85.7 | 3827 | 70.8 | |
| 84.7 | 84.6 | 84.7 | 1481 | 85.7 | |
| 84.3 | 84.6 | 85.4 | 1324 | 87.0 | |
| 84.2 | 84.9 | 85.9 | 922 | 90.6 | |
| 83.7 | 84.5 | 85.7 | 733 | 92.3 |
Sensitivity (Equation 1) is the percentage of the nucleotides of MSs detected by RepeatMasker and were also found by MsDetector. The mean of the FPRs (Equation 2) and the mean of the precisions (Equation 3) of MsDetector on the three sets are also shown.
Figure 3.The effect of the length of the flanking sequences on the emission probabilities. We report the length of one of the two flanking sequences.
The performance of the HMM combined with a GLM-based filter
| Window | Training sensitivity (%) | Validation sensitivity (%) | Testing sensitivity (%) | Mean FPR (bp/Mbp) | Mean precision (%) |
|---|---|---|---|---|---|
| 87.3 | 86.5 | 83.2 | 43 | 99.5 | |
| 83.4 | 83.4 | 83.4 | 40 | 99.5 | |
| 83.0 | 83.6 | 84.1 | 36 | 99.6 | |
| 83.0 | 83.8 | 84.7 | 39 | 99.6 | |
| 82.5 | 83.4 | 84.6 | 41 | 99.5 |
The size of the window is shown under column ‘Window.’ The sensitivity, FPR and precision are defined in Equations (1–3). The sensitivity is calculated with respect to the detections by RepeatMasker. The average FPR and the average precision of MsDetector on the three datasets are reported in the last two columns.
Figure 4.The linear function representing the GLM-based filter. HMM detections that have lengths and average scores below the line are considered negatives. On the other hand, detections that have lengths and average scores on or above the line are considered positives.
Tools performance on different species
| Tool | FPR (bp/Mbp) | Precision (%) | PP (%) | Time (s) | |
|---|---|---|---|---|---|
| Human chromosome 19 (59.1-Mbp long) | |||||
| | 83.3 | 34 | 99.7 | 3.0 | 29 |
| STAR | 94.7 | 10 | 99.9 | 3.1 | 49 588 |
| Mreps | 70.8 | 346 | 97.0 | 2.1 | 15 |
| Tantan | 92.8 | 2842 | 83.6 | 8.4 | 43 |
| | 89.0 | 161 | 98.2 | 2.6 | 1 |
| | 93.6 | 520 | 94.7 | 3.7 | 1 |
| STAR | 92.6 | 7 | 99.9 | 1.6 | 1284 |
| Mreps | 71.7 | 860 | 89.2 | 2.1 | 1 |
| Tantan | 94.0 | 9538 | 49.4 | 7.6 | 2 |
| | 74.9 | 214 | 90.8 | 1.4 | 12 |
| | 73.8 | 67 | 96.8 | 1.1 | 13 |
| STAR | 87.6 | 48 | 98.1 | 0.7 | 21 793 |
| Mreps | 67.7 | 817 | 70.0 | 1.30 | 7 |
| Tantan | 90.7 | 8479 | 23.1 | 7.30 | 17 |
| | 74.1 | 174 | 92.5 | 0.8 | 1 |
| | 81.7 | 130 | 94.8 | 1.0 | 1 |
| STAR | 88.4 | 6 | 99.8 | 0.7 | 1019 |
| Mreps | 67.3 | 741 | 72.5 | 0.9 | 1 |
| Tantan | 90.0 | 6922 | 27.4 | 3.6 | 1 |
| | 81.7 | 2945 | 97.2 | 21.9 | 2 |
| | 78.0 | 965 | 99.0 | 19.6 | 2 |
| | 75.2 | 896 | 99.0 | 17.0 | 2 |
| STAR | 96.5 | 64 | 99.9 | 28.3 | 2025 |
| Mreps | 63.3 | 3518 | 96.4 | 13.6 | 1 |
| | 67.4 | 1434 | 98.6 | 15.9 | 3 |
| | 54.7 | 250 | 70.0 | 0.8 | 3 |
| | 76.6 | 39 | 95.4 | 2.0 | 3 |
| STAR | 88.8 | 4 | 99.6 | 1.0 | 2959 |
| Mreps | 21.9 | 961 | 19.7 | 0.7 | 1 |
| Tantan | 88.1 | 10 270 | 8.4 | 5.7 | 4 |
Column ‘’ displays the percentage of the nucleotides that were detected by RepeatMasker as MSs and were also detected by one of the four tools (Equation 1). FPR is the false-positive rate (Equation 2). Precision is defined by Equation (3). PP is the percentage of the chromosome predicted as MSs (Equation 10). The time that a tool took to process the chromosome is reported under ‘Time.’ was trained on the human chromosome 20; the threshold of the GLM was 0.5. was trained on the same chromosome; however, the threshold of the GLM was 0.99. , , , and were trained on one-third of the D. melanogaster chromosome 3R, P. falciparum chromosome 14, A. thaliana chromosome 5, S. cerevisiae chromosome 4 and M. tuberculosis circular chromosome, respectively. We used a half window of size 24 bp for all models except the model of , for which we used a half window of size 48 bp. The parameters of were the ones recommended by the author for AT-rich genomes. Specifically, we used the ‘atMask’ scoring matrix and the value of the parameter ‘r’ was assigned 0.01. All other parameters were the defaults.
Examples of MSs located by MsDetector but missed by RepeatMasker or STAR or both
Motif logos were generated by WebLogo (36).
Figure 5.The distributions of the length (A) and average score (B) of two groups of MSs detected in the human chromosome 19. The first group consisted of MSs overlapping with MSs located by RepeatMasker or by STAR. MSs located by MsDetector only comprised the second group.
Figure 6.Analysis of the average scores of the MSs identified by RepeatMasker but missed by MsDetector in the human chromosome 19. The first group consisted of the MSs that were detected by RepeatMasker and MsDetector (98%). The second group consisted of the MSs that MsDetector missed (2%).