| Literature DB >> 22753023 |
Tong Liang1, Xiaodan Fan, Qiwei Li, Shuo-Yen R Li.
Abstract
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.Entities:
Mesh:
Year: 2012 PMID: 22753023 PMCID: PMC3479165 DOI: 10.1093/nar/gks644
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.A schematic view of a sequence with DSATR. A background sequence and multiple repeat segments are generated independently. All repeat units in these repeat segments are random instances of a common motif ‘b’ of width w. The input sequence is generated by randomly inserting the repeat segments into separated locations in the background sequence.
Performance comparison on synthetic data sets from the generative model
| Conservation | GMS (Std.) | RepeatMasker | TRF (Std.) | Vanilla (Std.) | Piloted (Std.) | ||
|---|---|---|---|---|---|---|---|
| Sensitivity | 3 | High | 0.258(0.379) | 0.966(0.041) | 0.967(0.061) | 0.976(0.028) | 0.980(0.028) |
| Specificity | 3 | High | 0.597(0.048) | 0.931(0.046) | 0.896(0.080) | 0.967(0.025) | 0.969(0.022) |
| Sensitivity | 3 | Median | 0.000(0.000) | 0.724(0.145) | 0.662(0.175) | 0.918(0.059) | 0.920(0.062) |
| Specificity | 3 | Median | – | 0.936(0.047) | 0.932(0.058) | 0.937(0.040) | 0.944(0.033) |
| Sensitivity | 3 | Low | 0.000(0.000) | 0.232(0.179) | 0.241(0.173) | 0.771(0.122) | 0.782(0.125) |
| Specificity | 3 | Low | – | 0.952(0.064) | 0.908(0.141) | 0.907(0.061) | 0.905(0.061) |
| Sensitivity | 6 | High | 0.922(0.026) | 0.993(0.010) | 0.993(0.009) | 0.959(0.073) | 0.991(0.010) |
| Specificity | 6 | High | 0.848(0.037) | 0.960(0.024) | 0.837(0.104) | 0.982(0.014) | 0.984(0.014) |
| Sensitivity | 6 | Median | 0.634(0.049) | 0.942(0.051) | 0.871(0.100) | 0.959(0.046) | 0.976(0.020) |
| Specificity | 6 | Median | 0.790(0.050) | 0.965(0.025) | 0.906(0.074) | 0.976(0.019) | 0.977(0.019) |
| Sensitivity | 6 | Low | 0.192(0.106) | 0.310(0.176) | 0.216(0.125) | 0.901(0.063) | 0.909(0.054) |
| Specificity | 6 | Low | 0.603(0.114) | 0.977(0.034) | 0.900(0.121) | 0.948(0.033) | 0.952(0.033) |
| Sensitivity | 9 | High | 0.989(0.010) | 0.976(0.015) | 0.998(0.002) | 0.953(0.069) | 0.988(0.011) |
| Specificity | 9 | High | 0.959(0.024) | 0.983(0.012) | 0.763(0.126) | 0.988(0.010) | 0.989(0.010) |
| Sensitivity | 9 | Median | 0.804(0.039) | 0.954(0.026) | 0.937(0.080) | 0.944(0.079) | 0.981(0.023) |
| Specificity | 9 | Median | 0.883(0.047) | 0.981(0.017) | 0.882(0.078) | 0.984(0.016) | 0.984(0.017) |
| Sensitivity | 9 | Low | 0.399(0.049) | 0.456(0.151) | 0.302(0.164) | 0.944(0.045) | 0.944(0.037) |
| Specificity | 9 | Low | 0.738(0.077) | 0.984(0.021) | 0.961(0.057) | 0.963(0.022) | 0.965(0.024) |
Note: Each cell of the table shows the corresponding mean and standard deviation (in the bracket) of the sensitivity or specificity calculated from the 100 different synthetic sequences.
aThe true consensus is given as the sole repeat pattern in RepeatMasker library, so this comparison favors RepeatMasker.
b68 trails did not report any repeat element.
cAll trails did not report any repeat element.
d18 trails did not report any repeat element.
e4 trails did not report any repeat element.
f17 trails did not report any repeat element.
g5 trails did not report any repeat element.
h1 trail did not report any repeat element.
Pairwise comparison of RepeatMasker, TRF and our algorithm on real data
| Data and consensus pattern | Algorithm A | Algorithm B (reference) (%) | The percentage of tested sequence classified as repeat region (%) | |||
|---|---|---|---|---|---|---|
| Piloted version | TRF | RepeatMasker | ||||
| V1 | V2 | |||||
| Chimp (panTro2) | V1 | – | 90.41 | 89.43 | 90.09 | 1.32 |
| ChrY 1120001- | V2 | 100 | – | 94.27 | 99.55 | 1.46 |
| 1140000 | TRF | 76.89 | 73.29 | – | 79.73 | 1.14 |
| (CA) | RepeatMasker | 75.76 | 75.68 | 77.97 | – | 1.11 |
| Dog (Broad/canFan2) | V1 | – | 93.22 | 91.93 | 93.69 | 57.60 |
| Chr1 3160001- | V2 | 100 | – | 98.70 | 99.33 | 61.79 |
| 3170000 | TRF | 98.49 | 98.58 | – | 98.54 | 61.71 |
| (CGAAT) | RepeatMasker | 94.57 | 93.46 | 92.84 | – | 58.14 |
| Human (hg19) | V1 | – | 93.33 | 92.68 | 89.47 | 1.26 |
| Chr2 201650001- | V2 | 100 | – | 97.56 | 92.11 | 1.35 |
| 201670000 | TRF | 30.16 | 29.63 | – | 19.30 | 0.41 |
| (CA) | RepeatMasker | 80.95 | 77.78 | 53.66 | – | 1.14 |
| Rat (rn3) | V1 | – | 100 | 99.20 | 95.51 | 2.35 |
| Chr17 24340001- | V2 | 100 | – | 99.20 | 95.51 | 2.35 |
| 24350000 | TRF | 52.77 | 52.77 | – | 51.02 | 1.25 |
| (TCCTA) | RepeatMasker | 99.57 | 99.57 | 100 | – | 2.45 |
| Zebrafish (danRer6) | V1 | – | 93.88 | 92.64 | 89.89 | 2.05 |
| Chr13 24220001- | V2 | 100 | – | 95.84 | 96.02 | 2.18 |
| 24250000 | TRF | 94.30 | 91.59 | – | 92.50 | 2.08 |
| (TA) | RepeatMasker | 95.60 | 95.87 | 96.64 | – | 2.18 |
aV1: Piloted version.
bV2: Piloted version with post-processing.