| Literature DB >> 15831790 |
Korbinian Schneeberger1, Ketil Malde, Eivind Coward, Inge Jonassen.
Abstract
A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering.Entities:
Mesh:
Year: 2005 PMID: 15831790 PMCID: PMC1079970 DOI: 10.1093/nar/gki511
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Number of identical clusters as a function of the minimum length threshold.
Figure 2Number of identical clusters as a function of the minimum gap size parameter.
Figure 3Number of identical clusters as a function of minimum match number and multiplier.
Figure 4For varying minimum match number, the number of clusters identical to clusters produced when masking is performed by RepeatMasker. The dataset is the ∼100 000 sequences from which the test dataset was constructed.
Comparison of different clustering results for our test dataset
| Cluster size | Number of clusters after masking with | ||
|---|---|---|---|
| RepeatBeater | RepeatMasker | Unmasked | |
| 1 | 168 | 186 | |
| 2 | 65 | 67 | |
| 3–4 | 59 | 62 | |
| 5–8 | 67 | 64 | |
| 9–16 | 40 | 41 | |
| 17–32 | 20 | 19 | |
| 33–64 | |||
| 65–128 | |||
| 129–256 | |||
| 257–512 | |||
| 513–1024 | |||
| 1025–2048 | 1 | ||
Cluster size distribution of clusterings of 100 000 A.thaliana ESTs
| Cluster size | UniGene | Number of clusters after masking with | ||
|---|---|---|---|---|
| RepeatBeater | RepeatMasker | Unmasked | ||
| 1 | 204 | 3017 | 3010 | 2988 |
| 2 | 363 | 1792 | 1790 | 1786 |
| 3–4 | 918 | 2243 | 2240 | 2232 |
| 5–8 | 1798 | 2161 | 2152 | 2147 |
| 9–16 | 1905 | 1433 | 1408 | 1403 |
| 17–32 | 1012 | 701 | 667 | 654 |
| 33–64 | 402 | 321 | 298 | 293 |
| 65–128 | 107 | 110 | 103 | 100 |
| 129–256 | 39 | 41 | 33 | 27 |
| 257–512 | 9 | 10 | 9 | 10 |
| 513–1024 | 4 | 6 | 5 | 3 |
| 1025–2048 | 1 | 1 | 1 | |
| 2049–4096 | 0 | 0 | ||
| 4097–8192 | 1 | 0 | ||
| 8193–16 384 | 1 | |||
| 16 385–32 768 | ||||
Cluster size distribution of clusterings of 50 000 Oryza sativa ESTs
| Cluster size | UniGene | Number of clusters after masking with | ||
|---|---|---|---|---|
| RepeatBeater | RepeatMasker | Unmasked | ||
| 1 | 505 | 597 | 700 | 576 |
| 2 | 192 | 229 | 232 | 227 |
| 3–4 | 445 | 417 | 417 | 412 |
| 5–8 | 718 | 666 | 678 | 647 |
| 9–16 | 814 | 719 | 709 | 656 |
| 17–32 | 502 | 404 | 409 | 353 |
| 33–64 | 221 | 179 | 176 | 138 |
| 65–128 | 92 | 76 | 73 | 46 |
| 129–256 | 18 | 13 | 14 | 7 |
| 257–512 | 2 | 0 | 0 | 0 |
| 513–1024 | 1 | 0 | 1 | 0 |
| 1025–2048 | 0 | 0 | 0 | |
| 2049–4096 | 0 | 0 | 0 | |
| 4097–8192 | 0 | 0 | 0 | |
| 8193–16 384 | 1 | 1 | 0 | |
| 16 385–32 768 | 1 | |||
Cluster size distribution of clusterings of 100 000 C.elegans ESTs
| Cluster size | UniGene | Number of clusters after masking with | ||
|---|---|---|---|---|
| RepeatBeater | RepeatMasker | Unmasked | ||
| 1 | 785 | 4688 | 4666 | 4642 |
| 2 | 2296 | 3187 | 3175 | 3173 |
| 3–4 | 1926 | 2842 | 2788 | 2784 |
| 5–8 | 1520 | 2140 | 2073 | 2070 |
| 9–16 | 1248 | 1353 | 1305 | 1305 |
| 17–32 | 961 | 643 | 575 | 572 |
| 33–64 | 397 | 269 | 235 | 233 |
| 65–128 | 133 | 96 | 74 | 71 |
| 129–256 | 55 | 52 | 31 | 32 |
| 257–512 | 9 | 9 | 5 | 5 |
| 513–1024 | 2 | 0 | 0 | |
| 1025–2048 | 1 | 0 | 0 | |
| 2049–4096 | 0 | 0 | ||
| 4097–8192 | 0 | 0 | ||
| 8193–16 384 | 1 | 1 | ||
Comparison of masked regions
| Repeat family | Total | Found |
|---|---|---|
| SINE/Alu | 919 | 893 |
| Low complexity | 77 | 17 |
| Simple repeat | 59 | 12 |
| SINE/MIR | 53 | 0 |
| LINE/L2 | 34 | 3 |
| LINE/L1 | 34 | 1 |
| DNA/MER1 type | 23 | 2 |
| LTR/MaLR | 17 | 1 |
| DNA/MER2 type | 14 | 1 |
| LTR/ERV1 | 10 | 0 |
| Other | 24 | 0 |
| Sum | 1264 | 930 |
The second column shows how many regions of the given type that were identified by RepeatMasker, and the third column shows how many of these were found by RepeatBeater. A region counts as found if there is at least a partial overlap with a RepeatBeater masked region.