| Literature DB >> 19570806 |
Abstract
MOTIVATION: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence-similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm.Entities:
Mesh:
Year: 2009 PMID: 19570806 PMCID: PMC2735666 DOI: 10.1093/bioinformatics/btp410
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Results of clustering ESTs using a genomic reference
| Species | #Clusters | #ESTs | Percentage alignable |
|---|---|---|---|
| 23 002 | 151 923 | 48.26 | |
| 24 525 | 71 113 | 44.30 | |
| 20 361 | 96 506 | 51.36 |
aWith align threshold being 90% identity over a window of 40 bases.
Probability of seeing at least k occurrences of the same chimeric type
| Species | Number of occurences | |
|---|---|---|
| 2 | 3 | |
| (0.3696, 0.4611) | (0.0049, 0.0049) | |
| (0.0183, 0.0185) | (<0.0001, <0.0001) | |
| (0.1888, 0.2092) | (0.0012, 0.0012) | |
Probabilities were calculated using a rough 6-link clustering.
Evaluation of error introduced by linkage degree
| Species | TP | FP | FN | FN/(TP + FN) | FP/(TP + FP) |
|---|---|---|---|---|---|
| Single-link | 6 293 463 | 1 873 985 | 7256 | 0.0012 | 0.2294 |
| 2-link | 6 293 463 | 1 873 985 | 7256 | 0.0012 | 0.2294 |
| 3-link | 6 220 829.56 | 1 608 038.61 | 79 889.44 | 0.0127 | 0.2054 |
| 4-link | 6 171 122.04 | 1 340 286.73 | 129 596.96 | 0.0206 | 0.1784 |
| 5-link | 6 129 294.1 | 1 282 964.12 | 171 424.9 | 0.0272 | 0.1731 |
| Single-link | 544 450 | 777 126 | 1723 | 0.0032 | 0.5880 |
| 2-link | 544 409.5 | 771 706.75 | 1763.5 | 0.0032 | 0.5864 |
| 3-link | 531 223.34 | 581 126.43 | 14 949.66 | 0.0274 | 0.5224 |
| 4-link | 514 636.5 | 457 709.57 | 31 536.5 | 0.05772 | 0.4707 |
| 5-link | 496 927.81 | 419 862.93 | 49 245.19 | 0.0902 | 0.4580 |
| Single-link | 2 798 705 | 1 246 093 | 5950 | 0.0021 | 0.3081 |
| 2-link | 2 798 705 | 1 246 093 | 5950 | 0.0021 | 0.3081 |
| 3-link | 2 774 210 | 737 405.33 | 30445 | 0.0109 | 0.2100 |
| 4-link | 2 754 055.07 | 680 983.11 | 50 599.92 | 0.01804 | 0.1982 |
| 5-link | 2 728 152.31 | 628 988.71 | 76 502.69 | 0.0273 | 0.1874 |
TP, FP and FN values for clustering (for various k) when compared with the reference clustering. The proportion of pairs in the reference clustering which were incorrectly separated (i.e. similar to Type I error) in the clustering was calculated using FN/(TP + FN). The proportion of pairs in the non-reference clustering which were not together in the reference(similar to Type II error), was calculated using FP/(TP + FP).
Summary statistics for cluster sizes in the reference and clusterings
| Clustering | Min | Q1 | Median | Mean | Q3 | Max | No. of Clusters |
|---|---|---|---|---|---|---|---|
| Reference | 1 | 1 | 2 | 6.605 | 5 | 1321 | 23 002 |
| Single-link | 1 | 1 | 2 | 7.455 | 6 | 1407 | 20 378 |
| 2-link | 1 | 1 | 2 | 7.455 | 6 | 1407 | 20 378 |
| 3-link | 1 | 1 | 2 | 6.830 | 5 | 1407 | 23 019 |
| 4-link | 1 | 1 | 2 | 6.471 | 5 | 1407 | 25 402 |
| 5-link | 1 | 1 | 3 | 6.289 | 5 | 1407 | 27 420 |
| Reference | 1 | 1 | 1 | 2.900 | 3 | 337 | 24 525 |
| Single-link | 1 | 1 | 1 | 3.668 | 3 | 713 | 19 387 |
| 2-link | 1 | 1 | 1 | 3.668 | 3 | 713 | 19 390 |
| 3-link | 1 | 1 | 2 | 3.413 | 3 | 712 | 22 887 |
| 4-link | 1 | 1 | 2 | 3.367 | 3 | 653 | 25 786 |
| 5-link | 1 | 1 | 2 | 3.420 | 4 | 653 | 28 136 |
| Reference | 1 | 1 | 2 | 4.740 | 4 | 1060 | 20 361 |
| Single-link | 1 | 1 | 2 | 5.656 | 4 | 1354 | 17 062 |
| 2-link | 1 | 1 | 2 | 5.656 | 4 | 1354 | 17 062 |
| 3-link | 1 | 1 | 2 | 5.266 | 4 | 1086 | 19 102 |
| 4-link | 1 | 1 | 2 | 5.075 | 4 | 1086 | 20 862 |
| 5-link | 1 | 1 | 2 | 4.994 | 4 | 1080 | 22 572 |
Fig. 1.Manual inspection of putative chimeras identified by appear to be a mixture of true chimeric sequences and low-abundance transcripts. In (A), the EST shown is a C.elegans chimera of two genes, one fragment from dod-6 on chromosome III (consisting of two exons-location is marked with two red asterisks), and the other fragment from col-103 on chromosome IV (one exon-marked with one red asterisk). This chimera was the only sequence suggesting a link between the transcripts from these two separate genes. (B) ESTs derived from the T09F3.2 gene (http://genome.ucsc.edu, C.elegans, May 2008). A rare splice variant (BJ751277) spans the two otherwise non-overlapping EST clusters, and for this reason is marked as a putative chimera.