| Literature DB >> 18480101 |
Scott Hazelhurst1, Winston Hide, Zsuzsanna Lipták, Ramon Nogueira, Richard Starfield.
Abstract
UNLABELLED: The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d(2) distance function or edit distance, improving existing implementations of d(2). It supports merging, refinement and reclustering of clusters. It is 'drop in' compatible with the StackPack clustering package. wcd supports parallelization under both shared memory and cluster architectures. It is distributed with an EMBOSS wrapper allowing wcd to be installed as part of an EMBOSS installation (and so provided by a web server). AVAILABILITY: wcd is distributed under a GPL licence and is available from http://code.google.com/p/wcdest. SUPPLEMENTARY INFORMATION: Additional experimental results. The wcd manual, a companion paper describing underlying algorithms, and all datasets used for experimentation can also be found at www.bioinf.wits.ac.za/~scott/wcdsupp.html.Entities:
Mesh:
Year: 2008 PMID: 18480101 PMCID: PMC2718666 DOI: 10.1093/bioinformatics/btn203
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Quality assessment on the A076941 set
| SE | JI | |
|---|---|---|
| wcd | 0.933 | 0.476 |
| PaCE | 0.835 | 0.636 |
| CAP3 | 0.739 | 0.719 |
| wcd | 0.994 | 0.401 |
| PaCE | 0.926 | 0.589 |
| wcd+CAP3 | 0.993 | 0.980 |
| PaCE+CAP3 | 0.853 | 0.828 |
Sensitivity and Jaccard index on subsets of A076941 set
| CAP3 | Estate | PaCE | wcd | xsact | |
|---|---|---|---|---|---|
| Sensitivity | |||||
| 1 | 0.659 | 0.709 | 0.632 | 0.809 | 0.834 |
| 2 | 0.659 | 0.747 | 0.662 | 0.853 | 0.864 |
| 3 | 0.666 | 0.905 | 0.847 | 0.950 | 0.959 |
| 4 | 0.705 | 0.831 | 0.759 | 0.900 | 0.941 |
| 5 | 0.800 | 0.907 | 0.859 | 0.960 | 0.969 |
| 6 | 0.813 | 0.897 | 0.886 | 0.978 | 0.982 |
| 7 | 0.821 | 0.878 | 0.841 | 0.917 | 0.929 |
| Jaccard index | |||||
| 1 | 0.649 | 0.439 | 0.672 | 0.478 | 0.230 |
| 2 | 0.650 | 0.541 | 0.673 | 0.702 | 0.250 |
| 3 | 0.653 | 0.657 | 0.615 | 0.768 | 0.585 |
| 4 | 0.656 | 0.456 | 0.656 | 0.828 | 0.413 |
| 5 | 0.782 | 0.539 | 0.542 | 0.577 | 0.235 |
| 6 | 0.796 | 0.713 | 0.700 | 0.504 | 0.247 |
| 7 | 0.819 | 0.759 | 0.837 | 0.903 | 0.756 |
Fig. 1.Robustness: Jaccard index of clusters from error sets with respect to original cluster.
Fig. 2.Relative performance of different size datasets on the C4 cluster.
| Size | 4M | 8M | 16M | 32M | 64M | 128M | 256M | 295M |
| Time (s) | 9.7 | 38 | 159 | 631 | 2562 | 10413 | 42936 | 53286 |
| dataset | wcd | CAP3 | ESTate | xsact |
|---|---|---|---|---|
| Arabidopsis | ||||
| subsets (ave.) | 8.62s/4M | 309s/160M | 72s/64M | 42s/1.9G |
| SANBI10000 | 7.4s/4M | 75s/59M | 49s/47M | 40s/1.6G |
| Public cotton | 195s/10M | 1470s/530M | – | – |
| # slaves | 1 | 2 | 15 | 31 | 63 | 96 | 127 |
| Time (s) | 76243 | 10652 | 4938 | 2583 | 1411 | 1132 | 878 |
| Efficiency | 1.00 | 1.02 | 1.03 | 0.95 | 0.86 | 0.71 | 0.68 |