| Literature DB >> 12398795 |
John Parkinson1, David B Guiliano, Mark Blaxter.
Abstract
BACKGROUND: Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify expressed genes. However, they are often prone to sequencing errors and typically define incomplete transcripts. To increase the amount of information obtainable from ESTs and reduce sequencing errors, it is necessary to cluster ESTs into groups sharing significant sequence similarity.Entities:
Mesh:
Year: 2002 PMID: 12398795 PMCID: PMC137596 DOI: 10.1186/1471-2105-3-31
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Cluster size distribution for the three compared D. rerio cluster datasets
| 1 | 4169 | 9914 | 6848 |
| 2 | 1824 | 2231 | 2655 |
| 3–4 | 1953 | 1956 | 2321 |
| 5–8 | 1288 | 1155 | 1407 |
| 9–16 | 638 | 506 | 574 |
| 17–32 | 270 | 214 | 230 |
| 33–64 | 123 | 103 | 112 |
| 65–128 | 37 | 41 | 33 |
| 129–256 | 16 | 24 | 24 |
| 257–512 | 12 | 8 | 9 |
| 513–1024 | 5 | 3 | 2 |
| 1025–2048 | 1 | 1 | 0 |
| Total Clusters (from 58,888 sequences) |
Distribution of cluster events
Detailed analysis of UniGene cluster ug.2984
| Total sequences in clusters containing at least one sequence derived from ug.2984 | 1873 | 1880 | 1900 | 1900 |
| Total clusters | 1 | 38 | 66 | 74 |
| Clusters with > 100 seqs (sizes) | 1 | 4 (1075, 146, 145, 136) | 3 (711,482,108) | 5 (425, 219, 214, 203, 143) |
| Clusters with only one sequence (singletons) | 0 | 23 | 33 | 31 |
Figure 1Schematic showing how the history of a cluster can affect its construction. For a given cluster (1), two sequences (A) and (B) show significant identity. Depending upon which sequence is processed first, cluster 1A or cluster 1B can be constructed. Addition of further sequences showing identity to (A) or (B) then leads to the formation of different clusters (1A, 2A) or (1B, 2B) depending on whether cluster 1A or 1B was originally built.
Post cluster consensus assembly using CAP3 of CLOBB clusters derived from ug.2984
| CLOBB pooled clusters | 1 (1900) | 20 | 23 | 43 |
| CLOBB-α individual clusters | 33 (1867) | 46 | 11 | 90 |
| CLOBB-β individual clusters | 43 (1869) | 56 | 14 | 101 |
| TIGR predictions for the same 1900 sequences | N/A | 30 | 19 | 49 |
Summary of features of the three cluster methods examined
| Underlying Clustering Method | megaBLAST | WU-BLAST & CAP3 | NCBI BLAST |
| Stringency | Dependent on stage of clustering | Very High | High |
| Overlap allowed | N/A | < 20 bp | < 10% of sequence length |
| Clusters are always contiguous? | No | Yes | Yes |
| Dealing with potential chimeric clusters | Initial clustering performed with gene sequences – merging of these initial distinct clusters rejected | CAP3 does not include identified chimeric sequences | Definition of type III matches and 'superclusters' prevents chimeric sequences from merging unsuitable clusters. |
| Continuity (addition of new sequences) | New builds are compared with previous builds | Post processing | Incremental within algorithm |
| Historical information | Availability of previous builds | Notes showing retirement of clusters | 'superclusters' and merge events can be tagged |
| Portability and adapatibility | Low | Low | High |
| Ease of retention of manual curation | Medium | Medium | High |
Figure 2Schematic representation of the cluster process. For a further explanation of the clustering process see text.