| Literature DB >> 19534735 |
Ernesto Picardi1, Flavio Mignone, Graziano Pesole.
Abstract
BACKGROUND: ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed loci. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an ad hoc genomic mapping.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19534735 PMCID: PMC2697633 DOI: 10.1186/1471-2105-10-S6-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graphical overview of EasyCluster algorithm and work-flow. In EasyCluster, genomic and EST sequences are initially used to build local databases. Next, GMAP is used to produce EST to genome alignments and results are parsed to build a first round of pseudo-clusters according to overlapping coordinates. For each cluster a refinement procedure is used to generate final clusters taking into account exon/intron boundaries. Before results, the prediction of alternative splicing events per cluster can be optionally required.
Evaluation of different EST clustering tools on our benchmark dataset.
| α | β | |||||
| EasyCluster | 0.995 | 0.995 | 112 (0) | 158.3 | 0.009 | 0 |
| wcd | 0.926 | 0.797 | 112 (15) | 158.3 | 0.018 | 0.045 |
| TGICL | 0.906 | 0.875 | 125 (0) | 141.3 | 0.144 | 0.036 |
| ClustDB | 0.562 | 0.424 | 201 (0) | 87.8 | 0.765 | 0.009 |
| BLASTClust | 0.037 | 0.037 | 8255 (7304) | 2.1 | 0.792 | 0 |
Sensitivity (Sn), Jaccard Index (JI), number of generated clusters and average number of ESTs per cluster based on our human benchmark dataset. The number of singletons is shown in brackets. Moreover, α and β represents Type I and Type II error rates, respectively.
EasyCluster statistics and results for Ricinus communis
| #ESTs | #Ex_ESTs | #Unique ESTs | #Unspliced ESTs | #Used ESTs | #Clusters | #Singletons |
| 57690 | 482 | 33907 | 19921 | 35272 | 5879 | 2944 |
For EST clustering in Ricinus communis, EasyCluster used 57,690 ESTs. A reduced number of ESTs (Ex_ESTs) have been excluded due to poor genomic mapping and 19,921 ESTs have been filtered out since unspliced and with no defined orientation. Finally, 5,879 clusters have been generated using 35,272 high quality ESTs.
Figure 2Example of EasyCluster graphical report per cluster. This figure shows a graphical overview of an EST cluster generated by EasyCluster on Ricinus communis genome. The cluster comprises 7 ESTs mapping on the Ricinus contig number 29848 and 3440 bp long. For each EST the exon/intron structure is shown in green squares joined to black lines. The red circle underlines an exon skipping event occurring in the first two ESTs.
Alternative splicing events in Ricinus communis.
| All events | Alt_acc | Alt_don | Skip | IR | Others |
| 918 | 254 | 187 | 92 | 277 | 108 |
Distribution of the 918 alternative splicing events deduced in Ricinus communis. Alt_acc: alternative acceptor; Alt_don: alternative donor; Skip: exon skipping; IR: intron retention; Others: combination of simple events (ex.: cassette exons).