| Literature DB >> 15980572 |
Abstract
The identification of the genes that participate at the biological interface of two species remains critical to our understanding of the mechanisms of disease resistance, disease susceptibility and symbiosis. The sequencing of complementary DNA (cDNA) libraries prepared from the biological interface between two organisms provides an inexpensive way to identify the novel genes that may be expressed as a cause or consequence of compatible or incompatible interactions. Sequence classification and annotation of species origin typically use an orthology-based approach and require access to large portions of either genome, or a close relative. Novel species- or clade-specific sequences may have no counterpart within existing databases and remain ambiguous features. Here we present a web-service, Eclair, which utilizes support vector machines for the classification of the origin of expressed sequence tags stemming from mixed host cDNA libraries. In addition to providing an interface for the classification of sequences, users are presented with the opportunity to train a model to suit their preferred species pair. Eclair is freely available at http://eclair.btk.fi.Entities:
Mesh:
Year: 2005 PMID: 15980572 PMCID: PMC1160195 DOI: 10.1093/nar/gki434
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1A schematic showing the workflow as applied by the Éclair web-server to classify EST sequences for species origin. There are two routes by which Éclair may be used. In Route 1, a user applies an already existing model to classify sequences. In the second scenario, Route 2, the application is trained. A user uploads homogeneous sequence collections and these are used to prepare the required models for ESTScan and the Eclat SVM. Both methods produce extensive WWW reporting to indicate sequence origins and to indicate the sensitivity and selectivity of the underlying models.
A list of the host: pathogen pairs that are available through the Éclair web-server and basic statistics that illustrate the effectiveness of the underlying Eclat SVM models
| Pathogen genome | Host genome | Test data | Host | PP | HP | HP | HH | Evaluation of the SVM model | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pathogen | Pathogen | Host | |||||||||
| Sensitivity | Selectivity | Sensitivity | Selectivity | ||||||||
| Blumeria graminis | Hordeum vulgare | 292.9 | 1254.4 | 206.4 | 79.2 | 72.7 | 1061.2 | 0.72 (0.07) | 0.74 (0.03) | 0.94 (0.01) | 0.93 (0.02) |
| Globodera pallida | Solanum tuberosum | 727.7 | 1247.0 | 605.7 | 73.4 | 64.7 | 1096.2 | 0.89 (0.01) | 0.90 (0.01) | 0.94 (0.01) | 0.94 (0.01) |
| Haemonchus contortus | Ovis aries | 838.6 | 861.4 | 771.2 | 67.0 | 55.5 | 796.5 | 0.92 (0.02) | 0.93 (0.03) | 0.93 (0.03) | 0.92 (0.02) |
| Heterodera glycines | Glycine max | 1246.4 | 1239.5 | 1166.3 | 64.3 | 182.9 | 890.4 | 0.95 (0.03) | 0.86 (0.08) | 0.83 (0.16) | 0.93 (0.03) |
| M.grisea | O.sativa | 742 | 1241.3 | 640.5 | 116.4 | 98.9 | 974.8 | 0.85 (0.02) | 0.87 (0.01) | 0.91 (0.01) | 0.89 (0.01) |
| Manduca sexta | Nicotiana tabacum | 164.2 | 1237.3 | 133.8 | 35.0 | 11.5 | 898.7 | 0.79 (0.03) | 0.92 (0.02) | 0.99 (0.00) | 0.96 (0.01) |
| Meloidogyne incognita | Gossypium arboreum | 1247 | 1245.6 | 1073 | 100.8 | 138.2 | 969.4 | 0.91 (0.02) | 0.89 (0.01) | 0.88 (0.01) | 0.91 (0.02) |
| Neurospora crassa | Arabidopsis thaliana | 542.4 | 1251.8 | 503.9 | 34.7 | 44.8 | 1075.1 | 0.94 (0.03) | 0.92 (0.01) | 0.96 (0.01) | 0.97 (0.01) |
| Phytophthora infestans | L.esculentum | 1241.1 | 1240.1 | 1117.6 | 123.5 | 52.8 | 1074.1 | 0.90 (0.01) | 0.95 (0.01) | 0.95 (0.01) | 0.90 (0.01) |
| P.sojae | Glycine max | 1248.7 | 1242.1 | 1227 | 32.7 | 40.6 | 1025.0 | 0.97 (0.01) | 0.97 (0.01) | 0.96 (0.01) | 0.97 (0.01) |
Following the unigene assembly and CDS prediction steps in Éclair the dataset was randomly sampled 10 times to produce representative data sets for Eclat training. The sampled data was split so that 75% of the sequences were used for model training; with the remaining 25% of sequences used for testing. The average number of sequences used for testing are shown for the pathogen and host datasets. Following creation of the Eclat SVM model, the retained test sequences were classified, the average numbers of sequences classified are shown in columns PP, PH, HP and HH where PP represents a pathogen sequence classified as a pathogen sequence, PH represents a pathogen sequence classified as a host sequence and so on. The results of the simulations are summarized as sensitivity and selectivity for both the pathogen and host sequence models, the standard deviations from the 10 replicates are shown in brackets.