| Literature DB >> 19586527 |
Ingo Ebersberger1, Sascha Strauss, Arndt von Haeseler.
Abstract
BACKGROUND: EST sequencing is a versatile approach for rapidly gathering protein coding sequences. They provide direct access to an organism's gene repertoire bypassing the still error-prone procedure of gene prediction from genomic data. Therefore, ESTs are often the only source for biological sequence data from taxa outside mainstream interest. The widespread use of ESTs in evolutionary studies and particularly in molecular systematics studies is still hindered by the lack of efficient and reliable approaches for automated ortholog predictions in ESTs. Existing methods either depend on a known species tree or cannot cope with redundancy in EST data.Entities:
Mesh:
Year: 2009 PMID: 19586527 PMCID: PMC2723089 DOI: 10.1186/1471-2148-9-157
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Workflow of the HaMStR approach. Standard orthology prediction tools are used to identify orthologous groups, the so called core-orthologs, for a set of completely sequenced primer taxa (Proteome A - F). The sequences in a core-ortholog are aligned and converted into a profile HMM (pHMM). A compilation of protein sequences or translated ESTs from a taxon not included in the primer-taxa (Protein set G) is searched for hits with the pHMM. The resulting candidates display features that are characteristic for the protein modelled by the pHMM. To determine the orthology status of the candidates, we introduce a reciprocity criterion. Each candidate is compared by BLASTP with the proteome of one of the primer-taxa, the so-called reference-taxon (Proteome F). If the best BLASTP hit sequence from the reference taxon corresponds to the protein that contributed to the pHMM, the candidate is called candidate-ortholog, else it is discarded.
Overview of the data and data sources used in this study
a
b
c
d
e . The numbers of ESTs per organism are given in parenthesis.
f . The numbers of tentative consensus sequences are given in parenthesis.
Ortholog search for 994 evolutionary conserved genes in the human proteome
| 976 | 979 | |
| 972 | 972 | |
| - | 4 | |
| 1 | - | |
| 3 | 3 | |
| 14 | 14 | |
| 4 | - | |
| - | 1 | |
| 994 | 994 | |
a different denotes those instances where either both programs predict a different human protein as ortholog, or where an ortholog is predicted only by one program.
HaMStR ortholog search in human chromosome 2 ESTs
| 1106 | 32647 | |
| 81 | 6288 | |
| 1032 | 29293 | |
| 74 | 3354 | |
| 72 | 3243 | |
| 2 | 111 | |
| 9 | 389 | |
| 3% | 3% | |
| 89% | 55% |
a Total denotes the number genes/ESTs in the chr2-EST data.
b Intersection of the genes represented in the chr2-ESTs and the human orthologs for the genes in the PoP set obtained with the human proteome data (c.f. Table 2).
c Relative to the results using the human proteome data.
Figure 2Sensitivity of HaMStR as a function of CDS coverage. Fraction of the coding sequence (CDS) covered by the ESTs that have been correctly annotated and missed by HaMStR, respectively.
Figure 3A maximum likelihood phylogeny of 35 fungi based on 178 genes. Unless otherwise stated, all splits in the tree have bootstrap support values of 100. For taxa in all upper case letters the annotated proteome was used. For the remaining taxa orthologs were predicted from ESTs.