Literature DB >> 10404625

Iterated sequence databank search methods.

W R Taylor1, N P Brown.   

Abstract

Iterated sequence databank search methods were assessed from the viewpoint of someone with the sequence of a novel gene product wishing to find distant relatives to their protein and, with the specific searches against the PDB, also hoping to find a relative of known structure. We examined three methods in detail, spanning a range from simple pattern-matching to sophisticated weighted profiles. Rather than apply these methods 'blindly' (with default parameters) to a large number of test queries, we have concentrated on the globins, so allowing a more detailed investigation of each method on different data subsets with different parameter settings. Despite their widespread use, regular-expression matching proved to be very limited-seldom extending beyond the sub-family from which the pattern was derived. To attain any generality, the patterns had to be 'stripped-down' to include only the most highly conserved parts. The QUEST program avoided these problems by introducing a more flexible (weighted) matching. On the PDB sequences this was highly effective, missing only a few globins with probes based on each sub-family or even a single representative from each sub-family. In addition, very few false-positives were encountered, and those that did match, often only did so for a few cycles before being lost again. On the larger sequence collection, however, QUEST encountered problems with maintaining (or achieving) the alignment of the full globin family. psi-BLAST also recognised almost all the globins when matching against the PDB sequences, typically, missing three or four of the most distantly related sequences while picking-up a few false-positives. In contrast to QUEST, psi-BLAST performed very well on the larger databank, getting almost a full collection of globins although still retaining the same proportion of false-positives. SAM applied to the PDB sequences performed reasonably well with the myoglobin and hemoglobin families as probes, missing, typically several of the more difficult proteins but performed poorly with the leghemoglobin probe. Only with the full family range as a probe did it produce results comparable to psi-BLAST and QUEST. With the larger databank, SAM produced a good result but, again, this was only achieved using the full range of sequence variation with the default regulariser and use of Dirichlet mixtures completely failed in this situation.

Entities:  

Mesh:

Substances:

Year:  1999        PMID: 10404625     DOI: 10.1016/s0097-8485(99)00017-0

Source DB:  PubMed          Journal:  Comput Chem        ISSN: 0097-8485


  2 in total

1.  Homology-extended sequence alignment.

Authors:  V A Simossis; J Kleinjung; J Heringa
Journal:  Nucleic Acids Res       Date:  2005-02-07       Impact factor: 16.971

2.  Reduction, alignment and visualisation of large diverse sequence families.

Authors:  William R Taylor
Journal:  BMC Bioinformatics       Date:  2016-08-02       Impact factor: 3.169

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.