| Literature DB >> 15980465 |
Roee Gutman1, Carine Berezin, Roy Wollman, Yossi Rosenberg, Nir Ben-Tal.
Abstract
Sequence signature databases such as PROSITE, which include amino acid segments that are indicative of a protein's function, are useful for protein annotation. Lamentably, the annotation is not always accurate. A signature may be falsely detected in a protein that does not carry out the associated function (false positive prediction, FP) or may be overlooked in a protein that does carry out the function (false negative prediction, FN). A new approach has emerged in which a signature is replaced with a sequence profile, calculated based on multiple sequence alignment (MSA) of homologous proteins that share the same function. This approach, which is superior to the simple pattern search, essentially searches with the sequence of the query protein against an MSA library. We suggest here an alternative approach, implemented in the QuasiMotiFinder web server (http://quasimotifinder.tau.ac.il/), which is based on a search with an MSA of homologous query proteins against the original PROSITE signatures. The explicit use of the average evolutionary conservation of the signature in the query proteins significantly reduces the rate of FP prediction compared with the simple pattern search. QuasiMotiFinder also has a reduced rate of FN prediction compared with simple pattern searches, since the traditional search for precise signatures has been replaced by a permissive search for signature-like patterns that are physicochemically similar to known signatures. Overall, QuasiMotiFinder and the profile search are comparable to each other in terms of performance. They are also complementary to each other in that signatures that are falsely detected in (or overlooked by) one may be correctly detected by the other.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15980465 PMCID: PMC1160256 DOI: 10.1093/nar/gki496
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1PROSITE motif PS00138. The first, second, third and seventh amino acid positions can accommodate only glycine, threonine, serine and proline, respectively. The fifth position can accommodate serine or alanine. The eleventh position can accommodate alanine or glycine. The tenth position can accommodate serine, threonine, alanine, valine or cysteine, and the fourth, sixth, eighth and ninth positions are wildcard residues (x). The serine residue in the third position is part of the catalytic triad.
Figure 2A QuasiMotiFinder analysis of the bovine furin protein. A part of the output is presented here, and the full output is available as supplementary material at . The query sequence is color-coded by evolutionary conservation (see the color bar), with burgundy-through-turquoise indicating conserved-through-variable residues. Amino acid positions that were occupied with 9 or fewer residues (the rest of the homologous proteins included gaps) are marked in yellow. The inferred evolutionary conservation grade of these positions is unreliable. Residues in the query sequence that are identical to the ones of the PROSITE signature are marked in green above the sequence according to the single-letter code; residues that deviate from the PROSITE signature are marked as ‘Ψ’, and wildcard residues are marked with dots. Three strict PROSITE motifs were detected. Two of them, the PS00136 and PS00137 signatures, are related to the aspartic and histidine residues of the catalytic triad, which is consistent with the biological function of the protein. In addition, the server detected three pseudo-motifs: PS00013, PS00501 and PS00138. The first two differ from their original motifs in two and three positions, respectively, thus suggesting that they are FP hits. The third (residues 366–376) involves a single change compared with the PS00138 motif, which is typical for the furin protein family.
The set of 22 sequence signatures used in the statistical analysis
| PROSITE identifier | Signature description | γi |
|---|---|---|
| PS00485 | Adenosine and AMP deaminase signature | −1.4940 |
| PS00197 | 2Fe-2S ferredoxins, iron–sulfur-binding region signature | 2.7733 |
| PS00636 | dnaJ domains signatures and profile | −2.4318 |
| PS00693 | Riboflavin synthase alpha chain family Lum-binding site signature | −1.3764 |
| PS00147 | Arginase family signatures | 1.8978 |
| PS00152 | ATP synthase alpha- and beta-subunits signature | −1.5161 |
| PS00043 | Bacterial regulatory proteins, gntR family signature | −0.9389 |
| PS00104 | EPSP synthase signatures | −0.6271 |
| PS00453 | FKBP-type peptidyl–prolyl | −1.7875 |
| PS00178 | Aminoacyl-transfer RNA synthetases class-I signature | −1.0039 |
| PS00227 | Tubulin subunits alpha, beta and gamma signature | 4.5902 |
| PS00296 | Chaperonins cpn60 signature | −1.3363 |
| PS00061 | Short-chain dehydrogenases/reductases family signature | 0.7331 |
| PS00036 | Basic-leucine zipper (bZIP) domain signature and profile | 0.2328 |
| PS00559 | Eukaryotic molybdopterin oxidoreductases signature | −1.5359 |
| PS00287 | Cysteine protease inhibitors signature | −2.3378 |
| PS00107 | Protein kinases ATP-binding region signature | −0.0815 |
| PS00118 | Phospholipase A2 histidine active site | 2.3733 |
| PS00283 | Soybean trypsin inhibitor (Kunitz) protease inhibitors family signature | −0.7184 |
| PS00606 | Beta-ketoacyl synthases active site | −0.0164 |
| PS00697 | ATP-dependent DNA ligase AMP-binding site | 2.0708 |
| PS01228 | Hypothetical cof family signatures | 0 |
γi is the value of the coefficient associated with the signature in the logistic model of Equation 1; PS01228 was selected as a reference and its coefficient was set to zero.
Figure 3The distribution of the conservation scores within the three subgroups: true positive (TP, solid line), false negative (FN, dashed line) and false positive (FP, dot–dashed line).
Comparison of the performance of the QuasiMotiFinder (QMF) and the eMOTIF web servers
| TP | FP | FN | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Present | Absent | Total | Present | Absent | Total | Present | Absent | Total | |
| QMF | 29 | 1 | 30 | 6 | 24 | 30 | 23 | 7 | 30 |
| eMOTIF | 25 | 5 | 30 | 1 | 29 | 30 | 24 | 6 | 30 |
The three subgroups are marked as TP, FP and FN. The number of proteins with correctly detected motifs is listed in the ‘Present’ column in each subgroup. The number of proteins whose motifs were overlooked is listed in the ‘Absent’ column. The total number of proteins in each subgroup is listed in the ‘Total’ column.