Literature DB >> 23161676

New and continuing developments at PROSITE.

Christian J A Sigrist¹, Edouard de Castro, Lorenzo Cerutti, Béatrice A Cuche, Nicolas Hulo, Alan Bridge, Lydie Bougueleret, Ioannis Xenarios.

Abstract

PROSITE (http://prosite.expasy.org/) consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule a collection of rules, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE signatures, together with ProRule, are used for the annotation of domains and features of UniProtKB/Swiss-Prot entries. Here, we describe recent developments that allow users to perform whole-proteome annotation as well as a number of filtering options that can be combined to perform powerful targeted searches for biological discovery. The latest version of PROSITE (release 20.85, of 30 August 2012) contains 1308 patterns, 1039 profiles and 1041 ProRules.

Entities: Chemical Species

Mesh：

Substances：
Proteins
Proteome

Year: 2012 PMID： 23161676 PMCID： PMC3531220 DOI： 10.1093/nar/gks1067

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

PROSITE is a resource for the identification and annotation of conserved regions in protein sequences. These regions are identified using two types of signatures: generalized profiles (weight matrices) that describe protein families and modular protein domains and patterns (regular expressions) that describe short sequence motifs often corresponding to functionally or structurally important residues (1). PROSITE signatures are linked to annotation rules, or ProRules, which define protein sequence annotations (such as active site and ligand-binding residues) and the conditions under which they apply (for example requiring specific amino acid residues) (2). ProRule is used for the annotation of protein families, domains and sequence features in UniProtKB/Swiss-Prot, the manually curated section of the UniProt KnowledgeBase (3), and currently provides annotation for >75% of the 1054 domains to be found there (release 2012_08, 5 September 2012). Part of the information stored in ProRule (e.g. active and binding sites, disulfide bonds) is also accessible to the ScanProsite user. PROSITE provides extensive documentation for each signature including information on nomenclature, function, sequence features, pointers to 3D structure(s), protein architectures in which the signature is found, its taxonomic distribution and important literature references (1). PROSITE signatures, ProRules and PROSITE documentation can be accessed from our website at http://prosite.expasy.org/ (4). PROSITE signatures are also made available through InterPro (http://www.ebi.ac.uk/interpro/index.html), an integrated database of protein signatures used for the classification and annotation of proteins and genomes (5). Through InterPro users can combine PROSITE classifications with those provided by other InterPro consortium members. Since our last report in the NAR database issue (6), PROSITE has increased the number of available signatures to 1308 patterns and 1039 profiles, which are associated with 1041 ProRules and 1650 documentation entries.

NEW DEVELOPMENTS: SCANPROSITE

The ScanProsite tool (http://prosite.expasy.org/scanprosite/) allows users to search protein sequences against all PROSITE signatures, and to search for matches to defined PROSITE signatures in the UniProtKB and PDB databases (4). To enhance the utility and flexibility of PROSITE searches, we have modified the ScanProsite tool in a number of ways.

Proteome annotation by ScanProsite

The original implementation of ScanProsite allowed users to search only a limited number of sequences against the entire library of PROSITE signatures. We have since relaxed this restriction and now offer users the possibility to upload complete proteome sets in FASTA format to the PROSITE server (subject to a size limitation of 16 Mb, which is sufficient for the majority of proteomes). A unique identifier is assigned to each uploaded set of protein sequences and is returned to the user as a reference for use in subsequent searches. The identifier remains valid for 1 month, allowing users to perform multiple analyses on the same set of sequences, if desired. These analyses are performed on the high-performance cluster of the Vital-IT facility (http://www.vital-it.ch/). It is possible to perform combinatorial scans (see below), and users can perform searches against their own defined sequence patterns. To demonstrate this application, we annotated the complete proteome sequence of the fire ant Solenopsis invicta at the ScanProsite server (7). The Official Gene Set of S. invicta is predicted to encode 16 569 canonical protein sequences. These were uploaded to the ScanProsite server in FASTA format and run against all PROSITE motifs, including both patterns and profiles. The entire process took <30 min. A total of 14 562 hits to 1248 distinct PROSITE signatures were found in 5496 protein sequences, giving total coverage at the protein level of ∼33% for this organism (Table 1). Users wishing to obtain higher coverage may of course combine the classification and annotation from PROSITE with that provided by other annotation tools and pipelines.

Table 1.

Results of the ScanProsite search of the 16 569 predicted Solenopsis invicta proteins against the complete set of PROSITE patterns and profiles

	Patterns ^a	Profiles
Total number of PROSITE signature matches in all proteins	4903	9664
Number of distinct proteins matching PROSITE signatures	2696	4349
Number of distinct PROSITE signatures matched	626	622
Number of proteins annotated with one or more functional sites	520	1693
Total number of functional sites annotated	744	7022
Number of distinct PROSITE signatures providing annotation for functional sites	74	148
Total number of detected domains annotated with functional sites	606	3397

aPattern hits are validated by automatically generated ‘miniprofiles’ that assign a status to pattern matches (8).

Results of the ScanProsite search of the 16 569 predicted Solenopsis invicta proteins against the complete set of PROSITE patterns and profiles aPattern hits are validated by automatically generated ‘miniprofiles’ that assign a status to pattern matches (8).

Combinatorial search

In parallel to this work, we have developed and implemented a number of search options that enhance the power and flexibility of ScanProsite. The first of these allows users to search for specific combinations of signatures. This feature may be useful in fine-grained functional inference, allowing users to search a given set of sequences for instances of domains (profiles) that are associated with particular functional residues (patterns) or to search for specific combinations of domains that may confer particular functions (9,10). PROSITE descriptors are combined using the logical operators ‘and’, ‘or’ and ‘not’, with parentheses used to define the priority in which the operators are applied (Figure 1). Users can also define their own sequence patterns and combine them with existing PROSITE signatures. This may allow the further discrimination of particular domain variants or subfamilies that are not yet covered by existing PROSITE signatures (2).

Figure 1.

The use of logical operators in ScanProsite. The PROSITE profiles used are PS50122 (CHEB), PS50123 (CHER) and PS50110 (RESPONSE_REGULATORY). The matched architectures correspond to the following UniProtKB/Swiss-Prot entries: Q02998 (YH19_RHOCA), A1SMR4 (CHEB_NOCSJ), P31758 (FRZG_MYXXA), P31759 (FRZF_MYXXA) and A1VZQ6 (CHER_CAMJJ). Single-asterisk symbol denotes that ‘not’ has to be used with another operator (‘and’ or ‘or’). Double-asterisk symbol denotes that parentheses have to be preceded and followed by a space.

Targeted search with filters

The results of PROSITE searches on UniProtKB can be further restricted using a variety of filtering options. Users can limit the results to only those proteins that derived from one or more taxa, according to the taxonomic classification of UniProtKB (http://www.uniprot.org/taxonomy/), at any desired level in the taxonomy. Taxonomic information is found in the ‘OC’ and ‘OS’ line(s) of the UniProtKB flat file. Users can also limit their results to only those proteins having a particular name (be it the recommended name or alternative name), which can be a general class of protein such as ‘protease.’ Such nomenclature information is found in the ‘DE’ line(s) of the UniProtKB flat file. Users can also limit their results to only those proteins that are expressed in one of 56 adult tissues, using data from the Bgee resource (http://bgee.unil.ch/bgee/bgee), a database of gene expression and evolution (11). This particular filter is applicable to proteins of Homo sapiens, Mus musculus, Xenopus laevis and Danio rerio. Finally, users can also limit their results to only those proteins having a certain size or within a certain size range. Together, these filters allow users to combine prior biological knowledge with specific sequence features (or combinations of them) in order to perform very powerful targeted searches. We illustrate a typical application of these search options using the alkylglycerol mono-oxygenase of M. musculus as an example (12). Prior to the identification of the sequence encoding this enzyme, a limited amount of information was available regarding its biological and biochemical characteristics. We used this information to identify a number of possible candidate sequences for experimental validation. It was known that this enzyme, along with nitric oxide synthase and aromatic amino acid hydroxylase, required tetrahydrobiopterin and iron to be active. The enzyme was also known to have similar iron-binding characteristics to aromatic amino acid hydroxylase, suggesting a role for histidine residues in this process (13). The protein was known to be present in brain and liver, and its size was estimated at between 400 and 650 amino acids (14). We used this information to perform a restricted search of the murine proteome using a degenerate pattern corresponding to the two iron-coordinating histidines of aromatic amino acid hydroxylase (H–X(3,5)–H). This reduced the list of UniProtKB protein entries matching this motif from over 1000 to only 31, corresponding to 22 genes. Following manual inspection of these sequences, we excluded a number of previously characterized proteins that were unlikely to be responsible for the specified activity, including transcription factors, transporters and enzymes. The remaining set of 16 proteins constituted a reasonable number of candidate sequences for experimental investigation. One of these was found to possess alkylglycerol mono-oxygenase activity, and this is described in UniProtKB/Swiss-Prot entry Q8BS35.

CONCLUSION

PROSITE provides a resource for the identification and annotation of conserved regions in protein sequences, covering protein families, domains and motifs. We will continue to develop new PROSITE profiles and ProRules as new proteins, domains and functions are characterized. We describe here improvements to ScanProsite that permit PROSITE to be applied by users for whole-proteome annotation, as well as a number of options that allow very fine-grained searches including prior biological knowledge. Our current software developments are addressed at further enhancing the speed of ScanProsite for improved proteome annotation. To achieve this, the original code of pfsearch is being rewritten and optimized to efficiently use modern multi-core processors and an heuristic implemented for further speed enhancements. This work will be described in a forthcoming publication (L. Cerutti and T. Schuepbach, personal communication).

FUNDING

An FNS project [315230-116864]. PROSITE activities are also supported by the Swiss Federal Government through the Federal Office of Education and Science. Funding for open access charge: Swiss Federal Office of Education and Science. Conflict of interest statement. None declared.

13 in total

1. PROSITE: a documented database using patterns and profiles as motif descriptors.

Authors: Christian J A Sigrist; Lorenzo Cerutti; Nicolas Hulo; Alexandre Gattiker; Laurent Falquet; Marco Pagni; Amos Bairoch; Philipp Bucher
Journal: Brief Bioinform Date: 2002-09 Impact factor: 11.622

2. Identification of the gene encoding alkylglycerol monooxygenase defines a third class of tetrahydrobiopterin-dependent enzymes.

Authors: Katrin Watschinger; Markus A Keller; Georg Golderer; Martin Hermann; Manuel Maglione; Bettina Sarg; Herbert H Lindner; Albin Hermetter; Gabriele Werner-Felmayer; Robert Konrat; Nicolas Hulo; Ernst R Werner
Journal: Proc Natl Acad Sci U S A Date: 2010-07-19 Impact factor: 11.205

3. ProRule: a new database containing functional and structural information on PROSITE profiles.

Authors: Christian J A Sigrist; Edouard De Castro; Petra S Langendijk-Genevaux; Virginie Le Saux; Amos Bairoch; Nicolas Hulo
Journal: Bioinformatics Date: 2005-08-09 Impact factor: 6.937

4. Nature of the protein universe.

Authors: Michael Levitt
Journal: Proc Natl Acad Sci U S A Date: 2009-06-18 Impact factor: 11.205

Review 5. The evolution of protein domain families.

Authors: Marija Buljan; Alex Bateman
Journal: Biochem Soc Trans Date: 2009-08 Impact factor: 5.407

6. Partial characterization of the alkylglycerol cleavage enzyme system of rat liver.

Authors: J F Soodsma; C Piantadosi; F Snyder
Journal: J Biol Chem Date: 1972-06-25 Impact factor: 5.157

7. PROSITE, a protein domain database for functional characterization and annotation.

Authors: Christian J A Sigrist; Lorenzo Cerutti; Edouard de Castro; Petra S Langendijk-Genevaux; Virginie Bulliard; Amos Bairoch; Nicolas Hulo
Journal: Nucleic Acids Res Date: 2009-10-25 Impact factor: 16.971

8. Glyceryl ether monooxygenase resembles aromatic amino acid hydroxylases in metal ion and tetrahydrobiopterin dependence.

Authors: Katrin Watschinger; Markus A Keller; Albin Hermetter; Georg Golderer; Gabriele Werner-Felmayer; Ernst R Werner
Journal: Biol Chem Date: 2009-01 Impact factor: 3.915

9. ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins.

Authors: Edouard de Castro; Christian J A Sigrist; Alexandre Gattiker; Virginie Bulliard; Petra S Langendijk-Genevaux; Elisabeth Gasteiger; Amos Bairoch; Nicolas Hulo
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10. The 20 years of PROSITE.

Authors: Nicolas Hulo; Amos Bairoch; Virginie Bulliard; Lorenzo Cerutti; Béatrice A Cuche; Edouard de Castro; Corinne Lachaize; Petra S Langendijk-Genevaux; Christian J A Sigrist
Journal: Nucleic Acids Res Date: 2007-11-14 Impact factor: 16.971

477 in total

1. A Multireporter Bacterial 2-Hybrid Assay for the High-Throughput and Dynamic Assay of PDZ Domain-Peptide Interactions.

Authors: David M Ichikawa; Carles Corbi-Verge; Michael J Shen; Jamie Snider; Victoria Wong; Igor Stagljar; Philip M Kim; Marcus B Noyes
Journal: ACS Synth Biol Date: 2019-04-18 Impact factor: 5.110

2. PIRSitePredict for protein functional site prediction using position-specific rules.

Authors: Chuming Chen; Qinghua Wang; Hongzhan Huang; Cholanayakanahalli R Vinayaka; John S Garavelli; Cecilia N Arighi; Darren A Natale; Cathy H Wu
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

3. PPR proteins of green algae.

Authors: Nicolas J Tourasse; Yves Choquet; Olivier Vallon
Journal: RNA Biol Date: 2013-08-28 Impact factor: 4.652

4. Plant-specific ribosome biogenesis factors in Arabidopsis thaliana with essential function in rRNA processing.

Authors: Denise Palm; Deniz Streit; Thiruvenkadam Shanmugam; Benjamin L Weis; Maike Ruprecht; Stefan Simm; Enrico Schleiff
Journal: Nucleic Acids Res Date: 2019-02-28 Impact factor: 16.971

5. Improving the explainability of Random Forest classifier - user centered approach.

Authors: Dragutin Petkovic; Russ Altman; Mike Wong; Arthur Vigil
Journal: Pac Symp Biocomput Date: 2018

6. Cloning and functional characterization of three branch point oxidosqualene cyclases from Withania somnifera (L.) dunal.

Authors: Niha Dhar; Satiander Rana; Sumeer Razdan; Wajid Waheed Bhat; Aashiq Hussain; Rekha S Dhar; Samantha Vaishnavi; Abid Hamid; Ram Vishwakarma; Surrinder K Lattoo
Journal: J Biol Chem Date: 2014-04-25 Impact factor: 5.157

7. Identification and structural characterization of a histidinol phosphate phosphatase from Mycobacterium tuberculosis.

Authors: Bhavya Jha; Deepak Kumar; Arun Sharma; Abhisek Dwivedy; Ramandeep Singh; Bichitra Kumar Biswal
Journal: J Biol Chem Date: 2018-05-11 Impact factor: 5.157

Review 8. Integrating omics technologies to study pulmonary physiology and pathology at the systems level.

Authors: Ravi Ramesh Pathak; Vrushank Davé
Journal: Cell Physiol Biochem Date: 2014-04-28

9. Differential effects of membrane sphingomyelin and cholesterol on agonist-induced bitter taste receptor T2R14 signaling.

Authors: Feroz Ahmed Shaik; Prashen Chelikani
Journal: Mol Cell Biochem Date: 2019-09-20 Impact factor: 3.396

10. Functional understanding of the diverse exon-intron structures of human GPCR genes.

Authors: Dorothy A Hammond; Victor Olman; Ying Xu
Journal: J Bioinform Comput Biol Date: 2013-12-11 Impact factor: 1.122