Literature DB >> 15059245

PASS2: an automated database of protein alignments organised as structural superfamilies.

Anirban Bhaduri¹, Ganesan Pugalenthi, Ramanathan Sowdhamini.

Abstract

BACKGROUND: The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. DESCRIPTION: An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database.
CONCLUSIONS: The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html

Entities: Chemical Disease Gene

Mesh：

Substances：
Cytochromes
Proteins

Year: 2004 PMID： 15059245 PMCID： PMC407847 DOI： 10.1186/1471-2105-5-35

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Classification of proteins into families is performed on the basis of the similarity of sequences to the family members [1,2]. Importantly, however, detectable global sequence similarity in a protein family is not required for retention of the three-dimensional fold and only a very small number of conserved functional residues are required for biochemical activity amongst proteins belonging to a superfamily [3]. Establishing evolutionary relationships between superfamily members having similar structure and function but sequentially diverged is challenging. Over 49,000 domains deposited in the Protein Data Bank (PDB) [4] are organized in different databases by hierarchical classification schemes or in terms of structural neighbourhood distances [5-7]. SCOP (1.63 release) records 49,497 protein domains, grouped into merely 765 folds, suggesting a strong structural convergence of proteins. Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles [8-13] can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences. Availability of reliable sequence alignments for distantly related proteins despite poor sequence identity permits better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. In addition, the construction of three-dimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are 30% or above. Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distantly related proteins. PASS2 database [14,15] presents alignments of sequentially distant proteins related at the superfamily level. We report an automated, updated version of the superfamily alignment database that is in direct correspondence with SCOP (1.63) database.

Construction and content

The present version of PASS2 consider domains as assigned in SCOP 1.63 [6]. Domains within a superfamily, no more than 40% identical with each other, have been considered for curating the database. The choice of 40% cut-off in percentage sequence identity, as compared to the previous version of PASS2 that works at 25% identity level, was to reduce the number of single-member superfamilies. The 4,001 protein domains were assigned 1,194 superfamilies spanning the seven classes of proteins and were thus chosen for structure based sequence alignments.

Curation of alignments

Structural domains, obtained consulting SCOP [6] definitions, have been grouped at the superfamily level and superposed by rigid-body superposition (Figure 1). An initial superposition for all the structural domains belonging to each non-redundant superfamily was performed using LSQMAN [16] or STAMP 4.0 [17]. LSQMAN [16] was used for superposing two member superfamilies while STAMP 4.0 [17] was utilised in multi-member superfamilies. From the coarse alignment, equivalent regions were identified using JOY [18]. COMPARER [19] was employed to derive a refined alignment and superposition for the structures. Superposition was achieved by the choice of 'initial equivalencies' that served as seeds for pairwise rigid-body superposition using PMNFC, a modified form of MNYFIT [20] (Figure 2). The final alignment was presented using the three-dimensional structural features of JOY [18] (Figure 3).

Figure 1

Flowchart representation of the steps involved in the curation of PASS2 database. Listed are useful tools and additional derived information that may be obtained from PASS2.

Figure 2

Superposed structures of the cytochrome superfamily representatives: The cytochrome superfamily has six representative members in PASS2 (1a7va-, 1bbha-, 1cpq--, 1e85a-, 256ba-, 2ccya-) which have been superposed as explained (see Curation of Alignments section). The figure has been created using MOLSCRIPT [32].

Figure 3

Representative structure-based sequence alignment for the cytochrome superfamily. The six members have been aligned and represented incorporating the three-dimensional features of JOY [18].

Utility and discussion

Assigning new structural entries to pre-existing superfamilies

Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new three-dimensional protein structures deposited in the Protein Data Bank. PASS2 allows classification of three-dimensional domains into respective superfamilies based on sequential and structural properties. Sequence of the uploaded structure is compared to the hidden markov models of PASS2 and assigned to superfamilies on the basis of liberal expectation values (E = 1.0). Representative structures of the putative superfamilies have been superposed with the query using LSQMAN [16], thus associating the query to a particular superfamily. Alternatively, the user can superpose an uploaded structure to specific superfamilies.

Predicting superfamilies and alignment for sequences

Links have been provided to popular sequence search methods like PSI-BLAST [21] and PHI-BLAST [22], which may be employed to associate unannotated sequences to PASS2 superfamilies. A sequence to probabilistic profile match method Hmmpfam [23] can also be used for similar assignment. Sequence alignments for a query sequence can be obtained with superfamily members using MALIGN [24]. 3-dimensional features can also be attributed to the sequence alignment using JOY [18].

Hidden markov models for PASS2

During search for sequence homologues and sequence assignment, profile-based methods perform better compared to those that use pairwise comparisons [25]. Family profiles based on hidden markov models are popular probabilistic models applied for sequence annotations and searches [26,27]. Structure-based sequence alignment of respective superfamilies in PASS2 provides a reliable basis for building hidden markov models. We provide HMMs, built using HMM suite [23], for superfamily alignments corresponding to the latest version of PASS2. The performances of these HMMs have been compared with models built using their structural homologues present in the PDB [28]. Search for homologues have been performed on the non-redundant sequence database using both sets of models. Higher coverage has been obtained (Table 1) for superfamilies using PASS2 HMMs suggesting their value in sensitive sequence searches. Hidden markov models for both the structure-based sequence alignments and the sequence enriched superfamily alignments can be downloaded from the World Wide Web.

Table 1

Comparision of the number of hits obtained in HMMSearch using models derived from regular multiple sequence alignments and structure based sequence alignments.

Superfamily name	SCOP code	Hits obtained from PASS2 HMMs	Hits obtained from superfamily HMMs
Superoxide dismutase	46609	152	137
Anticodon-binding domain of class I aminoacyl-tRNA synthetases	47323	220	182
Cyclophilin (peptidylprolyl isomerase)	50891	112	98
Hemopexin-like domain	50923	103	73

Superfamily members in the genome database

PASS2 has several new features to associate the structure-based sequences to their homologues in various genome databases. Sequence homologues of the superfamilies have been searched in the non-redundant sequence database using PSI-BLAST [21] and Hmmsearch [23]. For the PSI-BLAST searches, individual member for each superfamily was queried against the non-redundant sequence database. The expectation value was set to 0.001 with 20 iterations. Hidden Markov Models for every superfamily was built using structural alignments (as explained above). These models were searched against the non-redundant database to enrich the sequence members using the Hmmsearch program belonging to the HMM suite applying an E-value threshold of 0.1. A third approach has been to employ interacting motifs, identified for superfamilies, as constraints in PHI-BLAST against searches in the non-redundant database using an E-value 1.0 as explained elsewhere [29]. Hits obtained by the three approaches belonging to the genomes were aligned using CLUSTALW [30] and presented along with their structural representatives of the superfamilies. The top 10 hits displayed in the web are aligned with PASS2 members. The entire set of hits corresponding to genomes can also be downloaded.

Information about superfamily members

A structure-based sequence alignment for the query with the appropriate superfamily can be obtained. Superposed coordinates for the query with the best ranking superfamily (based on the RMSD value) is also provided. Motifs represent invariant regions of the superfamily and are helpful in protein design, engineering and folding studies. Spatially conserved interacting motifs are identified as described elsewhere [29] for each superfamily and are listed in the current version of the database along with psuedoenergies for their spatial interactions (Bhaduri et al., in press). Corresponding links to the structural motifs of superfamily (SMoS) database [31] can also be accessed. Phylogenetic analysis aids in the understanding of the diversity among the members. Diversification of structural members may be studied in terms of the dissimilarity of structure or divergence of the sequences. The database has been linked to other useful protein databases as in the previous version of PASS2 [15].

PASS2 and its applications

PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions. Furthermore, PASS2 acts as a 'junction' point to obtain links of representative superfamily members to genome, sequence and structural databases. Phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships among the members. Motifs explain the invariant regions of proteins acting as descriptors for the superfamily. HMM models can be useful in identifying more members. Availability of such alignment databases over the World Wide Web facilitates the study and design of experiments on specific superfamilies. They also enable systematic survey and analysis of various structural properties for performing fold predictions. The database may be accessed and downloaded across the World Wide Web.

Conclusions

Associating different proteins with structurally similar and evolutionarily related proteins enhance our functional understanding of protein superfamily. The multiple alignments of distantly related representatives are particularly informative and often reveal a signature of invariantly conserved residues. Access to sequence alignments of distantly related proteins over the World Wide Web offers the possibility to study and design experiments on specific superfamilies. They also permit systematic survey and analysis of various structural properties and to perform fold predictions.

Availability of PASS2 database

PASS2 is accessible at

Authors' contributions

AB and GP have contributed equally to the curation of the database. RS has supervised the study and provided input both in the design of the study and drafting of the final manuscript.

29 in total

1. Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins.

Authors: B V Reddy; W W Li; I N Shindyalov; P E Bourne
Journal: Proteins Date: 2001-02-01

2. SMoS: a database of structural motifs of protein superfamilies.

Authors: Saikat Chakrabarti; K Venkatramanan; R Sowdhamini
Journal: Protein Eng Date: 2003-11

3. Multiple sequence alignment with the Clustal series of programs.

Authors: Ramu Chenna; Hideaki Sugawara; Tadashi Koike; Rodrigo Lopez; Toby J Gibson; Desmond G Higgins; Julie D Thompson
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. Conserved spatially interacting motifs of protein superfamilies: application to fold recognition and function annotation of genome data.

Authors: Anirban Bhaduri; R Ravishankar; R Sowdhamini
Journal: Proteins Date: 2004-03-01

5. PASS2: a semi-automated database of protein alignments organised as structural superfamilies.

Authors: V Mallika; Anirban Bhaduri; R Sowdhamini
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 6. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

7. The Protein Data Bank: a computer-based archival file for macromolecular structures.

Authors: F C Bernstein; T F Koetzle; G J Williams; E F Meyer; M D Brice; J R Rodgers; O Kennard; T Shimanouchi; M Tasumi
Journal: J Mol Biol Date: 1977-05-25 Impact factor: 5.469

8. Chemical and biological evolution of nucleotide-binding protein.

Authors: M G Rossmann; D Moras; K W Olsen
Journal: Nature Date: 1974-07-19 Impact factor: 49.962

9. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins.

Authors: A M Lesk; C Chothia
Journal: J Mol Biol Date: 1980-01-25 Impact factor: 5.469

10. Insulin-like growth factor: a model for tertiary structure accounting for immunoreactivity and receptor binding.

Authors: T L Blundell; S Bedarkar; E Rinderknecht; R E Humbel
Journal: Proc Natl Acad Sci U S A Date: 1978-01 Impact factor: 11.205

23 in total

1. Alignment of multiple protein structures based on sequence and structure features.

Authors: M S Madhusudhan; Benjamin M Webb; Marc A Marti-Renom; Narayanan Eswar; Andrej Sali
Journal: Protein Eng Des Sel Date: 2009-07-08 Impact factor: 1.650

Review 2. Structural and functional constraints in the evolution of protein families.

Authors: Catherine L Worth; Sungsam Gong; Tom L Blundell
Journal: Nat Rev Mol Cell Biol Date: 2009-09-16 Impact factor: 94.444

3. Overcoming sequence misalignments with weighted structural superposition.

Authors: Nickolay A Khazanov; Kelly L Damm-Ganamet; Daniel X Quang; Heather A Carlson
Journal: Proteins Date: 2012-07-28

4. PASS2 version 4: an update to the database of structure-based sequence alignments of structural domain superfamilies.

Authors: A Gandhimathi; Anu G Nair; R Sowdhamini
Journal: Nucleic Acids Res Date: 2011-11-28 Impact factor: 16.971

5. PASS2.7: a database containing structure-based sequence alignments and associated features of protein domain superfamilies from SCOPe.

Authors: Teerna Bhattacharyya; Soumya Nayak; Smit Goswami; Vasundhara Gadiyaram; Oommen K Mathew; Ramanathan Sowdhamini
Journal: Database (Oxford) Date: 2022-04-12 Impact factor: 4.462

10. Identification of structurally conserved residues of proteins in absence of structural homologs using neural network ensemble.

Authors: Ganesan Pugalenthi; Ke Tang; P N Suganthan; Saikat Chakrabarti
Journal: Bioinformatics Date: 2008-11-27 Impact factor: 6.937