Literature DB >> 16381855

MulPSSM: a database of multiple position-specific scoring matrices of protein domain families.

V S Gowri¹, O Krishnadev, C S Swamy, N Srinivasan.

Abstract

Representation of multiple sequence alignments of protein families in terms of position-specific scoring matrices (PSSMs) is commonly used in the detection of remote homologues. A PSSM is generated with respect to one of the sequences involved in the multiple sequence alignment as a reference. We have shown recently that the use of multiple PSSMs corresponding to an alignment, with several sequences in the family used as reference, improves the sensitivity of the remote homology detection dramatically. MulPSSM contains PSSMs for a large number of sequence and structural families of protein domains with multiple PSSMs for every family. The approach involves use of a clustering algorithm to identify most distinct sequences corresponding to a family. With each one of the distinct sequences as reference, multiple PSSMs have been generated. The current release of MulPSSM contains approximately 33,000 and approximately 38,000 PSSMs corresponding to 7868 sequence and 2625 structural families. A RPS_BLAST interface allows sequence search against PSSMs of sequence or structural families or both. An analysis interface allows display and convenient navigation of alignments and domain hits. MulPSSM can be accessed at http://crick.mbu.iisc.ernet.in/~mulpssm.

Entities: Disease Species

Mesh：

Year: 2006 PMID： 16381855 PMCID： PMC1347406 DOI： 10.1093/nar/gkj043

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Multiple sequence alignments of protein families are commonly represented as hidden Markov models (HMMs) (1–4) or position-specific scoring matrices (PSSMs) (5). Tremendous power of such protein profiles in enabling the detection of remote homologues is well known. For example, PSI_BLAST (5) uses PSSM generated, at the end of every round of search, as an input to the next round. Such a use of dynamic PSSMs is known to be extremely effective in the detection of distant relatives of query in the sequence database. RPS_BLAST and IMPALA use a database of static PSSMs corresponding to homologous proteins and enables rapid match of the query sequence with various PSSMs in the database (6,7). Programs employing PSSM matching algorithms are generally faster than the HMM matching programs and are commonly used in large-scale analyses (8,9). Recently we have focused on an important and sensitive feature of PSSMs (10). It is the use of a reference sequence in the construction of a PSSM starting from a multiple sequence alignment. Reference sequence should be any one of the sequences involved in the multiple sequence alignment. Basically, PSSM integrates two kinds of information at every alignment position of a multiple sequence alignment. (i) Extent of occurrence of each of the 20 residue types. (ii) Potential of replacement of the residue in the reference sequence with each one of the 20 residue types. The reference sequence corresponds to the query in the case of PSI_BLAST. For RPS_BLAST searches, reference sequence for generating a PSSM is chosen arbitrarily from the multiple sequence alignment. Hence, the PSSMs generated for a multiple sequence alignment with different homologues as reference sequences will be different. It has been shown convincingly that the sensitivity of the PSSM is strongly dependent on the choice of the reference sequence. During RPS_BLAST searches use of multiple PSSMs corresponding to different reference sequences for a given alignment results in remarkably improved specificity, sensitivity and error rate compared with the use of single PSSM corresponding to an alignment and even HMM (10). In this paper, we report the ready availability of multiple PSSMs for every multiple sequence alignment corresponding to large datasets of sequence and structural families. A web interface allows convenient use of RPS_BLAST to search these datasets of multiple PSSMs and analysis of results.

DATASETS, CLUSTERING OF SEQUENCES AND GENERATION OF PROFILES

MulPSSM database consists of PSSMs of protein domain sequence and structural families. The set of protein domain sequence families corresponds to Protein Families Database of alignments and HMMs (Pfam version 17.0), which consists of 7868 families (11,12). The seed alignments available in PfamA version have been used in the present work. The dataset of structural domain families has been obtained from the latest update of the database on phylogeny and alignment of homologous protein structures (PALI release 2.4) corresponding to 2625 families (13,14). The list of protein domain families of known structures in PALI release 2.4 has been obtained from 1.67 release of structural classification of proteins (SCOP) database (15,16). The multiple sequence alignments in PALI families have been derived on the basis of three-dimensional structural superposition of homologues using structural alignment of multiple proteins (STAMP, version 4.2) program (17). Homologues of yet unknown structure corresponding to the PALI families have been identified by mining the Universal Protein Resource (UNIPROT) database (18,19) using PSI_BLAST (E and h value cut-offs are 10−5 and 10−6 respectively; cut-off for both query length coverage in the alignment and sequence identity are fixed at 30%). Generation of multiple PSSMs with two very closely related sequences as references is unlikely to be worthwhile and it will unnecessarily add to the computer time while searching the PSSM database. Hence choice of reference sequences from a multiple sequence alignment should be optimal so that it maximizes the sensitivity during search in the PSSM database and minimizes the computer time. Hence we clustered sequences in every family of protein domains and identified a set of most disperse sequences within each family. Blastclust was used for the generation of clusters of non-redundant sequences from the set of homologues of the family sequences (5). In order to identify optimal level of clustering following exercise was performed: the clustering was performed at cut-offs of 30–100% sequence identity for 250 randomly picked SCOP families. The sequences in PALI database were then queried against the multi PSSM database at each cut-off value. The number of true positives (Superfamily- or Fold-related sequences) and false positives (class-related or across class connections) were found for each cut-off value to assess the sensitivity and specificity of the profiles generated at each cut-off. It was found that at a sequence identity cut-off of 70%, the sensitivity of the method was equal to the sensitivity at 100% sequence identity cut-off whereas the number of profiles had decreased from ∼25 000 at 100% cut-off to 14 000 at 70% cut-off. At a sequence identity cut-off of 50%, the sensitivity is 75% of the sensitivity at 100% cut-off. Apart from this point, considering the time taken for search that is dependent on the number of PSSMs in the database a cut-off of 50% was chosen as optimal. In our previous work, we have shown that in the multiple PSSM approach, the false positive rate is 2% at an E-value cut-off of 10−5 (10). The false positive rate remains the same after clustering the sequences at a cut-off of 50% sequence identity. For the structural families, the multiple sequence alignments were biased towards the structure-based alignment of the family. In principle, generation of profiles can be carried out using sequence-based alignments alone and it would make no difference to the quality of profiles if the sequences are not very divergent. In case of very divergent sequences, the structural alignments are more accurate than sequence alignments and thus in such cases, using the structural information will lead to more robust profiles.

GENERATION OF PSI_BLAST PROFILES

After identifying the reference sequences present in the multiple sequence alignment using the clustering algorithm, the following procedure is used for the generation of PSSMs. The multiple sequence alignment and the reference sequence are given as inputs to a PSI_BLAST run to iterate against a database of sequences present in the input multiple sequence alignment. In such iterative searches, any hit identified is already present in the multiple sequence alignment fed to the search program. Hence, the profile generated corresponds to the multiple sequence alignment fed to the PSI_BLAST program. The PSSMs output from such PSI_BLAST runs form MulPSSM database. The current version of MulPSSM database consists of ∼33 000 and 38 000 PSSMs corresponding to 7868 sequence and 2625 structural families, respectively. We believe, based on our earlier analysis (10), that the use of multiple profiles for each family can lead to better annotation of genome sequences than is provided by HMM based or single profile based searches. For example, the protein gi|23508377 from Plasmodium falciparum is annotated as a hypothetical protein in the NCBI genome database. Using PSI_BLAST search or Pfam HMM based searches no known domain could be detected in the sequence with significant E-values. Using the multiple PSSM approach, we detected the relationship between this protein and the family of a Peptidase_M10. The E-value is 10−4 with a sequence identity of 22% over the alignment length of about 80 residues. Such observations generate reasonable hypothesis for exploration using experimental studies.

WEB INTERFACE

In the MulPSSM website (), users can select datasets of PSSMs corresponding to either sequence families or structural families or both. Although we recommend an E-value cut-off of 10−4, the choice of E-value is left to the users. The results are represented in pictorial form, with ASCII characters, to show the presence of various domains that may be present in the query sequence. Links are provided to the alignment corresponding to the PSSM with the best E-value, list of all the profiles within and across families that are hit to the query, links to the details of the families (Figure 1).

Figure 1

Typical outputs of a search made in MulPSSM site. The main window gives the alignment of the families found as hits in a semi-graphical layout. The main window has a link to the details of the multiple hits for a family (indicated by blue encircle) which opens in a new window (indicated by a blue arrow). Similarly, the details of the family can be obtained by clicking on the name of the family (indicated by a red encircle). The detailed view of the alignments and other features of a family hit (e.g. the fraction of PSSMs of a family found as hits) can help in assessing the accuracy of a hit.

Lists of all the sequence and structural families are provided in the MulPSSM site with alignments and indication of reference sequences for each family. Reference sequences used in every family are shown highlighted and the user can download all the PSSMs corresponding to a given family for use in a local machine in a RPS_BLAST search. The complete set of all the PSSMs may be obtained from the authors upon request. This set is expected to be invaluable for rapid and sensitive genome-wide domain assignments.

19 in total

1. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels.

Authors: R B Russell; G J Barton
Journal: Proteins Date: 1992-10

2. Use of multiple profiles corresponding to a sequence alignment enables effective detection of remote homologues.

Authors: B Anand; V S Gowri; N Srinivasan
Journal: Bioinformatics Date: 2005-04-07 Impact factor: 6.937

3. Hidden Markov models for detecting remote protein homologies.

Authors: K Karplus; C Barrett; R Hughey
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

Review 4. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

5. Pfam: a comprehensive database of protein domain families based on seed alignments.

Authors: E L Sonnhammer; S R Eddy; R Durbin
Journal: Proteins Date: 1997-07

Review 6. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

7. SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Authors: A G Murzin; S E Brenner; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1995-04-07 Impact factor: 5.469

8. Hidden Markov models in computational biology. Applications to protein modeling.

Authors: A Krogh; M Brown; I S Mian; K Sjölander; D Haussler
Journal: J Mol Biol Date: 1994-02-04 Impact factor: 5.469

9. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. SCOP database in 2004: refinements integrate structure and sequence family data.

Authors: Antonina Andreeva; Dave Howorth; Steven E Brenner; Tim J P Hubbard; Cyrus Chothia; Alexey G Murzin
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

12 in total

1. Identification and characterization of a novel deoxyhypusine synthase in Leishmania donovani.

Authors: Bhavna Chawla; Anupam Jhingran; Sushma Singh; Nidhi Tyagi; Myung Hee Park; N Srinivasan; Sigrid C Roberts; Rentala Madhubala
Journal: J Biol Chem Date: 2009-10-30 Impact factor: 5.157

2. Functional diversity of human protein kinase splice variants marks significant expansion of human kinome.

Authors: Krishanpal Anamika; Nicolas Garnier; Narayanaswamy Srinivasan
Journal: BMC Genomics Date: 2009-12-22 Impact factor: 3.969

3. A framework for classification of prokaryotic protein kinases.

Authors: Nidhi Tyagi; Krishanpal Anamika; Narayanaswamy Srinivasan
Journal: PLoS One Date: 2010-05-26 Impact factor: 3.240

4. Repertoire of Protein Kinases Encoded in the Genome of Takifugu rubripes.

Authors: R Rakshambikai; S Yamunadevi; K Anamika; N Tyagi; N Srinivasan
Journal: Comp Funct Genomics Date: 2012-05-14

5. Recent trends in remote homology detection: an Indian Medley.

Authors: Venkataraman S Gowri; Sankaran Sandhya
Journal: Bioinformation Date: 2006-02-21

6. SInCRe-structural interactome computational resource for Mycobacterium tuberculosis.

Authors: Rahul Metri; Sridhar Hariharaputran; Gayatri Ramakrishnan; Praveen Anand; Upadhyayula S Raghavender; Bernardo Ochoa-Montaño; Alicia P Higueruelo; Ramanathan Sowdhamini; Nagasuma R Chandra; Tom L Blundell; Narayanaswamy Srinivasan
Journal: Database (Oxford) Date: 2015-06-30 Impact factor: 3.451

7. GraphProt: modeling binding preferences of RNA-binding proteins.

Authors: Daniel Maticzka; Sita J Lange; Fabrizio Costa; Rolf Backofen
Journal: Genome Biol Date: 2014-01-22 Impact factor: 13.583

Review 8. Proteomics in India: the clinical aspect.

Authors: Somaditya Mukherjee; Arun Bandyopadhyay
Journal: Clin Proteomics Date: 2016-11-05 Impact factor: 3.988

9. Comparative kinomics of human and chimpanzee reveal unique kinship and functional diversity generated by new domain combinations.

Authors: Krishanpal Anamika; Juliette Martin; Narayanaswamy Srinivasan
Journal: BMC Genomics Date: 2008-12-23 Impact factor: 3.969

10. 3PFDB--a database of best representative PSSM profiles (BRPs) of protein families generated using a novel data mining approach.

Authors: Khader Shameer; Paramasivam Nagarajan; Kumar Gaurav; Ramanathan Sowdhamini
Journal: BioData Min Date: 2009-12-04 Impact factor: 2.522