Literature DB >> 19420063

webPRC: the Profile Comparer for alignment-based searching of public domain databases.

Abstract

Profile-profile methods are well suited to detect remote evolutionary relationships between protein families. Profile Comparer (PRC) is an existing stand-alone program for scoring and aligning hidden Markov models (HMMs), which are based on multiple sequence alignments. Since PRC compares profile HMMs instead of sequences, it can be used to find distant homologues. For this purpose, PRC is used by, for example, the CATH and Pfam-domain databases. As PRC is a profile comparer, it only reports profile HMM alignments and does not produce multiple sequence alignments. We have developed webPRC server, which makes it straightforward to search for distant homologues or similar alignments in a number of domain databases. In addition, it provides the results both as multiple sequence alignments and aligned HMMs. Furthermore, the user can view the domain annotation, evaluate the PRC hits with the Jalview multiple alignment editor and generate logos from the aligned HMMs or the aligned multiple alignments. Thus, this server assists in detecting distant homologues with PRC as well as in evaluating and using the results. The webPRC interface is available at http://www.ibi.vu.nl/programs/prcwww/.

Entities: Disease Gene

Mesh：

Year: 2009 PMID： 19420063 PMCID： PMC2703954 DOI： 10.1093/nar/gkp279

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Sequence-alignment techniques are essential in providing predictions of protein function and evolution. The introduction of sequence–profile methods, such as hmmpfam, hmmsearch (1) and PSI-BLAST (2,3), increased the detection of homologous sequences considerably compared to sequence-sequence methods [e.g. (4)], such as BLAST (3). A profile numerically encodes a multiple sequence alignment and its amino acid diversity by counting the amino acids in each column. Profile hidden Markov models (HMMs), or (profile) HMMs, are statistically more advanced than numerical profiles and allow for variable gap penalties (1). Clearly, profiles, based on an alignment, contain more information than a single sequence. Indeed, including distant but true homologues in the alignment, further increases the chance of detecting of similar families (5). We here use the word ‘profiles’ to refer to both numerical profiles and profile HMMs. The last decade the sequence–profile methods have been advanced to profile–profile methods. Profile–profile methods provide a more sensitive (6–9) way to find distant homologies between proteins. Using profiles for both query and subject (domain database), has been shown to lead to more sensitive detection of evolutionary remote relationships [e.g. (9,10)]. Different profile–profile methods have been developed, including prof_sim (9) and FFAS (11). We here focus on three widely used state-of-the-art profile–profile programs: Profile Comparer [first released in 2002 (12)], COMPASS [COmparison of Multiple Protein sequence Alignments with assessment of Statistical Significance (6,13)] and HHsearch (7,14). Profile Comparer [PRC, (12)] is a stand-alone program for scoring and aligning HMM and is routinely used by, for example, the CATH (15) and Pfam (16,17) domain databases. The CATH pipeline uses PRC to detect extremely remote homologues and group them in superfamilies [http://www.cathdb.info/wiki/doku.php?id=about:intro, (15)]. Initially, Pfam used only PRC to detect similar domains (16), but now also uses HHsearch (14) [and SCOOP (18)] to establish Pfam clans (17). In addition, internal links from one Pfam family to another are generated with PRC and SCOOP. In contrast to HHsearch and COMPASS (7,13), PRC did not have a web interface available yet. We therefore have implemented webPRC, a server for searching several public domain databases with additional functionality, including HMM-to-alignment translation, as compared to stand-alone PRC.

METHODS

Database construction

Several major domain databases are provided: Pfam (17), NCBI's Conserved Domain Database (19), KOG (20), TIGRFAMs (21), CATH (22) and SUPERFAMILY (23). We briefly indicate how the profile HMMs and their seed alignments were obtained. Pfam-A: The Pfam-A (16,17) profile HMMs have been rebuilt locally using the seed alignments downloaded from the Pfam FTP site (http://pfam.sanger.ac.uk) and the hmmbuild options provided therein. When building the HMMs the starting alignment, also for CDD/KOG and TIGRFAMs, was re-saved by hmmbuild (HMMER v2.3.2; http://hmmer.janelia.org/). This re-saved alignment includes an ‘RF’ line that indicates which alignment columns are absent from the HMM. This line is used to translate the HMM coordinates of the PRC results back to the alignment coordinates. CDD/KOG: NCBI's Conserved Domain Database [CDD (19)] and KOG (20) HMMs have been built from the seed alignments downloaded via the CDD site (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). As there can be multiple identical sequence identifiers in CDD alignments, the sequence identifiers in the re-saved alignments were made unique by prepending a number to the entire identifier for reoccurring identifiers only. TIGRFAMs: The TIGRFAMs (21) HMMs have been rebuilt locally from the seed alignments and the hmmbuild options provided in the TIGRFAMs HMM files. CATH: The CATH (15,22) HMMs have been obtained from the CATH web site (http://www.cathdb.info). These models are not based on Pfam-like seed alignments, but are produced iteratively starting from a single sequence (24). This can result in huge alignments with high gap content (up to about 80 000 sequences, >50 000 columns, or 680 Mb for a single alignment). For this reason, the CATH models are used directly. Their underlying alignments have been processed to include an ‘RF’ line and a maximum of the first 200 sequences are included in the alignment output. SUPERFAMILY: The SUPERFAMILY (23) models were retrieved from http://supfam.org.

User input

The user can provide a single protein sequence or multiple sequence alignment via the paste or upload field. A variety of alignment formats is accepted (ClustalW, FASTA, GCG MSF, Stockholm and SELEX). The user may configure the following search parameters: the domain database, PSI-BLAST options, several PRC options, the number of unique hits to be visualized in the hit graphic, and the use of the hmmbuild ‘–hand’ option. This option can be used to mark regions of the alignment that should be absent from the HMM produced by hmmbuild, which is useful for searching with discontinuous domains. The ‘RF’ annotation line, required for the optional ‘–hand’ option, is supported for the SELEX (#= RF) and Stockholm (#=GC RF) formats. Finally, the user may choose to generate logos from the HMM alignments or from the aligned multiple sequence alignments [with LogoMat-P (25) and Two Sample Logo (26), respectively] to visualize the alignments. Example input and output are provided, including the possibility to regenerate the example output (‘rerun the example’).

Alignment calculation

The webPRC searches run on a 64-CPU computer cluster. The processing scripts are coded in Perl, Bioperl [Bio::Graphics and Bio::SimpleAlign; (27)], PHP and Javascript. PRC is run with the selected domain library and domain descriptions of the hits are parsed from the chosen domain database. Since PRC results are reported in profile HMM space, both the PRC alignment output and the re-saved alignment files, produced by hmmbuild, are processed to provide a mapping of PRC results to the query and hit multiple sequence alignments. Then, these alignments are sliced according to the calculated alignment coordinates and joined in one alignment. The IDs of the hit alignment in this combined alignment file are prepended with ‘Hit:’. In addition, an ‘aligned alignments’ view is constructed which contains the first sequence and the consensus sequence from each alignment. For viewing the alignment interactively, an extended version of Jalview (28) is used that supports regular expressions to parse sequence identifiers for its linkUrl parameters.

Logo generation

The logos are generated with local installations of LogoMat-P (25) and Two Sample Logo (26). LogoMat-P was adapted such that the generated logos correspond exactly to the HMM alignments reported by PRC. Thus, LogoMat-P is not executing a new pair-wise PRC search to find an HMM alignment between the query and the single subject HMM, but now directly uses the alignment produced by the PRC library run against the domain databases.

RESULTS AND DISCUSSION

The webPRC server facilitates the use of PRC for finding domains related to a query alignment. Besides the possibility to run PRC against different domain databases, webPRC offers additional functionality not available with a PRC stand-alone run. After completion of a PRC search, the raw PRC output is reformatted into a BLAST-like report, which includes a domain hit distribution graphic and a hit table (Figure 1). This makes interpreting PRC output as straightforward as reading a BLAST report. The reformatted PRC alignments now include the match, insert, and delete percentages (Figure 2). In addition, several other features aiding the evaluation of the hits are included in the report: hits in the table are linked to the source domain database and include a description from the selected domain database. The alignments section contains links to the optionally produced logos. These logos are graphical representations of the aligned HMMs or the aligned alignments and can help in the evaluation of the found domains. LogoMat-P (25) produces pair-wise HMM logos based on the reported PRC alignment. These HMM logos are related to the HMM logos (29) used to visualize the HMMs of protein families in Pfam (17). In addition, Two Sample Logos are produced. These logos are based on two multiple sequence alignments and show the positions that are significantly different between the alignments (26). Furthermore, the alignments section contains an ‘aligned alignments’ presentation. Specifically, this translation of ‘raw’ PRC results to query and hit alignments facilitates the identification of conserved residues. The combined multiple sequence alignments can be viewed in Jalview (28). The sequence labels in the Jalview applet are linked to several sequence databases, including UniProt and Entrez Protein, to facilitate the retrieval of sequence annotations. Finally, the alignments can be downloaded for additional analyses. For example, Sequence Harmony can be used to predict specificity-determining residues from these alignments (30).

Figure 1.

Figure 2.

An example alignment showing hit number (#1), links, PRC alignment and aligned alignments (truncated). The original PRC HMM alignment is formatted in a BLAST-like style and now includes the counts and percentages of the Match, Insert and Delete states (M–M, M–I, D–∼ pairs, respectively). The aligned alignments view shows the PRC result in multiple sequence alignment space and includes the first sequence of the query and hit alignment as well as their consensus sequences. The alignments are separated by a mid-line that indicates the PRC match states (M) with a ‘+’. Gaps present in the seed alignments are indicated by ‘–’, gaps introduced by PRC by ‘∼’ and positions corresponding to columns missing from the HMM by ‘:’. The entire (aligned) alignments can be viewed with Jalview or downloaded by clicking on ‘View alignment’ or ‘Download’, respectively.

An example of the webPRC domain graphic and hit table section for GGA1_HUMAN run against Pfam (after running PSI-BLAST). The graph can be viewed in HMM or alignment space and the hits are hyperlinked to the alignments. The PRC hit table provides links to the original PRC and PSI-BLAST output and shows a table with annotated hits, including the name and, after clicking on ‘>>’, the description from the domain database. The hits are hyperlinked to the source database and E-values are hyperlinked to the alignments. Co-emission, simple and reverse scores are calculated by PRC [cf. (12)]. The E-value is calculated from the reverse score. An example alignment showing hit number (#1), links, PRC alignment and aligned alignments (truncated). The original PRC HMM alignment is formatted in a BLAST-like style and now includes the counts and percentages of the Match, Insert and Delete states (M–M, M–I, D–∼ pairs, respectively). The aligned alignments view shows the PRC result in multiple sequence alignment space and includes the first sequence of the query and hit alignment as well as their consensus sequences. The alignments are separated by a mid-line that indicates the PRC match states (M) with a ‘+’. Gaps present in the seed alignments are indicated by ‘–’, gaps introduced by PRC by ‘∼’ and positions corresponding to columns missing from the HMM by ‘:’. The entire (aligned) alignments can be viewed with Jalview or downloaded by clicking on ‘View alignment’ or ‘Download’, respectively. The translation from HMM alignments to sequence alignments is provided for most databases. However, the sequence alignments resulting from searches against CATH generally include a large number of gaps (indicated with ‘:’ in the web output). Many alignment columns are indeed absent from their corresponding HMMs due to the high gap content of the seed alignments: for the entire CATH database only 15% of all alignment columns are represented in the HMMs as opposed to 91% for Pfam-A. Figures 1 and 2 illustrate the webPRC output of a search with ADP-ribosylation factor-binding protein GGA1 (UniProt: GGA1_HUMAN) against Pfam and explain the aligned alignments view. A search with the single sequence indeed finds the known domains: VHS, GAT, and GAE (cf. UniProt). PSI-BLAST was run on this sequence to build an alignment (three iterations, E-value 0.0005, NCBI's NR database). Now, not only the VHS, but also the ENTH and ANTH domains are detected, while the GAE domain is not detected anymore. Indeed, the VHS, ENTH and ANTH domains are related, though in general, especially an E-value like that for the ANTH match (0.007) would require further data to state a homologous relationship. In addition to further profile–profile based searching, it is worthwhile to check the Pfam and CDD databases for information on the retrieved hits: CDD contains superfamilies and Pfam groups related families into clans and also provides ‘internal database links’. Pfam and CDD provide information on this VHS/ENTH/ANTH cluster. Hence, webPRC can be used to easily find such clusters and links for any query alignment. E-values can be used to judge the significance of the hits returned by PRC. However, they are accurate only if the library contains more than 1000 profile HMMs (12). The author of PRC indicated that ‘for libraries of sufficient size, E < 0.003 can be taken as indicative of homology and E < 10−5 as a strong match’ (12). For profile–profile comparisons, Pfam uses an E < 0.001 as an indication of a significant match and E-values between 0.1 and 0.001 as an indication of a true relationship (16). We here describe our PRC web interface and refrain from including another PRC validation. We would like to refer the reader to several benchmarking studies that report on the performance of PRC [(8,12,14,18), http://toolkit.tuebingen.mpg.de/hhpred/help_ov]. Reid et al. (24) benchmarked profile–profile and profile-sequence methods, including PRC, COMPASS, HHsearch, and concluded that PRC is the best method for distinguishing homologous from non-homologous domains. Depending on the specific benchmarking study, PRC performs better or worse than HHsearch, but generally better than COMPASS. We encourage prospective webPRC users to have a look at these benchmarking studies as well as the COMPASS (13) and HHsearch web servers (7).

CONCLUSION

The webPRC server provides a web-based front end to PRC, one of the state-of the-art methods for detecting remote homology, to carry out similarity searches against well-established domain databases. Since the input is a single sequence or an alignment, users need not build an HMM themselves. In addition to the domain hit distribution graphic and logo visualizations, webPRC features the translation of the PRC HMM alignments to multiple sequence alignments. This supports evaluation of a hit based on multiple sequence alignments. To this end, the Jalview applet is implemented. Furthermore, the hit, query and combined alignments can be downloaded for additional analyses.

FUNDING

ENFIN, a Network of Excellence funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LSHG-CT-2005-518254. Funding for open access charge: ENFIN. Conflict of interest statement. None declared.

30 in total

1. Protein homology detection by HMM-HMM comparison.

Authors: Johannes Söding
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

2. Visualizing profile-profile alignment: pairwise HMM logos.

Authors: Benjamin Schuster-Böckler; Alex Bateman
Journal: Bioinformatics Date: 2005-04-12 Impact factor: 6.937

3. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

Authors: J Park; K Karplus; C Barrett; R Hughey; D Haussler; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1998-12-11 Impact factor: 5.469

Review 4. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. The Jalview Java alignment editor.

Authors: Michele Clamp; James Cuff; Stephen M Searle; Geoffrey J Barton
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

7. The HHpred interactive server for protein homology detection and structure prediction.

Authors: Johannes Söding; Andreas Biegert; Andrei N Lupas
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. Pfam: clans, web tools and services.

Authors: Robert D Finn; Jaina Mistry; Benjamin Schuster-Böckler; Sam Griffiths-Jones; Volker Hollich; Timo Lassmann; Simon Moxon; Mhairi Marshall; Ajay Khanna; Richard Durbin; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes.

Authors: Eugene V Koonin; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Dmitri M Krylov; Kira S Makarova; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Igor B Rogozin; Sergei Smirnov; Alexander V Sorokin; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: Genome Biol Date: 2004-01-15 Impact factor: 13.583

10. HMM Logos for visualization of protein families.

Authors: Benjamin Schuster-Böckler; Jörg Schultz; Sven Rahmann
Journal: BMC Bioinformatics Date: 2004-01-21 Impact factor: 3.169

8 in total

1. Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis.

Authors: Bin Liu; Junjie Chen; Xiaolong Wang
Journal: Mol Genet Genomics Date: 2015-04-21 Impact factor: 3.291

2. Using amino acid physicochemical distance transformation for fast protein remote homology detection.

Authors: Bin Liu; Xiaolong Wang; Qingcai Chen; Qiwen Dong; Xun Lan
Journal: PLoS One Date: 2012-09-28 Impact factor: 3.240

3. FFAS server: novel features and applications.

Authors: Lukasz Jaroszewski; Zhanwen Li; Xiao-hui Cai; Christoph Weber; Adam Godzik
Journal: Nucleic Acids Res Date: 2011-07 Impact factor: 16.971

Review 4. Recovering full-length viral genomes from metagenomes.

Authors: Saskia L Smits; Rogier Bodewes; Aritz Ruiz-González; Wolfgang Baumgärtner; Marion P Koopmans; Albert D M E Osterhaus; Anita C Schürch
Journal: Front Microbiol Date: 2015-10-01 Impact factor: 5.640

5. Powerful sequence similarity search methods and in-depth manual analyses can identify remote homologs in many apparently "orphan" viral proteins.

Authors: Durga B Kuchibhatla; Westley A Sherman; Betty Y W Chung; Shelley Cook; Georg Schneider; Birgit Eisenhaber; David G Karlin
Journal: J Virol Date: 2013-10-23 Impact factor: 5.103

6. Using distances between Top-n-gram and residue pairs for protein remote homology detection.

Authors: Bin Liu; Jinghao Xu; Quan Zou; Ruifeng Xu; Xiaolong Wang; Qingcai Chen
Journal: BMC Bioinformatics Date: 2014-01-24 Impact factor: 3.169

7. Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

Authors: Juliana Bernardes; Gerson Zaverucha; Catherine Vaquero; Alessandra Carbone
Journal: PLoS Comput Biol Date: 2016-07-29 Impact factor: 4.475

8. CATH: expanding the horizons of structure-based functional annotations for genome sequences.

Authors: Ian Sillitoe; Natalie Dawson; Tony E Lewis; Sayoni Das; Jonathan G Lees; Paul Ashford; Adeyelu Tolulope; Harry M Scholes; Ilya Senatorov; Andra Bujan; Fatima Ceballos Rodriguez-Conde; Benjamin Dowling; Janet Thornton; Christine A Orengo
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8 in total