Literature DB >> 25943549

CCTOP: a Consensus Constrained TOPology prediction web server.

László Dobson¹, István Reményi¹, Gábor E Tusnády².

Abstract

The Consensus Constrained TOPology prediction (CCTOP; http://cctop.enzim.ttk.mta.hu) server is a web-based application providing transmembrane topology prediction. In addition to utilizing 10 different state-of-the-art topology prediction methods, the CCTOP server incorporates topology information from existing experimental and computational sources available in the PDBTM, TOPDB and TOPDOM databases using the probabilistic framework of hidden Markov model. The server provides the option to precede the topology prediction with signal peptide prediction and transmembrane-globular protein discrimination. The initial result can be recalculated by (de)selecting any of the prediction methods or mapped experiments or by adding user specified constraints. CCTOP showed superior performance to existing approaches. The reliability of each prediction is also calculated, which correlates with the accuracy of the per protein topology prediction. The prediction results and the collected experimental information are visualized on the CCTOP home page and can be downloaded in XML format. Programmable access of the CCTOP server is also available, and an example of client-side script is provided.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Membrane Proteins

Year: 2015 PMID： 25943549 PMCID： PMC4489262 DOI： 10.1093/nar/gkv451

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Transmembrane proteins (TMPs) play an important role in living cells, both in unicellular and multicellular organisms, since they take part in energy production processes, material and information transports and establish cell-cell adhesions. Previous studies estimate that 20–30% of coded proteins in the various genomes are TMPs (1–3). Despite their abundance and importance, the number of solved TMP structures is relatively low (4–6) compared to those of globular proteins, due to the challenging experimental conditions that these proteins require in structure determination. However, there are several indirect experimental techniques providing information about the localization of TM segments, e.g. experiments based on fusion with reporter enzymes or proteins (7–10), post translational modification (11–14), protease digestion (15), immune-localization (16–20), chemical modification (21–23), etc. Topology prediction provides a fast, low resolution structural information about TMPs, which could be used as a starting point for laboratory experiments (24) or modeling their 3D structures (25). Most of the early prediction methods were based on the physical-chemical properties of amino acids and were able to determine the topography fairly accurately. Later these approaches were extended with the ‘positive inside’ rule (26), opening prospects to localize the water soluble TMP regions. Machine learning algorithms increased the accuracy of the predictions to a higher level, however these methods depend on the training set used, and therefore their performances are usually lower when encountering new sequence families. As a consequence, whenever a new TM topology is discovered, these methods require re-training (27). Utilizing experimental evidence (28,29), or combining different algorithms as a consensus predictor could also improve the accuracy by eliminating the errors of individual methods. There are three consensus approaches, two of them TOPCONS (30) and MetaTM (31) are currently available, while ConPredII (32) was not available during the preparation of this work. According to its own benchmarking of TOPCONS, it shows only 1% improvement over single predictors. The per protein accuracy of MetaTM was reported to be 86.3%, which is 4% higher than the best included individual method. The Consensus Constrained TOPology prediction (CCTOP) method was used to predict the human transmembrane proteome, which is collected in the HTP database (3). For testing the accuracies of the various transmembrane topology predictors on human proteins only, a special benchmark set was compiled, and on this set CCTOP proved to be superior in topology prediction accuracy. Here we describe the web server of the CCTOP algorithm, a novel consensus topology predictor for α-helical TMPs based on 10 state-of-the-art topology prediction methods. Moreover, the CCTOP server automatically incorporates information from previously determined structural, experimental and bioinformatics studies collected in the PDBTM, TOPDB and TOPDOM databases, respectively. If there are segments with known topology information for the query protein or for any of its homologs in any of the databases mentioned above, this information applied as a constraint in the hidden Markov model. Signal peptide prediction and TMP filtering are also available on the server.

MATERIALS AND METHODS

Preparing benchmark set

A benchmark set was created using the TOPDB database collecting those entries, which have known 3D structures in the PDB that cover the entire TMHs of the TOPDB entry. Redundancy of the sequences was removed at 40% identity using CD-HIT (33), which resulted in 320 sequences. Next, the sequences that were used to train any of the 10 selected methods plus CCTOP itself, were collected, and were used as filters: entries in the testing set with 40% or higher sequence identity with any of the training proteins were removed. This procedure reduced the testing set to 170 sequences, which set was used to measure the accuracy of the different predicting methods.

The CCTOP method

The CCTOP method has three main steps: removing cleavable parts of a target sequence, TMP filtering and topology prediction. Signal peptide segments are often mistaken with TMHs by transmembrane topology prediction algorithms, therefore a preceding analysis of these segments was applied. CCTOP uses SignalP 4.0 (34) to cleave signal peptides; however, this step can be ignored, if a homologous protein in the TOPDB database (35,36) has contradictory evidence. After removing cleavable segments the next step is to distinguish transmembrane and globular proteins. For this task a simple voting system is applied on the results of Phobius (37), Scampi-single (38) and TMHMM (2,39) algorithms. If any two of these methods predict transmembrane segment(s), the protein is classified as TMP. A variety of methods was taken into account for the consensus topology prediction, regarding both the training set and the utilized algorithm. Ten methods were selected based on their availability and performance on different benchmark sets: HMMTOP (28,40), MemBrain (41), MEMSAT-SVM (42), Octopus (43), Philius (44), Phobius (37), Pro- and Prodiv-TMHMM (45), Scampi-MSA (38) and TMHMM (2,39). The prediction results of these methods are used as constraints in the same hidden Markov model that was used by HMMTOP but with different weights. The weights depend on the accuracy of each method, measured on a benchmark set collected for the Human Transmembrane Proteome database (3). To further improve the prediction accuracy for each query, its homologous structures from PDBTM (4–6), experiments of homologous sequences from TOPDB (35,36) and conservatively localized domains and motifs from TOPDOM (46) recognized in the query sequence are collected automatically and all these information is incorporated into a probabilistic framework provided by a hidden Markov model as described in Bagos et al. (47). A formalized and more detailed description of the algorithm is available in our earlier paper (3) and on the home page of the CCTOP server.

Calculating the reliability of the prediction

To calculate the reliability of the prediction, posterior probabilities from the HMM are summed up for each main hidden state type (inside, membrane, loop and outside), in each position of the TMP sequence. Reliability is the average of these sums on the most probable state path determined by the Viterbi algorithm (48). The mathematical details can be found on the manual pages of the CCTOP home page.

Evaluating the methods

Prediction accuracies of the 10 selected methods, MetaTM, TOPCONS and CCTOP were tested on the newly compiled benchmark set. For testing, constraints from the PDBTM database were not used, because predefined topology parts would have led to artificially high accuracy values.

The CCTOP server

To handle the high resource consumption of the heavy duty calculations of the different methods, we created a multilayered application architecture. Without going into much detail, the computational part is forwarded to a dedicated load-balanced queue of our HPC cluster, in which we isolated some nodes just to serve these online requests. There are two interface types for using CCTOP: a web server with browser friendly GUI written in C++ using the Wt web toolkit programming library along with our in-house developed XBuilder library developed for our previous works (6,49) and an unpretentious PHP based frontend for scripts. The results can be visually reviewed, and are also available in XML format; its XSD- Schema definition is located at http://cctop.enzim.ttk.mta.hu/CCTOP.xsd.

RESULTS AND DISCUSSION

Evaluation the prediction accuracy of the CCTOP

Previously, the prediction accuracy of the CCTOP algorithm was tested on a benchmark set containing only human proteins (3). To analyze the effect of using a different source of organisms on the prediction accuracy, a new test set was prepared. The 3D structures are known for all proteins in this newly prepared benchmark set, and it does not contain any homologous protein to the training set of any of the 10 used methods nor to the training set of CCTOP. We have tested the prediction accuracy of various methods on this new benchmark set (Table 1). Some of the algorithms were reported to have 80–90% per protein accuracies; however their performance in this benchmark set are much lower. This can be explained by the stringent filtering of sequences homologous in the training sets. The accuracy of CCTOP was calculated in two ways. In the first test, only the results of the 10 prediction methods were used to calculate the consensus prediction and we did not used any experimental constraints from the PDBTM, TOPDB or TOPDOM databases. TOPCONS was reported to had similar accuracy as the best algorithm utilized, and it was explained as the best input represented the theoretical limit, which cannot be improved (30). However, here we showed with a more sophisticated approach, the prediction accuracy can be significantly improved. In our tests MetaTM shows a setback compared to the incorporated methods. This can be explained with the less strict construction of its benchmark set. Since topology information was not incorporated into the prediction, this test shows that by taking advantage of the probabilistic description of hidden Markov model, the prediction accuracy of a HMM based consensus method can outperform all state-of-the-art and consensus methods.

Table 1.

Prediction accuracy (in percent) of the CCTOP compared with the accuracies of the incorporated methods, as well as two consensus methods (TOPCONS and MetaTM) on the newly compiled benchmark set containing 170 proteins

	CCTOP*	CCTOP	HMMTOP*	MemBrain	MEMSAT-SVM	MetaTM	Octopus	Philius	Phobius	Pro	Prodiv	Scampi-MSA	TMHMM	TOPCONS
Sens/res	98	98	95	92	94	94	93	95	93	96	96	97	93	97
Spec/res	98	98	95	97	98	97	98	97	97	97	95	97	97	98
MCC/res	98	98	95	95	96	96	96	96	95	97	96	97	95	97
Acc_Tpg/prot	84	84	69	62	66	67	71	71	62	75	75	76	66	79
Acc_Top/prot	81	79	64	0	53	58	66	64	56	70	69	72	59	75

Predictions marked with * are enhanced with topological constraints from TOPDB and TOPDOM databases. Structural information from PDBTM was not used in any of these tests. Sens/res is the per residues sensitivity, Spec/res is the per residue specificity, MCC/res is the Matthew correlation coefficient, AccTpg/prot is the per protein topography accuracy, AccTpl/prot is the per protein topology accuracy. In the next test, constraints provided by solved structures were not used, but experiments determining segment localizations relative to the membrane and bioinformatics evidences (conservatively localized domains and motifs) were incorporated. The accuracy of CCTOP increased (Table 1). However, it is important to note that by adding structural constraints the accuracy of CCTOP is expected to be 100% on the benchmark set, as it contains proteins with solved structure.

The reliability of the CCTOP predictions

We define the reliability of the prediction as the average of the posterior probabilities of the main states (inside, outside, membrane, loop) over the best probable state-path (which is the final prediction), which is determined by the Viterbi algorithm. This value can be used to measure the reliability of a single prediction, as it highly correlates with the prediction accuracy. We plotted the per protein prediction accuracy of each subset on the benchmark set which contains predictions above a certain reliability. The independent variable is the coverage, i.e. how many predictions have reliability larger than a certain threshold divided by the number of the proteins in the benchmark set, while the dependent variables are the threshold and the prediction accuracy measured in this subset (Figure 1). As it can be seen, the accuracy values decrease monotonously, showing high correlation with the reliability. In this benchmark set, the accuracy of the predictions with reliability value above 86% is expected to be above 95%.

Figure 1.

Correlation between the topology accuracy (in percent) of CCTOP and reliability. Predictions are sorted according to their reliability values, and then the topology accuracies and the lowest reliability measured on the subset of the benchmark set (red and magenta, respectively) are plotted against their rank in the sorted list divided by the number of the proteins in the benchmark set (coverage). Above 86% reliability value the prediction accuracy is expected to be 95%.

The home page of the CCTOP server

Protein sequences can be submitted at http://cctop.enzim.ttk.mta.hu. As the prediction time may vary from a few minutes up to 30 minutes depending on sequence length and the load of the various servers incorporated into the CCTOP algorithm, the user may ask an email alert containing a link referring to the results. When the submitted job is finished, a six-panel window is produced by the CCTOP web server. The first panel is a summary, presenting the most relevant information, protein name (if available), number of predicted TMHs, cross-references to various databases. The generated XML file can be downloaded from the bottom of the panel, or its content can be further investigated in the next panels. The next panel shows the raw XML file, containing all gathered information together with the results of underlying methods and the final consensus prediction. The 1D panel shows the amino acid sequence colored by the consensus topology. Colors are based on the localization: gray, black, blue, red, yellow and orange for transit sequence, signal peptide, extra-cytosolic, cytosolic, membrane and re-entrant loop regions, respectively. The 2D panel is probably the most useful panel. It is a graphical representation of the determined and predicted topology of the given entry. The graph consists of three parts: the final CCTOP prediction, the results of the various topology and topography prediction methods and the collected constraints aligned to the amino acid sequence of the given entry. The x-axis on the graph is the sequence number of the query protein. To inspect the details, the graphs can be enlarged and scrolled. The color code is the same as described above (Figure 2, panel A).

Figure 2.

Layout of the result tabs of the CCTOP web server. (A) 2D tab (B) 3D tab. For details, see the text.

Layout of the result tabs of the CCTOP web server. (A) 2D tab (B) 3D tab. For details, see the text. If homologous proteins can be found in the PDBTM database, the 3D panel is active and contains their structures. In case of several cross-references to PDBTM each one can be selected and the 3D structure can be inspected using JSMol. Link is provided for each homologous protein to download its 3D co-ordinates from the PDBTM database (Figure 2, panel B). Finally, using the Customize panel, the initial prediction can be recalculated by (de)selecting any of the prediction methods, mapped experiments or by adding user specified topology constraints. After submission, at the top of this panel, the reliability of the possible localizations for each sequence positions is shown, as well the new reliability value. The content of the other tabs are updated by the results of the recalculated prediction. At the bottom of the page the new XML file can be downloaded, while, the original XML file is still available from the summary panel. To predict the topology for multiple sequences, a direct interface is available, which allow the programmable access to the CCTOP server. A template python script can be downloaded from the CCTOP server. The main purpose of the web server is to provide an easy access user interface for the CCTOP method. Some of the utilized methods and CCTOP itself have high computation requirements. Setting up and running these servers locally is rather time consuming. Gathering and mapping already solved structures and experiments together with the prediction results of the state-of-the-art prediction methods are additional non-trivial problems. The CCTOP server does all these tasks automatically.

ENDNOTES

TM: transmembrane; TMP: transmembrane protein; TMH: transmembrane helix.

48 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Transmembrane proteins in the Protein Data Bank: identification and classification.

Authors: Gábor E Tusnády; Zsuzsanna Dosztányi; István Simon
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

3. A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors: E L Sonnhammer; G von Heijne; A Krogh
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1998

4. Topology of NGEP, a prostate-specific cell:cell junction protein widely expressed in many cancers of different grade level.

Authors: Sudipto Das; Yoonsoo Hahn; Dawn A Walker; Satoshi Nagata; Mark C Willingham; Donna M Peehl; Tapan K Bera; Byungkook Lee; Ira Pastan
Journal: Cancer Res Date: 2008-08-01 Impact factor: 12.701

5. N- and O-glycosylation in the murine synaptosome.

Authors: Jonathan C Trinidad; Ralf Schoepfer; Alma L Burlingame; Katalin F Medzihradszky
Journal: Mol Cell Proteomics Date: 2013-07-01 Impact factor: 5.911

6. Mass spectrometry investigation of glycosylation on the NXS/T sites in recombinant glycoproteins.

Authors: Izabela Sokolowska; Armand G Ngounou Wetie; Urmi Roy; Alisa G Woods; Costel C Darie
Journal: Biochim Biophys Acta Date: 2013-04-28

7. The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane topology.

Authors: G Heijne
Journal: EMBO J Date: 1986-11 Impact factor: 11.598

8. The human transmembrane proteome.

Authors: László Dobson; István Reményi; Gábor E Tusnády
Journal: Biol Direct Date: 2015-05-28 Impact factor: 4.540

9. MemBrain: improving the accuracy of predicting transmembrane helices.

Authors: Hongbin Shen; James J Chou
Journal: PLoS One Date: 2008-06-11 Impact factor: 3.240

10. Transmembrane protein alignment and fold recognition based on predicted topology.

Authors: Han Wang; Zhiquan He; Chao Zhang; Li Zhang; Dong Xu
Journal: PLoS One Date: 2013-07-19 Impact factor: 3.240

104 in total

1. Bacterial lyso-form lipoproteins are synthesized via an intramolecular acyl chain migration.

Authors: Krista M Armbruster; Gloria Komazin; Timothy C Meredith
Journal: J Biol Chem Date: 2020-05-29 Impact factor: 5.157

2. Neuronal deficiency of ARV1 causes an autosomal recessive epileptic encephalopathy.

Authors: Elizabeth E Palmer; Kelsey E Jarrett; Rani K Sachdev; Fatema Al Zahrani; Mais Omar Hashem; Niema Ibrahim; Hugo Sampaio; Tejaswi Kandula; Rebecca Macintosh; Rajat Gupta; Donna M Conlon; Jeffrey T Billheimer; Daniel J Rader; Kouichi Funato; Christopher J Walkey; Chang Seok Lee; Christine Loo; Susan Brammah; George Elakis; Ying Zhu; Michael Buckley; Edwin P Kirk; Ann Bye; Fowzan S Alkuraya; Tony Roscioli; William R Lagor
Journal: Hum Mol Genet Date: 2016-06-06 Impact factor: 6.150

Review 3. Extracellular Protein Phosphorylation, the Neglected Side of the Modification.

Authors: Eva Klement; Katalin F Medzihradszky
Journal: Mol Cell Proteomics Date: 2016-11-10 Impact factor: 5.911

4. Phage Morons Play an Important Role in Pseudomonas aeruginosa Phenotypes.

Authors: Yu-Fan Tsao; Véronique L Taylor; Smriti Kala; Joseph Bondy-Denomy; Alima N Khan; Diane Bona; Vincent Cattoir; Stephen Lory; Alan R Davidson; Karen L Maxwell
Journal: J Bacteriol Date: 2018-10-23 Impact factor: 3.490

5. In silico characterization of residues essential for substrate binding of human cystine transporter, xCT.

Authors: Monika Sharma; C R Anirudh
Journal: J Mol Model Date: 2019-11-09 Impact factor: 1.810

6. TMC1 Forms the Pore of Mechanosensory Transduction Channels in Vertebrate Inner Ear Hair Cells.

Authors: Bifeng Pan; Nurunisa Akyuz; Xiao-Ping Liu; Yukako Asai; Carl Nist-Lund; Kiyoto Kurima; Bruce H Derfler; Bence György; Walrati Limapichat; Sanket Walujkar; Lahiru N Wimalasena; Marcos Sotomayor; David P Corey; Jeffrey R Holt
Journal: Neuron Date: 2018-08-22 Impact factor: 17.173

7. Amino Acid Substitution in the Major Multidrug Efflux Transporter Protein AcrB Contributes to Low Susceptibility to Azithromycin in Haemophilus influenzae.

Authors: Shoji Seyama; Takeaki Wajima; Hidemasa Nakaminami; Norihisa Noguchi
Journal: Antimicrob Agents Chemother Date: 2017-10-24 Impact factor: 5.191

8. Genome wide characterization revealed MnMLO2 and MnMLO6A as candidate genes involved in powdery mildew susceptibility in mulberry.

Authors: A Ramesha; Himanshu Dubey; K Vijayan; Kangayam M Ponnuvel; Rakesh K Mishra; K Suresh
Journal: Mol Biol Rep Date: 2020-04-01 Impact factor: 2.316

9. Analysis of the Topology and Active-Site Residues of WbbF, a Putative O-Polysaccharide Synthase from Salmonella enterica Serovar Borreze.

Authors: Samantha S Wear; Brittany A Hunt; Bradley R Clarke; Chris Whitfield
Journal: J Bacteriol Date: 2020-02-11 Impact factor: 3.490

10. An overview and metanalysis of machine and deep learning-based CRISPR gRNA design tools.

Authors: Jun Wang; Xiuqing Zhang; Lixin Cheng; Yonglun Luo
Journal: RNA Biol Date: 2019-09-27 Impact factor: 4.652