Dániel Kozma1, István Simon, Gábor E Tusnády. 1. Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, PO Box 7, H-1518 Budapest, Hungary.
Abstract
A contact map is a 2D derivative of the 3D structure of proteins, containing various residue-residue (RR) contacts within the structure. Contact maps can be used for the reconstruction of structure with high accuracy and can be predicted from the amino acid sequence. Therefore understanding the various properties of contact maps is an important step in protein structure prediction. For investigating basic properties of contact formation and contact clusters we set up an integrated system called Contact Map Web Viewer, or CMWeb for short. The server can be used to visualize contact maps, to link contacts and to show them both in 3D structures and in multiple sequence alignments and to calculate various statistics on contacts. Moreover, we have implemented five contact prediction methods in the CMWeb server to visualize the predicted and real RR contacts in one contact map. The results of other RR contact prediction methods can be uploaded as a benchmark test onto the server as well. All of these functionality is behind a web server, thus for using our application only a Java-capable web browser is needed, no further program installation is required. The CMWeb is freely accessible at http://cmweb.enzim.hu.
A contact map is a 2D derivative of the 3D structure of proteins, containing various residue-residue (RR) contacts within the structure. Contact maps can be used for the reconstruction of structure with high accuracy and can be predicted from the amino acid sequence. Therefore understanding the various properties of contact maps is an important step in protein structure prediction. For investigating basic properties of contact formation and contact clusters we set up an integrated system called Contact Map Web Viewer, or CMWeb for short. The server can be used to visualize contact maps, to link contacts and to show them both in 3D structures and in multiple sequence alignments and to calculate various statistics on contacts. Moreover, we have implemented five contact prediction methods in the CMWeb server to visualize the predicted and real RR contacts in one contact map. The results of other RR contact prediction methods can be uploaded as a benchmark test onto the server as well. All of these functionality is behind a web server, thus for using our application only a Java-capable web browser is needed, no further program installation is required. The CMWeb is freely accessible at http://cmweb.enzim.hu.
Structures of globular proteins are determined and maintained by non-covalent residue residue–(RR) interactions (1). Mapping RR contacts into a 2D binary map results in the so called contact map. Contact maps can be predicted from amino acid sequence information of proteins with acceptable accuracy. Several methods have been developed for predicting these contacts based on machine learning algorithms (2–4), or simpler statistical-based algorithm like mutual information (MI) (5), correlated mutations (6,7) and statistical coupling algorithm (SCA) (8), etc. The accuracy of the state-of-the-art RR predictors is ∼20–30%, suggesting the need for improvement, although the most recent methods [e.g. (9)] show significantly better, but still unsatisfactory performance.To understand the properties of contact maps and the relations between of 3D structure and residue contacts, besides statistical approaches, visual inspection of contact maps can be useful, as well. During the last decades several useful contact map viewers have been developed (10–12). The most recent contact map viewers is the CMView (13) program using PyMol (14) for visualizing the 3D structures. CMView is a desktop application, which is mainly designed for studying 3D structure reconstruction from a contact map. The Contact Map Web Viewer (CMWeb) server presented in this article has a different purpose. CMWeb is designed for analysing, understanding contact formation, protein contacts and to help to develop methods for predicting protein contacts. Our aim is not predicting 3D structure of proteins, only the visual investigation of RR contacts and the results of RR contact prediction methods.The server is a standalone, user-friendly platform, which does not require additional component for operation.
MATERIALS AND METHODS
The web server is written in C/C++ using the Wt web toolkit (15) and the in house written PDBLIB program library used for TMDET algorithm earlier (16). In contempt of the numerous calculations the web server is really fast due to the C/C++ program core. The web server utilizes the OpenAstex (17) protein structure viewer for combined contact map and 3D structure view. We choose OpenAstex structure viewer because it renders molecules more nicely and faster than other such Java-based methods. We suggest to utilize at the client browser the most commonly used Oracle Java JRE (http://java.com).As the web server is designed for analysing protein chains and structures, we apply a basic filter on PDB entries to exclude nucleic acid structures.
Multiple sequence alignment
The multiple sequence alignment (MSA) is generated, based on the sequence stored in the PDB file. The sequence is searched against a user selectable sequence database (SwissProt or nr) using the BLAST algorithm. MSA is generated from the resulted local pair alignments, where the columns containing gap in the query sequence are neglected. The prediction methods use this generated MSA for estimating contacts.
Implemented contact prediction methods
Five protein contact prediction methods have been implemented as follows: MI (Mutual Information) (5), SCA (Statistical Coupling Analysis) (8), ELSC (Explicit Likelihood of Subset Co-variation) (7), OMES (Observer Minus Expected Squared) (18) and the one of the first methods by Göbel (6). We have re-implemented these contact predictors in C programming language to make on-the-fly prediction realizable. Because our aim is to benchmark these or any other methods, predictions can be made only using PDB entries, user provided sequences are not allowed. These implementations were checked on the original as well as on other tested implementation of these prediction methods.
RESULTS AND DISCUSSION
Contact map viewer
The CMWeb server integrates a contact map, structure and MSA viewer combined with a statistical evaluating system (Figure 1) in a fully interactive way. The web server provides a graphical user interface (GUI) like web application, where the various objects on the screen are connected via signals, therefore any user interaction is traced and handled by these objects. When selecting a contact pair in the contact map panel, the server executes the following processes:
Figure 1.
Layout of the CMWeb web server. (A) menu bar and navigation bar; (B) overall contact map; (C) information panel; (D) structure viewer; (E) zoomable detailed contact map (blue: contacts, green: false prediction, red: correct prediction); (F) MSA viewer with conservation profile and secondary structure; (G) statistical panel with marginal and double marginal distribution of contacts, amino acid distributions, amino acid contact propensities, ROC curve, score histogram and a table of statistical measures.
shows the corresponding residues in the structure viewer panel (Figure 1D), coloured corresponding to the secondary structure scheme shown in the MSA panel (Figure 1F), where conservation profile can be found as well;highlights positions in the sequence alignment (Figure 1F);displays the distance between the selected residues in the structure viewer proportional to the contact definition (Figure 1D);displays the distance value, the residue number and the connecting atom types in the information panel (Figure 1C).Layout of the CMWeb web server. (A) menu bar and navigation bar; (B) overall contact map; (C) information panel; (D) structure viewer; (E) zoomable detailed contact map (blue: contacts, green: false prediction, red: correct prediction); (F) MSA viewer with conservation profile and secondary structure; (G) statistical panel with marginal and double marginal distribution of contacts, amino acid distributions, amino acid contact propensities, ROC curve, score histogram and a table of statistical measures.Furthermore, the web server could shows all the neighbours of a selected residue (using double-click on the contact map panel) in the structure viewer. The central residue and its neighbours are coloured by a given colour scheme. The centre of the given cluster and the number of the surrounding residues are displayed in the information panel. The position of the selected residues are highlighted in the sequence alignment panel as well. These functions can be also activated from the MSA panel. Additionally selecting any region in the MSA is displayed in the structure viewer too.All data presented on the webpage are calculated on-the-fly based on the selected or uploaded PDB protein chain structure. The contact definition can be specified in terms of contact type (all-atom, side-chain atoms, Cα and Cβ) and contact threshold (distance cutoff in Å). In addition, the user can filter indirect contacts within a given distance limit. Furthermore, the contact map could display the indirect connections of the residues over heteromolecules such as e.g. structural waters. The server can shows the contact map proportional to the contact definition (Figure 2A) or the distance matrix of a given protein chain (Figure 2B).
Figure 2.
Contact maps of the 2bl2A protein chain with the given contact definition. (A) Binary contact map (any heavy atoms closer than the sum of their van der Waals radii plus 1.5Å, the sequence separation is 1); (B) Continuous distance map (distance scale is from the closest to the farthest as red–yellow–green–blue–purple).
Contact maps of the 2bl2A protein chain with the given contact definition. (A) Binary contact map (any heavy atoms closer than the sum of their van der Waals radii plus 1.5Å, the sequence separation is 1); (B) Continuous distance map (distance scale is from the closest to the farthest as red–yellow–green–blue–purple).The user can investigate all PDB entry by entering the PDB code or can upload any protein structure in PDB format for visualizing own, not published or modelled structures, as well. The server incorporates all PDB entries and is updated weekly.The server provides MSA with schematic secondary structure and conservation profile as well, to help us to collect necessary sequence information, whereas similar sequences share roughly the same structure.The main advantage of these features is that we can get broad information with one click about the inspected residues and its physco-chemical, spatial environment with highlighting and displaying the corresponding positions in the MSA panel and in the structure viewer simultaneously.The web server calculates statistics on the inspected protein chain. A marginal and a double marginal distribution of amino acid contact numbers are presented. The later shows the population of RR contacts between amino acids with n and m number of contacts, various amino acid frequencies and RR contact distribution. The predicted results displayed on the contact map panel and the performance of the given method on the specified protein chain is shown by the ROC curve. In addition we can follow with attention the separation of the TP or FP scores, and the informative statistical measures such as accuracy, precision, TPR/sesitivity/coverage/recall, FPR, Matthews correlation coefficient, improvement over random, F1 and Xd scores. In addition to a ROC curve, score histogram and statistical measures for evaluating performance of prediction techniques are presented, as well. The score histogram is a useful check of the prediction methods, here we can see the separation of the score values calculated for residue pairs are in contact and for which are not.
Benchmark test
Using the benchmark test menu users can check the performance of any contact prediction methods. After uploading the CASP RR formatted prediction files and setting contact map definition our statistical evaluating system returns a list with the name, small contact map including predictions and different statistical measures line by line. Each prediction can be analysed further (Figure 3) inspecting the contacts between residues and orientation of them in the 3D space using the OpenAstex(17) molecular viewer described above. The summary at the end of the list gives a brief information about the average performance of the tested contact prediction technique. It is important to note that our evaluating system neglect the distance ranges in the file, it evaluates predictions based on the contact map definition previously set by the user.
Figure 3.
Layout of the inspector window which is useful for the further analysis of the elements of the benchmark setlist, we could inspect the environment of the correct and incorrect predictions using the structure viewer on the right. In the upper left corner there is an overall contact map and an information box displaying the user activity, in the bottom left corner there is a detailed contact map, the blue points are the real contacts corresponding to the contact definition, the reds are the correct and greens are the false predictions.
Layout of the inspector window which is useful for the further analysis of the elements of the benchmark setlist, we could inspect the environment of the correct and incorrect predictions using the structure viewer on the right. In the upper left corner there is an overall contact map and an information box displaying the user activity, in the bottom left corner there is a detailed contact map, the blue points are the real contacts corresponding to the contact definition, the reds are the correct and greens are the false predictions.
CONCLUSION
CMWeb is an interactive on-line web application to examine contact maps together with linked 3D structures, MSAs, secondary structures, sequence conservation and five commonly used prediction methods. Furthermore, CMWeb can be used for benchmark testing custom prediction methods and measuring theirs performance. The server utilize state-of-the-art technologies to provide a desktop application like GUI and functionality on the web. This web server could be a good example of the hidden great potential of the Wt programming library. We hope CMWeb will be a powerful web tool for analysing protein contacts and contact prediction methods and may become a widespread scientific tool.
FUNDING
Hungarian Scientific Research Fund (OTKA) [NK100482 and K75460]. Funding for open access charge: Research grant of Hungarian Scientific Research Fund.Conflict of interest statement. None declared.
Authors: Aleksandra E Badaczewska-Dawid; Chandran Nithin; Karol Wroblewski; Mateusz Kurcinski; Sebastian Kmiecik Journal: Nucleic Acids Res Date: 2022-05-07 Impact factor: 19.160