Literature DB >> 35955594

cpxDeepMSA: A Deep Cascade Algorithm for Constructing Multiple Sequence Alignments of Protein-Protein Interactions.

Abstract

Protein-protein interactions (PPIs) are fundamental to many biological processes. The coevolution-based prediction of interacting residues has made great strides in protein complexes that are known to interact. A multiple sequence alignment (MSA) is the basis of coevolution analysis. MSAs have recently made significant progress in the protein monomer sequence analysis. However, no standard or efficient pipelines are available for the sensitive protein complex MSA (cpxMSA) collection. How to generate cpxMSA is one of the most challenging problems of sequence coevolution analysis. Although several methods have been developed to address this problem, no standalone program exists. Furthermore, the number of built-in properties is limited; hence, it is often difficult for users to analyze sequence coevolution according to their desired cpxMSA. In this article, we developed a novel cpxMSA approach (cpxDeepMSA. We used different protein monomer databases and incorporated the three strategies (genomic distance, phylogeny information, and STRING interaction network) used to join the monomer MSA results of protein complexes, which can prevent using a single method fail to the joint two-monomer MSA causing the cpxMSA construction failure. We anticipate that the cpxDeepMSA algorithm will become a useful high-throughput tool in protein complex structure predictions, inter-protein residue-residue contacts, and the biological sequence coevolution analysis.

Entities: Chemical

Keywords: STRING interaction network; genomic distance; multiple sequence alignment; phylogeny information; protein complex; protein–protein interactions; sequence coevolution analysis

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35955594 PMCID： PMC9369210 DOI： 10.3390/ijms23158459

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 6.208

1. Introduction

Proteins play crucial roles in almost all biological processes in cells. These important biomolecules, particularly proteins, accomplish their roles by using intermolecular interactions, such as the identity, dynamics, and specificity of protein interactions [1,2]. Experimental screens have identified tens of thousands of protein–protein interactions (PPI) or protein complexes, and structural biology has provided detailed functional insight into select 3D protein complexes. However, the structures of many protein complexes are unknown, and there is still little, or no, 3D information for a significant percentage of currently known PPIs or protein complexes in bacteria, yeast, and humans [3,4]. The structures of many essential PPI complexes, including those bound with the cell membrane, are difficult, if not impossible, to solve using the current techniques. The computational approach has therefore become an increasingly important means to obtain protein complex structures, especially for large-scale protein complex structure modeling [5]. With the rapid growth in our knowledge of genetic variation at the sequence level, there is increased interest in linking sequences with the change in molecular interactions. However, the current experimental approaches cannot meet the demand for residue-level information on these interactions. Recent work has demonstrated the accuracy of co-evolution-based contact prediction for monomeric proteins using global statistical models [6,7,8]. The chain’s multiple sequence alignment (MSA) is the fundament of the quantified coevolution. MSA provides more information by showing conserved regions and motifs of structural and functional importance within the protein family. Furthermore, the MSA is an essential part of protein structure prediction [9], protein contact map [10], second structure feature [11], ligand-binding site prediction [12], homologous templates [13], gene ontology [14], phylogenetic analysis [15], and many other valuable procedures in sequence research [16]. Therefore, many MSA construct methods have been developed, such as BLAST [17], HHblits [18] from the HH-suite [19], and Jackhammer and HMMsearch tools from the HMMER suite [20], MetaPSICOV2 [21], and DeepMSA [22]. In contrast to the extensive work on monomeric proteins, little is known about the utility of such statistical models for predicting protein–protein interactions or protein complexes. Coevolution is at the basis of many modern computational techniques for characterizing protein−protein interactions. Therefore, as shown in Figure 1, how to build the multiple sequence alignment (MSA) of the protein–protein interaction or protein complex is an important issue that needs to be addressed. EVcomplex [3], Gremlin-Complex [23], and ComplexContact [4] are based on the genomic distances to build protein complex multiple sequence alignments. ComplexContact [4] also creates protein complex multiple sequence alignment by using a phylogeny-based method.

Figure 1

The flowchart of the building cpxMSA. The two circles (yellow and green) connected by double arrow lines indicate sites of coevolution (left) to identify evolutionary couplings between co-evolving inter-chain residue pairs (right).

Although the above methods developed the genomic-based and phylogeny-based methods to generate protein complex multiple sequence alignments, few standalone pipelines/programs exist that efficiently generate sensitive protein complex MSAs from the input protein complex sequences; hence, there was an urgent need to address this issue. Inspired by the protein monomer MSA algorithm DeepMSA, we developed and released cpxDeepMSA, a new open-source program to construct deep and sensitive protein complex MSAs by merging sequences from three different strategies through a hybrid homology–detection approach.

2. Results and Discussion

2.1. Evaluation

We evaluated our cpxMSA method for contact prediction using the state-of-the-art programs CCMpred [24] and trRosettaX [25,26]. We calculated the accuracy of the top 50, 20, 10, 5, and top L/k (k = 5, 10, 20, 50) predicted contacts where L is the total length of the two protein chains. The prediction is defined as the percentage of correctly predicted contacts among the top predictions.

2.2. cpxDeepMSA Increases Protein Complex Contact Prediction Accuracy

The genomic-, phylogeny-, and STRING-based methods for cpxMSA construction complement each other. Generally, for prokaryotic species, the genome-based method works better, and for eukaryotes, our phylogeny-based method works better, as shown in Table 1, which was tested on the PDB100 (of 100 heterodimers) database by using the predictor trRosettaX (with defaults: “predict.py -i input.a3m -o output.npz -mdir./model_res2net_202012”). The benchmark PDB100 was extracted from a Protein Data Bank (PDB) [27], and the sequence identity cutoff in the benchmark was 40%. The results indicate that for the cpxMSA construction method there is little difference between genomic-based (stage 1) and phylogeny-based (stage 2) and all of them are better than STRING-based (stage 3). The MSA from cpxDeepMSA outperforms the other three MSAs for contact prediction. For instance, when using the MSA from cpxDeepMSA, the precision for the top five contacts was 0.673; this was 58.7%, 16.6%, and 99.1% higher than that of the MSA from genomic-, phylogeny- and STRING-based, respectively.

Table 1

Inter-protein contact prediction precision on the PDB100 database by trRosettaX. Bold font indicates the highest value in each category.

MSA	L/5	L/10	L/20	L/50	50	20	10	5
Genomic-based	0.273	0.325	0.372	0.414	0.302	0.365	0.397	0.424
Phylogeny-based	0.353	0.430	0.499	0.564	0.394	0.485	0.538	0.577
STRING-based	0.210	0.253	0.295	0.333	0.228	0.282	0.316	0.338
cpxDeepMSA	0.398	0.491	0.572	0.645	0.449	0.560	0.629	0.673

To further investigate the effectiveness of cpxDeepMSA, we list in Table 2 the comparison of the contact map prediction results of cpxDeepMSA and RoseTTAFold (RF) MSA [28] on the Baker’s dataset [23]. We used CCMpred with parameters “CCMpred input.aln output.mat -n 100 -e 0 -A” to detect the co-evolution on each alignment. Significant improvement of cpxDeepMSA was shown for the contact map prediction over RF MSA on the predictor CCMpred and trRosettaX. In comparison, the corresponding precision for the top 10 contacts of RF MSA by CCMpred and trRosettaX were 0.4% and 51.9%, respectively. cpxDeepMSA achieved precision for the top 10, with 38.5% and 55.2%, which were 9225.0% and 6.4% higher than RF MSA with CCMpred and trRosettaX, respectively.

Table 2

Inter-protein contact prediction precision (%) on Baker’s data.

Predictor	MSA	L/5	L/10	L/20	L/50	50	20	10	5
CCMpred	RF MSA	0.012	0.007	0.006	0.009	0.010	0.006	0.004	0.007
CCMpred	cpxDeepMSA	0.137	0.214	0.306	0.377	0.242	0.333	0.385	0.400
tRosettaX	RF MSA	0.340	0.416	0.462	0.535	0.406	0.461	0.519	0.556
tRosettaX	cpxDeepMSA	0.334	0.418	0.487	0.565	0.410	0.494	0.552	0.578

2.3. Web-Server and User Guide

To enhance the value of its practical applications, the web server for cpxDeepMSA was established. Below, we further give a step-by-step guide on how to use the web server to obtain the desired results.

2.3.1. Server Input

Opening the web server at https://zhanggroup.org/cpxDeepMSA/, you will see the top page of the cpxDeepMSA on your computer screen, as shown in Figure 2. The input to the cpxDeepMSA server involves two single-chain amino acid sequence files in FASTA format. After submitting a job, a URL link with a random job ID is generated, allowing the user to check the results and keep the data private. The user must provide an email address when submitting a job, and the server will automatically send a notification email with a link to the results page upon the job completion.

Figure 2

A semi-screenshot showing the top page of the cpxDeepMSA web server at https://zhanggroup.org/cpxDeepMSA/.

2.3.2. Server Output

The cpxDeepMSA results page consists of seven sections: (i) A summary of the multiple sequence alignments and sequence analysis compressed package files (Figure 3A), (ii) a submission including a query sequence (Figure 3B), (iii)–(vi) protein complex multiple sequence alignment-generated-based string, genomics, phylogeny, and cpxDeepMSA, respectively (Figure 3C–F), (G) the multiple sequence alignment file (Figure 3G). As an illustration, Figure 3 presents an example from the conformationally-strained, circular permutant of barnase (PDB ID: 3da7) to explain section vi of the results page.

Figure 3

Illustration of the cpxDeepMSA server output, including (A) a summary of the multiple sequence alignment results; (B) summary of the user input; (C–F) protein complex multiple sequence alignment-generated-based string, genomics, phylogeny, and cpxDeepMSA, respectively; (G) multiple sequence alignment file.

Section vi (Figure 3F) shows the cpxMSA generated by cpxDeepMSA, which lists three parts, (1), (2), and (3). For (1), which shows the cpxMSA on the page, users can drag or zoom in on the table to check the cpxMSA. Additionally, (2) presents the sequence analysis of cpxMSA using the software WebLogo 3.6 [29]. In the last subsection, (3), an aln-formatted file can be downloaded by clicking on the link at the bottom table.

3. Materials and Methods

3.1. cpxDeepMSA Pipeline for MSA Construction

Figure 4 shows a complex pipeline that can be divided into three stages, which correspond to searching two protein sequence databases, Uniclust30 [30] and STRING [31], combining the HH-suite [19] program, and through three matching databases, ENA [32], Taxonomy [33], and STRING linker [31].

Figure 4

The flowchart of cpxDeepMSA. Three stages of MSA generations were performed consecutively using sequences from the HHblits search through Uniclust30 and pairing with genomic distance (first column), phylogeny information (second column), and the STRING interaction network (third column).

Stage 1. First, download the Uniclust30 (version: 2018_08) [30] protein monomer sequence database from the whole genome data of the protein monomer sequence. Secondly, use the multiple sequence alignment software HHblits (with the parameters “-diff inf -id 99 -cov 50 -n 3”) from the HH-suite 2.0.16 program to search the protein sequence database Uniclust30 for query sequence A and sequence B, respectively. Additionally, obtain the multiple sequence alignment information MSA_A and MSA_B of the protein monomer sequence, respectively. Third, compare the results MSA_A and MSA_B in the genome database (ENA), and obtain the gene information MSA_A_gene and MSA_B_gene of the multiple sequence alignment results. Fourth, according to the gene distance of the two protein sequences and with the same gene in the MSA_A_gene and MSA_B_gene, if , connect the protein sequence and . Finally, according to the above steps, construct a multiple sequence alignment (MSA) of the protein complex based on gene distance, as shown in Figure 5.

Figure 5

An example of the genomic-based MSA concatenation.

Stage 2. There are five steps in the cpxMSA construction based on the protein monomer sequence and species similarity search. First, download the taxonomy database from the National Center for Biotechnology Information (NCBI) public database. Secondly, compare the multiple sequence alignment information MSA_A and MSA_B of sequence A and sequence B in Stage 1 with the taxonomy database, respectively, to obtain the species information of the proteins in MSA_A_phy and MSA_B_phy, respectively. Third, rank the similarity of proteins and query sequences in each species in MSA_A_phy and MSA_B_phy from high to low. Fourth, let be the species-specific proteins in MSA_A_phy sorted by sequence similarity, and be the species-specific proteins in MSA_B_phy ranked by sequence similarity. Then, connect with , where . Finally, according to the species comparison result, the two monomer multi-sequence comparisons are concatenated to obtain the species-based multi-sequence comparison result of the protein complex (see Figure 6).

Figure 6

The flowchart of the phylogeny information-based method.

Stage 3. The main points of the process (according to the protein interaction network to build cpxMSA) are as follows: (i) Download the protein interaction information (STRING linker) and protein interaction sequence information (STRING database) from the protein interaction network database (STRING version 10.5v: https://cn.string-db.org/) of the public database. (ii) Use the multiple sequence alignment HHblits program to search for the protein interaction sequence information (STRING) of sequence A and sequence B, respectively, and obtain the multiple sequence alignment information MSA_stringA and MSA_stringB, respectively. (iii) According to the protein interaction information (STRING linker), determine whether any two proteins, protein i and j in the MSA_stringA and MSA_stringB, have interactions. If there is an interaction, connect the two. In summary, according to steps (i)–(iii), construct an interaction-based multiple sequence alignment (MSA) of the protein complexes.

3.2. Selection of the Protein Complex Multiple Sequence Alignment Method

The number of effective sequences of the protein complex multiple sequences alignment (Necs): where L is the length of the query protein complex and N is the number of sequences in the protein complex multiple sequence alignment (MSA). is the sequence identity between chain A in sequence and chain A in sequence . is the sequence identity between chain B in sequence and chain B in sequence . Selection of protein complex multiple sequence alignment method: First, calculate the number of effective sequences in the multiple sequence alignment of the protein complex based on genomic distance in stage 1. Secondly, if the number of sequences in the multiple sequence alignment in stage 1 meets the requirements, the sequence alignment in stage 1 is used as the input in the step of removing redundant sequences. Otherwise, combine the multiple sequence alignment in step 1 with the multiple sequence alignment based on the species category in stage 2, and calculate the number of effective sequences. Thirdly, if the number of valid sequences after the merging of stage 1 and stage 2 meets the condition, the merging result is used as the input of the redundant sequence step. Otherwise, combine the multiple sequence alignments based on the protein interaction network in stage 1, stage 2, and stage 3 as the input to the step of removing the redundant sequences.

4. Conclusions

We developed an open-source pipeline, cpxDeepMSA, to provide a cpxMSA algorithm that is high-quality, large-depth, and provides a wide range of sequence sources and strong generalization abilities. cpxDeepMSA was proposed to solve the shortcomings of low-quality cpxMSA results due to a single database and low search depth. The advantages of cpxMSA by cpxDeepMSA are as follows: (i) It increases the depth of MSA. The depth of MSA, not just using the single search algorithm or database to align, can also judge according to the number of valid sequences in the MSA results from the previous layer. (ii) The proposed method enhances the generalization ability by using different protein monomer databases and three different monomer MSA strategies (genomic distance, phylogeny information, and STRING interaction network) to join the monomer MSA results in protein complexes. The online server and the standalone program of cpxDeepMSA are freely available at https://zhanggroup.org/cpxDeepMSA/ (accessed on 1 July 2022).

33 in total

Review 1. Computational methods for protein-protein interaction and their application.

Authors: Tie-Liu Shi; Yi-Xue Li; Yu-Dong Cai; Kuo-Chen Chou
Journal: Curr Protein Pept Sci Date: 2005-10 Impact factor: 3.272

2. Large-scale multiple sequence alignment and tree estimation using SATé.

Authors: Kevin Liu; Tandy Warnow
Journal: Methods Mol Biol Date: 2014

3. Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Authors: Milot Mirdita; Lars von den Driesch; Clovis Galiez; Maria J Martin; Johannes Söding; Martin Steinegger
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

4. The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis.

Authors: Nelson Gil; Andras Fiser
Journal: Bioinformatics Date: 2019-01-01 Impact factor: 6.937

5. CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction.

Authors: Fusong Ju; Jianwei Zhu; Bin Shao; Lupeng Kong; Tie-Yan Liu; Wei-Mou Zheng; Dongbo Bu
Journal: Nat Commun Date: 2021-05-05 Impact factor: 14.919

6. LOMETS3: integrating deep learning and profile alignment for advanced protein template recognition and function annotation.

Authors: Wei Zheng; Qiqige Wuyun; Xiaogen Zhou; Yang Li; Peter L Freddolino; Yang Zhang
Journal: Nucleic Acids Res Date: 2022-04-14 Impact factor: 19.160

7. Improved protein contact predictions with the MetaPSICOV2 server in CASP12.

Authors: Daniel W A Buchan; David T Jones
Journal: Proteins Date: 2017-09-29

8. Accurate prediction of protein structures and interactions using a three-track neural network.

Authors: Minkyung Baek; Frank DiMaio; Ivan Anishchenko; Justas Dauparas; Sergey Ovchinnikov; Gyu Rie Lee; Jue Wang; Qian Cong; Lisa N Kinch; R Dustin Schaeffer; Claudia Millán; Hahnbeom Park; Carson Adams; Caleb R Glassman; Andy DeGiovanni; Jose H Pereira; Andria V Rodrigues; Alberdina A van Dijk; Ana C Ebrecht; Diederik J Opperman; Theo Sagmeister; Christoph Buhlheller; Tea Pavkov-Keller; Manoj K Rathinaswamy; Udit Dalwadi; Calvin K Yip; John E Burke; K Christopher Garcia; Nick V Grishin; Paul D Adams; Randy J Read; David Baker
Journal: Science Date: 2021-07-15 Impact factor: 47.728

Review 9. Protein-protein interaction networks: probing disease mechanisms using model systems.

Authors: Uros Kuzmanov; Andrew Emili
Journal: Genome Med Date: 2013-04-30 Impact factor: 11.117

10. HH-suite3 for fast remote homology detection and deep protein annotation.

Authors: Martin Steinegger; Markus Meier; Milot Mirdita; Harald Vöhringer; Stephan J Haunsberger; Johannes Söding
Journal: BMC Bioinformatics Date: 2019-09-14 Impact factor: 3.169