| Literature DB >> 24663061 |
Carlo Baldassi1, Marco Zamparo1, Christoph Feinauer2, Andrea Procaccini3, Riccardo Zecchina1, Martin Weigt4, Andrea Pagnani1.
Abstract
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24663061 PMCID: PMC3963956 DOI: 10.1371/journal.pone.0092721
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1True positive rate plotted against number of predicted pairs.
Results are shown for four different different scoring techniques: Frobenius norm (as described in [15], pseudo-count set to , blue); Gaussian direct information (as described in the text, APC-corrected, pseudo-count set to , red); mean-field direct information (as described in [10], pseudo-count set to , orange) and APC-corrected mutual information (as described in [41], green). The true positive rate is an arithmetic mean over 50 Pfam families (see Table 2 for the list); thin lines represent standard deviations.
50 Pfam families used in the benchmarks, together with their associated PDB entries.
| Pfam ID | Description | PDB |
| PF00001 | 7 transmembrane receptor (rhodopsin family) | 1f88, 2rh1 |
| PF00004 | ATPase family associated with various cellular activities (AAA) | 2p65, 1d2n |
| PF00006 | ATP synthase alpha/beta family, nucleotide-binding domain | 2r9v |
| PF00009 | Elongation factor Tu GTP binding domain | 1skq, 1xb2 |
| PF00011 | Hsp20/alpha crystallin family | 2bol |
| PF00012 | Hsp70 protein | 2qxl |
| PF00013 | KH domain | 1wvn |
| PF00014 | Kunitz/Bovine pancreatic trypsin inhibitor domain | 5pti |
| PF00016 | Ribulose bisphosphate carboxylase large chain, catalytic domain | 1svd |
| PF00017 | SH2 domain | 1o47 |
| PF00018 | SH3 domain | 2hda, 1shg |
| PF00025 | ADP-ribosylation factor family | 1fzq |
| PF00026 | Eukaryotic aspartyl protease | 3er5 |
| PF00027 | Cyclic nucleotide-binding domain | 3fhi |
| PF00028 | Cadherin domain | 2o72 |
| PF00032 | Cytochrome b(C-terminal)/b6/petD | 1zrt |
| PF00035 | Double-stranded RNA binding motif | 1o0w |
| PF00041 | Fibronectin type III domain | 1bqu |
| PF00042 | Globin | 1cp0 |
| PF00043 | Glutathione S-transferase, C-terminal domain | 6gsu |
| PF00044 | Glyceraldehyde 3-phosphate dehydrogenase, NAD binding domain | 1crw |
| PF00046 | Homeobox domain | 2vi6 |
| PF00056 | Lactate/malate dehydrogenase, NAD binding domain | 1a5z |
| PF00059 | Lectin C-type domain | 1lit |
| PF00064 | Neuraminidase | 1a4g |
| PF00069 | Protein kinase domain | 3fz1 |
| PF00071 | Ras family | 5p21 |
| PF00072 | Response regulator receiver domain | 1nxw |
| PF00073 | Picornavirus capsid protein | 2r06 |
| PF00075 | RNase H | 1f21 |
| PF00077 | Retroviral aspartyl protease | 1a94 |
| PF00078 | Reverse transcriptase (RNA-dependent DNA polymerase) | 1dlo |
| PF00079 | Serpin (serine protease inhibitor) | 1lj5 |
| PF00081 | Iron/manganese superoxide dismutases, alpha-hairpin domain | 3bfr |
| PF00082 | Subtilase family | 1p7v |
| PF00084 | Sushi domain (SCR repeat) | 1elv |
| PF00085 | Thioredoxin | 3gnj |
| PF00089 | Trypsin | 3tgi |
| PF00091 | Tubulin/FtsZ family, GTPase domain | 2r75 |
| PF00092 | Von Willebrand factor type A domain | 1atz |
| PF00102 | Protein-tyrosine phosphatase | 1pty |
| PF00104 | Ligand-binding domain of nuclear hormone receptor | 1a28 |
| PF00105 | Zinc finger, C4 type (two domains) | 1gdc |
| PF00106 | Short chain dehydrogenase | 1a27 |
| PF00107 | Zinc-binding dehydrogenase | 1a71 |
| PF00108 | Thiolase, N-terminal domain | 3goa |
| PF00109 | Beta-ketoacyl synthase, N-terminal domain | 1ox0 |
| PF00111 | 2Fe-2S iron-sulfur cluster binding domain | 1a70 |
| PF00112 | Papain family cysteine protease | 1o0e |
| PF00113 | Enolase, C-terminal TIM barrel domain | 2al2 |
Figure 2True positive rate plotted against number of predicted pairs.
Data for plmDCA [15] (green) and PSICOV version 1.11 [12] (red) was obtained using the code provided by the authors with standard parameters as found in the distributed code, except that PSICOV was run with the -o flag to override the check against insufficient effective number of sequences. The true positive rate is an arithmetic mean over 50 Pfam families (see Table 2 for the list); thin lines represent standard deviations.
Running times in seconds for a representative sample of proteins with varying length () and sequences in alignment (), using different algorithms.
| PF00014 | PF00025 | PF00026 | PF00078 | |
| N | 53 | 175 | 317 | 214 |
| M | 4915 | 5460 | 4762 | 172360 |
| Gaussian DCA (parallel) | 0.7 | 5.3 | 16.3 | 534.8 |
| Gaussian DCA (non-parallel) | 1.7 | 12.7 | 52.1 | 3583.4 |
| PSICOV | 11.7 | 1141.9 | 5442.7 | 10965.1 |
| plmDCA | 433.2 | 6980.7 | 37364.8 | 303331.0 |
Since the Gaussian DCA code is parallelized, we show two series of results, one in which we used 8 cores and one in which we forced the code to run on a single core, for the sake of comparing with the non-parallel code of PSICOV and plmDCA. These benchmarks were taken on a -core cluster of MHz AMD Opteron 6172 processors running Linux 3.5.0; PSICOV version 1.11 was used, compiled with gcc 4.7.2 at -O3 optimization level; plmDCA was run with MATLAB version r2011b. Gaussian DCA timings shown are taken using the Julia version of the code, using Julia version 0.2.
Figure 3First predicted contacts for the PF00069 family (Protein Kinase domain) with Gaussian DCA, using the same settings as for Fig. 2.
The left panel shows the predicted contacts overlaid on the PDB structure 3fz1 (figure produced using the PyMOL software [51]); the right panel shows the predicted pairs overlaid on the contact map (true contacts as obtained by setting the threshold at 8 Å are shown in black). In both panels, the color code is the following: the first predicted contacts are depicted in green, the next contacts in yellow, the last contacts in grey; the only false positive contact (occurring as the 24th predicted pair) is shown in red.
Figure 4DI-ranking-induced mean true positive rate for predicting inter-protein contacts in the SK/RR complex, for both mean-field DCA (blue curve) and multivariate Gaussian DCA (red curve).
Figure 5Partner prediction for Caulobacter crescentus orphan two-component proteins by the conditional probability method.
Experimentally known interaction partners [44], [45] are shown in red. Green dots correspond to partner predictions suggested in [18]. As for [18], the overall performance of the algorithm is good, except for the prediction on CenK-CenR interaction.
Figure 6Partner prediction for Bacillus subtilis orphan two-component proteins.
All 5 orphan kinases, KinA-E, are known to phosphorylate Spo0F, which is displayed in red and is always the maximally scoring protein in the RR set.
Figure 7Illustration of the encoding of a sequence from FASTA format to its intermediate numeric representation (matrix ) to its final binarized representation (matrix ).
For clarity, we restrict the alphabet to amino-acids, , plus the gap. The alternation of white and gray cell backgrounds helps to track the transformation (e.g. ). Typically, MSAs of protein families are such that in every column (i.e. residue position) there appears a number of distinct residues smaller than or equal to . Here, we did not not consider a restriction of the alphabet to the residues actually occurring, and we used instead the same encoding for all residues.