Literature DB >> 35900026

AlphaFold predicts the most complex protein knot and composite protein knots.

Maarten A Brems1, Robert Runkel1, Todd O Yeates2,3, Peter Virnau1.   

Abstract

The computer artificial intelligence system AlphaFold has recently predicted previously unknown three-dimensional structures of thousands of proteins. Focusing on the subset with high-confidence scores, we algorithmically analyze these predictions for cases where the protein backbone exhibits rare topological complexity, that is, knotting. Amongst others, we discovered a 71 -knot, the most topologically complex knot ever found in a protein, as well several six-crossing composite knots comprised of two methyltransferase or carbonic anhydrase domains, each containing a simple trefoil knot. These deeply embedded composite knots occur evidently by gene duplication and interconnection of knotted dimers. Finally, we report two new five-crossing knots including the first 51 -knot. Our list of analyzed structures forms the basis for future experimental studies to confirm these novel-knotted topologies and to explore their complex folding mechanisms.
© 2022 The Authors. Protein Science published by Wiley Periodicals LLC on behalf of The Protein Society.

Entities:  

Keywords:  AlphaFold; composite knots; protein knots; protein topology

Mesh:

Substances:

Year:  2022        PMID: 35900026      PMCID: PMC9278004          DOI: 10.1002/pro.4380

Source DB:  PubMed          Journal:  Protein Sci        ISSN: 0961-8368            Impact factor:   6.993


INTRODUCTION

Recently, the artificial intelligence (AI) system AlphaFold developed by Google's DeepMind dominated the Critical Assessment of Techniques for Protein Structure Prediction (CASP) twice. AlphaFold 2, the version under consideration here, is a deep learning system that incorporates training procedures based on the evolutionary, physical, and geometric constraints of protein structures. , It features iterative refinement of predictions and allows for learning from unlabeled protein sequences using self‐distillation and self‐estimates of accuracy to directly predict the 3D coordinates of all heavy atoms for a given protein using the primary structure and aligned sequences of homologues. , AlphaFold 2 has currently predicted several hundred thousand protein structures, most of which are not contained in the Protein Data Bank (PDB), , which mainly archives experimentally determined structures. Thereby, AlphaFold‘s prediction databank may be of tremendous value, especially for the research of protein phenomena which are infrequent but still of high relevance to understand the intricacies of the underlying mechanisms of protein folding. A particularly fascinating phenomenon arises for proteins that contain a topological knot in their polypeptide backbone, , , , , , , , , , , , , , , , , , , , , , , , , that is, proteins which would not fully disentangle after being pulled from both ends. In the past two decades, only about 20 different protein families containing knots have been identified. Nevertheless, knotted proteins pose a challenge to protein folding and evolution. Simulation algorithms often overestimate the knotting probability of proteins as the latter is lower than the knotting probability of random chains. , , , Moreover, protein topology is usually similar among homologues, meaning that knotted folds tend to be preserved across proteins closely related in evolution. For these reasons, and owing to the established rarity of knotting among natural proteins, the potential presence of knotted topologies in the vast new database of predicted protein structures is of keen interest. Currently, the most complex knot found in a protein is a single knot with six essential crossings in any projection to a plane ; a composite knot has not been observed yet. We searched the entire AlphaFold 2 databank, including the “Model organism proteomes”, “Swiss‐Prot” and “Global health proteomes” data sets, for topologically complex proteins containing previously unknown deep knots (which still persist when at least five amino acids [aa] are cut from both ends). We excluded from the analysis those with lower confidence scores (<80) or exceedingly long protein chains (>600 aa), where predicted accuracy and ability to experimentally validate the structures could be limiting. The applied criteria for the survey as well as our knot detection algorithm are discussed in detail in the methods section. Based on this search and visual inspection, we have identified the first 71‐knot (with at least seven crossings in any projection onto a plane) as well as a likely evolutionary mechanism for generation of 31#31 composite knots, accompanied by several examples. Moreover, we report two new five‐crossing knots including the first 51‐knot in a protein and provide an overview of additional knotted proteins present in the AlphaFold databank (Supporting Information S1).

GENERATION MECHANISM OF COMPOSITE KNOTS

Our survey identified nine cases of composite knots, previously unknown. These are all instances where two essentially independent trefoil knots are present in one longer protein chain. We propose a novel mechanism for generation of such composite knots based on gene duplication and the interconnection of a knotted homodimer. Interestingly, this mechanism resembles a strategy employed for the creation of the first artificial protein knot in which an unknotted dimer was “connected” to form a trefoil. We have observed multiple instances including the methyltransferases and carbonic anhydrases, as discussed below, in which proteins containing a composite trefoil knot (31#31) are homologous to a known knotted homodimer with one trefoil knot in each chain. Figure 1 depicts protein Q313J9, which has been identified as tRNA (guanine‐N1‐)‐methyltransferase, with a length of 425 aa and a knotted core between residues 86 and 360. If not stated otherwise, the protein code refers to the UniProt/AlphaFold identifier and structures are visualized using the Visual Molecular Dynamics (VMD) software. To visualize knots in the protein structures, we employ reduced representations (bottom structures in Figures 1, 2, 3, 4), in which the protein is divided into segments such that topology is conserved when the segments are replaced by straight lines connecting their respective start and end points. Methyltransferases are known to usually contain a single trefoil knot per chain and sometimes appear as homodimers. We have observed two variations of this phenomenon: For protein Q313J9 in Figure 1 and a similar methyltransferase Q72DU3, the two main segments containing the trefoil knots appear flexibly connected, whereas predictions for some proteins of similar sequence preserve the presumed original dimer structure more strictly. (See inset of Figure 1.) Examples are the methylases A4I142, Q4DMW6, and Q4D5S2 as well as proteins Q4CYG6, Q4D7N4, and Q381U1. The latter are labeled as uncharacterized but show about 15% sequence identity and 30% matching secondary structure with the methyltransferase pdb:2ha8:A for proteins Q4CYG6 and Q4D7N4 or with the methyltransferase pdb:1v2x:A for protein Q381U1 according to the PBDeFold webserver. Structural alignment and sequence identity discussions based on PDBeFold , for each group of methyltransferases containing composite knots can be found in the Supporting Information (SI). A particularly interesting example of the second variant is Carbonic anhydrase P54212 (Figure 2), with a length of 589 aa and a knotted core between residues 198 and 570. Carbonic anhydrases were the first proteins identified as being knotted. Both trefoil knots in methyltransferase Q313J9 and as well as in carbonic anhydrase P54212 have positive chirality. Therefore, the composite trefoil knots can be identified as what is commonly known as a granny knot. The chirality of the composite knots is in agreement with previous results reporting positive chirality for the single trefoil knots in methyltransferases and carbonic anhydrases. We have thereby observed the same phenomenon, a potential mechanism for generation of composite knots, in two distinct protein families and with two structural variations.
FIGURE 1

3D structure (top) and reduced representation (bottom) of a six‐crossing composite knot in protein Q313J9 (methyltransferase). A composite trefoil knot (31#31) can be identified. Topologically trivial segments are not displayed. Inset: A similar structure is predicted for Methyltransferase A4I142, except the two knotted domains form a more compact arrangement

FIGURE 2

3D structure (top) and reduced representation (bottom) of protein P54212 (carbonic anhydrase). A composite trefoil knot (31#31) can be identified. Topologically trivial segments are not displayed. The large green segments in the top structure are made transparent for a better view of the knotted region

FIGURE 3

Structure and topology of proteins P73136 (left) and Q9PR55 (right). Top: 3D structures predicted by AlphaFold. Bottom: Reduced representations to visualize the 51‐ and 71‐knots in the left and right structure, respectively. On the right, the dark blue segment introduces an additional winding

FIGURE 4

Structure and topology of proteins A0A0K0IQS9 (left) and C1GYM9 (right). Top: 3D structures predicted by AlphaFold. Bottom: Reduced representations to visualize the 51‐ and 52‐knots in the left and right structure, respectively

3D structure (top) and reduced representation (bottom) of a six‐crossing composite knot in protein Q313J9 (methyltransferase). A composite trefoil knot (31#31) can be identified. Topologically trivial segments are not displayed. Inset: A similar structure is predicted for Methyltransferase A4I142, except the two knotted domains form a more compact arrangement 3D structure (top) and reduced representation (bottom) of protein P54212 (carbonic anhydrase). A composite trefoil knot (31#31) can be identified. Topologically trivial segments are not displayed. The large green segments in the top structure are made transparent for a better view of the knotted region

FIRST 71‐KNOT IN A PROTEIN

Figure 3 depicts proteins P73136 and Q9PR55 with lengths of 112 and 89 amino acids, respectively. Both are uncharacterized and no probable homologues could be identified using PDBeFold. However, they have 48% sequence identity and 71% matching secondary structure with respect to each other, which indicates that they are probably homologues. Protein Q9PR55 contains the most complicated knot, a 71‐knot, known to date with a knotted core between residues 27 and 83. The similar structure of protein P73136 contains a 51‐knot with a knotted core between residues 45 and 94. Such a pair of homologues where the two proteins possess a different non‐trivial topology has not been observed previously. A closer look reveals that the more complex topology of protein Q9PR55 arises from a protein segment that introduces an additional winding (dark blue in Figure 3, right); a 71‐torus knot is essentially a 51‐torus knot with one additional winding around the torus. Both knots have positive chirality. Structure and topology of proteins P73136 (left) and Q9PR55 (right). Top: 3D structures predicted by AlphaFold. Bottom: Reduced representations to visualize the 51‐ and 71‐knots in the left and right structure, respectively. On the right, the dark blue segment introduces an additional winding

NEW 51‐ AND 52‐KNOTS

We have found two previously unknown knots with five essential crossings, including the first 51‐knot. Figure 4 (left) depicts protein A0A0K0IQS9 (Bm1115) which contains a 51‐knot. Its length is 173 aa and its knotted core extends from residue 39 to residue 157. Protein C1GYM9 (Figure 4 right) is uncharacterized, and no probable homologue could be identified using PDBeFold. It contains a 52‐knot with a knotted core between residues 76 and 391 and its length is 420 aa. Both knots exhibit positive chirality. Structure and topology of proteins A0A0K0IQS9 (left) and C1GYM9 (right). Top: 3D structures predicted by AlphaFold. Bottom: Reduced representations to visualize the 51‐ and 52‐knots in the left and right structure, respectively

TESTS OF ACCURACY

Owing to the novelty of the findings here, validation by independent methods will be important. Ahead of experimental studies, here we applied an orthogonal computational tool, ERRAT, to assess the predicted knotted structures. The ERRAT algorithm evaluates patterns of non‐bonded contacts between C, N, and O atoms, and makes a statistical comparison to high resolution structures. By being distinct from metrics employed in AlphaFold (and other prediction methods), it offers an independent assessment. We ran ERRAT on the set of knotted structures discussed above. Discounting occasional extended termini found in some models, all the models tested showed good scores; all cases have >90% of their protein chain falling within (below) the 95% threshold for rejecting unlikely conformations. Our overall assessment was therefore that the predicted structures are correct, at least to a large extent. However, in some cases, local regions of structure appeared potentially problematic. And it is critical to note that minor discrepancies in the path of a protein chain—for example, those that would change an over/under crossing—can change the topology, potentially leading to an incorrect assignment of a knot. With regard to the present study, we note that, for the composite knot Q4D5S2 (and its relatives), the ERRAT program flags a beta strand segment around residues 100–110 as likely to be structurally incorrect (SI, Figure S1). Notably, the passage of the chain in this region is important for the knotted topology. While the AlphaFold program assigns a high degree of confidence to the predicted structure in this region, our independent assessment emphasizes the need for confirmatory experimental studies.

DISCUSSION AND CONCLUSIONS

In conclusion, we have analyzed all predictions for protein 3D structures by the AlphaFold AI system for new topologically complex proteins. Our complete analysis of the data provided by AlphaFold (see SI) reveals several high‐confidence proteins containing deep complex knots, which are suitable for experimental verification of their 3D structure. In this data set, we found amongst others a 71‐knot, the most complex ever discovered in a protein, as well as a new 51‐knot in a homologue structure and the first instances of composite protein knots. For the latter, we propose an evolutionary mechanism for their creation by gene duplication. As protein topology is an ongoing challenge for protein folding algorithms, it will be important to verify or refute the discussed structure predictions experimentally. One would not only obtain a fine gauge for the capability of AlphaFold AI system to correctly predict the topology of complex proteins, but importantly confirm the multitude of novel protein knots identified here.

METHODS

Mathematically, knots are well‐defined in closed three‐dimensional curves, and can be categorized according to the minimal number of crossings the curve makes in a projection onto a plane, allowing for any non‐breaking manipulations (e.g., smoothing) of the curve. The simplest non‐trivial knot is the so‐called trefoil knot with three crossings. The figure‐eight knot has four crossings, there are two knots with five, three knots with six, and eight distinct knots with seven crossings. In addition, simpler knots can be combined—i.e., formed on separable regions of the same curve—to form composite knots, which are distinct from prime knots; the latter cannot be decomposed into simpler knots. In the present study, topologically non‐trivial proteins (i.e., polypeptide backbones that are knotted) have been identified using a classification algorithm based on the Alexander polynomial invariant. , Note that for a knot to be well‐defined, the two ends of the protein must be virtually closed, , which sometimes leads to ambiguous results and requires additional visual inspection. Employing the algorithm above, we find that the knotting probability (of around 2%) of the AlphaFold database is roughly in accordance with the one from PDB as discussed in the SI. We limit our detailed, non‐algorithmic analysis to proteins which fulfill the following three criteria: First, the average computed confidence score for the predicted structure must be 80 or above. The AlphaFold AI system provides a per‐residue estimate of its confidence on a scale from 0 to 100, which is based on the lDDT‐Cα metric. Second, the topology of the protein must be more complex than a trefoil (31) and figure‐eight (41) knot, that is, it must contain a knot with at least five essential crossings, which includes any potential composite knots. We exclude combinations of knot types and protein families which are already known, such as 52‐knots in ubiquitin hydrolases and 61‐knots in haloacid dehalogenase. Moreover, the knot must be deep in the sense that the topology of the system is invariant under removal of at least 5 aa from both termini. A related measure for the topological robustness of a structure, which we employ in our discussions, is the extend of the knotted core, that is, the smallest region of the protein which still contains the knot. The extend of the knotted core is one of the measures included in the knot matrix representation for proteins introduced by King et al. in Ref. 18 and popularized in further work. Third, the protein must not exceed a length of 600 aa. The final condition was set to mitigate the potential errors in topology assignment that can arise from relatively small structural discrepancies in large structures, in addition to challenges typically associated with experimental studies on very large and potentially flexible protein chains. As established above, correct prediction of protein topology is still an important challenge for modern computer algorithms. Thus, ultimate experimental verification or refutation will highlight the degree to which the AlphaFold AI system can grasp the intricacies of protein folding for highly complex cases. An extensive table of all knotted proteins in AlphaFold's databank, as determined by our algorithm, including all quantitative measures employed in our analysis and filtering can be found in the SI. In the present work, the most interesting proteins that fulfill the conditions above are discussed in detail. In the SI, we also list proteins that fulfilled the computational criteria, but which were set aside as potentially unreliable after visual inspection. The per‐residue confidence scores of all proteins depicted in the figures are given in the SI; we observe that no segments which are substantial for the knots possess particularly low confidence. Moreover, we want to acknowledge that we found the 63‐knot in von Willebrand factor A (identifiers O00534 and Q99KC8), which was also reported in Ref. 45, where AlphaFold predictions for the human proteome were studied, even though it does not satisfy the above conditions stated above due to its length. In the review stage of this manuscript, another paper was published by the same group, which describes a server to determine knots in predicted structures from AlphaFold.

AUTHOR CONTRIBUTIONS

Maarten Alexander Brems: Formal analysis (equal); investigation (supporting); software (equal); visualization (lead); writing – original draft (lead); writing – review and editing (equal). Robert Runkel: Formal analysis (equal); investigation (lead); software (equal); writing – review and editing (supporting). Todd Yeates: Conceptualization (supporting); methodology (supporting); project administration (supporting); software (equal); supervision (supporting); visualization (supporting); writing – original draft (supporting); writing – review and editing (supporting). Peter Virnau: Conceptualization (lead); funding acquisition (lead); methodology (lead); project administration (lead); resources (lead); supervision (lead); writing – review and editing (equal).

CONFLICT OF INTEREST

There is no conflict of interest to declare. Table S1 We provide access to an extensive table of all knotted proteins in AlphaFold's databank, as determined by our algorithm, including all quantitative measures employed in our analysis and filtering Click here for additional data file. Appendix S1 We discuss several knot‐related statistics of this list and compare them to the PDB. Moreover, we discuss further topologically interesting proteins, which match the criteria introduced above but for which visual inspection implies potential unreliability. Furthermore, we discuss the ERRAT accuracy test for the composite knot Q4D5S2 for which the program flags a topologically relevant beta strand segment as likely to be structurally incorrect, although the latter is assigned a high degree of confidence by AlphaFold. Finally, the supplementary material contains a depiction of the per‐residue confidence scores by AlphaFold for the proteins in Figures 1–4 as well as a discussion on the structure alignment and sequence identity of the 31#31‐methyltransferases Click here for additional data file.
  39 in total

1.  Protein knots: A tangled problem.

Authors:  William R Taylor; Kuang Lin
Journal:  Nature       Date:  2003-01-02       Impact factor: 49.962

Review 2.  Molecular knots in biology and chemistry.

Authors:  Nicole C H Lim; Sophie E Jackson
Journal:  J Phys Condens Matter       Date:  2015-08-20       Impact factor: 2.333

3.  Folding studies on a knotted protein.

Authors:  Anna L Mallam; Sophie E Jackson
Journal:  J Mol Biol       Date:  2005-01-28       Impact factor: 5.469

Review 4.  Knotted and topologically complex proteins as models for studying folding and stability.

Authors:  Todd O Yeates; Todd S Norcross; Neil P King
Journal:  Curr Opin Chem Biol       Date:  2007-11-09       Impact factor: 8.822

5.  Stabilizing effect of knots on proteins.

Authors:  Joanna I Sułkowska; Piotr Sulkowski; P Szymczak; Marek Cieplak
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-08       Impact factor: 11.205

6.  Sequence determines degree of knottedness in a coarse-grained protein model.

Authors:  Thomas Wüst; Daniel Reith; Peter Virnau
Journal:  Phys Rev Lett       Date:  2015-01-15       Impact factor: 9.161

7.  Proteins' Knotty Problems.

Authors:  Aleksandra I Jarmolinska; Agata P Perlinska; Robert Runkel; Benjamin Trefz; Helen M Ginn; Peter Virnau; Joanna I Sulkowska
Journal:  J Mol Biol       Date:  2018-11-01       Impact factor: 5.469

Review 8.  beta-Sheet topology and the relatedness of proteins.

Authors:  J S Richardson
Journal:  Nature       Date:  1977-08-11       Impact factor: 49.962

9.  Are there knots in proteins?

Authors:  M L Mansfield
Journal:  Nat Struct Biol       Date:  1994-04

Review 10.  Knotted proteins: A tangled tale of Structural Biology.

Authors:  Patrícia F N Faísca
Journal:  Comput Struct Biotechnol J       Date:  2015-08-19       Impact factor: 7.271

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.