Literature DB >> 35013567

From systems to structure - using genetic data to model protein structures.

Hannes Braberg^1,2, Ignacia Echeverria^1,2,3, Robyn M Kaake^1,2,4, Andrej Sali^2,3,5, Nevan J Krogan^6,7,8,9.

Abstract

Understanding the effects of genetic variation is a fundamental problem in biology that requires methods to analyse both physical and functional consequences of sequence changes at systems-wide and mechanistic scales. To achieve a systems view, protein interaction networks map which proteins physically interact, while genetic interaction networks inform on the phenotypic consequences of perturbing these protein interactions. Until recently, understanding the molecular mechanisms that underlie these interactions often required biophysical methods to determine the structures of the proteins involved. The past decade has seen the emergence of new approaches based on coevolution, deep mutational scanning and genome-scale genetic or chemical-genetic interaction mapping that enable modelling of the structures of individual proteins or protein complexes. Here, we review the emerging use of large-scale genetic datasets and deep learning approaches to model protein structures and their interactions, and discuss the integration of structural data from different sources.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35013567 PMCID： PMC8744059 DOI： 10.1038/s41576-021-00441-w

Source DB: PubMed Journal: Nat Rev Genet ISSN： 1471-0056 Impact factor: 59.581

Introduction

Deciphering the functional consequence of genetic variation within and across populations is a fundamental question of biology. To address this, a combination of techniques to interrogate changes on both systems-wide and mechanistic scales is required (Fig. 1). Systems-wide approaches provide a high-level view and generate networks that describe how different proteins or genes relate to each other or to environmental perturbations. Such networks have proved highly informative, enabling functional annotations of proteins and conveying information on the architectures of entire biological systems[1,2]. Protein–protein interaction (PPI) networks describe which proteins interact[3-5] (Fig. 1a). Experimental methods to determine PPIs include affinity purification–mass spectrometry (AP–MS)[6,7], yeast two-hybrid (Y2H) screening[8] and protein fractionation[9]. AP–MS and protein fractionation identify proteins that form complexes together in a cell type of interest, whereas Y2H uses a yeast reporter system to identify binary interactions. PPI networks describe proteins that are in physical contact but lack the resolution to discern mechanism, which often requires knowledge of the structures of the proteins and the complexes they form. Typically, high-resolution protein structures are determined using biophysical approaches, such as X-ray crystallography[10], cryogenic electron microscopy (cryo-EM)[11] and NMR spectroscopy[12] (Fig. 1b). These methods are key for elucidating protein mechanisms and designing drugs that bind to active sites or disrupt PPIs. However, traditional structural biology methods are often time-consuming and rely on purification of the relevant proteins, which is not always feasible. Furthermore, they take place in vitro, which can introduce artefacts and may not always reflect biologically relevant protein conformations.

Fig. 1

Readouts, scale and resolution.

Readouts, scale and resolution.

A complete understanding of cellular processes requires measurements of physical and functional properties at a low-resolution, systems-wide scale and at high resolution of individual components. a | Protein–protein interaction networks describe which proteins bind to each other and are generated using methods such as affinity purification–mass spectrometry (AP–MS), protein fractionation and yeast two-hybrid screening. b | High-resolution structures of proteins and their complexes are determined using biophysical methods, such as X-ray crystallography, cryogenic electron microscopy (cryo-EM) and NMR spectroscopy, that typically take place in vitro. c | Functional interaction networks (left panel) describe how different genes or proteins or regions thereof affect the function of each other, or how they respond to drugs. Functional connections are determined using methods such as genetic or chemical–genetic interaction mapping. Improvements in these methods and the related field of coevolution have recently enabled the structures of proteins and their complexes to be determined (right panel). PPI mapping and traditional structural biology are centred on proteins and their physical attributes. Genetic methods provide a functional context by means of measuring the phenotypic consequences of perturbing proteins or PPI networks. The characterization of genetic interactions[13], which describes how mutations in different genes affect one another, has proved a particularly useful complement to PPI networks. Systematic mapping of genetic interactions enables the generation of functional interaction networks, shedding light on the biological purpose of the PPIs[14,15] (Fig. 1c, left panel). Until recently, systematic genetic analyses were applied only at a whole-gene or protein level, relying on traditional structural biology for deciphering mechanistic actions. Over the past decade, developments in genetic interaction mapping and the related field of coevolution, which studies how protein residues evolve together, have allowed structural biology to be tackled on a genetic basis. By identifying pairs of residues that are related through genetic interactions or coevolution, these methods are providing high-resolution functional information sufficient to model the structures of proteins and their complexes (Fig. 1c, right panel). In this Review, we describe the fundamentals of coevolution and genetic interaction mapping, and outline how these methods have evolved over the past decades. We discuss how technical advances and the growth of protein sequence databases have enabled the application of these methods to inform structural modelling of proteins and protein complexes. We also describe chemical–genetic interaction mapping, which is closely related to genetic interaction mapping and has similarly been used for structural modelling. We list applications of these methods and discuss emerging approaches that will enable expansion into new systems. For brevity, we do not discuss traditional structural biology methods (reviewed in[16-19]).

Coevolution and deep learning approaches

The genetic material of all living organisms evolves over time. This evolution takes place in the form of alterations to the DNA sequence, often as single base substitutions. Coevolution analysis is based on the principle that amino acid residues in a protein, or in two interacting proteins, mutate and evolve together when they reside in the same functional region[20]. For example, in a single protein, spatially proximal amino acid residues that are essential to a specific function are likely to evolve together over time. Similarly, with two interacting proteins, if one protein evolves in the binding interface, the other protein can develop complementary changes in the interface to avoid disruption of the interaction site. This evolutionary phenomenon was observed more than three decades ago[20], and its application to predicting residue–residue contacts was made feasible a few years later with the growth of protein sequence databases and increases in computational power[21-25].

Modelling protein structures using coevolution

Accurate identification of residue–residue contacts is crucial for coevolution-based protein structure modelling. Residue–residue contacts are predicted by generating a multiple sequence alignment of a protein family and identifying correlations in amino acid changes for pairs of residue positions across the alignment. Early methods used local statistical models to determine covariation between residue pairs, relying on the assumption that each correlated residue pair is independent of all other pairs[21-23,26,27]. Thus, while computationally efficient, these approaches failed to accurately represent real proteins, in which each residue can interact with many others. As a result, the local approaches were not able to distinguish direct from indirect correlations between residue pairs. Direct correlations reflect true residue–residue contacts, whereas indirect correlations arise for pairs that coevolve without being in contact. Indirect correlations can arise, for example, between residues that are evolutionarily constrained through a network path of direct contacts[28]. Accurate structure prediction requires that only direct correlations be considered. Hence, the local statistical models were sufficient to predict contacts but lacked the resolving power necessary to model entire protein structures. During the past decade, local models have been replaced by global models, which recognize that correlated pairs are dependent on each other and furthermore incorporate the conservation of individual residues[29-33]. Global models enable the distinction of directly coupled residue pairs from those that should be excluded from the analysis because they are indirectly coupled. Crucially, these technical advancements have been accompanied by the rapid growth of protein sequence databases such as UniProt[34], increasing the coverage of sequence space across the members of protein families and making possible the systematic comparison of evolutionary changes at residue level in prokaryotes. Together, these developments paved the way for using coevolution to model the structures of monomeric proteins. The first successful determination of protein folds using coevolution was achieved by EVfold[35,36], followed by other methods, such as DCA-fold[37], FILM3 (ref.[38]) and GREMLIN[39] (Fig. 2a).

Fig. 2

Structural modelling of proteins and their complexes using coevolution.

Coevolution methods identify pairs of amino acid residues within or between proteins that have evolved together. Such pairs are often close in space and can be used to derive spatial restraints for structural modelling. a | To identify coevolving residue pairs in a protein, a multiple sequence alignment of its protein family is first generated. Pairs of sequence positions whose residue types change in a correlated fashion across the sequence alignment are coevolving and are likely to be close in space. Spatial restraints are generated based on predicted contacts and used for modelling the protein structures. b | Similar to part a, but coevolving residue pairs are here identified across the sequence alignments of an interacting pair of proteins. Here, the predicted residue contacts are thus between two different proteins, and the resulting restraints are used for modelling protein complexes instead of individual proteins. c | Random mutagenesis is carried out on an antibiotic resistance gene, and plasmids harbouring the gene variants are transformed into cells, followed by selection for functional copies of the gene. Surviving variants are again exposed to random mutagenesis and reintroduced into the assay. After a sufficient number of cycles, variants are deep sequenced to identify coevolving residue pairs and structural modelling is carried out as in part a. Filled circles represent sequence positions and the colours represent different residue types (grey denotes any residue type).

Structural modelling of proteins and their complexes using coevolution.

Modelling of protein complexes and prediction of PPIs using coevolution

The same coevolution principles used to determine residue–residue contacts within a protein can be used to determine residue–residue contacts between proteins. However, a key challenge lies in the identification of orthologues to generate the paired multiple sequence alignments required for quantifying coevolution among residues between two proteins. Only organisms that contain both interacting proteins can be used for the multiple sequence alignments, and the interacting pairs must be correctly paired in each species, which is particularly difficult if the proteins have paralogues that perform other cellular functions[32,40-43]. To enable prediction of PPIs and modelling of their interfaces (Fig. 2b), most studies have limited their scope to protein pairs that are likely to interact based on specific criteria. For example, several efforts have focused on protein pairs encoded close to each other in conserved genomic locations (for example, on the same operon)[40,41], or pairs of protein families with members known to interact[42,44]. Although these studies demonstrated that coevolution could in principle be used for the systematic identification of PPIs, the challenges of scaling to unbiased and proteome-wide predictions made this unfeasible in practice. Furthermore, coevolution methods are computationally costly, and applying them to identify PPIs requires the combinatorial pairing of all possible interaction partners. A recent effort tackled these challenges via a combination of techniques to systematically identify PPIs in Escherichia coli and Mycobacterium tuberculosis using coevolution[45]. Hundreds of previously uncharacterized PPIs were discovered by quantifying the coevolution of residue pairs across several millions of protein pairs in both organisms. The high computational requirements were managed via a multistep protocol incorporating a faster pre-screen using local models[26], followed by global models[32,39] and structural modelling to home in on the highest confidence interactors. This study showed that coevolution is highly effective for PPI prediction in binary complexes, but less so in higher-order complexes or those that contain nucleic acids[45].

Experimental evolution

Coevolution has proved powerful for determining the structures of proteins and their complexes. However, the requirement of large protein families with sufficient diversity and the obfuscating effects of paralogues impose limitations on the applicability of the approach. An experimental method (3Dseq)[46] was recently developed with the aim of using protein sequence variation generated in a laboratory to determine coevolving residues and subsequent application of computational coevolution methods for structure modelling. The approach relies on iterative generation of mutations in a given gene using error-prone PCR and exposure to a medium that selects functional variants of the gene (Fig. 2c). Selected populations are deep sequenced, and coevolving residue pairs are identified by comparison throughout the population, allowing inference of residue couplings and structural modelling using the same principles as for natural coevolution. The method was applied to two antibiotic resistance proteins from Pseudomonas — β-lactamase PSE1 and acetyltransferase AAC6 — expressed in E. coli, with functional selection by ampicillin for PSE1 and kanamycin for AAC6, resulting in accurate high-resolution models of both structures[46]. As 3Dseq does not rely on natural variation, it is particularly well suited to proteins that lack the large number of family members required for natural coevolution modelling and should provide an avenue for tackling eukaryotic systems.

Deep learning-based approaches

In addition to experimental evolution, numerous computational developments have refined and extended the coevolution field. Improved statistical models[30,39,47] have increased accuracy and decreased the required number of aligned protein sequences. Incorporation of metagenome sequencing datasets has provided a means of increasing the sequence space accessed by multiple sequence alignments[48]. Finally, several new methods, such as RaptorX[49], ComplexContact[50] and DeepCov[51], use deep learning to extract and integrate additional protein sequence features with the coevolution data for contact prediction. Although these advances increased the accuracy of modelling and enabled systematic studies across prokaryotic proteomes, the technology has, in most cases, not been applied to eukaryotic proteins and complexes. Recent advances in deep learning have led to a revolutionary development in the form of the neural network-based AlphaFold[52], which enables regular prediction of protein structures at near experimental accuracy, in prokaryotes as well as eukaryotes. The AlphaFold (version 2) engine makes use of constraints on protein structure derived from evolution, physics and geometry. During training, AlphaFold parses experimental protein structures deposited in the protein databank (PDB)[53], as well as clustered protein sequence databases, such as BFD[52] and UniRef90 (ref.[54]), learning rules to govern the modelling of structure from sequence. The neural network takes as input a multiple sequence alignment of a given protein and its family members to extract evolutionary information for individual residues as well as on a pairwise basis. Incorporation with components learnt from the PDB enables the final structure prediction[52]. AlphaFold has proved remarkably effective for determining the structures of individual proteins and their complexes. The AlphaFold model, trained on single protein chains, was showcased on nearly the entire human proteome, resulting in confident structure predictions for 58% of all residues[55]. In comparison, experimental efforts over the past several decades have together resulted in structural coverage of 17% of human protein residues[55]. Similarly, a study across 11 different proteomes found that AlphaFold added structure determination for on average 25 percentage points of additional residues over existing experimental structures or those that could be derived by homology modelling[56]. Interestingly, despite being trained on single proteins, AlphaFold proved capable of modelling the structures of protein complexes[56-58]. Most recently, AlphaFold-Multimer has been released, featuring a model trained on multimeric protein structures, which clearly outperforms the standard AlphaFold for modelling protein complex structures[59]. Inspired by the performance of AlphaFold, the RoseTTAFold[60] software was developed using similar ideas. The accuracy of RoseTTAFold is generally somewhat lower than that of AlphaFold, but the predictions are faster and require less computational power[60]. RoseTTAFold provided early evidence that this technology can model protein complexes in addition to individual proteins[60]. Recently, the respective strengths of RoseTTAFold and AlphaFold were combined to not only model but also identify protein complexes[61]. The high speed of RoseTTAFold was leveraged to examine more than 4 million paired multiple sequence alignments to generate a set of approximately 5,500 potential PPIs in Saccharomyces cerevisiae (budding yeast). AlphaFold was then applied to this smaller set to identify higher-confidence candidate protein complexes and model their structures[61]. Importantly, like all technologies discussed in this Review, these methods rely on data generated from experimental approaches and should be viewed as powerful complements to these[62], rather than as replacements.

Genetic and chemical–genetic interactions

A complementary approach to coevolution and deep learning-based methods leverages the measurement of genetic interactions, providing a means for structural modelling using sets of intentionally designed mutations. For most organisms, such as Homo sapiens, budding yeast or E. coli, any given gene is typically directly functionally related to only a small number of other genes. Thus, when deleting or otherwise perturbing two different genes, the cellular response will most often reflect the combined effect of the two as independent contributions. Genetic interactions arise between genes for which the response deviates from this expectation, indicating that the genes are functionally related. Genetic interactions can be measured by multiple phenotypic readouts, but often centre around cell replication and survival as this can be informative for most systems, including unicellular organisms and human cancer cells. Positive genetic interactions arise when the cell is either no sicker (epistatic) or healthier (buffering) than the sickest single mutant. This may indicate factors that operate in the same pathway or are subunits of the same non-essential complex[63]. Conversely, negative genetic interactions (synthetic sick or lethal) occur when mutations in two genes lead to a more severe growth defect than expected. This may reflect factors that function in parallel pathways or are non-essential subunits of the same essential protein complex (Fig. 3a).

Fig. 3

Mapping of genetic and chemical–genetic interactions.

Mapping of genetic and chemical–genetic interactions.

Genetic and chemical–genetic interactions describe the functional relationships between pairs of mutations or between a mutation and a drug, respectively. a | A positive genetic interaction between two gene deletions may indicate that the gene products operate in the same pathway (G1–G2 or G3–G4), whereas a negative interaction can arise if the products of the deleted genes belong to parallel pathways (for example, G1–G3). b | Positive interactions between a drug (D) and a gene deletion can indicate an antagonistic relationship (for example, D–G1), whereas a negative interaction may indicate that the gene product belongs to a parallel pathway of the drug target (for example, D–G3). c | The epistatic miniarray profile (E-MAP) and synthetic genetic array (SGA) approaches allow for high-throughput measurements of genetic or chemical–genetic interactions between a set of test mutants (y-axis) and a genome-scale library (x-axis). Each row constitutes the genetic interaction profile for a test mutant (A–E), and clustering these by similarity (tree on right) provides a functional organization of the mutants. d | Deep mutational scanning (DMS) can be used to measure genetic interactions between all pairwise combinations of point mutations in a gene. For each pair of residue positions (left), all possible combinations of amino acids (aa) are measured (right), which can be used to generate a composite genetic interaction score for the position pair. Depictions in parts c,d are illustrative subsets of much larger interaction maps. Chemical–genetic interactions, similar to genetic interactions, describe how the presence or absence of a drug or environmental perturbation affects the phenotype of a single genetic mutation. Here, a positive interaction reflects that drug treatment has a lesser effect on the mutant phenotype than expected, which could indicate that the drug inhibits pathways in which the mutated gene functions. By contrast, negative chemical–genetic interactions arise when the effect of a mutation in the presence of a drug is more severe than expected, potentially indicating that the drug inhibits a parallel pathway (Fig. 3b). Notably, the relationships that form the basis of genetic and chemical–genetic interactions are often more complex than the illustrative examples provided here.

Systematic analysis of genetic and chemical–genetic interactions

Early work on concepts that underlie genetic interactions focused on small numbers of genes that were already known to affect a given phenotype of interest[13]. In the early 2000s, the creation of gene deletion libraries in budding yeast and advances in high-throughput technologies paved the way for systematic mapping of genetic and chemical–genetic interactions[64]. A key development was introduced by synthetic genetic array (SGA), which enabled the rapid crossing of a set of test mutants across a deletion library in a plate-based format, providing an efficient means of identifying synthetic lethal interactions[15]. A different method, diploid-based synthetic lethal analysis with microarrays (dSLAM), relied on barcoded yeast mutants grown in a pooled competitive format, where microarrays were used to quantify the amounts of the different single and double mutants[65]. These methods were primarily developed to identify negative genetic interactions. The ability to capture positive genetic interactions was introduced by epistatic miniarray profile (E-MAP), which expanded on SGA to provide quantitative measurements of the entire spectrum of genetic interactions in a high-throughput format[66,67]. This approach enables the generation of a continuous genetic interaction profile for each test mutant, consisting of its scores across all deletion library mutants; these profiles can be used to group together proteins that are functionally related or belong to the same complex[14,67-70] (Fig. 3c). In parallel with these developments, related methods were designed for determining chemical–genetic interactions, following a similar format but using a library of chemical perturbations in place of the deletion library[71,72] (Fig. 3c). Chemical–genetic interaction mapping relies on methods similar to those of genetic interaction mapping but is considerably less complex, as it simply relies on the addition of drugs to the plates or pools of single mutants[65,71-74]. Systematic genetic and chemical–genetic interaction mapping (for example, chemical–genetic miniarray profile (CG-MAP)) have proved highly effective for organizing genes on the basis of function on both local and global levels[14,67-71,74-76]. The technologies have been adapted to different model systems, including Caenorhabditis elegans[77], E. coli[75,76], Schizosaccharomyces pombe[78] and Drosophila melanogaster cell lines[79]. More recently, advances in RNA interference (RNAi) and CRISPR–Cas9 (ref.[80]) genome editing have enabled expansion into mammalian cells[81-85].

Genetic interactions of point mutants

Most genetic interaction maps have focused on whole-gene deletions or knockdowns. However, early studies in budding yeast investigated the genetic interaction profiles for limited numbers of point mutants. For example, alanine scan mutations of the actin gene ACT1 were screened for genetic interactions with more than 200 genes that had been shown to exhibit complex haploinsufficiencies in a strain hemizygous for ACT1 (ref.[86]). The screen revealed that alanine mutations in close proximity on the actin surface shared many interactions (that is, exhibited similar genetic interaction profiles), suggesting that they may be disrupting the same PPI binding interfaces[86]. Similarly, an early budding yeast E-MAP that focused on chromatin biology included three alleles of the POL30 gene[14], which encodes the multifunctional protein PCNA that functions in DNA replication and repair and in chromatin assembly. The pol30-79 point mutant allele gave rise to a genetic interaction profile similar to that of pol30-DAmP (a gene knockdown allele), suggesting a destabilizing effect on the protein. The genetic interaction profiles of these mutants were consistent with a defective DNA replication and repair system[14,63,87]. By contrast, the pol30-8 allele, which perturbs a different region of PCNA, exhibited genetic interactions relating to defects in chromatin assembly. Interestingly, this allele has been shown to diminish the PPI between PCNA and chromatin assembly factor 1 (CAF1)[88]. These results indicated that genetic interactions provide a high level of resolution and allow the dissection of multifunctional proteins into regions that are functionally and physically connected to other factors. Spurred by these findings, the E-MAP technology was extended to screen entire libraries of point mutations in a set of related proteins to generate point mutant E-MAPs (pE-MAPs)[89,90]. Quantitative SGA screens have also included large numbers of point mutations; however, these have generally been chosen on the basis of their phenotype as temperature-sensitive alleles of essential genes, rather than systematic mutations of a specific protein or complex[68,69]. Concurrently with pE-MAP, a complementary approach termed deep mutational scanning (DMS) was developed[91]. DMS set out to tackle the problem of identifying the most informative mutations to study in a protein, without the requirement of preselecting residues of interest. To this end, the method allows for a comprehensive screen of point mutations in a protein or protein domain. DMS relies on the rapid synthesis of large numbers of mutations in a gene, in conjunction with a genotype–phenotype coupled selection assay. In its most basic form, DMS quantifies the effects of individual point mutations on a specific function, via the chosen selection assay. However, it can also be applied to pairs of point mutations to quantify genetic interactions[91] (Fig. 3d). The development of pE-MAP and DMS enabled the systematic study of the relationship between genetic interactions and residue distances in a protein structure. The first pE-MAP covered 53 budding yeast point mutants in RNA polymerase II (RNAPII), crossed against a library of 1,200 deletion and knockdown mutants[89]. This study revealed that pairs of residues that exhibited similar genetic interaction profiles were typically close in space, whether they resided in the same or different RNAPII subunits[89,90]. Several early DMS studies revealed similar patterns for the pairwise genetic interactions between point mutants[92-94]. For example, a screen of double mutants of 75 residues in the RRM2 domain of the budding yeast PAB1 protein showed that both positive and negative genetic interactions were enriched at shorter distances between the mutated residues[92]. These findings were supported in a screen of genetic interactions for all pairs of mutations in 55 residues of the IgG binding domain of streptococcal protein G (GB1)[93]. In some proteins, such as those regulated by allostery, these trends can differ. For example, a recent pE-MAP screen of the molecular switch Gsp1/Ran revealed that the genetic interaction profiles of interface mutations reflected their biophysical effects on the switch cycle kinetics, instead of their interface locations[95]. These studies highlight how genetic interactions ultimately report on mechanism and showcase the complementarity of this technology to traditional structural biology approaches.

Modelling the structures of proteins and their complexes using genetic and chemical–genetic interactions

Similar to coevolution, genetic interaction data have been used for structural modelling of proteins and their complexes. The key challenge remains how to derive spatial restraints between pairs of residues that can be used for modelling. pE-MAP and DMS provide complementary strengths for this purpose. For example, DMS can provide comprehensive genetic interaction measurements of all possible residue–residue combinations in a protein. Indeed, these fine-grained data can be used to model the secondary structure and tertiary structure of small proteins or domains[96-98] (Fig. 4a,b). Two groups[96,97] examined genetic interaction data from DMS scans of GB1 (ref.[93]), the RRM2 domain of the budding yeast PAB1 protein[92], the human YAP65 WW domain[99] and the heterodimer FOS–JUN[100]. The authors set out to use the genetic interaction data from each of these studies to predict structural contacts between residue pairs in the respective protein domains and to test whether the contacts could be used for structure determination[96,97]. The GB1 dataset was the most comprehensive and covered nearly all possible mutation pairs across 55 residues, which allowed the determination of residue contacts and accurate modelling of both secondary and tertiary structure of the domain[96,97]. The RRM2 and WW domain datasets covered only a fraction of the possible double mutants and were sequenced less deeply. Although contact prediction was possible with these datasets, the secondary structure predictions were not accurate. The fold of a 22–24 residue section of the WW domain could be modelled; however, the RRM2 domain fold could not[96,97]. The data for the FOS–JUN dimer covered a stretch of 32 residues on each monomer and enabled contact predictions across the interface[96,97]. The predicted contacts were then incorporated into a protein docking of the two monomers as spatial restraints, greatly improving the accuracy of the models compared with docking without DMS-derived restraints[96]. Finally, one of the studies also predicted contacts in an RNA molecule[96,101], the twister ribozyme from Oryza sativa, suggesting that DMS could be used for RNA structure prediction. Interestingly, although the two studies[96,97] harnessed different ranges of the genetic interaction data and used different interaction metrics for computing contact predictions, they nonetheless arrived at similar results. This suggests that the approach is robust and highlights the massive information content of DMS data. Accordingly, both groups showed that sparser data subsets still allowed modelling of the GB1 structure at an accuracy similar to that achieved when using the complete dataset. These findings highlight the potential of DMS as a structural biology tool, and other studies have further applied it to successfully reveal structural features of intrinsically disordered proteins[102,103].

Fig. 4

Structural modelling of proteins and their complexes using genetic and chemical–genetic interactions.

Structural modelling of proteins and their complexes using genetic and chemical–genetic interactions.

a | Deep mutational scanning (DMS) relies on the rapid synthesis of mutated variants (blue, red or green) of a gene, which are cloned into vectors and introduced into an assay (here, cell-based) that competitively selects for variants with particular traits. The composition of variants is determined via deep sequencing before and after selection, allowing for identification of variants that are enriched or depleted by the selection. b | When using DMS to measure genetic interactions, each gene variant contains two point mutations (stars). The selection assay identifies mutant pairs that are enriched (positive genetic interaction) or depleted (negative genetic interaction) compared with an expectation from the quantities of each single mutant. Likely residue contacts are identified based on the genetic interactions and used for modelling the structure of the protein. c | The point mutant epistatic miniarray profile (pE-MAP) approach relies on in vivo screening of a set of point mutants in two or more interacting proteins against a large library of gene deletions and/or knockdowns (pE-MAP) or chemicals (chemical–genetic miniarray profile (CG-MAP)). The resulting genetic (or chemical–genetic) interaction profiles often consist of more than 1,000 genetic interactions for each point mutant. Pairwise comparison of the profiles provides measures of genetic similarity between all pairs of tested point mutants. High similarity between a pair of point mutants indicates a likely contact between the mutated residues. The structure of the protein complex is modelled using this relationship for pairs of residues that reside in different subunits of the complex. Whereas DMS is well suited for modelling the structures of small proteins and domains, the pE-MAP approach is more appropriate for determining structures of protein assemblies. pE-MAP has lower coverage than DMS but enables comparison of genetic interactions across residues in any number of interacting proteins in a single screen, which facilitates the modelling of interactions. Additionally, pE-MAP provides systems-wide cellular information for every mutated residue via its genetic interaction profile with thousands of other mutants in different pathways and processes. A recent study harnessed these traits to use pE-MAP and chemical–genetic interaction data to determine the structures of protein complexes[104] (Fig. 4c). Using a technique termed integrative structure determination[105] (Box 1), the authors modelled the structures of three protein complexes: histones H3 and H4 in budding yeast; subunits Rpb1 and Rpb2 of RNAPII in budding yeast, and subunits RpoB and RpoC of bacterial RNA polymerase (RNAP) in E. coli. The histone pE-MAP included a comprehensive alanine scan as well as context-specific mutations, resulting in a map of 350 histone mutants crossed against 1,370 deletion or knockdown mutants[104]. Distance restraints between H3–H4 residue pairs were devised using the similarity of genetic interaction profiles between the corresponding mutations. These restraints were then applied to arrange the structures of the H3 and H4 subunits, capturing the interface of their interaction and obtaining an accurate structure of the H3–H4 complex. The RNAPII dataset provided an opportunity to test the performance of the approach on a system that differs vastly from that of the histones. Specifically, Rpb1 and Rpb2 are much larger than the histones (1,200–1,700 residues versus 100–140 residues) and the RNAPII pE-MAP is much sparser, with 53 point mutants crossed against 1,200 deletion or knockdown mutants[89]. In addition, the authors split Rpb1 into two domains for the structural modelling to test the applicability to a higher-order system. The model of this three-body complex proved accurate, suggesting that the approach is generalizable and can effectively harness the contents of sparse datasets. Extending the use of the approach to chemical–genetic interactions, the authors accurately modelled the RpoB–RpoC complex of bacterial RNAP using a CG-MAP of 44 point mutants subjected to 83 different environmental stresses[106]. This showed transferability of the approach to chemical–genetic interaction maps in spite of the reduced size of the interaction profiles in this dataset. Finally, in a comparison of integrative structure determination using cross-linking mass spectrometry (XL-MS) data and pE-MAP data, the authors found that the two performed similarly, but crucially led to higher accuracy models when combined[104]. Thus, a key value of the methods described in this Review is that their data types are typically orthogonal to those traditionally used in structural biology, allowing data integration that results in improved models[105] (Box 1). Integrative structure determination is a powerful tool to determine the structures of macromolecular assemblies[105,131] by providing a framework to combine information from varied experimental approaches, bioinformatics tools and prior knowledge. Integrative modelling aims to maximize the completeness, accuracy and precision of the resulting model by computing an ensemble of structural models that are consistent with all the input information. The integrative modelling approach has been successful in determining the architecture of large macromolecular assemblies[132,133], describing the structural heterogeneity of flexible protein complexes[134,135] and rationalizing the effect of pathogenic mutations[132,136]. The integrative modelling workflow iterates through the following four stages (see the figure). Gathering information A large variety of experimental and computational information can be used for integrative modelling including X-ray crystallography, NMR spectroscopy, electron microscopy, chemical cross-linking mass spectrometry, small-angle scattering and affinity purification–mass spectrometry. Evolutionary residue–residue couplings computed from natural variation[40,41,137] or from experimental evolution[46] can also be used for modelling and are often complementary to experimental methods. Recently it has also been demonstrated that genetic interactions measured using the point mutant epistatic miniarray (pE-MAP) platform[104] and deep mutational scanning[96,97,102,103,138] (DMS) can be used for integrative modelling of small proteins and protein complexes. Representing the system and translating information into spatial restraints A structural model of a macromolecular assembly is defined by the conformations and relative positions and orientations of its components (for example, atoms, residues, domains and subunits). Thus, the representation is defined by all the structural variables that need to be determined on the basis of input information. This includes, for example, the components of the system (including the copy number), the coordinates of the components and whether multiple states need to be modelled. The scoring function consists of a series of terms that encode the spatial restraints that quantify the degree of a match between the structural models and the input information. For example, pE-MAP data were converted into a Bayesian data likelihood that provides an upper bound on the distance spanned by the mutated residues and objectively interprets the noise in the experimental data[104]. Similarly, data from DMS experiments and coevolution analysis are converted into upper-bound or harmonic distance restraints between the residues[40,41,96,97,102,139]. The scoring function also accounts for the physicochemical properties of proteins via terms such as excluded volume and sequence connectivity[140]. Structural sampling Structural models are computed by sampling the conformations and/or the configuration of the components; this is often achieved by using Monte Carlo-based methods for stochastic sampling. The result is an ensemble (that is, the model) of predicted structures that agree with the input information within acceptable tolerances. Validating the model Validation of the model is essential to quantify its uncertainty and to assess the degree of consistency between the model and the information used and not used to compute it[141,142]. To this end, the validation protocol includes five steps whose outputs are an estimate of the model precision (quantified by the variability between the models in the ensemble), one or more representative structures and their uncertainties, and mapping of the known information into the structures in the model. This protocol (that is, stages 2, 3 and 4) can be scripted using the open-source Integrative Modelling Platform (IMP) package[143]. Figure adapted with permission from ref.[104], AAAS.

Emerging approaches

A key promise and challenge for the methods discussed in this Review is the expansion into new systems, scales and organisms. The continued success of this field will rely on the effective integration of complementary data types to best make use of available methods (Fig. 1). In particular, the integration of experimental data with those from computational coevolution and deep learning models should prove valuable. Such efforts will likely benefit from a fine-grained interpretation of the scale and resolution represented by each data type. For example, it has been shown that residue–residue contacts derived from coevolution are more accurate when compared with experimentally determined side chain contacts than with more commonly used backbone contacts[107]. This finding suggests that the dominant effect observed in coevolution reflects side chain interactions, and could be harnessed to generate more precise models when computationally feasible. To better complement computational methods, there is a need to increase the speed and coverage of experimental genetic approaches. Advances in CRISPR–Cas9 genome editing (Box 2) are setting the stage for such developments. For example, chemical–genetic interaction mapping is primed for modelling PPIs on a proteome-wide scale in yeast, using a recent method to efficiently generate point mutations while surveying their drug sensitivities in a multiplexed fashion[108] (Box 2). Guided by global PPI maps[109], and using individual protein structures from traditional structural biology methods or AlphaFold/RoseTTAFold, this system should in principle enable the modelling of interaction interface structures across the yeast proteome. In addition to facilitating increased scale, CRISPR–Cas9 genome editing can be used for the systematic generation of point mutations in mammalian cells[110-114]. At present, these approaches are not suitable for mammalian pE-MAP screening, owing to incomplete editing, off-target effects or other technical obstacles (Box 2). However, these limitations are steadily diminishing[110], setting the stage for genetics-based structural modelling of protein complexes in human cells and providing a means of characterizing the effects of disease-causing mutations. By integration with recent efforts to generate multi-scale models of entire cells[115-119], genetic interaction mapping could thus inform on global function as well as the structures of protein complexes. One of the most crucial, and currently tractable, applications to human systems relates to the rapidly growing field of host–pathogen interaction mapping[120-124]. This area of research is centred on the systematic identification of PPIs between pathogen and host proteins and the generation of interaction networks between the two organisms (Fig. 5a). These networks have proved highly effective for interrogating the mechanisms of infection, revealing important aspects of pathogen life cycles, host factor functions and host–pathogen interplay, as well as providing potential targets for drug discovery[120-124]. Host–pathogen PPI networks could be used as a blueprint for genetic interaction mapping between pathogen point mutants and human gene knockouts or knockdowns. To generate these maps, human cells would be infected by virus harbouring the relevant point mutations, and the human proteins from the PPI maps would be knocked down or knocked out (Fig. 5b), allowing for the construction of a host–pathogen genetic interaction map (Fig. 5c). The genetic interaction profiles of the viral point mutants would then be converted into spatial restraints for structural modelling of viral protein complexes (Fig. 5d), which would ultimately be re-integrated into the PPI map. The platforms required for such efforts have recently been developed. For example, a technology for generating viral E-MAPs (vE-MAPs), using infectivity as readout, was recently applied to HIV infection in human cells[125]. In an analogous fashion, DMS could be used for modelling individual viral proteins, by employing suitable selection assays[126]. For example, a DMS platform was developed to structurally map mutations in the SARS-CoV-2 Spike receptor-binding domain that alter ACE2 binding or escape antibody recognition[127,128]. Many pathogens adapt rapidly to circumvent immune and drug responses[128-130]. Genetic interaction-driven modelling of pathogen protein structures will provide an avenue to identify the mechanisms of these changes, laying the groundwork for therapeutic intervention.

Fig. 5

Structural characterization of host–pathogen interaction networks.

Structural characterization of host–pathogen interaction networks.

a | A host–pathogen protein–protein interaction (PPI) network generated using affinity purification–mass spectrometry. The edges denote PPIs between pairs of proteins. b | To generate a host–pathogen point mutant epistatic miniarray profile (pE-MAP), host cells are infected with point mutant virus strains, in combination with CRISPR–Cas9 knockout (KO) or knockdown (KD) of the host genes identified in the host–pathogen PPI network (part a). c | The resulting pE-MAP comprises genetic interaction profiles for the viral point mutants, containing their genetic interactions with the library of host gene KOs and KDs. d | Viral genetic interaction profiles are compared across the subunits of viral protein complexes and the similarities are used for modelling their structures, which can then be integrated into the original network. The CRISPR–Cas9 system sets up for genome editing by introducing a double-stranded break (DSB) in DNA (see the figure, panel a)[80]. The Cas9 enzyme is directed to the target DNA site by a single guide RNA (sgRNA), which contains the target sequence. Cas9 cuts the DNA at the target site, and the break is typically repaired via non-homologous end joining (NHEJ), resulting in insertions and deletions (indels) that lead to inactivation of the target gene. Alternatively, the DSB can be repaired via homology-directed repair (HDR), resulting in a specific edit based on the template of a stretch of donor DNA. However, HDR in mammalian cells is inefficient, and the natural preference of the cell for NHEJ would lead to the introduction of unwanted indels even in the presence of donor DNA. Base editors offer a more fine-tuned alternative, by relying on catalytically impaired versions of Cas9 that do not introduce DSBs. Most base editors consist of a DNA deaminase enzyme fused to either nickase Cas9 (nCas9), which cuts a single strand of double-stranded DNA, or to catalytically dead Cas9 (dCas9). Base editors convert specific base pairs (as directed by the sgRNA) into different base pairs (see the figure, panel b). Base editing circumvents the need for donor DNA and avoids unintentional indels at target or off-target sites. However, the technique does not support all 12 possible DNA base-to-base conversions and suffers from other limitations, including unwanted bystander or off-target edits and sequence-specific requirements to allow for editing (for example, proximity of a protospacer adjacent motif (PAM) site)[112]. A recent development, termed prime editing, provides a flexible platform for DNA editing, allowing for all base-to-base conversions, insertions or deletions, without the need of a DSB or donor DNA, and with lower off-target activity than Cas9 (see the figure, panel c)[110]. The prime editor consists of nCas9 fused to a reverse transcriptase, which is guided to its target by a prime editing guide RNA (pegRNA). In addition to the target sequence, the pegRNA contains a reverse transcriptase template (RT template) for the desired edit, preceded by a primer-binding site. The primer-binding site hybridizes to the nicked target DNA, and the RT template dictates the sequence of the new edited DNA. Prime editing and base editing methods could both potentially be used for genetic interaction mapping in mammalian cells, but the editing efficiency is not yet high enough for robust application[112]. In budding yeast, which is more tractable for genome editing than mammalian cells, a CRISPR–Cas9-based method was recently developed for multiplexed genome editing in a pooled fashion, allowing for the rapid measurement of point mutant chemical–genetic interactions (see the figure, panel d)[108]. Here, guide–donor plasmids are first generated, which contain the desired sequence of donor DNA, combined with a barcode and guide sequences to direct the edit and barcode integration. The plasmids are transformed into Cas9-expressing yeast cells, resulting in genomically edited cells with the corresponding barcode integrated. Cells are grown in a pooled format and exposed to a large number of different conditions. Barcodes are counted via sequencing, and chemical–genetic interactions are quantified based on enrichment or depletion of each mutant in treated versus untreated conditions. This method would allow for proteome-wide measurement of chemical–genetic interactions for protein complex subunits, thereby providing the data required for global structural modelling of the budding yeast protein interactome.

Conclusions

Structural modelling of proteins and protein complexes using genetically derived restraints lies at the intersection of network biology and structural biology. Until recently, these major areas of research were disparate and had little overlap. Network biology provided a large-scale systems view of interactions within and between cellular processes, whereas structural biology supplied structures of individual proteins and complexes, typically derived in vitro. Genetics-based structural modelling uses spatial restraints derived from functional data, such as coevolution or genetic interactions, to compute structural models. The methods are efficient and low cost, and enable structural characterization of protein interaction interfaces, with a potential to cover entire protein–protein interactomes, including those of host–pathogen systems. These techniques are not meant to replace traditional structural biology methods, which remain the gold standard in terms of resolution. Instead, the orthogonal datasets produced by genetics-based modelling are primed to complement traditional structural biology methods to provide a more accurate and complete description of the structures of proteins in vivo.

138 in total

1. Genetic interaction mapping in mammalian cells using CRISPR interference.

Authors: Dan Du; Assen Roguev; David E Gordon; Meng Chen; Si-Han Chen; Michael Shales; John Paul Shen; Trey Ideker; Prashant Mali; Lei S Qi; Nevan J Krogan
Journal: Nat Methods Date: 2017-05-08 Impact factor: 28.547

2. Structural dynamics of the human COP9 signalosome revealed by cross-linking mass spectrometry and integrative modeling.

Authors: Craig Gutierrez; Ilan E Chemmama; Haibin Mao; Clinton Yu; Ignacia Echeverria; Sarah A Block; Scott D Rychnovsky; Ning Zheng; Andrej Sali; Lan Huang
Journal: Proc Natl Acad Sci U S A Date: 2020-02-07 Impact factor: 11.205

3. Compensating changes in protein multiple sequence alignments.

Authors: W R Taylor; K Hatrick
Journal: Protein Eng Date: 1994-03

4. A census of human soluble protein complexes.

Authors: Pierre C Havugimana; G Traver Hart; Tamás Nepusz; Haixuan Yang; Andrei L Turinsky; Zhihua Li; Peggy I Wang; Daniel R Boutz; Vincent Fong; Sadhna Phanse; Mohan Babu; Stephanie A Craig; Pingzhao Hu; Cuihong Wan; James Vlasblom; Vaqaar-un-Nisa Dar; Alexandr Bezginov; Gregory W Clark; Gabriel C Wu; Shoshana J Wodak; Elisabeth R M Tillier; Alberto Paccanaro; Edward M Marcotte; Andrew Emili
Journal: Cell Date: 2012-08-31 Impact factor: 41.582

5. A protein network map of head and neck cancer reveals PIK3CA mutant drug sensitivity.

Authors: Danielle L Swaney; Dana J Ramms; Zhiyong Wang; Jisoo Park; Yusuke Goto; Margaret Soucheray; Neil Bhola; Kyumin Kim; Fan Zheng; Yan Zeng; Michael McGregor; Kari A Herrington; Rachel O'Keefe; Nan Jin; Nathan K VanLandingham; Helene Foussard; John Von Dollen; Mehdi Bouhaddou; David Jimenez-Morales; Kirsten Obernier; Jason F Kreisberg; Minkyu Kim; Daniel E Johnson; Natalia Jura; Jennifer R Grandis; J Silvio Gutkind; Trey Ideker; Nevan J Krogan
Journal: Science Date: 2021-10-01 Impact factor: 63.714

6. Diffusion, crowding & protein stability in a dynamic molecular model of the bacterial cytoplasm.

Authors: Sean R McGuffee; Adrian H Elcock
Journal: PLoS Comput Biol Date: 2010-03-05 Impact factor: 4.475

7. The genetic landscape of a cell.

Authors: Michael Costanzo; Anastasia Baryshnikova; Jeremy Bellay; Yungil Kim; Eric D Spear; Carolyn S Sevier; Huiming Ding; Judice L Y Koh; Kiana Toufighi; Sara Mostafavi; Jeany Prinz; Robert P St Onge; Benjamin VanderSluis; Taras Makhnevych; Franco J Vizeacoumar; Solmaz Alizadeh; Sondra Bahr; Renee L Brost; Yiqun Chen; Murat Cokol; Raamesh Deshpande; Zhijian Li; Zhen-Yuan Lin; Wendy Liang; Michaela Marback; Jadine Paw; Bryan-Joseph San Luis; Ermira Shuteriqi; Amy Hin Yan Tong; Nydia van Dyk; Iain M Wallace; Joseph A Whitney; Matthew T Weirauch; Guoqing Zhong; Hongwei Zhu; Walid A Houry; Michael Brudno; Sasan Ragibizadeh; Balázs Papp; Csaba Pál; Frederick P Roth; Guri Giaever; Corey Nislow; Olga G Troyanskaya; Howard Bussey; Gary D Bader; Anne-Claude Gingras; Quaid D Morris; Philip M Kim; Chris A Kaiser; Chad L Myers; Brenda J Andrews; Charles Boone
Journal: Science Date: 2010-01-22 Impact factor: 47.728

8. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

Authors: Baris E Suzek; Yuqi Wang; Hongzhan Huang; Peter B McGarvey; Cathy H Wu
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

9. Disentangling direct from indirect co-evolution of residues in protein alignments.

Authors: Lukas Burger; Erik van Nimwegen
Journal: PLoS Comput Biol Date: 2010-01-01 Impact factor: 4.475

10. Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition.

Authors: Allison J Greaney; Tyler N Starr; Pavlo Gilchuk; Seth J Zost; Elad Binshtein; Andrea N Loes; Sarah K Hilton; John Huddleston; Rachel Eguia; Katharine H D Crawford; Adam S Dingens; Rachel S Nargi; Rachel E Sutton; Naveenchandra Suryadevara; Paul W Rothlauf; Zhuoming Liu; Sean P J Whelan; Robert H Carnahan; James E Crowe; Jesse D Bloom
Journal: Cell Host Microbe Date: 2020-11-19 Impact factor: 31.316

3 in total

1. Recent Advances in Machine Learning Variant Effect Prediction Tools for Protein Engineering.

Authors: Jesse Horne; Diwakar Shukla
Journal: Ind Eng Chem Res Date: 2022-04-06 Impact factor: 4.326

2. Integrative structure determination of histones H3 and H4 using genetic interactions.

Authors: Ignacia Echeverria; Hannes Braberg; Nevan J Krogan; Andrej Sali
Journal: FEBS J Date: 2022-03-17 Impact factor: 5.622

3. Knowledge structure and emerging trends in the application of deep learning in genetics research: A bibliometric analysis [2000-2021].

Authors: Bijun Zhang; Ting Fan
Journal: Front Genet Date: 2022-08-23 Impact factor: 4.772

3 in total