Literature DB >> 34179613

Completeness and Consistency in Structural Domain Classifications.

R Dustin Schaeffer¹, Lisa N Kinch², Jimin Pei², Kirill E Medvedev¹, Nick V Grishin^1,2.

Abstract

Domain classifications are a useful resource for computational analysis of the protein structure, but elements of their composition are often opaque to potential users. We perform a comparative analysis of our classification ECOD against the SCOPe, SCOP2, and CATH domain classifications with respect to their constituent domain boundaries and hierarchal organization. The coverage of these domain classifications with respect to ECOD and to the PDB was assessed by structure and by sequence. We also conducted domain pair analysis to determine broad differences in hierarchy between domains shared by ECOD and other classifications. Finally, we present domains from the major facilitator superfamily (MFS) of transporter proteins and provide evidence that supports their split into domains and for multiple conformations within these families. We find that the ECOD and CATH provide the most extensive structural coverage of the PDB. ECOD and SCOPe have the most consistent domain boundary conditions, whereas CATH and SCOP2 both differ significantly.

Entities: Chemical

Year: 2021 PMID： 34179613 PMCID： PMC8223206 DOI： 10.1021/acsomega.1c00950

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Proteins and protein complexes contain domains, evolutionarily distinct subunits which confer function either solely or in concert with other domains.[1−3] As such, domains represent the building blocks of proteins that guide their evolution. Both sequence and structure similarities between domains are useful in determining their evolutionary relationships. Structural similarity can be used to infer homology over a greater evolutionary distance than the sequence. However, the difficulty in obtaining protein structures for known sequences led to discrete types of classifications, those that focused principally on the sequence (and the utility of deep multiple sequence alignments for homology detection)[4−7] and those that classify structures (and the utility of structural similarity for the detection of distant homology).[8−12] Recent developments in protein structure prediction, exemplified by the recent CASP14 results and the performance of the DeepMind predictor: AlphaFold2, signal an incoming change for sequence- and structure-based domain classifications alike.[13] When near-native predictions of globular protein domains are readily (if not easily) available, both sequence and structural classifications may adapt to incorporate predicted structures of sequence families lacking experimental structures. If this were to come to pass, the distinction between structure and sequence classifications would become purely historical. Due to the nature of experimental methods used to solve protein structures, well-behaved proteins and fragments corresponding to their protein domains that easily form crystals (X-ray) or are relatively small (NMR) have dominated protein structure databases since their inception.[14] Such methodology tends to exclude large, multidomain proteins and disordered or flexible regions of proteins that contribute to both their function and their evolution.[15,16] With recent improvements in cryo-EM techniques to determine protein structure, the growth of the field is revolutionizing structural biology.[17,18] This technique is producing an ever-increasing number of larger and more complete protein structures that are not limited by their ability to crystallize. Categories of proteins such as those that span the membrane and exist as dynamic macromolecular complexes or fibrous assemblies are increasingly dominating newly released structures. Thus, a more complete picture of structure space is emerging that includes nondomain sequences not easily classified in the ECOD hierarchy. Proteins function as dynamic entities that can adopt multiple functional conformations.[19,20] Many of these conformations have been captured in static forms in the existing experimental structures, and they often involve flexible interactions between domains.[21,22] Theoretically, the increasing availability of large multidomain structure examples should also expand examples of alternate conformations between domains. With its classification of sequence-related protein domains that function similarly, the family level of ECOD classification is poised to provide large-scale examples of protein conformation change. The view that proteins exist as an ensemble of multiple substructures whose dynamic behavior contributes to their evolvability can begin to be addressed with the sequence to structure ensemble relationships in ECOD families. Domains are defined at an intersection of multiple competing concepts. Sequence continuity, structural compactness, and functional considerations can each be taken into account to different degrees by separate classifications. These alternate definitions lead to observable differences between similar types of classifications. Examining inconsistencies between proteins classified in multiple classifications can provide insights toward improving classification and provide a foundation for analysis into function. Additionally, examining classifications for consistency can also lead to biological insights. We recently published an account of our analysis of domains containing a minimal Rossmann-like motif (RLM). Using structural motif analysis, we reorganized the two largest X-groups containing Rossmann domains in ECOD: “Rossmann-related” (ECOD id: 2003) and former X-group “other Rossmann-like domains” (ECOD id: 2111). Manual justification, based on structure and sequence similarity, did not reaffirm uniting more than 100 of the homology groups within the previous “other Rossmann-like domains” X-group. Its reclassification provided 102 new X-groups, which constitute about 4% of overall X-groups number in ECOD v275. Moreover, 44 H-groups changed their location and were moved or merged to other H-groups (e.g., “Rossmann-related” and “Flavodoxin-like”), representing 22% of the initial number of RLM-containing H-groups in previous versions of ECOD.[23] Overall, this resulted in the unification of the two largest X-groups containing Rossmann motifs in ECOD and the generation of numerous novel X-groups where, despite containing a common RLM, the homology with other X-groups could not be justified. Our structural domain classification, evolutionary classification of protein domains (ECOD), is nearing a decade of active development and 7 years of public release.[12] The details of the ECOD classification have been published elsewhere; briefly, the pilot version of ECOD was developed from a previous version of SCOP (v1.75) that sorted existing SCOP domains into a novel hierarchy that de-emphasized fold similarity and emphasized distant homology. ECOD, like other extant structural domain classifications, relies on a combination of manual curation to define new domains and groups of domains and automated methods to assign newly observed proteins to the existing groups. A core challenge of maintaining ECOD is keeping pace with the ever-growing rate of structure release and the complexity of these structures. Consequently, all structural domain classifications are incomplete. They only partially cover the known set of experimentally determined protein structures. Luckily, this set of protein structure is highly redundant, both because of the concentration of investigator interest and the biophysical properties of protein structures. Here, we assess the completeness of ECOD and other structural domain classifications (CATH,[24] SCOPe,[10] and SCOP2[9]) with respect to the deposited set of structures in the PDB and a curated set of reference proteins in the UniProt sequence database. We find generally that recently deposited structures, especially those of viral and other fast-evolving proteins, are not categorized in structural domain classifications and offer some potential strategies to accelerate their classification in the future. We also present an analysis of major facilitator transporter proteins that justifies the split of their duplication into separate domains, as well as investigate the separation of structures of these proteins into open and closed conformations. Together, we think that this provides a view of where domain classifications stand now and a vision of what they may need to encompass in the future.

Results and Discussion

Completeness and Coverage of Major Structural Domain Classifications

Domain classifications of protein structures are necessarily incomplete. Insofar as structures are still being determined whose homology is indeterminate, we suspect that additional structures are required to achieve a covering set of domains. Even so, it is likely that we have achieved a largely covering set and that we will only asymptotically approach completion with the release of more structures.[25] We compared the coverage of current versions of ECOD, CATH, SCOPe, and SCOP2 with a recent release of the PDB in order to compare their coverage of the known protein structural space. ECOD, CATH, and SCOPe pursue similar classification strategies wherein a smaller number of structures are manually curated and used to seed further classification by automated alignment methods. Additionally, both SCOPe and CATH release stable versions along with more frequent periodic versions which are subject to subsequent error checking before being incorporated into a stable release. We compared the total deposited domains from each classification against a version of the PDB current to November 2020 (see details in Methods). Depositions consisting solely of nucleic acids were not considered. Both ECOD and SCOPe used an earlier version of SCOP (v1.75) in their derivation and are expected to share some domains due to that shared ancestry. Comparing the number of observed structure depositions in each classification by the presence of at least one domain classified in that structure compared to the release date of that deposition, we derived a running cumulative total of structures observed in each classification at each release date compared to the structure observed in each classification. We observe that ECOD and SCOP classify the most PDB structures and UniProt sequences. SCOP2, although classifying fewer structures, classifies roughly the same amount of sequences as SCOPe over time (Figure ).

Figure 1

Completeness of domain classifications with respect to the sequence and structure. (left) Cumulative sum of structures released prior to a release date with at least one domain in ECOD (green), CATH (blue), SCOPe (cyan), and SCOP2 (green). The overall sum of PDB release dates (red) provided for comparison. (right) Comparison of PDB chains with known UniProt annotation by SIFTs with at least one domain defined in each domain classification. Where multiple chains with the same UniProt accession are available, the structure with the earliest release date is used.

Comparison of ECOD Domains to Other Major Structural Domain Comparisons

We also compared the consistency of ECOD domain classifications with respect to other major domain classifications. Domain classifications differ both in how they divide proteins into domains and how these domains are organized into hierarchies. We evaluated ECOD with respect to each of the major domain classifications discussed above in terms of domain partition. Where two classifications’ definition of the domain overlapped significantly, we denote that these domains are “shared” between the classifications. Where two domains are not shared, but are defined from the same peptide chain, we label these domains as from a shared chain. Domains that are from chains unique to a classification are called “unique domains.” This terminology allows us to distinguish domains that differ in partition from those that differ due to the protein content of individual classifications. We compared the domain partition between ECOD v278 to SCOPe v2.07-2020-07-16 by examining the bidirectional coverage of possible shared domains. This analysis reveals the extent to which classifications divide proteins into similar domains. The domain coverage of ECOD by SCOPe and SCOPe by ECOD is shown in Figure A (left and right, respectively) as the population of matched, unmatched, and unique (see the Methods Section) domains for each of the 15 most populated ECOD H-groups and SCOPe folds. As both SCOPe and ECOD were derived from SCOP v1.75, we find that their domain boundaries are largely similar: 83% of SCOPe domains map to an ECOD domain. Although only 29% of ECOD domains map to a SCOPe domain, the 63% of ECOD domains which could not be mapped are due to the inclusion of structures not found in SCOPe. This is likely a consequence of the more aggressive automatic update procedure of ECOD compared to the more conservative protocol of SCOPe.[10,26] Among the most populated ECOD H-groups, we find that the majority of domains are either mapped directly to an SCOPe domain or are from PDB chains not yet incorporated into SCOPe. For example, in ECOD H-group 11.1, “IgG beta sandwich domains”, 38% of ECOD domains map directly to a SCOPe domain, 60% occur in chains not contained in SCOPe, and only 2% of ECOD domains come from a PDB chain contained in SCOPe where no domain mapping could be found. Similarly, for SCOPe fold b.1 “immunoglobulin-like beta-sandwich”, 98% of SCOPe domains are mapped to ECOD, 1% of domains in this fold cannot be mapped to an ECOD domain despite sharing a chain with ECOD, and 1% of domains are from chains unique to SCOPe. This suggests that the criteria for domain boundary selection are very similar for ECOD and SCOPe and that a major source of difference between the classifications is simply the structures considered. We note that these calculations were carried out over the full set of domains in both classifications, which are known to be highly sequence redundant. In order to compare the classifications in terms of hierarchal organization, in addition to domain boundary selection, we used domain pair analysis as described previously.[12,27,28]

Figure 2

ECOD and SCOPe domain partition comparison. (left) Domain population of 15 most populated H-groups in ECOD v278 stratified by shared (blue), novel partition on shared chain (orange), and new domains from unshared chains (gray). (right) SCOPe 15 most populated folds stratified by shared (blue), unmatched (orange), and unique (gray) domains. (inset) Total unique, unmatched, and shared domains from SCOPe and ECOD. We compared ECOD v278 to CATH v4.3 using domain boundary correspondence. ECOD and CATH differ fundamentally in methods for domain boundary selection and conceptually in domain hierarchy. Although ECOD (and SCOPe) relies principally on sequence alignment methods for automatic extension of classification, earlier versions of CATH contained an automated structural search method involved in domain parsing.[24,29] The 4.3 version of CATH introduced a substantial rework to the methods used for automatic domain parsing.[30] The results of the domain boundary analysis are summarized in Figure . Many ECOD H-groups contain domains from chains contained in CATH but where there is no definitive mapping. This lack of mapping suggests that substantive differences in domain boundary choices exist between ECOD and CATH. For example, among domains belonging to the SH3 H-group in ECOD, less than 45% map to domains in CATH.

Figure 3

Well-populated ECOD H-group domain comparison with CATH topologies. (A) (left) Top 15 most populated ECOD H-groups composed of “matched” domains with 90% bidirectional sequence coverage (blue), “unmatched” domains from chains in both SCOP and CATH that have no corresponding domain (orange), and “unique” domains defined from chains that occur in only one classification. (right) The top 15 most populated CATH topologies with matched (blue), unmatched (orange), and unique (gray) domains. (inset) Total matched, unmatched, and unique domains from SCOP and CATH. (B) Trypsin (PDB: 3otj) complexed with BPTI (partial peptide in the magenta stick) at the interface of duplicated RIFT-barrel subunits (ECOD 1.1, cyan and green). Interacting residues from both subunits are on the transparent surface. Active-site residues (spheres) contribute from both subunits. The domain schematic (below) highlights the sequence discontinuous CATH domain definition. (C) First RIFT barrel colored as in panel B except for the N-terminal RIFT β-strands (blue) and C-terminal RIFT β-strands (red). An active-site residue is in the RIFT crossover loop (black sphere marks the Ca position). The N-terminal sequence discontinuous loop (orange) is connected by dashes (green) where the duplicated RIFT barrel is inserted. (D) Second RIFT-barrel is colored as in panel C with active-site residues (black spheres), intervening loops (green), and dashes (cyan, representing the first RIFT barrel position) connecting the C-terminal sequence discontinuous helix (orange). Among the top populated ECOD H-groups, ECOD classification is less consistent with CATH than with SCOPe (Figure A, left, orange bars). The differences between several of these H-groups encompass folds with a Rossmann-like motif (e.g., ECOD H-groups: 2003.1, 2004.1, 2007.1, and 7523.1) and have been compared previously.[23,31] Inconsistencies among other large H-groups include examples from ECOD immunoglobulin-related (11.1; 92% overlap), TIM barrel (2002.1; 86% overlap), SH3 barrel (4.1; 71% overlap), RIFT-related barrels (1.1; 54% overlap), HTH (101.1; 57% overlap), and protein kinases (206.1: 2.3% overlap), among others. Among the most populated β-barrels, the RIFT-related barrel group tends to be less consistent than the SH3 group. A large portion of the inconsistent RIFT-related domains come from the CATH topology thrombin H (Figure , **), which includes trypsin-like proteases. Trypsin-like proteases consist of duplicated RIFT-related barrel subdomains. The barrels pack compactly together to form the catalytic core, with the active site cleft running alongside the subdomain boundary. Given the shared active site and the reliance of ECOD on manual sequence-based classification, we do not split trypsin-like serine proteases into independent domains. However, CATH considers them as independent. For comparison, the increased consistency of the SH3 domain definition between the two classifications may be due to their function as protein scaffold domains that bind to proline-rich or other motifs.[32] One of the main discrepancies for SH3 barrel domains stems from their presence in the ribosome complex. ECOD includes a C-terminal extended tail in the SH3 definition, while CATH limits the domain definition to the compact unit.

Figure 4

Plasticity of distantly related protein kinase homologs. (A) Cki1 (PDB: 1csn) with bound Mg-ATP (gray sphere-black stick) in the catalytic cleft between the N-lobe (green, CATH 3.30.200) and the conserved core of the C-lobe (cyan, CATH 1.10.510) with additional elaborated helices (white). (B) Minimal kinase domain from OspG (PDB: 4q5hA) bound to MG-AMPPNP is colored as above. (C) Core protein kinase C-lobe SSEs (colored in rainbow from the N-terminus to the C-terminus) are common to Cki1 (left) and OspG (right). (D) ATP-grasp GART (PDB: 1kjqB; CATH 3.30.1 and 3.30.470) in complex with Mg-ADP is colored by the subdomain. An insertion in the N-lobe (white) replaces the N-terminal strands in protein kinases. The C-subdomain includes a pronounced β-sheet with similar topology to (E) SAICAR (PDB: 2gqrA, CATH 3.30.200 and 3.30.470) bound to Mg-ADP. (F) Common C-subdomain from SAICAR (rainbow) lacks the C-terminal helix (left), but a related SAICAR Itpkb includes the C-helix (PDB: 2aqx). The large superfamily of kinases and related homologs shows a notable inconsistency between ECOD domain definitions (ECOD 206.1) and CATH definitions (CATH 3.30.200 and 1.10.510). ECOD combines the protein kinase fold together with that of SAICAR and ATP-grasp based on previously noted homology.[33−35] Each of these groups include two subdomains that combine to form an obligate active site, as exemplified by the active-site cleft formed between the N-lobe and C-lobe of the protein kinases (Figure A). ECOD does not separate these subdomains, keeping the protein kinase, SAICAR, and ATP-grasp folds as single units. Alternately, CATH separates the subdomains and classifies them based on topology of extant proteins: including the protein kinase N-lobe (3.30.200) and C-lobe (1.10.510), the SAICAR N-lobe (3.30.200) and C-domain (3.30.470), and the ATP-grasp domains (3.30.1490 and 3.30.470). CASP distinguishes the mainly helical protein kinase C-lobe (1.10.510) from the distantly related SAICAR and ATP-grasp C-domains (3.3.470). Instead, ECOD uses core structure motifs that are common among homologs to define the relationship. For example, the shigella effector kinase adopts a minimal kinase domain that excludes most of the C-terminal helices (Figure B). The limited structure includes three conserved helices and three short β-strands that surround the active site (Figure C). Similar subdomains from glycinamide ribonucleotide transformylase (GART) ATP-grasp (Figure D) and from SAICAR phosphoribosylaminoimidazole-succinocarboxamide synthase (Figure E) contribute to a central active site that binds ligand is a similar orientation as in protein kinases. However, the C-terminal domains from GART and SAICAR exhibit a pronounced β-sheet when compared to the protein kinase C-lobe. This difference leads to an alternate topology definition in CATH, which considers the protein kinase C-lobe as a helical topology (1.10.510) and the others as alpha beta two-layered sandwiches (3.30.470). While the sheets differ in lengths, they all include a similar set of secondary structure elements (Figure F). The SAICAR C-subdomain lacks the protein kinase C-terminal helix. However, some SAICAR homologs such as inositol 1,4,5-trisphophate 3-kinase (Itpkb) possess this element. CATH classifies the inositol kinase as a single domain together with the protein kinase C-lobes (1.10.510). The pronounced difference in protein kinase/SAICAR/ATP-grasp classification in ECOD and CATH stem from substantial plasticity of the folds. The evolutionary considerations in ECOD lead to a unified group of homologs that exhibit alternate topologies, while the topology-based classification scheme in CATH distinguishes among the diverse scaffolds. Finally, we conducted the domain boundary comparison to the most recent version of SCOP2. SCOP2 contains a redesigned protein classification schema that clusters families and superfamilies of domains into networks. SCOP2 domains can be defined both by a “family range” and “superfamily range.” For the purposes of this analysis, we compared the SCOP2 family domain ranges to ECOD domain ranges. Of the classifications considered, SCOP2 contains the fewest domains, containing 33,845 domains in 1502 folds and 2706 superfamilies. Nearly all SCOP2 domains are included in ECOD, while more than 95% of ECOD domains are not present in SCOP2 (Figure , right).

Figure 5

Well-populated ECOD H-groups compared to SCOP2 folds. (A) ECOD H-groups with matched (blue), unmatched (orange), and unique (gray) ECOD domains. (B) SCOP2 folds with matched (blue), unmatched (orange), and unique (gray) SCOP2 domains. The domain coverage of ECOD by SCOP2 and SCOP2 by ECOD is shown in Figure (left and right, respectively). SCOP2 and ECOD exhibited larger differences in domain partition compared to those between SCOPe and ECOD. A total of 11 out of the top 15 most populated SCOP folds have greater than 40% unmatched domains and four of them have more than 50% unmatched domains (ferredoxin-like, immunoglobulin-like, OB-fold, and ribonuclease-like), suggesting significantly different domain boundary definitions. One possible reason could be the much smaller sample sizes of SCOP2 domains. For PLP-dependent transferase-like fold, all of the SCOP2 domains are unmatched compared to ECOD, as ECOD separated the N-terminal catalytic domain and the C-terminal domain of these PLP-dependent enzymes into two X groups, while SCOP2 kept them together as single-domain units.

Distribution of Equivalent Domain Pairs among Structural Domain Classifications

We analyzed the distribution of homologous domain pairs between ECOD and other domain classifications: SCOP, SCOP2, and CATH. This is similar to a domain pair analysis we previously presented,[12] although in that case, the domain pair needed to be shared among all involved classifications, whereas in this case, we performed a series of pairwise analyses. Domain pair analysis reports the overall similarity between two hierarchal levels of differing domain classifications. We compared ECOD H-groups to SCOPe folds, CATH topologies, and SCOP2 folds (Figure ). This analysis generally describes the relative breadth of classification levels. ECOD H-groups have a similar breadth of classification to CATH topology groups and SCOP2 folds but encompass a slightly broader definition of homology than SCOPe folds.

Figure 6

Fraction of ECOD H-group domain pairs found in other classifications. For every domain pair (i.e., pair of matched domains between classifications sharing an H-group in ECOD), we assess whether these domain pairs are found in the same SCOPe fold, CATH topology, or SCOP2 fold. This broadly measures the average evolutionary distance encompassed by other classifications and their hierarchical levels compared to ECOD homology groups (H-groups).

Comparison of Domain Partition between ECOD and Other Classifications

Domain classifications vary not only based on how they organize domains into hierarchies by the sequence and structure but also in how they partition proteins into domains. We analyzed the domain partition of chains shared by ECOD with other domain classifications to quantitate the degree to which domain partition varies among single-domain proteins in ECOD. We chose to limit this analysis to proteins with a single domain because they are the most populated in all classifications and it is a concise way to investigate and illustrate differences between classifications. Classifications vary in how many domains they define because of a variety of factors: some prefer not to split duplications unless independent instances of a duplicated domain can be found. Classifications can vary on the degree to which the function should be considered in the partition of domains, especially when structural factors of that function are provided by sequence-distant residues. The protein kinase/ATP-grasp/SAICAR H-group is a canonical example of a protein that some classifications (ECOD and SCOPe) choose to define as a single domain (despite its clear dual-lobe structure) for functional considerations, whereas others (CATH) choose to divide the protein into two structurally compact domains.[33] The previously discussed RIFT-barrel H-group is another set of domains where partition strategies can vary significantly between domain classifications. SCOPe and ECOD are both descendants of SCOP v1.75 and so contain some of the same implicit assumptions regarding domain partition. ECOD differs from SCOPe in a number of ways that impact domain partition. We compared ECOD single-domain proteins (i.e., domains from PDB chains containing only that domain) from chains present in pairs of classifications. Our goal was to identify whether (1) there are single-domain proteins that do not achieve our bidirectional coverage and (2) where ECOD over- or undersplits single-domain proteins with respect to other classifications. ECOD classifies 533,057 PDB chains, 66% of which are single-domain proteins. Of these 354,861 ECOD single-domain proteins, 148,305 (41%) are from chains/proteins also classified by SCOPe. A total of 145,757 ECOD single-domain proteins (98%) are matched by coverage to a SCOPe domain, and 144,106 (97%) of these cases that match in SCOPe are also single-domain. This leaves 2181 (1.4%) cases where an (1) ECOD single-domain protein is from a SCOPe chain but matches no SCOPe domain and (2) SCOPe defines that protein as single domain (i.e., both ECOD and SCOPe classify a PDB chain as single domain but vary sufficiently by coverage to prevent a match). Conversely, of 167,118 SCOPe single-domain proteins from chains also classified by ECOD, 144,572 (86%) match an ECOD domain where 144,106 (86%) of those are single-domain matches. Finally, 22,546 (13%) SCOPe single-domain proteins from chains shared with ECOD do not match any ECOD domain. The low percentage of unmapped ECOD single domains compared with the relatively higher level of unmapped SCOPe single domains indicates that ECOD tends to split proteins into smaller domains when there is disagreement between classifications. The high rate of the overall single-domain matches indicates a correspondence between ECOD and SCOPe single-domain proteins, while the SCOPe results indicated a small fraction of SCOPe single-domain proteins (generally unsplit duplications) that do not match. In conclusion, SCOPe and ECOD still retain a high degree of similarity in their classification and boundary partition of single-domain proteins. We also compared the domain partition of ECOD to CATH using this method. CATH varies more fundamentally from ECOD on the strength of structural compactness on determining domain boundaries and so we expected more varied results. We compared the same set of ECOD single-domain proteins to CATH. A total of 210,386 (59%) ECOD domains are from chain/proteins that are also classified by CATH. By the coverage threshold, 175,168 ECOD (83%) single-domain proteins match a CATH domain and 172,950 (82%) of these matches are to a CATH single-domain protein. A total of 32,993 (5%) ECOD single-domain proteins are defined on chains present in CATH, yet those ECOD domains are unmatched. The top two most populated ECOD H-groups containing these unmatched ECOD domains are the PK-like and RIFT barrel domains. Conversely, 204,555 (62%) of PDB chains classified by CATH are single domains, and of these, 197,619 (96%) are derived from chains also classified by ECOD. Finally, 172,950 (87%) of CATH single-domain proteins also match to an ECOD single-domain protein definition, whereas 21,433 (10%) of CATH-single domain proteins match no domain. In contrast to the SCOPe, the higher degree of mismatch among single-domain proteins here indicates that both ECOD and CATH oversplit some domains with respect to the opposing classification. Finally, we compared the domain partition of SCOP2 to ECOD. A total of 21,382 ECOD single-domain proteins are from PDB chains defined by both ECOD and SCOP2. Of these ECOD single-domains, 20,178 (94%) match by coverage to a SCOP2 domain, of which 20,033 (93%) are SCOP2 single-domain proteins. A total of 1114 (5%) ECOD single-domain proteins are from chains classified in SCOP2 but match no SCOP2 domain by coverage. Conversely, 24,297 (98%) SCOP2 single-domain proteins are from chains classified in both SCOP2 and ECOD. A total of 20,190 (83%) SCOP single-domains match by coverage to some ECOD domain, and 20,019 (82%) of these matches are to ECOD single-domains. A total of 4107 SCOP (2%) single-domain proteins do not match to any ECOD domain, indicative of ECOD splitting these chains into multidomain architectures.

Major Facilitator Superfamily General Substrate Transporter Ensembles

Major facilitator superfamily (MFS) transporters contain 12 transmembrane helices (TMH) arranged around a single cavity formed between a domain duplication of 6 TMH.[36] Thus, the MFS transporters likely arose from an ancient 6TMH transporter that duplicated and fused to produce present day 12 TMH MFS topology. Because the peptide binding site is located at the intersection of these two domains, sequence-based classifications such as PFAM define the entire 12TMH entity as a single functional unit, with binding contributed by residues from both domains. ECOD splits the MFS structures into two domains based on their observed duplication and their structural compactness and independence. Sequence and structure evidence further supports splitting of the 6TMH domain into two primordial 3TMH units that form an interdigitated complex that would interact from alternate sides of the membrane.[37−39] In support of primitive MFS-like 6TMH-containing domains, a recent expansion of the MFS family to include novel transporters without a known structure includes several families with a single 6TMH domain.[40] Splitting of duplications is one question on which domain classifications, sequence and structure, differ. ECOD splits the duplicated domains from the MFS transporters prior to their classification into families. As such, the N-terminal and C-terminal halves of the largest and most diverse MFS superfamily (defined as MFS_1 or PF07690 by PFAM) group into several different families (MFS_1, MFS_1_1, MFS_1_2, 5050.1.1.19_develop277_1 and 5050.1.1.20_develop277_2). This separation reflects sequence divergence that can result from duplication events. Alternatively, the N-terminal and C-terminal domains from the remaining ECOD families reflect simple duplications that classify into the same group (PTR2, MFS_2, FPN1, and LacY_symp). These groupings roughly reflect previously defined relationships in the transporter classification database.[40,41] MFS transporters follow an alternating access model where the central cavity opens to the outside to allow substrate binding and then transitions to an inward-facing conformation to release the substrate on the other side of the membrane.[42,43] This conformation change occurs at the interface of the two duplicated 6TMH domains defined in ECOD. We compared structures from one of the smaller and less diverse MFS family represented by E. coli lactose permease (LacY) with structures from both the inward (Figure A) and outward (Figure B) conformations.[44−46] LacY symporters utilize an ion gradient to transport β-galactosides across the membrane against a concentration gradient. The LacY structure ensemble can be depicted as a tree generated from scores of structure comparisons (Figure C), which separates the inward facing (left) and outward facing (right) conformations. Closer inspection of the inward facing conformations highlight flexibility in the loop connecting the two domains and the C-terminal helix that line the cytoplasmic side of the structure (Figure D). Alternatively, flexibility in the outward facing conformations highlights a periplasmic gate that leads to partial occlusion of two substrate-bound structures (Figure C, indicated by * in the tree). Interestingly, the outward facing conformations require either stabilization with a nanobody (6c9w and 5gxb) or mutation of two residues (4zyr and 4oaa), supporting the idea that conformational diversity is an evolvable trait. The presence of multiple observed conformations in structures reflects a known limitation of structural domain classifications, including ECOD, in that recording these conformations is not supported by their taxonomies.

Figure 7

MFS Transporter LacY family structure ensemble. The LacY structure is composed of a 6TMH domain duplication (colored slate and salmon) with a central cleft that binds the substrate (black spheres). LacY adopts (A) an inward facing and (B) an outward facing conformation. (C) Conformations are separated in a tree depicting distances between all LacY structures. (D) Inward-facing structure superpositions highlight flexibility (colored cyan and green) on the cytoplasmic side. (E) Outward facing structure superpositions highlight flexibility in the periplasmic gate (cyan), leading to partial occlusion of the substrate (green).

Comparison of Equivalent Hierarchal Levels by Domain Pair Analysis

Where matched domains were found between ECOD and other domain classifications, we analyzed differences in their classification by domain pair analysis. For each pair of shared domains occupying a taxonomic level in one classification (e.g., ECOD H-group), we examined whether the domain pair was also paired on an equivalent level in another classification (e.g., SCOPe fold). For the purposes of this analysis, we considered ECOD H-groups, SCOP2 and SCOPe folds, and CATH topologies to be equivalent. These equivalencies are based on our experience with these classifications: generally speaking, ECOD H-groups tend to be quite broad, so we chose the broadest level possible from other comparisons in order to generate the closest possible comparison.

Conclusions

Domain classifications provide useful community resources for the study of the protein structure. By curating and analyzing the ever-expanding exploration of the protein sequence and structure space, milestones for further analysis are created. Analogous to how the protein structural space is explored nonrandomly in congruence with investigator interest, domain classifications of this space also develop in potentially unexpected ways based on the choices of their curators. Since there is no “first principles” analysis of protein domains, periodic comparative analysis can be used to determine how domain classifications are changing with respect to each other and with respect to the overall set of known structures. Here, we present an update on the state of protein domain classification 7 years subsequent to the release of our protein domain classification, ECOD. By domain boundaries, our classification continues to match well with SCOPe, while classifying many more structures. Both CATH and SCOP2 differ more in boundaries and are likely interesting alternative sources of boundaries for users. Determining when and how to split domain duplications is a key conceptual point on which domain classifications may differ. We present an analysis and structural comparison of ECOD domains from the MFS superfamily and offer supporting evidence for having them split into multiple domains. For potential users, we anticipate that ECOD provides the broadest coverage of the known structures with the most distant homologous relationships. SCOPe is also a descendant from SCOP v1.75 and is the most similar for those looking for a benchmark. CATH provides a distinct perspective on domain definition more focused on structural compactness and has a rigorous treatment of protein families and their function that was beyond the scope of this work. SCOP2 provides the most efficient coverage of protein space by focusing on classifying only nonredundant representatives. We suspect that advent of cheap near-native protein structural predictions will inevitably lead to the necessity for domain classifications to explicitly label and organize multiple native domain conformations in the near future.

Methods

Generation of Unrooted Structure Similarity Trees

We batch downloaded structures defined in the ECOD LacY_symp family from the PDB[47] and limited those structures to chain A. We compared all against all structures using Dalilite[48,49] and transformed the Dali Z scores to distances using the following transformation: −ln(DaliZAB/min(DaliZAA or DaliZBB)) where DaliZAB is the Dali Z-score between two structures and DaliZAA and DaliZBB are the scores of the self-alignments. We generated a nonrooted tree using the fitch program (with global rearrangements) with an input matrix of the calculated structure-based distances.[50]

Calculating Domain Classification Completeness by the Sequence and Structure against the PDB

Completeness of domain classifications was calculated against a version of the PDB current to November 2020. This PDB reference contains 171,447 depositions with 566,056 peptide chains. A total of 878 obsolete PDB entries were retained where necessary to incorporate all domains from the classifications being studied. ECOD v278 was downloaded from http://prodata.swmed.edu/ecod, containing 789,634 domains from 2460 X-groups (possible homology) and 3716 H-groups. CATH v4.3 was downloaded from https://cathdb.info containing 500,238 domains from 1472 topology groups and 6631 homologous superfamilies. A periodic update of SCOPe (v2.07-2020-07-16) was downloaded from https://scop.berkeley.edu/ containing 317,172 domains from 1457 folds and 2323 superfamilies, and noncanonical classes were excluded from analysis. UniProt accessions were associated with PDB entries using the SIFTS database.[51] PDB release dates were retrieved from the mmCIF records for each deposition.

Comparison of Domain Partition by the Overall Coverage

We compared the domain partition of classifications by residue coverage of these domains from shared peptide chains. PDB ranges were translated into internal “seq_id” ranges (as in the pdbx_poly_seq_scheme records from the mmCIF representation) in order to make more accurate comparisons. These seq_id ranges uniquely identify individual residues (without the need for insertion codes) and always range from 1 to N where N is the length of the protein. They also disambiguate problems between consistent numbering and unresolved residues in the structure, which can be difficult using standard author-provided PDB residue numbers. Where domain residue ranges overlapped by 90% or more, they were deemed shared domains. Where domains were defined on a shared peptide chain but they did not satisfy our overlap criteria, they were described as “unmatched domains from shared chains” or simply “unmatched.” Finally, where domains were defined on peptide chains that could only be found in one classification, these domains were deemed “unique” to that classification. ECOD, SCOPe, and CATH each record the locations of noncanonical domains or regions that do not properly satisfy their own domain criteria (e.g., expression tags, peptides, and synthetic or de novo designed domains). For SCOPe, we only conducted domain comparisons for SCOPe domains from classes [a–g]. For CATH, only domains from classes 1–4 were considered. SCOP2 does not record noncanonical domain regions at this time.

51 in total

Review 1. The modular logic of signaling proteins: building allosteric switches from simple binding domains.

Authors: Wendell A Lim
Journal: Curr Opin Struct Biol Date: 2002-02 Impact factor: 6.809

2. DaliLite workbench for protein structure comparison.

Authors: L Holm; J Park
Journal: Bioinformatics Date: 2000-06 Impact factor: 6.937

3. Evolution of the protein repertoire.

Authors: Cyrus Chothia; Julian Gough; Christine Vogel; Sarah A Teichmann
Journal: Science Date: 2003-06-13 Impact factor: 47.728

4. 'It will change everything': DeepMind's AI makes gigantic leap in solving protein structures.

Authors: Ewen Callaway
Journal: Nature Date: 2020-12 Impact factor: 49.962

Review 5. Protein Ensembles: How Does Nature Harness Thermodynamic Fluctuations for Life? The Diverse Functional Roles of Conformational Ensembles in the Cell.

Authors: Guanghong Wei; Wenhui Xi; Ruth Nussinov; Buyong Ma
Journal: Chem Rev Date: 2016-01-25 Impact factor: 60.622

Review 6. Single-particle cryo-EM-How did it get here and where will it go.

Authors: Yifan Cheng
Journal: Science Date: 2018-08-31 Impact factor: 47.728

Review 7. Divergence and convergence in enzyme evolution.

Authors: Michael Y Galperin; Eugene V Koonin
Journal: J Biol Chem Date: 2011-11-08 Impact factor: 5.157

8. InterPro in 2017-beyond protein family and domain annotations.

Authors: Robert D Finn; Teresa K Attwood; Patricia C Babbitt; Alex Bateman; Peer Bork; Alan J Bridge; Hsin-Yu Chang; Zsuzsanna Dosztányi; Sara El-Gebali; Matthew Fraser; Julian Gough; David Haft; Gemma L Holliday; Hongzhan Huang; Xiaosong Huang; Ivica Letunic; Rodrigo Lopez; Shennan Lu; Aron Marchler-Bauer; Huaiyu Mi; Jaina Mistry; Darren A Natale; Marco Necci; Gift Nuka; Christine A Orengo; Youngmi Park; Sebastien Pesseat; Damiano Piovesan; Simon C Potter; Neil D Rawlings; Nicole Redaschi; Lorna Richardson; Catherine Rivoire; Amaia Sangrador-Vegas; Christian Sigrist; Ian Sillitoe; Ben Smithers; Silvano Squizzato; Granger Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Ioannis Xenarios; Lai-Su Yeh; Siew-Yit Young; Alex L Mitchell
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

9. A database of domain definitions for proteins with complex interdomain geometry.

Authors: Indraneel Majumdar; Lisa N Kinch; Nick V Grishin
Journal: PLoS One Date: 2009-04-08 Impact factor: 3.240

Review 10. The history of the CATH structural classification of protein domains.

Authors: Ian Sillitoe; Natalie Dawson; Janet Thornton; Christine Orengo
Journal: Biochimie Date: 2015-08-04 Impact factor: 4.079