Literature DB >> 34755867

PDBe-KB: collaboratively defining the biological context of structural data.

Abstract

The Protein Data Bank in Europe - Knowledge Base (PDBe-KB, https://pdbe-kb.org) is an open collaboration between world-leading specialist data resources contributing functional and biophysical annotations derived from or relevant to the Protein Data Bank (PDB). The goal of PDBe-KB is to place macromolecular structure data in their biological context by developing standardised data exchange formats and integrating functional annotations from the contributing partner resources into a knowledge graph that can provide valuable biological insights. Since we described PDBe-KB in 2019, there have been significant improvements in the variety of available annotation data sets and user functionality. Here, we provide an overview of the consortium, highlighting the addition of annotations such as predicted covalent binders, phosphorylation sites, effects of mutations on the protein structure and energetic local frustration. In addition, we describe a library of reusable web-based visualisation components and introduce new features such as a bulk download data service and a novel superposition service that generates clusters of superposed protein chains weekly for the whole PDB archive.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 34755867 PMCID： PMC8728252 DOI： 10.1093/nar/gkab988

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The structure of biological macromolecules and their complexes is invaluable for understanding their functions (1,2). These structures allow researchers to infer atomic-level mechanisms of biological systems and enable them to modulate biological processes through, for example, structure-based drug design, synthetic biology, and protein engineering (3–5). For 50 years, the Protein Data Bank (PDB), managed by the worldwide Protein Data Bank consortium (wwPDB) (6), has served as the global archive for experimentally determined structures. To date, the PDB contains over 180 000 structures of 55 000 distinct proteins, with around 12 000 new PDB entries deposited annually (7). Advances in structure determination promise that the repertoire of known structures will continue to grow, mainly owing to the widespread application of single-particle cryo-electron microscopy yielding high-resolution structures. Nevertheless, the known sequence space is expanding even faster (8); only 0.27% of protein sequences in the Universal Protein Data Resource (UniProt) has structural representations in the PDB. This gap between the knowledge of sequences and structures will continue to grow (6,9–11). Although high-accuracy predicted models made public recently have the potential to expand the structural coverage of the sequence space massively, these methods still have limitations in modelling mutant structures and assemblies (12–14). While macromolecular structures are invaluable, they often need to be interpreted using additional structural and functional annotation layers to answer specific biological questions (15). For example, annotating structures with druggable surface pockets, molecular channels or identifying residues critical for stabilising an interaction interface can give more in-depth insights than 3D coordinates alone (16–18). Many specialist data resources, and scientific software provide such annotations, and their number keeps growing (15). However, while having access to such a rich ecosystem of annotations empowers the scientific community, it is becoming increasingly difficult to track and combine these data. While most annotations are openly accessible, they may not be easily findable, and the lack of standard data formats often hinders interoperability and reusability. We established PDBe-KB in 2018 to make these annotations FAIR (i.e. findable, accessible, interoperable, reusable) through a global collaboration between PDBe and leading specialist data providers and scientific software developers (15). This collaborative consortium aims to place macromolecular structures in their biological context by providing FAIR access to structural, functional and biophysical annotations of protein, nucleic acid and small-molecule structures in the PDB. PDBe-KB is an open consortium transparently governed by a collaboration guideline (https://pdbe-kb.org/guidelines). Contributing data resources are requested to provide their PDB residue or PDB chain annotations in a data format defined and maintained by the consortium (https://github.com/PDBe-KB/funpdbe-schema). This data exchange format evolves according to the partner resources' requirements, and the consortium reviews the specification during annual PDBe-KB workshops. In addition, PDBe-KB makes the integrated annotations openly accessible to the scientific community through file transfer protocol (ftp://ftp.ebi.ac.uk/pub/databases/pdbe-kb), programmatic access (https://pdbe-kb.org/graph-api) and web pages (https://pdbe-kb.org/proteins). As a result, the consortium grew from 18 to 30 collaborating data resources from 11 different countries in the past two years (Table 1).

Table 1.

Data resources and scientific software contributing annotations to PDBe-KB

Partner resource	Resource leader	Type of annotations	Country
14–3–3-Pred (19)	G. Barton	Binding site predictions	GBR
3D Complex (20)	E. D. Levy, S. Dey	Interaction interfaces	ISR
3DLigandSite (21)	M. Wass	Binding site predictions	GBR
AKID (22)	M. Helmer-Citterich	Kinase-target predictor	ITA
Arpeggio (23)	T. Blundell	Ligand interactions	GBP
CamKinet (in preparation)	M. Kumar	Curated post-translational modification sites	DEU
canSAR (16)	B. al-Lazikani	Druggable pocket predictions	GBR
CATH-FunSites (24)	C. Orengo	Functional site predictions	GBR
ChannelsDB (18)	R. Svobodova, K.Berka	Molecular channels	CZE
COSPI-Depth (25)	M. S. Madhusudhan	Residue depth	IND
Covalentizer (26) (new)	N. London	Predicted covalent binding molecules	ISR
DynaMine (27)	W. Vranken	Backbone flexibility predictions	BEL
ELM (28)	T. Gibson	Short linear motifs	DEU
EMV (29) (new)	J. R. Macias	EM validation annotations from 3DBionotes	ESP
EVcouplings (30) (new)	D. Marks	Covariations	USA
FireProt DB (31) (new)	J. Damborsky	Effects of mutations on protein stabilities	CZE
FoldX (32)	L. Serrano	Energetic consequences of mutations	ESP
FrustratometeR (33)(new)	R. Gonzalo Parra	Energetic local frustration	ESP
KinCore (34) (new)	R. Dunbrack	Conformational annotations	USA
KnotProt (35) (new)	J. Sulkowska	Topology annotations	POL
M-CSA (36)	J. Thornton	Curated catalytic sites	GBR
MetalPDB (37)	C. Andreini, A. Rosato	Curated metal-binding sites	ITA
Missense3D (38)	M. Sternberg	Mutations in human proteome	GBR
MobiDB (39) (new)	S. Tosatto	Consensus disorder predictions	ITA
P2rank (40)	D. Hoksza	Binding site predictions	CZE
POPS (41)	F. Fraternali	Solvent accessibility	GBR
ProKinO (42)	N. Kannan	Curated post-translational modification sites	USA
Scop3P (43) (new)	L. Martens, W. Vranken	Phosphorylation sites	BEL
SKEMPI (44) (new)	J. Fernandez-Recio	Thermodynamic effects of mutations	ESP
WEBnm@ (45) (new)	N. Reuter	Flexibility predictions	NOR

PDBe-KB integrates annotations from 30 partner resources who provide functional, biophysical and biochemical annotations.

Data resources and scientific software contributing annotations to PDBe-KB PDBe-KB integrates annotations from 30 partner resources who provide functional, biophysical and biochemical annotations.

IMPLEMENTATION

The infrastructure of PDBe-KB consists of four main components. These are (i) a deposition system for annotations; (ii) a graph database that integrates annotations with the core PDB data; (iii) a rich set of application programming interface (API) endpoints that provide access to the data; (iv) a set of reusable web components that are combined to create the PDBe-KB aggregated views (Figure 1).

Figure 1.

Schematic overview of the PDBe-KB infrastructure. PDBe-KB partner resources convert their annotations to a predefined JSON format and transfer these file sets via FTP. Weekly data validation and integration processes parse and load the annotations into the PDBe graph database. A rich set of API endpoints expose the data and power the PDBe-KB aggregated views. Researchers can access the data by setting up a local instance of the graph database, using the API endpoints, or visiting the aggregated view pages.

Data deposition

The data deposition system changed significantly to ensure scalability as more partner data resources joined the consortium. Data providers are required to convert their annotations to JavaScript Object Notation (JSON) files, according to the data exchange format specification, which is available at https://github.com/PDBe-KB/funpdbe-schema. Collaborators then copy their JSON files to private FTP areas provided by PDBe-KB, hosted at EMBL-EBI in Hinxton. A weekly running data processing pipeline parses, validates and integrates the data from these JSON files into the PDBe graph database. When displaying or providing access to annotations from any PDBe-KB partner resources, we provide direct links the users can follow to find the original data set from the corresponding database or scientific software.

Data access

The PDBe graph database is an up-to-date knowledge graph that contains the latest PDB data, linked to the corresponding UniProt accessions and integrated with structural, functional and biophysical annotations. It is implemented in Neo4j v3.5 and has over 1 billion nodes and 1.5 billion edges. The database is openly accessible at https://pdbe-kb.org/graph-download, and users can install it in-house to use it as a research tool for data mining. It requires ∼0.5TB of local storage space, preferably on an SSD drive with a recommended 6 GB RAM and eight cores. The PDBe aggregated API provides programmatic access to all the aspects of the data contained within the graph database. As we integrated new annotations into PDBe-KB, we have expanded the API and currently provide over 90 different API endpoints. We have described these endpoints and provided use case examples elsewhere (46). The API is available at https://pdbe-kb.org/graph-api.

Web components library

PDBe-KB web pages use modular web components which can be reused and customised easily. We have created an open-source library for these components so that data service developers can use them as plugins for visualising structural data. In addition, they provide built-in support for the PDBe aggregated API, allowing developers to display data from PDBe-KB conveniently. We implemented the web components using the AngularJS framework. They are available and freely reusable from GitHub at https://github.com/PDBe-KB?q=component.

New features on the aggregated views of proteins

We continuously develop the PDBe-KB pages, displaying all the available structural information for a protein, keyed on a UniProt accession. We call these pages aggregated views of proteins. In addition, we have added several features, in particular: (i) a superposition service to visualise protein chains clustered by structural similarity; (ii) a bulk download service that provides easy access to all the coordinate files, validation reports, sequences for a protein of interest; (iii) a section dedicated to processed proteins and (iv) annotations for small molecules and macromolecular interaction partners. We have designed a weekly process to generate superposed UniProt segments for the whole PDB archive. We described the details of the data process on the public Wiki pages of PDBe-KB at https://github.com/PDBe-KB/pdbe-kb-manual/wiki/Superposition. The superposed coordinates are made available on the aggregated views of proteins, where clustered, superposed PDB chains can be displayed using the interactive 3D molecular viewer, Mol* (47), by clicking on the ‘view structure clusters’ buttons. We also provide a unique superposition view that displays all the ligand molecules overlaid on representative chains from superposition clusters (Figure 2). In the example below, we display the 3C-like proteinase nsp5 of SARS-CoV-2. By overlaying all the bound small molecules, researchers can identify a frequently populated binding pocket.

Figure 2.

Superposition of protein chains and ligand molecules. The aggregated views of proteins provide access to superposed protein segments and offer a display mode that overlays all the observed ligands on representative chains from superposition clusters. The figure displays the ligand superposition of all the available small molecules in PDB structures of the 3C-like proteinase nsp5 of SARS-CoV-2. Previously, it was cumbersome to download all the structural and functional data available for a protein of interest from its aggregated view. We have recently designed a download service that has a graphical user interface to enable users to download coordinates (archive mmCIF, updated mmCIF and PDB format), sequences (FASTA format) and validation data (Figure 3). The updated mmCIF is based on the archive mmCIF file. Both files follow the same PDBx/mmCIF dictionary. The updated mmCIF has two major differences from the archive mmCIF file: (i) selected data values are cleaned up to standardise the enumerations; (ii) additional data categories and items are added as required to support PDBe data out activities and external users. An example of a standardised enumeration is the values in _exptl.method which is standardised and changed from uppercase to title case. Another example of additional categories is the _chem_comp_bond which defines the expected bond order for every bond in every component in the PDB entry. Users can download these data for all the PDB entries for a protein of interest or only those containing small molecules or macromolecular complexes. Users can also interact with the download service programmatically through a set of API endpoints. Documentation of this API is available at https://www.ebi.ac.uk/pdbe/download/api/docs.

Figure 3.

Bulk data download service. The aggregated views of proteins provide a graphical user interface to a new bulk data download service which enables researchers to download all the coordinates, sequences and validation data available for a protein of interest. In response to the COVID-19 pandemic, we developed a unique set of web pages focused on the proteins of SARS-CoV-2 in early 2020. However, it became apparent that we could improve the display of polyproteins. In particular, the aggregated views were not highlighting the mature, processed proteins, and users could not zoom in on these proteins. To address this, we have integrated information on processed proteins from UniProt and have added a new section that highlights the segments they occupy on the full-length polyprotein sequences (Figure 4). We now also enable users to view pages specifically for a particular processed/mature protein, using the PRO identifiers from UniProt. For example, https://www.ebi.ac.uk/pdbe/pdbe-kb/proteins/PRO_0000449633 is the dedicated page of the 2'-O-methyltransferase nsp16 of SARS-CoV-2, which is a processed protein from Replicase polyprotein 1ab (UniProt accession P0DTD1). These pages are available for all the processed/mature proteins with known structures, not only viral proteins.

Figure 4.

Processed proteins section. The aggregated views of proteins now include a section highlighting all the mature, processed proteins for a polyprotein. In addition, users can click on the green boxes to view the 3D structures using Mol*, and they can navigate to dedicated processed proteins pages by clicking on the ‘view page’ button. While experimentally determining structures remains a costly and labour-intensive endeavour, there have been significant advances in the field of structural predictions. Researchers increasingly deploy Artificial Intelligence (AI) techniques to predict a protein's structure computationally from its amino-acid sequence alone (12,13,48). While the aggregated views of proteins already provided an overview of all the protein structures available in the PDB, we have expanded the scope to include predicted models from data providers such as SWISS-MODEL and AlphaFold DB (14,49) (Figure 5). Displayed in ProtVista, users can compare the structural coverage of the protein sequences and directly download predicted models.

Figure 5.

Predicted models of a protein of interest. The aggregated views of proteins now provide an overview of available predicted models from data resources such as AlphaFold DB and SWISS-MODEL.

Predicted models of a protein of interest. The aggregated views of proteins now provide an overview of available predicted models from data resources such as AlphaFold DB and SWISS-MODEL. Most of the annotations provided by the PDBe-KB partner resources focus on amino acid residues and their functions or biophysical characteristics, yet PDBe-KB has information also on molecular entities such as small molecules or macromolecular interaction partners (Figure 6). For example, using a previously developed semi-automated annotation process, we can now flag small molecules as enzyme cofactors and cofactor-like molecules (50). We display this information on the sequence feature viewer, ProtVista and ligand gallery. Similarly, we have weekly processes for identifying and annotating peptides and antibody structures, which we display in the macromolecular interactions section.

Figure 6.

Ligand annotations. The aggregated views of proteins now display annotations for ligand molecules based on a cofactor data pipeline. Similarly, we annotate peptides and antibodies in the macromolecular interactions section.

Training and tutorials

Working together with the Training team of EMBL-EBI, we actively participated in training courses and continued to create training materials and tutorials that describe the new functionalities and changes to PDBe-KB web services and web pages. Recently, we created a set of tutorials that encompass programmatic access to PDBe data, data processing packages and web components that visualise the structures and annotations. These tutorials are available at https://pdbeurope.github.io/api-webinars/index.html.

DISCUSSION

PDBe-KB expands the structural, functional, and biophysical annotations of molecular structure data according to its long-term goals. By allowing integrated and FAIR access to these annotations, researchers in academia and industry can take advantage of the rich ecosystem of specialist data resources and scientific software and efficiently collate data to answer specific biological questions. Since we established PDBe-KB in 2018, the collaboration grew, integrating data from 30 partner resources across 11 countries, providing over 1.2 billion residue-level annotations. Furthermore, PDBe-KB continues to be one of the main activities of the ELIXIR 3D-BioInfo community, which brings together researchers, structural bioinformatics developers and data providers to discuss and strive for data FAIRness, benefitting the broader scientific community (51). While we plan to further improve the aggregated views of proteins, we are also developing novel aggregated web pages for ligands, providing comprehensive structural and functional information on all the observed small molecules in the PDB archive. Finally, we would like to extend an invitation to all the data providers and scientific software developers to join the consortium and increase user exposure through this community-driven data-sharing platform and knowledge base. In conclusion, PDBe-KB keeps evolving and makes structural data and their structural, functional, and biophysical annotations more accessible to the scientific community, reaching over 340 000 users annually either through their usage of the rich set of programmatic access endpoints or by their visits to the PDBe-KB aggregated views pages.

DATA AVAILABILITY

PDBe-KB is available at https://pdbe-kb.org. Individual, protein-focused pages per UniProt accessions are available at https://pdbe-kb.org/proteins/P0DTD1. Documentation of the consortium members is available at https://github.com/PDBe-KB/pdbe-kb-manual/wiki. Users can download the graph database from https://pdbe-kb.org/graph-download, and users can find the aggregated API at https://pdbe-kb.org/graph-api. The PDBe-KB web component library is public at https://github.com/PDBe-KB?q=component. Finally, we make all the annotations available in JSON format from ftp://ftp.ebi.ac.uk/pub/databases/pdbe-kb.

50 in total

Review 1. Artificial intelligence techniques for integrative structural biology of intrinsically disordered proteins.

Authors: Arvind Ramanathan; Heng Ma; Akash Parvatikar; S Chakra Chennubhotla
Journal: Curr Opin Struct Biol Date: 2021-01-06 Impact factor: 6.809

2. 3DLigandSite: predicting ligand-binding sites using similar structures.

Authors: Mark N Wass; Lawrence A Kelley; Michael J E Sternberg
Journal: Nucleic Acids Res Date: 2010-05-31 Impact factor: 16.971

3. WEBnm@ v2.0: Web server and services for comparing protein flexibility.

Authors: Sandhya P Tiwari; Edvin Fuglebakk; Siv M Hollup; Lars Skjærven; Tristan Cragnolini; Svenn H Grindhaug; Kidane M Tekle; Nathalie Reuter
Journal: BMC Bioinformatics Date: 2014-12-30 Impact factor: 3.169

4. FoldX 5.0: working with RNA, small molecules and a new graphical interface.

Authors: Javier Delgado; Leandro G Radusky; Damiano Cianferoni; Luis Serrano
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

5. 3DBIONOTES v3.0: crossing molecular and structural biology data with genomic variations.

Authors: Joan Segura; Ruben Sanchez-Garcia; C O S Sorzano; J M Carazo
Journal: Bioinformatics Date: 2019-09-15 Impact factor: 6.937

6. Accurate prediction of protein structures and interactions using a three-track neural network.

Authors: Minkyung Baek; Frank DiMaio; Ivan Anishchenko; Justas Dauparas; Sergey Ovchinnikov; Gyu Rie Lee; Jue Wang; Qian Cong; Lisa N Kinch; R Dustin Schaeffer; Claudia Millán; Hahnbeom Park; Carson Adams; Caleb R Glassman; Andy DeGiovanni; Jose H Pereira; Andria V Rodrigues; Alberdina A van Dijk; Ana C Ebrecht; Diederik J Opperman; Theo Sagmeister; Christoph Buhlheller; Tea Pavkov-Keller; Manoj K Rathinaswamy; Udit Dalwadi; Calvin K Yip; John E Burke; K Christopher Garcia; Nick V Grishin; Paul D Adams; Randy J Read; David Baker
Journal: Science Date: 2021-07-15 Impact factor: 47.728

7. MobiDB: intrinsically disordered proteins in 2021.

Authors: Damiano Piovesan; Marco Necci; Nahuel Escobedo; Alexander Miguel Monzon; András Hatos; Ivan Mičetić; Federica Quaglia; Lisanna Paladin; Pathmanaban Ramasamy; Zsuzsanna Dosztányi; Wim F Vranken; Norman E Davey; Gustavo Parisi; Monika Fuxreiter; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

8. Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants.

Authors: Tarun Khanna; Gordon Hanna; Michael J E Sternberg; Alessia David
Journal: Hum Genet Date: 2021-01-27 Impact factor: 5.881

9. PDBe Aggregated API: Programmatic access to an integrative knowledge graph of molecular structure data.

Authors: Sreenath Nair; Mihály Váradi; Nurul Nadzirin; Lukáš Pravda; Stephen Anyango; Saqib Mir; John Berrisford; David Armstrong; Aleksandras Gutmanas; Sameer Velankar
Journal: Bioinformatics Date: 2021-06-03 Impact factor: 6.937

10. Protein Data Bank: the single global archive for 3D macromolecular structure data.

Authors:
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8 in total

1. GWYRE: A Resource for Mapping Variants onto Experimental and Modeled Structures of Human Protein Complexes.

Authors: Sukhaswami Malladi; Harold R Powell; Alessia David; Suhail A Islam; Matthew M Copeland; Petras J Kundrotas; Michael J E Sternberg; Ilya A Vakser
Journal: J Mol Biol Date: 2022-04-27 Impact factor: 6.151

2. 3DLigandSite: structure-based prediction of protein-ligand binding sites.

Authors: Jake E McGreig; Hannah Uri; Magdalena Antczak; Michael J E Sternberg; Martin Michaelis; Mark N Wass
Journal: Nucleic Acids Res Date: 2022-04-12 Impact factor: 19.160

3. Experiences From Developing Software for Large X-Ray Crystallography-Driven Protein-Ligand Studies.

Authors: Nicholas M Pearce; Rachael Skyner; Tobias Krojer
Journal: Front Mol Biosci Date: 2022-04-11

4. The 2022 Nucleic Acids Research database issue and the online molecular biology database collection.

Authors: Daniel J Rigden; Xosé M Fernández
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

Review 5. Role of genomics in combating COVID-19 pandemic.

Authors: K A Saravanan; Manjit Panigrahi; Harshit Kumar; Divya Rajawat; Sonali Sonejita Nayak; Bharat Bhushan; Triveni Dutt
Journal: Gene Date: 2022-03-04 Impact factor: 3.688

6. FGDB: a comprehensive graph database of ligand fragments from the Protein Data Bank.

Authors: Daniele Toti; Gabriele Macari; Enrico Barbierato; Fabio Polticelli
Journal: Database (Oxford) Date: 2022-06-27 Impact factor: 4.462

7. A Versatile Class of 1,4,4-Trisubstituted Piperidines Block Coronavirus Replication In Vitro.

Authors: Sonia De Castro; Annelies Stevaert; Miguel Maldonado; Adrien Delpal; Julie Vandeput; Benjamin Van Loy; Cecilia Eydoux; Jean-Claude Guillemot; Etienne Decroly; Federico Gago; Bruno Canard; Maria-Jose Camarasa; Sonsoles Velázquez; Lieve Naesens
Journal: Pharmaceuticals (Basel) Date: 2022-08-18

8. PDBe and PDBe-KB: Providing high-quality, up-to-date and integrated resources of macromolecular structures to support basic and applied research and education.

Authors: Mihaly Varadi; Stephen Anyango; Sri Devan Appasamy; David Armstrong; Marcus Bage; John Berrisford; Preeti Choudhary; Damian Bertoni; Mandar Deshpande; Grisell Diaz Leines; Joseph Ellaway; Genevieve Evans; Romana Gaborova; Deepti Gupta; Aleksandras Gutmanas; Deborah Harrus; Gerard J Kleywegt; Weslley Morellato Bueno; Nurul Nadzirin; Sreenath Nair; Lukas Pravda; Marcelo Querino Lima Afonso; David Sehnal; Ahsan Tanweer; James Tolchard; Charlotte Abrams; Roisin Dunlop; Sameer Velankar
Journal: Protein Sci Date: 2022-10 Impact factor: 6.993

8 in total