Literature DB >> 35199148

Localpdb- a Python package to manage protein structures and their annotations.

Jan Ludwiczak¹, Aleksander Winski¹, Stanislaw Dunin-Horkawicz¹.

Abstract

MOTIVATION: The wealth of protein structures collected in the Protein Data Bank enabled large-scale studies of their function and evolution. Such studies, however, require the generation of customized data sets combining the structural data with miscellaneous accessory resources providing functional, taxonomic, and other annotations. Unfortunately, the functionality of currently available tools for the creation of such data sets is limited and their usage frequently requires laborious surveying of various data sources and resolving inconsistencies between their versions.
RESULTS: To address this problem, we developed localpdb, a versatile Python library for the management of protein structures and their annotations. The library features a flexible plugin system enabling seamless unification of the structural data with diverse auxiliary resources, full version control, and powerful functionality of creating highly customized data sets. The localpdb can be used in a wide range of bioinformatic tasks, in particular those involving large-scale protein structural analyses and machine learning. AVAILABILITY: localpdb is freely available at https://github.com/labstructbioinf/localpdb. Documentation along with the usage examples can be accessed at https://labstructbioinf.github.io/localpdb/.

Entities: Chemical

Year: 2022 PMID： 35199148 PMCID： PMC9048648 DOI： 10.1093/bioinformatics/btac121

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

The size of the Protein Data Bank (PDB) has been growing steadily over the past decades (Burley ), encouraging the development of computational tools enabling classification (Andreeva ; Cheng ; Dawson ) and annotation (Dana ) of protein structures it encompasses. The wealth of the collected data has opened up many research opportunities ranging from studies focused on proteins of a particular function or evolutionary position to comprehensive surveys aiming at capturing the general aspects of the ‘protein universe’ (Alva ; Nepomnyachiy ). However, these innovations came at the cost of the dispersion of data sources, whose integration began to require expert knowledge and the tedious process of mapping structures to the corresponding sequence and functional information. Another problem frequently faced during research involving multiple PDB structures is related to the management of their versions—it is not uncommon for PDB structures to become updated, entirely changed or even removed after deposition. Similarly, the associated tools integrating the classifications and annotations, such as the PDBe-KB (Varadi ), RCSB (Rose ), SIFTS (Dana ), Proteo3Dnet (Postic ) or ECOD (Cheng ), are updated regularly to capture new structures or fix errors. Consequently, the datasets obtained prior to a given study may differ substantially from those generated later. This, in turn, makes the recreation of data at a given time point very hard, negatively impacting the research reproducibility. The difficulties outlined above force researchers to manually collect and manage their datasets, bringing a risk of omitting important information, for example, during the collection of protein structures originating from a given taxonomic, evolutionary or functional group. Also, even cosmetic changes to a few structures, their identifiers or associated data may cause hard-to-track internal inconsistencies that need to be amended by hand. These research-hampering issues were partially alleviated by the development of computational tools such as the PDB module (Hamelryck and Manderick, 2003) of the biopython package (Cock ), CCPDB webserver (Agrawal ; Singh ) or PDB application programming interfaces (API) (Gilpin, 2015; Rose ). Unfortunately, these solutions have limitations, such as the lack of version management, the impossibility of building highly customized datasets based on complex queries, and last but not least, the unavailability of high-level integration with the programming tools typically used to carry out bioinformatics analyses. The modern data analysis pipelines rely on the general developments in the Python and R programming languages. In particular, the so-called data frame structures, originating from the R language, become increasingly popular among Python developers, owing to the development of the pandas package (McKinney, 2010). The main advantage of using data frames in data analyses stems from their plasticity, that is, a possibility to query, sort, search, merge or partition them with a handful of simple commands. The usefulness of pandas data frames inspired the development of its extensions devoted to the analyses of biological data. For example, the biopandas (Raschka, 2017) library enables representing the individual PDB structures as data frames, thus greatly facilitating their investigation at the atomic level. Another tool having pandas under the hood is rstoolbox (Bonet ), a Python library for the analysis of large-scale structural data, mostly in the context of protein design tasks. The limitations of the currently available tools motivated us to develop localpdb, a lightweight Python library that utilizes the versatility of pandas DataFrames to offer a simple programming framework for handling a local copy of the PDB and auxiliary resources, such as ECOD (Schaeffer ), SIFTS (Dana ) or DSSP (Touw ) and thus enabling the creation of complex and reproducible bioinformatics workflows. The requirements to use localpdb are minimal and the package will run on any modern PC with enough disk space (around 100 GB if the user will decide to store PDB structures in both the PDB and mmCIF formats). In the following section, we provide a concise overview of the package features and functionalities. The full documentation can be accessed at https://labstructbioinf.github.io/localpdb/.

2 Overview of the localpdb package

The overview of the localpdb package is shown in Figure 1. In its basic functionality, it allows creating a local mirror image of the PDB (in either PDB or mmCIF formats) accessible via the PDB object and its entries and chains attributes (both being pandas DataFrames) that provide direct access to the whole structures and their components, respectively.

Fig. 1.

General overview of the features and functionalities of the localpdb package. At its core, localpdb syncs the raw PDB data and entries (in PDB and mmCIF formats) and makes them available to the user through the DataFrame objects. With the weekly releases of new data, the local files can be updated, however, the possibility to access the previous versions is retained through the tracking mechanism. The functionalities of the localpdb can be further extended with the configurable plugin system that allows to fetch and track the updates from the additional data sources. localpdb also provides access to the RCSB search API that can be used for complex queries based on multiple criteria. Finally, each version of the localpdb can be independently recreated on a different machine or by other users by exporting a small configuration file With the weekly releases of the PDB, the local copy can be updated to account for added, modified or outdated entries. Importantly, the update procedure is optional and does not have to follow a weekly routine. Another important feature of the localpdb is the ability to retain access to the previous versions of the data even after performing multiple updates. This enables the work on multiple projects concurrently without the need to store large separate datasets for each of them. Moreover, data corresponding to any of the localpdb versions can be seamlessly shared and independently recreated using the relatively small, exportable configuration file. The basic functionality of the localpdb package is extendable with a flexible plugin system. The plugins enable augmenting the structural data with various annotations and making them available to the user in a uniform, searchable DataFrame format. We implemented several plugins featuring access to various resources such as domain databases (SCOP, ECOD, CATH and Pfam) or taxonomic and EC number mappings. While these plugins provide annotations related to the whole structure, chain or their segments, the others enable also per residue annotations. For example, the current release of the localpdb package includes plugins for DSSP (Touw ), a program for the annotation of secondary structure elements and Socket (Walshaw and Woolfson, 2001), a tool for the annotation of coiled-coil domains. Alike the core of the localpdb package, also the plugins are under the control of the versioning system, which is especially important in the context of periodically updated resources such as ECOD or SIFTS. Finally, in parallel to the plugin system, the localpdb package features access to the recently released RCSB search API (Rose ) allowing for queries based on the text descriptors, sequence similarity metrics and sequence or structural motifs. In an effort to simplify the process of creation of the datasets integrating information from multiple sources, we implemented a filtering system that adjusts all the active DataFrames once the selection is performed on either of them. For example, performing a selection on the entries DataFrame automatically adjusts all the associated DataFrames such as chains and those related to the plugins. To fully demonstrate the applicability of the package, we present two advanced use cases that accompany the manuscript and are available at the documentation webpage (https://labstructbioinf.github.io/localpdb/). The first example reproduces the ensemble analysis of the publicly available HIV protease structures to infer the conformational dynamics of this enzyme (Katebi ). We show that coupling localpdb with other common structural bioinformatics packages enables the recreation of this complex, multistep analysis purely in Python with a handful of lines of code. The second example focuses on the machine learning applications and describes steps performed to derive a dataset used to train our coiled-coil domain prediction algorithm—DeepCoil (Ludwiczak ). The presented approach can be easily adapted to facilitate dataset creation for similar machine learning tasks in which sequence similarity control and minority class oversampling are essential. Finally, the localpdb package has been also additionally tested during various projects conducted in our group; for example, it was used to construct and maintain specialized datasets used to train machine learning models allowing the prediction of protein-ligand interactions (Kamiński ) and to develop a pipeline for the annotation of coiled-coil motifs in protein structures (Szczepaniak ). In sum, the localpdb package unifies and extends a variety of features offered by other tools by providing robust versioning and easily extendable plugin systems. We therefore envision that the localpdb package can be applicable to most structural bioinformatics tasks, in particular those related to building complex and reproducible workflows around the PDB data and deriving datasets for machine learning purposes.

23 in total

1. PyPDB: a Python API for the Protein Data Bank.

Authors: William Gilpin
Journal: Bioinformatics Date: 2015-09-14 Impact factor: 6.937

2. A galaxy of folds.

Authors: Vikram Alva; Michael Remmert; Andreas Biegert; Andrei N Lupas; Johannes Söding
Journal: Protein Sci Date: 2010-01 Impact factor: 6.725

3. DeepCoil-a fast and accurate prediction of coiled-coil domains in protein sequences.

Authors: Jan Ludwiczak; Aleksander Winski; Krzysztof Szczepaniak; Vikram Alva; Stanislaw Dunin-Horkawicz
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

4. Global view of the protein universe.

Authors: Sergey Nepomnyachiy; Nir Ben-Tal; Rachel Kolodny
Journal: Proc Natl Acad Sci U S A Date: 2014-07-28 Impact factor: 11.205

5. The use of experimental structures to model protein dynamics.

Authors: Ataur R Katebi; Kannan Sankar; Kejue Jia; Robert L Jernigan
Journal: Methods Mol Biol Date: 2015

6. Proteo3Dnet: a web server for the integration of structural information with interactomics data.

Authors: Guillaume Postic; Jessica Andreani; Julien Marcoux; Victor Reys; Raphaël Guerois; Julien Rey; Emmanuelle Mouton-Barbosa; Yves Vandenbrouck; Sarah Cianferani; Odile Burlet-Schiltz; Gilles Labesse; Pierre Tufféry
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

Localpdb- a Python package to manage protein structures and their annotations.

1 Introduction

2 Overview of the localpdb package

1. PyPDB: a Python API for the Protein Data Bank.

2. A galaxy of folds.

3. DeepCoil-a fast and accurate prediction of coiled-coil domains in protein sequences.

4. Global view of the protein universe.

5. The use of experimental structures to model protein dynamics.

6. Proteo3Dnet: a web server for the integration of structural information with interactomics data.

7. ccPDB: compilation and creation of data sets from Protein Data Bank.

8. ECOD: an evolutionary classification of protein domains.

9. A library of coiled-coil domains: from regular bundles to peculiar twists.

10. Protein Data Bank: the single global archive for 3D macromolecular structure data.