| Literature DB >> 35176149 |
Leandro G Radusky1, Luis Serrano1,2,3.
Abstract
Recent years have seen an increase in the number of structures available, not only for new proteins but also for the same protein crystallized with different molecules and proteins. While protein design software have proven to be successful in designing and modifying proteins, they can also be overly sensitive to small conformational differences between structures of the same protein. To cope with this, we introduce here pyFoldX, a python library that allows the integrative analysis of structures of the same protein using FoldX, an established forcefield and modeling software. The library offers new functionalities for handling different structures of the same protein, an improved molecular parametrization module, and an easy integration with the data analysis ecosystem of the python programming language. AVAILABILITY: pyFoldX rely on the FoldX software for energy calculations and modelling, which can be downloaded upon registration in http://foldxsuite.crg.eu/ and its licence is free of charge for academics. The pyFoldX library is open-source. Full details on installation, tutorials covering the library functionality, and the scripts used to generate the data and figures presented in this paper are available at https://github.com/leandroradusky/pyFoldX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Year: 2022 PMID: 35176149 PMCID: PMC9004634 DOI: 10.1093/bioinformatics/btac072
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) pyFoldx structure-handling capabilities. Single structures can be instantiated from different formats, while ensembles of structures of the same protein can be instantiated from the protein's UniProt accession. FoldX commands can be executed into structures and ensembles, returning pandas dataframes with energies and, if applicable, objects with the transformed structures. (B) Example of parametrization of a glucose molecule with the pyFoldX paramx package. (C) Analysed mutations dataset description. To train a random forest classifier, 80% of the Missense3D-DB mutations were used in order to estimate the probability of belonging to the ‘pathogenic’ category. The remaining 20% were used for testing and analysed by using the indicated structure in the database and the ensemble of good resolution structures for these proteins. (D) Histogram of probability of belonging to the ‘pathogenic’ category given by the created classifier for mutations mapped into their best structure by Missense3D-DB (left) and the mean of the probabilities for all crystals of good resolution along its ensemble (right). (E) ROC curve of mutation class prediction by the generated classifier taking into account best crystal (orange lines) or mean predictions for crystals along ensemble (blue lines). Thin lines: classifying mutations as pathogenic (Ppathogenic > 0.5) or benign (Ppathogenic ≤ 0.5). Thick lines: mutations with no clear prediction are discarded (0.4 > Ppathogenic > 0.7). Overall, predictions are better when ensembles are considered and high accuracy is achieved (AUC = 0.9) when no clear predictions are discarded from the analysis