Literature DB >> 29377907

ProtDataTherm: A database for thermostability analysis and engineering of proteins.

Hassan Pezeshgi Modarres1,2,3, Mohammad R Mofrad2,4, Amir Sanati-Nezhad1,5.   

Abstract

Protein thermostability engineering is a powerful tool to improve resistance of proteins against high temperatures and thereafter broaden their applications. For efficient protein thermostability engineering, different thermostability-classified data sources including sequences and 3D structures are needed for different protein families. However, no data source is available providing such data easily. It is the first release of ProtDataTherm database for analysis and engineering of protein thermostability which contains more than 14 million protein sequences categorized based on their thermal stability and protein family. This database contains data needed for better understanding protein thermostability and stability engineering. Providing categorized protein sequences and structures as psychrophilic, mesophilic and thermophilic makes this database useful for the development of new tools in protein stability prediction. This database is available at http://profiles.bs.ipm.ir/softwares/protdatatherm. As a proof of concept, the thermostability that improves mutations were suggested for one sample protein belonging to one of protein families with more than 20 mesophilic and thermophilic sequences and with known experimentally measured ΔT of mutations available within ProTherm database.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29377907      PMCID: PMC5788348          DOI: 10.1371/journal.pone.0191222

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Thermophilic and hyper thermophilic microorganisms have become attractive to scientists specifically after reporting the microorganisms living at temperatures higher than 75°C (1). The extracted enzymes from such high temperature tolerating microorganisms have been studied to understand modulating factors of their improved thermostability and then to use it as a guidance for improving thermostability of proteins with lower thermal stability for biotechnological applications [1]. The knowledge about the preferred living temperature of microorganisms can help to approximate thermostability criteria of their expressed proteins and a direct relationship between the growth temperature of microorganisms and the melting point of their corresponding proteins [2]. Currently available data on homologous proteins are valuable for engineering of proteins to gain higher stability by for example introducing more salt-bridges or strengthening the hydrophobic cores within protein structure [3]. Although structure-based protein engineering, known as rational engineering or rational design, is the most popular methodology for thermostability engineering of proteins, the limited number of available protein structures is still a challenge to prevalent utilization of the methodology [4]. On the other hand, because of modern advances in DNA sequencing technologies, the number of sequenced proteins belonging to different families is growing rapidly [3, 5]. Advances in applications of protein sequences for protein engineering could assist the existing routine structure-based rational methods. The consensus concept (CC) is the most popular sequence-based protein engineering approach to extract thermo-stabilizing mutations out of homologous sequences [6-17]. In CC approach, a multiple sequence alignment (MSA) is first made and then non-consensus residues are substituted by the most frequently occurring amino acids [5]. However, there is no guarantee that all suggested mutations induced by CC approach can increase thermostability [9, 14, 16, 18]. To detect thermo-stabilizing mutations with higher probability, one can take the advantage of comparing the target sequence with homologues tolerant at higher temperatures [3]. To make it feasible for different families of proteins, one needs to have access to other proteins from the same family with a higher thermal stability. However, the main challenge using this method is the difficulty in finding homologues with a label showing the thermostability category. To overcome this challenge, we developed a comprehensive database that contains protein sequences that belongs to different microorganisms and clustered based on the Pfam ID. The user can find the Pfam ID of a protein of interest and find its homologues, categorized as psychrophilic, mesophilic and thermophilic. In addition to sequences, PDB IDs are also provided if a 3D structure is available for the Pfam ID of interest.

Materials and methods

First, a database was made for microorganisms such that each microorganism is categorized based on its growth temperature (GT) using BacDive [19] and NCBI [20] databases. For every microorganism, all available sequences with their corresponding sequence information, including Pfam ID [21] and PDB ID [22], if available, were obtained from UniProt database [23]. All the process was conducted using python programming language [24], incorporating Biopython module [25] (). In our database, all protein sequences have two labels: Pfam IDs and thermostability category. To facilitate the use of the database for thermostability analysis and engineering, sequences are clustered based on their Pfam IDs. For each Pfam ID cluster, we can find proteins from the same family labeled with their thermostability category. Therefore, for a target protein sequence, the user can find the corresponding Pfam ID from the Pfam database [21] and uses the Pfam ID as the primary input to search over the database. For each Pfam ID family, we categorized sequences based on their Uniprot IDs as psychrophilics (GT< 20°C), mesophilics (20°C < GT < 40°C), and thermophilic (40°C PDB IDs, are UniProt, Pfam and RCSB IDs, respectively. For the case study, first, Pfams containing more than 20 mesophilic and thermophilic sequences were found. Then, for pattern analysis, the AXB patterns were considered in each sequence where A and B can be any of 20 standard amino acids and X is a separation number between 0 and 10. Therefore, A0B means all double amino acid compositions that are subsequent like VE, and A1B patterns are all double amino acid compositions that there is one amino acid between them. For example, all patterns with Ala as the first amino acid, Val as the second, and with only one amino acid spacing between Ala and Val from the 20 standard amino acids are considered as A1V. The condition 0 = AXB patterns were counted and saved for each sequence. Finally, we have a group of data for both mesophilic and thermophilic sequences with the corresponding patterns. Therefore, for a given AXB (e.g. V4H pattern), there is one group of numbers for mesophilic and thermophilic categories with their corresponding average number. The Rank Sum test with critical p-value of 0.05 was used to detect AXB patters and distinguish mesophilic sequences from thermophilic sequences.

Results and discussion

A PHP webpage is designed as the user interface to access the database. The user can find the Pfam ID for a protein of interest (e.g. using Pfam database) and search it in the first page of the website (, panel A). The results are then presented in the next page including all available sequences and structures within the database for the submitted Pfam ID (, panel B). The database contains more than 14 million protein sequences and PDB structures for 9962 protein family, categorized based on their thermal stability as psychrophilic, mesophilic and thermophilic (). Totally, there are 14155392 protein sequences and 30950 PDB structures available in the database. For 957 members of protein families there is at least one PDB structure available for a thermophilic protein that can be used for structural comparison between mesophilic and thermophilic proteins (). In addition, for 3355 protein families there are at least 20 sequences belonging to thermophilic proteins as well as 3046 protein families with at least 20 sequences belonging to psychrophilic proteins. For such protein families, we can use amino acid content comparison between psychrophilic/mesophilic and mesophilic/thermophilic proteins to gain protein family-based specific knowledge of thermostability modulating factors.

The view of the webpage.

A) Users can enter the Pfam ID as input at the first page. B), All available sequences and structures are presented for different classes at the result page.

Other databases

Two databases, namely PGTdb [26] and Protherm [27], are presently available to provide data concerning protein thermostability. To the knowledge of authors, the PGTdb database is not presently available while it was the only resource that could provide experimental information about thermostability classification of protein sequences based on GT of their corresponding organisms (psychrophilic, mesophilic and thermophilic). On the other hand, ProTherm database provides thermodynamics data for mutagenesis but only for a limited number of proteins. Our database contains much higher number of microorganisms, protein sequences and PDB structures. This database categorizes all the sequences for different Pfam families according to their thermostability criteria and provides easier access to the needed data for analysis and engineering of protein families.

Case study: Pattern recognition for protein engineering

One important goal of all thermostability analysis is to understand how one can take advantage of the knowledge from analysis of the differences between two categories, engineer mesophilics by minimum number of mutations, and enhance protein thermostability towards thermophilic sequences. Here, as a case study, we selected a protein belonging to one of those protein families with more than 20 mesophilic and thermophilic sequences where its ΔT of mutations is experimentally available within ProTherm database. In the ProTherm database, ribonuclease H from Escherichia Coli (strain K12) (with PDB_ID of 2RN2, solved using X-ray diffraction, resolution 1.48Å) was selected. Ribonuclease belongs to Pfam ID of PF00075, with the reported ΔT upon mutation using thermal experiments and is amongst the proteins with the highest number of reported thermodynamic measurements for the effect of mutations on its stability. An algorithm (Algorithm 1) is designed to suggest thermostability improving mutations: for all AXB patterns with meaningful population difference between mesophilic and thermophilic sequences in the family (Pfam ID of PF00075) (see methods for definition of meaningful population difference), we chose those AXB patterns that have a higher average number of repeats than mesophilic within thermophilic category. We then found AXY patterns in the target sequence (ribonuclease H from Escherichia Coli) that the Y is not equal to B in the pattern. For these selected patterns, we suggest Y→B mutation. The same approach was used for ZXB to suggest Z→A mutations. If the mutation was available in the ProTherm database, the ΔT value was checked. If ΔT > 0, the suggested mutation was considered as a successful thermostability improving suggestion and if ΔT < 0, it was defined as a failed suggestion. The results are shown in where 72% of the suggested mutations can improve thermostability. This result confirms that the proposed method can be considered as a sequence-based thermostability engineering method only if we have categorized sequences as thermophilic and mesophilic for protein family of the target proteins. The accuracy of the suggested mutations for thermostability engineering is expected to be improved over such a database by recruiting more complicated methods like machine learning techniques. However, further studies with incorporation of more proteins from diverse range of protein families should be conducted to better evaluate the accuracy of this method. Algorithm 1. Thermostability improving mutation suggestion algorithm. Input. Protein sequence, P-fam ID, and thermophilic and mesophilic distinguishing AXB patterns for the P-fam ID Output. Mutation list for all AXB patterns for the P-fam ID do: if then: find AXY or ZXB patterns in the target sequence where Y is not B or Z is not A add Y → B or Z → A to mutation list end end return mutation list

Applications

The database developed in this work can be used for building protein thermostability mutation libraries using different approaches like CC and also comparison of the target sequence with its homologues with higher thermostability [17, 28, 29]. In addition, it can be used for systemic analysis of modulating factors of thermostability [30-BMC Structural Biology. 2008 ">32] for different families, while thermostability modulating factors can vary from family to family [3]. Furthermore, it is noteworthy that while the thermophilic sequence belongs to microorganisms that are tolerant to harsh conditions in general and not only to temperature, these data can be used for optimization of a target sequence for new applications under other harsh conditions than temperature, like intense pH and high concentration of salts. Altogether, this database provides the most important needed data for sequence-based protein engineering and analysis for researchers to develop new analysis and engineering tools in the field of thermal stability. This database is not only useful for general industrial and research purposes but also applicable for drug design [17, 33, 34]

Conclusions

Here we present the first release of ProtDataTherm database that contains more than 14 million protein sequences and structures belonging to microorganisms with different preferred living temperatures. All sequences and structures are labeled as psychrophilic, mesophilic and thermophilic. For ease of use, the sequences are classified based on their Pfam IDs. Users can find homologous sequences for their protein of interest by knowing its Pfam ID. This database can be applied not only for probing stability modulating factors within protein families but also for knowledge-based protein stability engineering.

Availability

This database is available at http://profiles.bs.ipm.ir/softwares/protdatatherm. The database can be accessible free of charge for academic users on demand.
Table 1

The distribution of protein sequences and structures over the three classes of thermostability.

Mesophilic sequences13111756
Thermophilic sequences661072
Psychrophilic sequences382564
Mesophilic structures23069
Thermophilic structures7741
Psychrophilic structures140
Pfams with at least one Mesophilic structure2306
Pfams with at least one Thermophilic structure957
Pfams with at least one Psychrophilic structure82
Pfams with at least 20 Thermophilic sequence3355
Pfams with at least 20 Psychrophilic sequence3046
Table 2

Ave_The: Average of the number of patterns for thermophilic sequences, Ave_Mes: Average of the number of patterns for mesophilic sequences.

PatternPositions on SequenceMutationΔTP_valueAve_TheAve_Mes
ER61 E, H 62H 62 R1.30.00321.7821.474
LR74 V, R 75V 74 L3.70.01061.7731.492
LE134 D, E 135D 134 L5.50.00071.9181.437
NK95 K, K 96K 95 N3.20.008411.9511.43
SI52 A, I 53A 52 S-5.80.041461.5791.21
SG10 D, G 11D 10 S9.20.00831.8361.63
KI52 A, I 53A 52 K19.50.03982.3851.69
EG10 D, G 11D 10 E3.80.0121.741.376
F1E8 F, D 10D 10 E3.80.01991.4631.242
F1S8 F, D 10D 10 S9.20.01861.7671.284
C2N41 R, N 44R 41 C1.60.00021.2821.052
A2Y70 D, Y 73D 70 A3.80.0041.6471.304
E2Y70 D, Y 73D 70 E1.80.03311.5831.12
E2C10 D, C 13D 10 E3.80.0081.4091
L2N49 L, A 52A 52 N-5.90.03231.6171.301
L2N67 L, D 70D 70 N5.50.03231.6171.301
V2K119 E, K 122E 119 V2.70.03791.6351.264
N3I130 N, D 134D 134 I4.60.02811.6671.246
N3N130 N, D 134D 134 N6.40.00041.6581.265
N3E130 N, D 134D 134 E3.10.03531.7571.557
N3V130 N, D 134D 134 V4.10.00311.5411.299
N3V70 D, V 74D 70 N5.50.00311.5411.299
R3V91 K, K 95K 91 R0.50.00051.5541.26
V3Y24 A, Y 28A 24 V3.20.04191.6381.136
E3V48 E, A 52A 52 V7.80.01332.0231.852
E3V64 E, S 68S 68 V1.90.01332.0231.852
E3V70 D, V 74D 70 E1.80.01332.0231.852
E3V94 D, V 98D 94 E-1.20.01332.0231.852
Y3L52 A, L 56A 52 Y-7.60.01461.6361.082
C4E52 A, E 57A 52 C2.50.01751.41
V4Y68 S, Y 73S 68 V1.90.01621.5281.079
N4R70 D, R 75D 70 N5.57.34E-091.5871.155
N4K130 N, E 135E 135 K-0.86.92E-052.3291.678
N4E52 A, E 57A 52 N-5.90.01271.6641.317
Q5N4 Q, D 10D 10 N6.80.03611.6961.16
E5N64 E, D 70D 70 N5.50.002571.6151.36
E5V10 D, N 16D 10 E3.80.00251.6151.3
E5N94 D, N 100D 94 E-1.20.00251.6151.36
R5P46 R, A 52A 52 P-5.40.04991.371.217
R5P91 K, P 97K 91 R0.50.04991.371.217
R5I46 R, A 52A 52 I6.20.02991.4291.206
R5Y46 R, A 52A 52 Y-7.60.01761.4831.116
L5P56 L, H 62H 62 P4.10.00091.591.316
L5P107 L, Q 113Q 113 P-0.60.00091.591.316
L5L80 Q, K 86Q 80 L10.00012.1021.618
K6E3 K, D 10D 10 E3.80.0272.1111.712
K6E87 K, D 94D 94 E-1.20.0272.1111.712
N6I45 N, A 52A 52 I6.20.0071.5541.2
L6L67 L, V 74V 74 L3.70.00022.0081.684
L6L52 A, L 59A 52 L4.30.00022.0081.684
L6K80 Q, K 87Q 80 L10.00171.8841.47
N6T45 N, A 52A 52 T-2.70.014191.4911.261
I7I66 I, V 74V 74 I2.40.00431.6181.241
I7I74 V, I 82V 74 I2.40.00431.6181.241
L7K52 A, K 60A 52 L4.30.03951.671.355
L7I74 V, I 82V 74 L3.70.00181.7041.346
K7E86 K, D 94D 94 E-1.20.00451.9711.573
K7K52 A, K 60A 52 K-19.50.00032.0591.632
G7N126 G, D 134D 134 N6.43.92E-071.9341.473
Y7K52 A, K 60A 52 Y-7.60.02021.7051.262
N7T44 N, A 52A 52 T-2.70.00011.7671.215
N7V16 N, A 24A 24 V3.20.00051.6321.311
N7V44 N, A 52A 52 V7.80.00051.6321.311
F7K52 A, K 60A 52 F-1.50.02111.6361.237
R7K91 K, K 99K 91 R0.50.01111.4461.2
  30 in total

1.  Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins.

Authors:  M M Gromiha; M Oobatake; A Sarai
Journal:  Biophys Chem       Date:  1999-11-15       Impact factor: 2.352

2.  ProTherm, version 4.0: thermodynamic database for proteins and mutants.

Authors:  K Abdulla Bava; M Michael Gromiha; Hatsuho Uedaira; Koji Kitajima; Akinori Sarai
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  PGTdb: a database providing growth temperatures of prokaryotes.

Authors:  Shir-Ly Huang; Li-Cheng Wu; Han-Kuen Liang; Kuan-Ting Pan; Jorng-Tzong Horng; Ming-Tat Ko
Journal:  Bioinformatics       Date:  2004-01-22       Impact factor: 6.937

4.  Structure-guided consensus approach to create a more thermostable penicillin G acylase.

Authors:  Karen M Polizzi; Javier F Chaparro-Riggers; Eduardo Vazquez-Figueroa; Andreas S Bommarius
Journal:  Biotechnol J       Date:  2006-05       Impact factor: 4.677

Review 5.  Better library design: data-driven protein engineering.

Authors:  Javier F Chaparro-Riggers; Karen M Polizzi; Andreas S Bommarius
Journal:  Biotechnol J       Date:  2007-02       Impact factor: 4.677

6.  Beta-turn propensities as paradigms for the analysis of structural motifs to engineer protein stability.

Authors:  E C Ohage; W Graml; M M Walter; S Steinbacher; B Steipe
Journal:  Protein Sci       Date:  1997-01       Impact factor: 6.725

7.  Improved thermostability of AEH by combining B-FIT analysis and structure-guided consensus method.

Authors:  Janna K Blum; M Daniel Ricketts; Andreas S Bommarius
Journal:  J Biotechnol       Date:  2012-03-09       Impact factor: 3.307

8.  Improved thermostability of Clostridium thermocellum endoglucanase Cel8A by using consensus-guided mutagenesis.

Authors:  Michael Anbar; Ozgur Gul; Raphael Lamed; Ugur O Sezerman; Edward A Bayer
Journal:  Appl Environ Microbiol       Date:  2012-03-02       Impact factor: 4.792

9.  Computational design of protein therapeutics.

Authors:  Inseong Hwang; Sheldon Park
Journal:  Drug Discov Today Technol       Date:  2008

10.  BacDive--The Bacterial Diversity Metadatabase in 2016.

Authors:  Carola Söhngen; Adam Podstawka; Boyke Bunk; Dorothea Gleim; Anna Vetcininova; Lorenz Christian Reimer; Christian Ebeling; Cezar Pendarovski; Jörg Overmann
Journal:  Nucleic Acids Res       Date:  2015-09-30       Impact factor: 16.971

View more
  5 in total

Review 1.  Making the cut with protease engineering.

Authors:  Rebekah P Dyer; Gregory A Weiss
Journal:  Cell Chem Biol       Date:  2021-12-17       Impact factor: 9.039

2.  A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features.

Authors:  Changli Feng; Zhaogui Ma; Deyun Yang; Xin Li; Jun Zhang; Yanjuan Li
Journal:  Front Bioeng Biotechnol       Date:  2020-05-05

3.  Role of simple descriptors and applicability domain in predicting change in protein thermostability.

Authors:  Kenneth N McGuinness; Weilan Pan; Robert P Sheridan; Grant Murphy; Alejandro Crespo
Journal:  PLoS One       Date:  2018-09-07       Impact factor: 3.240

4.  TEMPURA: Database of Growth TEMPeratures of Usual and RAre Prokaryotes.

Authors:  Yu Sato; Kenji Okano; Hiroyuki Kimura; Kohsuke Honda
Journal:  Microbes Environ       Date:  2020       Impact factor: 2.912

5.  Systematic evaluation of computational tools to predict the effects of mutations on protein stability in the absence of experimental structures.

Authors:  Qisheng Pan; Thanh Binh Nguyen; David B Ascher; Douglas E V Pires
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 13.994

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.