Literature DB >> 26582925

Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability.

Rita Pancsa¹, Mihaly Varadi², Peter Tompa³, Wim F Vranken⁴.

Abstract

Proteins fulfil a wide range of tasks in cells; understanding how they fold into complex three-dimensional (3D) structures and how these structures remain stable while retaining sufficient dynamics for functionality is essential for the interpretation of overall protein behaviour. Since the 1950's, solvent exchange-based methods have been the most powerful experimental means to obtain information on the folding and stability of proteins. Considerable expertise and care were required to obtain the resulting datasets, which, despite their importance and intrinsic value, have never been collected, curated and classified. Start2Fold is an openly accessible database (http://start2fold.eu) of carefully curated hydrogen/deuterium exchange (HDX) data extracted from the literature that is open for new submissions from the community. The database entries contain (i) information on the proteins investigated and the underlying experimental procedures and (ii) the classification of the residues based on their exchange protection levels, also allowing for the instant visualization of the relevant residue groups on the 3D structures of the corresponding proteins. By providing a clear hierarchical framework for the easy sharing, comparison and (re-)interpretation of HDX data, Start2Fold intends to promote a better understanding of how the protein sequence encodes folding and structure as well as the development of new computational methods predicting protein folding and stability.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2015 PMID： 26582925 PMCID： PMC4702845 DOI： 10.1093/nar/gkv1185

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Proteins are dynamic molecules that display a wide range of behaviours and fulfil many functions in the cell; understanding how they fold into complex three-dimensional (3D) structures and how these structures remain stable is essential for the interpretation of their overall behaviour. However, studying protein folding and stability is a difficult task; the complexity of the folding process and the diversity of and interchange between possible protein conformations require highly time-sensitive and complex experiments. Only solvent exchange-based methods (1–6) and protein engineering approaches (7,8) are able to provide experimental information at the level of individual amino acid residues. Among these approaches, hydrogen/deuterium exchange (HDX) techniques are the most widely used (1,6); they are based on the notion that, for a given residue, the rate of proton exchange between its backbone amide proton and the solvent (water) provides information on the structural state of that residue (9). Amide protons of residues that form stable hydrogen bonds, and are often deeply buried in the protein, are protected from solvent and basically stop exchanging, while the amide protons of solvent exposed residues not involved in hydrogen bonds display little or no protection from exchange. These HDX techniques (i) can be applied under various conditions, (ii) have the potential to provide information on a large fraction of the protein residues while highlighting residue-specific details, (iii) are exquisitely sensitive to the dynamics of structure and rare conformational changes and (iv) in combination with rapid mixing techniques can be used to monitor folding events with good time resolution (10,11). Due to their indisputable advantages, over the years these techniques have developed into powerful tools for studying protein structure, stability and dynamics (12), through examining the native state (13,14), partially folded equilibrium intermediates (15–17), kinetic folding intermediates (18–20) and association reactions of proteins (21–23). Native exchange experiments investigate proteins in their folded or partially folded states and report on the stability of hydrogen bonds that exist in these states (13,14,16). By varying the environmental conditions, like pH or denaturant concentrations, such measurements can also provide quantitative information on the protection levels of the individual amide protons in function of changes in the conformation of the protein. By comparing these values, one can distinguish more stable from less stable regions of the protein fold (13,14,16), and can moreover provide clues about folding kinetics under certain conditions (20,24). The protection levels or exchange rates of amide protons can also be followed from the completely unfolded state throughout the entire course of folding of the proteins by a range of methods, for example pulsed labelling (25–27), quenched flow (28–30) and competition-based HDX measurements (10,31) (coupled with either nuclear magnetic resonance (NMR) or mass spectrometry (MS) as detection techniques); they indicate which regions of the protein first form local structure, thus providing invaluable information on the folding mechanisms of proteins. Over the past 40 years numerous solvent-exchange experiments have been carried out to gain insights into protein structure, stability, folding and dynamics (32). Despite the invaluable insights these experiments have provided and their availability since the 1950's (33), the resulting datasets have never been assembled and made available as a database, which makes it difficult to draw general conclusions and/or to develop computational prediction tools. In our opinion this lack of a public HDX database is due to the heterogeneity of both the relevant methods and the measures used to describe the protection rates of residues (protection factors, folding rate constants, burst phase amplitude, midpoint of folding etc.), which make the collection and comparison of different HDX data in the literature especially difficult and labour intensive. We overcame these problems and present here Start2Fold (http://start2fold.eu), a comprehensive collection of carefully curated and classified residue-level folding and stability data derived from solvent exchange-based experiments, including native-state, pulsed labelling, quenched flow and competition-based HDX experiments and oxidative labelling measurements.

MATERIALS AND METHODS

Data collection and classification

The measurements were collected from literature and subsequently classified based on (i) whether they investigate protein folding or stability and (ii) whether they provide information on the residue or segment level. Due to the heterogeneity of data, the published protection levels and rates cannot be directly compared between proteins and it is not possible to apply generic quantitative thresholds of protection. To overcome this problem and enable comparisons on a qualitative level, the proteins’ residues were therefore classified into groups based on their detected exchange protection levels. We mostly adopted the residue classification that either was proposed in the original publication describing the measurement or that was suggested by Li and Woodward in their 1999 comparative analysis (6). In the few cases where the authors provided good quality measures of residue folding rates without a classification, we determined the protection thresholds according to our best knowledge so that ∼5–15% of all the residues were classified as the earliest folding ones. If the folding of the same protein was investigated by multiple measurements (e.g. horse cytochrome c (10,34,35)), the experiment with the best time resolution was selected, or the results of multiple measurements were retained for presentation in the database.

Database structure and content

Start2Fold is a relational database implemented in MySQL (http://www.mysql.com/). Data in Start2Fold is structured hierarchically in the following manner (Figure 1): on the top level (level 1) is the entry, which might be associated with one or multiple molecular systems (i.e. a protein of multiple chains or even complexes of multiple proteins). Each entry has a distinct identifier, which consists of a tag (STF) followed by a four letter code (e.g. STF0001). An entry might have several associated protein chains (level 2). This level (or class) stores information on the UniProt (36) and Protein Data Bank (PDB) (37,38) IDs of the protein chain, and serves to provide direct cross-links to these online repositories. Each protein chain might have several corresponding residue sets defined by their protection levels, which can range between early, intermediate and late for folding and strong, medium and weak for stability measurements (level 3). Additional information recorded on the residue sets include the resolution (residue-level or segment-level), the number of probes, the protection threshold, the experimental conditions (pH, temperature), the actual protein sequence used in the experiment (with any modifications/mutations), the PubMed ID (http://www.ncbi.nlm.nih.gov/pubmed) of the original publication and a textual description of the experimental procedure taken from the original publication. Finally, for the residues belonging to each residue set, the sequence position and amino acid types are also provided (level 4).

Figure 1.

Database structure: Start2Fold is organized hierarchically into four levels: the entries (level 1) are on the top; each entry might have multiple associated protein chains (level 2); which in turn might have several experimental residue sets (level 3), with information on their associated residues (level 4). Each level records relevant information that is displayed on the accession screens or in the XML files of the entries.

RESULTS

Currently, Start2Fold contains 57 entries with 219 residue sets defined based on the protection levels. These residue sets contain a total of 4172 residues, with the same residues appearing in multiple sets, since there are data available from both stability and folding measurements for most of the protein entries. Besides the measurement and residue classifications, all entries of the database contain structural features of the investigated protein, the detailed descriptions of the underlying experimental procedures, sample components, measurement conditions, cross references to other databases and references to the original publications. The accession pages of the entries also allow for the visualization of the relevant residue groups in the 3D structures of the corresponding proteins (where available). We also provide an XML template (39) that enables scientists to deposit new data into Start2Fold.

User interface and website features

The user interface of Start2Fold is divided into five main sections. The ‘home’ section briefly introduces the database and the types of data contained within. It also provides the user with a contact form to send inquiries and feedback to the developers. Three sections provide browsing options by different criteria, i.e. proteins, residue sets or entries. Each option provides an ordered list of the entries that can be rearranged by the most relevant information depending on the browsing option. When browsing by entries, the entry ID, and molecular system name are displayed. In case of browsing by proteins, the UniProt name of the protein, the entry ID, the UniProt and PDB IDs, the length of the protein chain, the corresponding UniProt fragment, and the secondary structure type of the chain constitute the browsing list. Lastly, when browsing by residue sets, the protection level, molecular system name, entry ID, experiment type and method, and PubMed reference are displayed in the list. The user is forwarded to the accession screen by clicking on the entry ID links on either the browsing lists or on the search results list. This page provides all the relevant information associated with the entry, along with a static picture of the protein structure (ifavailable) which links to an integrated JSmol applet page (http://sourceforge.net/projects/jsmol/). Finally, the ‘help’ section contains the detailed documentation of the database, with an in-depth user guide describing all the functionalities of Start2Fold. The ‘home’, ‘help’ and browsing pages can be accessed from all the pages using the menu on the top left section of the screen. Additionally, the database can be searched using the ‘search’ field located on the right side of the menu. Searching Start2Fold can be performed by typing in protein names, UniProt/PDB IDs, experiment types, experiment methods and protection levels in the search field, and pressing ‘search’.

Accession screens

The actual information stored within Start2Fold is displayed on the accession screens (Figure 2). The entry ID and the title of the entry appear on the top of the page. Below is the ‘download entry in xml’ link, which provides the complete entry in XML format for downloading. This XML follows the structure of the XML template found on the welcome page. Alternatively, this XML can be directly accessed by adding ‘.xml’ to the entry URL (e.g. http://start2fold.eu/STF0004.xml).

Figure 2.

The accession screen: every piece of information that is stored in Start2Fold is displayed on the accession screens. The top section of these screens hosts an integrated JSmol applet for interactive visualization (panel A). General information of the protein chain(s) and relevant information on the experimental sets are displayed below (panel B). The complete entry can be downloaded in XML format, while the sequences can be retrieved in FASTA format and the residue/segment-specific information in the form of a list file by clicking on the respective links. The integrated JSmol applet is available by either clicking on the image of the protein structure or by clicking on the “click here” link under the “visualize the data” section. This applet can be used to visualize the different residue sets or segments by clicking on one of the buttons (Figure 2A). The ‘reset view’ button can be clicked to reset the JSmol applet. Please note that due to technical issues with the current version of JSmol the loading speed can be slow on some browsers and depends on the available internet connection. The protein information and experimental set sections are next to and below the visualization applet link (Figure 2B). The protein information tab provides the name of the protein, the species of origin, the number of residues in the protein chain and cross-links to UniProt and PDB. The experimental sets can be opened and closed by clicking on the ‘show’ and ‘hide’ buttons. These sections display the corresponding reference details, the experimental type (stability/folding) and method, the experimental conditions (pH, temperature, number of probes), a brief description of the experiment and the actual sequence that was used for the measurement. This sequence can be downloaded in FASTA format by clicking on the ‘click to download sequence in fasta’ link under the sequence. Alternatively, the sequences can be directly accessed by adding ‘.fasta’ to the URL (e.g. http://start2fold.eu/STF0008.fasta). Finally, the residues are listed by their indices and their one-letter amino acid codes. Each residue of in a set is highlighted on the sequence to allow quick visualization at a glance. The residue lists can be downloaded either by clicking the ‘click to download list of residues’ link below the residues or by directly accessing them via adding ‘.residues’ to the URL (e.g. http://start2fold.eu/STF0008.residues).

DISCUSSION

We hope that the collection, curation and integration of HDX data into a single well-organized searchable database, Start2Fold, will stimulate future structure/folding-related work, both at the experimental and computational level. It is likely that the heterogeneous nature of HDX data stalled the development of such a database, an issue which we here addressed by introducing a clear classification scheme of the residues based on their protection levels. Although proteins (un-)fold at varying overall rates, it is now possible to compare the order in which residues become (de-)protected, which will hopefully allow refinement of the relationship between the stability and folding cores of proteins and re-interpretation of accumulated HDX data, so providing valuable new insights. We encourage researchers to submit their new exchange data, and will consider including additional types of data into Start2Fold, such as high quality computational simulations of protein folding (40–43). In all, we hope that Start2Fold will enable answering some of the remaining open questions related to the folding mechanisms of proteins (44–48), and that it will contribute to the development of new methods that allow for the reliable calculation of protein folding, structure and stability from sequence.

AVAILABILITY

Start2Fold is openly available for the scientific community at http://start2fold.eu. The database is open to submissions; we encourage users to submit their folding and stability data using the template XML provided on the welcome page of the website.

47 in total

1. Early formation of a beta hairpin during folding of staphylococcal nuclease H124L as detected by pulsed hydrogen exchange.

Authors: William F Walkenhorst; Jason A Edwards; John L Markley; Heinrich Roder
Journal: Protein Sci Date: 2002-01 Impact factor: 6.725

Review 2. Is there a unifying mechanism for protein folding?

Authors: Valerie Daggett; Alan R Fersht
Journal: Trends Biochem Sci Date: 2003-01 Impact factor: 13.807

Review 3. The nature of protein folding pathways.

Authors: S Walter Englander; Leland Mayne
Journal: Proc Natl Acad Sci U S A Date: 2014-10-17 Impact factor: 11.205

4. A kinetic folding intermediate probed by native state hydrogen exchange.

Authors: M J Parker; S Marqusee
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

5. The kinetic and equilibrium molten globule intermediates of apoleghemoglobin differ in structure.

Authors: Chiaki Nishimura; H Jane Dyson; Peter E Wright
Journal: J Mol Biol Date: 2008-03-19 Impact factor: 5.469

6. Mapping the conformational stability of maltose binding protein at the residue scale using nuclear magnetic resonance hydrogen exchange experiments.

Authors: Céline Merstorf; Olek Maciejak; Jérôme Mathé; Manuela Pastoriza-Gallego; Bénédicte Thiebot; Marie-Jeanne Clément; Juan Pelta; Loïc Auvray; Patrick A Curmi; Philippe Savarin
Journal: Biochemistry Date: 2012-10-24 Impact factor: 3.162

Review 7. Hydrogen exchange methods to study protein folding.

Authors: Mallela M G Krishna; Linh Hoang; Yan Lin; S Walter Englander
Journal: Methods Date: 2004-09 Impact factor: 3.608

8. UniProt: a hub for protein information.

Authors:
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

9. The Protein Data Bank archive as an open data resource.

Authors: Helen M Berman; Gerard J Kleywegt; Haruki Nakamura; John L Markley
Journal: J Comput Aided Mol Des Date: 2014-07-26 Impact factor: 3.686

10. PDBe: Protein Data Bank in Europe.

Authors: Aleksandras Gutmanas; Younes Alhroub; Gary M Battle; John M Berrisford; Estelle Bochet; Matthew J Conroy; Jose M Dana; Manuel A Fernandez Montecelo; Glen van Ginkel; Swanand P Gore; Pauline Haslam; Rowan Hatherley; Pieter M S Hendrickx; Miriam Hirshberg; Ingvar Lagerstedt; Saqib Mir; Abhik Mukhopadhyay; Thomas J Oldfield; Ardan Patwardhan; Luana Rinaldi; Gaurav Sahni; Eduardo Sanz-García; Sanchayita Sen; Robert A Slowley; Sameer Velankar; Michael E Wainwright; Gerard J Kleywegt
Journal: Nucleic Acids Res Date: 2013-11-27 Impact factor: 16.971

10 in total

1. A General Method for Predicting Amino Acid Residues Experiencing Hydrogen Exchange.

Authors: Boshen Wang; Alan Perez-Rathke; Renhao Li; Jie Liang
Journal: IEEE EMBS Int Conf Biomed Health Inform Date: 2018-04-09

Review 2. Computational Structure Prediction for Antibody-Antigen Complexes From Hydrogen-Deuterium Exchange Mass Spectrometry: Challenges and Outlook.

Authors: Minh H Tran; Clara T Schoeder; Kevin L Schey; Jens Meiler
Journal: Front Immunol Date: 2022-05-26 Impact factor: 8.786

10. Characterizing the relation of functional and Early Folding Residues in protein structures using the example of aminoacyl-tRNA synthetases.

Authors: Sebastian Bittrich; Michael Schroeder; Dirk Labudde
Journal: PLoS One Date: 2018-10-30 Impact factor: 3.240