Literature DB >> 25361972

MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins.

Emilio Potenza¹, Tomás Di Domenico¹, Ian Walsh¹, Silvio C E Tosatto².

Abstract

MobiDB (http://mobidb.bio.unipd.it/) is a database of intrinsically disordered and mobile proteins. Intrinsically disordered regions are key for the function of numerous proteins. Here we provide a new version of MobiDB, a centralized source aimed at providing the most complete picture on different flavors of disorder in protein structures covering all UniProt sequences (currently over 80 million). The database features three levels of annotation: manually curated, indirect and predicted. Manually curated data is extracted from the DisProt database. Indirect data is inferred from PDB structures that are considered an indication of intrinsic disorder. The 10 predictors currently included (three ESpritz flavors, two IUPred flavors, two DisEMBL flavors, GlobPlot, VSL2b and JRONN) enable MobiDB to provide disorder annotations for every protein in absence of more reliable data. The new version also features a consensus annotation and classification for long disordered regions. In order to complement the disorder annotations, MobiDB features additional annotations from external sources. Annotations from the UniProt database include post-translational modifications and linear motifs. Pfam annotations are displayed in graphical form and are link-enabled, allowing the user to visit the corresponding Pfam page for further information. Experimental protein-protein interactions from STRING are also classified for disorder content.

Entities: Chemical

Mesh：

Substances：

Year: 2014 PMID： 25361972 PMCID： PMC4384034 DOI： 10.1093/nar/gku982

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 19.160

INTRODUCTION

Proteins have been known to exist in an equilibrium between an unfolded and folded state at least since Anfinsen's experiments on denaturation. The existence of an unfolded, or disordered, state has long been considered temporary, due to the protein still having to adopt its final conformation. In this view, mobility of the protein structure was seen as a localized phenomenon, where protein structure determines function and local flexibility is limited to helping the protein achieve its function. This paradigm has been challenged by the collection of hundreds of proteins where function is determined by non-folding regions which play vital biological roles (1,2). Flexible segments lacking a unique native structure, known as intrinsic disordered regions, are widespread in nature, especially in eukaryotic organisms (3,4). The size of disordered regions can be short, long or even encompass entire proteins and their non-enzymatic functions include regulation, protein–DNA/RNA interactions and molecular recognition to name a few, for a recent review see e.g. (5). One of the first repositories for experimentally determined disorder was the DisProt database (6), containing manually curated information on currently 694 proteins. More recently, the IDEAL database (7) was developed, which annotates 446 proteins with disorder and other interesting properties by scanning the literature. Although DisProt and IDEAL are invaluable as an experimental gold standard, they both represent only a fraction of the sequences in nature, posing a bottleneck for large-scale understanding of the disorder phenomenon. Experimental in vitro techniques such as nuclear magnetic resonance (NMR) and x-ray crystallography detect disorder with difficulty in particular for long regions and entire proteins. With currently around 100 000 NMR and x-ray structures, the Protein Data Bank (PDB) (8) nevertheless provides a rich source of indirect experimental disorder. Missing residues in x-ray crystallographic structures in particular have become the de facto standard proxy to infer disorder (1,6,9–10). Only more recently have mobile regions in NMR structures started to be used to infer disorder (11), although it is not entirely clear how this relates to either missing x-ray regions or flexible loops. Due to the difficulty in determining disorder experimentally, a plethora of predictors were created over the last 15 years. Many are quite accurate, as shown at the recent Critical Assessment of techniques for protein Structure Prediction (CASP-10) (12) and a large-scale assessment of disorder predictors (10). Biophysical methods (13,14) derive pseudo-energy functions from residue pairings in rigid structures (i.e. non-disorder). Machine learning, especially neural networks, has been widely used to predict protein disorder (3,15–18). Many predictors try to capture quite diverse disorder flavors, e.g. ESpritz (15) can predict mobile NMR regions and DisEMBL (17) loop regions with high B-factor (high flexibility). Predictions can increase the number of annotated sequences to millions but they must be fast to process many gigabytes of data and keep pace with data expansion. Despite earlier interest in proteome-scale disorder predictions (3), DICHOT (19) is probably the first public database to provide predictions for the human proteome (ca. 20 000 proteins). MobiDB (20), initially limited to ca. 450 000 SwissProt sequences, was the first published database to contain a mixture of experimental data and a consensus prediction approach to annotate as many sequences as possible with intrinsic disorder. A similar large-scale database, D2P2 (21), was published somewhat later to provide consensus predictions for ca. 10 million sequences from fully-sequenced genomes. The new version of MobiDB 2.0 improves over its predecessor in terms of coverage and molecular annotations. It is cross-linked from UniProt, covering all of its protein sequences, presently annotating over 80 million sequences from thousands of organisms.

DATABASE DESCRIPTION

Data sources

MobiDB is designed in three layers (in order of quality): manual curation, indirect experimental PDB information and predictions. Its data sources are essentially four: DisProt, PDB-NMR, PDB-xray and predictors. The highest quality data is currently extracted from the DisProt database (6), a central repository manually curated for structure-function annotations associated with protein intrinsic disorder. PDB-NMR disorder, or rather mobility, is generated by processing NMR structures in the PDB with Mobi (11). Deposited files of NMR experiments for protein structure resolution often contain multiple models. By calculating the differences between the positions of each model's residues, the degree in which positions change can be measured, which is interpreted as a measure of how mobile or disordered a protein is. Indirect data is also inferred from missing residues in PDB-xray structures by considering as disordered residues whose Cα atoms are missing from x-ray crystallographic structures deposited in the PDB (8). Furthermore, every sequence in MobiDB is linked to UniProt (22), PDB (8) and Pfam (23) through SIFTS (24). MobiDB also includes secondary structure derived from PDB files using DSSP (25). Pfam annotations are displayed in graphical form and are link-enabled, allowing the user to visit the corresponding Pfam page for further information. Low-complexity regions predicted with SEG (26) and Pfilt (27) are included, as it is thought that low sequence complexity correlates with intrinsic disorder (28,29). Protein–protein interactions are incorporated from STRING (30) by considering only interactions of high accuracy with database or experimental evidence. Functional information from UniProt, e.g. post-translational modifications and binding sites (among others), are also assigned to residues.

Disorder predictors

MobiDB uses three biophysical predictors (IUPred-short (14), IUPred-long (14), Globplot (13)) and seven machine learning predictors (DisEMBL-465 , DisEMBL-HL , Espritz-DisProt (15), Espritz-NMR (15), Espritz-xray (15), JRONN (16) and VSL2b (18)). All predictors are chosen for their speed (<10 s per protein). A consensus prediction is formed by applying a majority vote on the 10 predictors when there is no high quality information from NMR, x-ray or DisProt.

Combining experimental data

The core of MobiDB is shown in the section ‘Sequence annotations’ where all the data are collected to form a global consensus. The first line of information is dedicated to ‘long disorder’ consensus and related percentage of residues, as well as the last line is dedicated to ‘predictor’ consensus as already described. The second line of information ‘Disorder Sources’ contains the overall representation of disorder that came from the union of DisProt, PDB and predictor consensus. Basically, for each source of information a consensus has been calculated in three possible states: structure, disorder and ambiguous. These are then merged in an overall consensus, using the logic described in Table 1. Simply put, the consensus assigns disorder and structure only when no contradictions are found and ambiguous otherwise.

Table 1.

Disorder sources consensus definition matrix

DisProt	PDB	Predictors	Consensus
Disorder	Disorder	Any	Disorder
Disorder	Structure	Any	Ambiguous
Disorder	Ambiguous	Any	Ambiguous
Structure	Disorder	Any	Ambiguous
Structure	Structure	Any	Structure
Structure	Ambiguous	Any	Ambiguous
Ambiguous	Any	Any	Ambiguous
None	Disorder	Any	Disorder
None	Structure	Any	Structure
None	Ambiguous	Any	Ambiguous
None	None	Disorder	Disorder (LC)
None	None	Structure	Structure (LC)

Each possible annotation scenario is listed for for the three data sources (DisProt, PDB, predictors) together with its consensus annotation. Ambiguous is used for residues with conflicting annotations warranting further investigation, which may be due to folding upon binding events. LC means low confidence. Combinations yielding structure as consensus are underlined and those for disorder are shown in bold. Sources which are not contributing to the consensus are shown in italics.

Long disorder and classification

Proteins with long disorder regions are more frequent in higher Eukaryotes and known to have specific functions (3,5) as well as being associated with human diseases such as cancer (31). The prediction consensus is also optimized for detection of long disordered regions by optimizing the agreement factor (number of predictors agreeing ≥75%) and a regular expression on long regions >20 consecutive amino acids. Optimization is achieved using a grid search and small disordered regions (<10 consecutive residues) are removed. The percentage of disordered residues in long regions is calculated to allow an easier search for interested users. Three classes are defined: high (>30%), medium (15–30%) and low (0–15%) long disorder percentage. Thresholds have been optimized for three uniform sequence subsets over a reduced test set with 10 million proteins.

Implementation

MobiDB was designed with a multi-tier architecture, as previously used in RepeatsDB (32), using separate modules for data management, data processing and presentation functions. To simplify development and maintenance, all tiers handle the common JSON (JavaScript Object Notation) format, thereby eliminating the need for data conversion. The MongoDB database engine is used for data storage and Node.js as middleware between data and presentation. The Angular.js framework and Bootstrap library provide the overall look-and-feel. Additional information is added to entries by querying the Uniprot, PDB and Pfam web services. MobiDB offers users both graphical web interface access and exposes its resources through RESTful web services, using the Restify library for Node.js from URL: http://mobidb.bio.unipd.it/. A detailed web service usage guide is available online. MobiDB was designed to be synchronized with UniProt releases with MobiDB updating its own data accordingly, and is already included in UniProt cross-references since the January 2014 release.

USING MobiDB

In the main usage scenario the user is able to analyze a particular protein in terms of its mobility and disorder information either by directly accessing the entry page with an UniProt accession number or by browsing directly from UniProt to our web-site. MobiDB also offers the capability to search the database directly through an advanced query syntax with a complete list of supported query fields for searching specific data (a full explanation can be found in the online documentation). After selecting a query and performing a search, the user will be presented with the results page. Figure 1 shows the results page after searching for ‘P53’ in organism ‘human’. In this page, it is possible to either select a single entry and proceed to the protein visualization interface or sort the results. Sorting for better selection criteria is possible either on protein length or percentage of residues in long disordered regions. In order to understand the disorder phenomenon better three classes of long disorder are defined. Low, medium and high disorder are colored green, yellow and red respectively, with the additional special cases of none (white) and full disorder (black) (see Figure 1). Additional information such as the basic UniProt descriptions and organism are also displayed to aid selection.

Figure 1.

Search results page. In this example the keyword ‘P53’ in organism ‘human’ is searched and the first 20 results (out of 262) are shown. Long disorder (% LD) coloring is as follows: none (white), low (green), medium (yellow), high (red) and full (black, not shown). Default sorting is by UniProt results, but can be changed by clicking on% LD or length. The sequence visualization interface is shown in Figure 2 for alpha-synuclein, a protein involved in neurodegenerative disorders which is not yet well understood. The page is composed of a variety of boxes and sections that can be collapsed to optimize usage of the available workspace. Starting from the top right corner (Figure 2a), five download buttons are available for retrieving disordered row data and the other related annotations. In the ‘Protein overview’ box the user can find a basic description of the sequence, like Uniprot ID, protein name, organisms and so on. The main annotations located inside ‘Sequence annotations’ (Figure 2a), are displayed as bars by combining the original data sources. By clicking on the green magnifying glass button next to each annotation, it is possible to open a more detailed sequence viewer. The bars titled Disorder Sources, DisProt, PDB-NMR and PDB-xray are defined in the section ‘Combining experimental data’. While the prediction bars Predictors and Long Disorder are defined in ‘Disorder Predictors’ and ‘Long Disorder and Classification’ sections respectively. Other bars give a more comprehensive picture of the protein, displaying Pfam and secondary structure annotations. More detail is also shown on the visualization page. Figure 2b shows the detailed overview of the raw data, i.e. Disport, PDB-NMR, PDB-xray and Predictors in the section ‘Detailed disorder annotations’. Where a PDB is available, the user can visualize the protein structure in 3D, chain by chain or in the entire complex. Scrolling down the page, known interacting proteins from the PDB and STRING are classified by disorder content (see Figure 2c). Last but not least, relevant functional features provided by UniProt, such as post-translational modifications, binding site residues and low complexity regions, can be found at the bottom of the page (see Figure 2d). For a complete summary of MobiDB 2.0 improvements over the previous version see Supplementary Table S1. All the different annotations contribute towards a comprehensive molecular story about each UniProt entry.

Figure 2.

Sequence annotations for alpha-synuclein (UniProt entry: P37840). (a) Overview disorder annotations combining DisProt, NMR, x-ray and predictors are shown. The highlighted red circle shows the experimentally determined and predicted long disordered region. Other information includes secondary structure and Pfam domains. Each of these annotations can be downloaded by clicking on the corresponding green button on the top right side of the page. (b) Detailed disorder annotation showing experimental (DisProt, NMR and x-ray) and predicted disorder (10 predictors). For each entry, it is possible to view the detailed sequence annotation by clicking on the green magnifying glass icon (see red circle and left inset). Where available, the 3D structure can be visualized to inspect interesting protein regions (see red circle and right inset). The red circle highlights the only known complete structure alpha-synuclein structure (PDB entry 2kkw). (c) Known protein–protein interactions deduced from PDB files and STRING are shown in analogy to the search results page, with color-coded long disorder percentage, length, protein name and organism. (d) Functional sequence features from UniProt, including binding sites, post-translational modifications and sequence regions.

CONCLUSIONS AND FUTURE WORK

Intrinsically disordered regions are key for the function of numerous proteins. High quality experimental disorder annotations can be extracted by manual curation and automatically from the PDB. Due to the difficulties in experimentally characterizing disorder, many computational predictors have been developed with various disorder flavors and are essential for large-scale annotation. Here we provide a new version of MobiDB, a centralized source for data on different flavors of disorder in protein structures now covering over 80 million proteins. The database features three levels of annotation: manually curated, indirect and predicted. The new version also features a consensus annotation for long disordered regions. MobiDB aims at giving the best possible picture of the ‘disorder landscape’ of a given protein of interest. Since it currently covers the full set of UniProt sequences, the included predictors need to be extremely fast, enabling MobiDB to provide disorder annotations for every protein, especially when no curated or indirect data is available. In order to complement the disorder annotations, MobiDB features additional annotations from external sources like the UniProt, Pfam and STRING databases including domains, protein–protein interactions, post-translational modifications, binding sites and low complexity regions. Beyond its current release, MobiDB is a continuous effort to expand, revise and improve intrinsically disordered annotations. The maintenance of such an amount of data is not simple, especially if we consider that the number of protein sequences in UniProt has doubled in less than a year, so the main effort will be to maintain a fully automated protocol allowing regular database updates. Inclusion of other prediction types such as amyloid aggregation tendency with PASTA 2.0 (33) or ubiquitinylation with RUBI (34) is also possible. Thematic collections, e.g. proteins for specific organisms and/or annotation types will be provided in due course. Interested users are encouraged to submit requests through the online contact form. MobiDB provides the means to obtain disorder annotations for more than 80 million proteins, providing the highest sequence-coverage of any available database, while annotating intrinsic disorder as well as possible through its combination of experimental sources and consensus predictions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

34 in total

Review 1. Getting the most from PSI-BLAST.

Authors: David T Jones; Mark B Swindells
Journal: Trends Biochem Sci Date: 2002-03 Impact factor: 13.807

2. MOBI: a web server to define and visualize structural mobility in NMR protein ensembles.

Authors: Alberto J M Martin; Ian Walsh; Silvio C E Tosatto
Journal: Bioinformatics Date: 2010-09-21 Impact factor: 6.937

3. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Authors: W Kabsch; C Sander
Journal: Biopolymers Date: 1983-12 Impact factor: 2.505

4. Intrinsic disorder in cell-signaling and cancer-associated proteins.

Authors: Lilia M Iakoucheva; Celeste J Brown; J David Lawson; Zoran Obradović; A Keith Dunker
Journal: J Mol Biol Date: 2002-10-25 Impact factor: 5.469

5. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors: J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal: J Mol Biol Date: 2004-03-26 Impact factor: 5.469

6. Assessment of protein disorder region predictions in CASP10.

Authors: Bohdan Monastyrskyy; Andriy Kryshtafovych; John Moult; Anna Tramontano; Krzysztof Fidelis
Journal: Proteins Date: 2013-11-22

7. RepeatsDB: a database of tandem repeat protein structures.

Authors: Tomás Di Domenico; Emilio Potenza; Ian Walsh; R Gonzalo Parra; Manuel Giollo; Giovanni Minervini; Damiano Piovesan; Awais Ihsan; Carlo Ferrari; Andrey V Kajava; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2013-12-05 Impact factor: 16.971

8. Activities at the Universal Protein Resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2013-11-18 Impact factor: 16.971

9. STRING v9.1: protein-protein interaction networks, with increased coverage and integration.

Authors: Andrea Franceschini; Damian Szklarczyk; Sune Frankild; Michael Kuhn; Milan Simonovic; Alexander Roth; Jianyi Lin; Pablo Minguez; Peer Bork; Christian von Mering; Lars J Jensen
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

10. The RCSB Protein Data Bank: new resources for research and education.

Authors: Peter W Rose; Chunxiao Bi; Wolfgang F Bluhm; Cole H Christie; Dimitris Dimitropoulos; Shuchismita Dutta; Rachel K Green; David S Goodsell; Andreas Prlic; Martha Quesada; Gregory B Quinn; Alexander G Ramos; John D Westbrook; Jasmine Young; Christine Zardecki; Helen M Berman; Philip E Bourne
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

93 in total

1. Large-scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe.

Authors: Marco Necci; Damiano Piovesan; Silvio C E Tosatto
Journal: Protein Sci Date: 2016-10-25 Impact factor: 6.725

Review 2. Protein Bioinformatics Databases and Resources.

Authors: Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal: Methods Mol Biol Date: 2017

3. Codon selection reduces GC content bias in nucleic acids encoding for intrinsically disordered proteins.

Authors: Christopher J Oldfield; Zhenling Peng; Vladimir N Uversky; Lukasz Kurgan
Journal: Cell Mol Life Sci Date: 2019-06-07 Impact factor: 9.261

4. Usage of a dataset of NMR resolved protein structures to test aggregation versus solubility prediction algorithms.

Authors: Daniel B Roche; Etienne Villain; Andrey V Kajava
Journal: Protein Sci Date: 2017-07-15 Impact factor: 6.725

5. Paradoxes and wonders of intrinsic disorder: Prevalence of exceptionality.

Authors: Vladimir N Uversky
Journal: Intrinsically Disord Proteins Date: 2015-06-25

6. Intrinsic disorder in spondins and some of their interacting partners.

Authors: Oluwole Alowolodu; Gbemisola Johnson; Lamis Alashwal; Iqbal Addou; Irina V Zhdanova; Vladimir N Uversky
Journal: Intrinsically Disord Proteins Date: 2016-12-15

7. How disordered is my protein and what is its disorder for? A guide through the "dark side" of the protein universe.

Authors: Philippe Lieutaud; François Ferron; Alexey V Uversky; Lukasz Kurgan; Vladimir N Uversky; Sonia Longhi
Journal: Intrinsically Disord Proteins Date: 2016-12-21

8. An optimized N^pro-based method for the expression and purification of intrinsically disordered proteins for an NMR study.

Authors: Natsuko Goda; Naoki Matsuo; Takeshi Tenno; Sonoko Ishino; Yoshizumi Ishino; Satoshi Fukuchi; Motonori Ota; Hidekazu Hiroaki
Journal: Intrinsically Disord Proteins Date: 2015-02-23

Review 9. Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions.

Authors: Fanchi Meng; Vladimir N Uversky; Lukasz Kurgan
Journal: Cell Mol Life Sci Date: 2017-06-06 Impact factor: 9.261

Review 10. To be disordered or not to be disordered: is that still a question for proteins in the cell?

Authors: Kris Pauwels; Pierre Lebrun; Peter Tompa
Journal: Cell Mol Life Sci Date: 2017-06-13 Impact factor: 9.261