| Literature DB >> 23203878 |
Matt E Oates1, Pedro Romero, Takashi Ishida, Mohamed Ghalwash, Marcin J Mizianty, Bin Xue, Zsuzsanna Dosztányi, Vladimir N Uversky, Zoran Obradovic, Lukasz Kurgan, A Keith Dunker, Julian Gough.
Abstract
We present the Database of Disordered Protein Prediction (D(2)P(2)), available at http://d2p2.pro (including website source code). A battery of disorder predictors and their variants, VL-XT, VSL2b, PrDOS, PV2, Espritz and IUPred, were run on all protein sequences from 1765 complete proteomes (to be updated as more genomes are completed). Integrated with these results are all of the predicted (mostly structured) SCOP domains using the SUPERFAMILY predictor. These disorder/structure annotations together enable comparison of the disorder predictors with each other and examination of the overlap between disordered predictions and SCOP domains on a large scale. D(2)P(2) will increase our understanding of the interplay between disorder and structure, the genomic distribution of disorder, and its evolutionary history. The parsed data are made available in a unified format for download as flat files or SQL tables either by genome, by predictor, or for the complete set. An interactive website provides a graphical view of each protein annotated with the SCOP domains and disordered regions from all predictors overlaid (or shown as a consensus). There are statistics and tools for browsing and comparing genomes and their disorder within the context of their position on the tree of life.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23203878 PMCID: PMC3531159 DOI: 10.1093/nar/gks1226
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.An example graphical report from the D2P2 website for two transcripts of the human gene BIN1. All disorder predictions (pastel-colored blocks) are stacked and aligned against the polypeptide chain in black. Their interplay with the predicted SCOP domains (bright-colored rounded blocks) is shown. The level of agreement between all of the disorder predictors is shown as color intensity in an aligned gradient bar below the stack of predictions. The green segments represent disorder that is not found within a predicted SCOP domain. The blue segments are where the disorder predictions intersect the SCOP domain prediction. Below the disorder agreement line, ANCHOR binding region predictions are displayed (yellow blocks with zigzag infill), along with PTM sites from PhosphoSitePlus when known (shown as lettered spheres hanging below other predictions).
The number of genomes and sequences included in the database at the time of writing
| Domain | Number of genomes | Reference species | Strains | Total sequences |
|---|---|---|---|---|
| Eukarya | 352 | 298 | 54 | 5 746 620 |
| Bacteria | 1305 | 862 | 443 | 4 216 314 |
| Archaea | 108 | 96 | 12 | 238 232 |
| Total | 1765 | 1256 | 509 | 10 429 761 |
The intention is to expand this over time as new genomes are described.
Figure 2.Toy example of the D2P2 predictor consensus calculation (see Figure 1 for a real example). The colored bars (top) represent real valued and binary disorder prediction output for four imagined predictors. Any real valued output is converted to a binary form by thresholding at a cut-off of 0.5 (as per CASP requirements) or at each prediction methods’ advised cut-off minimizing false-positive rate. Next, a binary N × M matrix of per residue (N) and per predictor (M) results is created (blue arrow). The percentage from full agreement of a disordered state is calculated for each column of the binary matrix. This is then re-encoded as a binary matrix (bottom) for each threshold of agreement (or consensus) and further run-length encoded for storage in the database as a set of agreed upon regions of disorder. Taking a higher percentage cut-off of consensus yields a more conservative result with 100% likely under predicting. When searching online with D2P2 75% consensus is used to highlight regions of sequence that are likely disordered.
Figure 3.A graph showing the distribution of total disorder coverage per-protein over the whole database of protein sequences for each predictor. The X axis shows the percentage of a protein sequence that was covered with disorder prediction from a given predictor, binned at 1% intervals. The Y axis shows the frequency of observed sequences with a given percentage coverage of disorder, log10 scaled for ease of comparison. The inset (left) shows the first 3% zoomed for clarity of how each predictor treats more structured proteins, the inset (right) shows the final 3% where proteins are predicted to be profoundly disordered with little to no stable tertiary structure.
Figure 4.A bar chart grouped by prediction method of global percentage disorder predicted per domain of cellular life. The X axis shows results per domain grouped by predictor, the Y axis shows the percentage of all amino acid residues for a given domain of life predicted disordered by a given method.
Figure 5.Amino acids which have been predicted to be disordered (Figure 4) were then sub-classified as either being inter- or intra-domain disorder. This figure shows a bar chart, with results grouped by predictor, of the percentage of disordered amino acids that reside within a predicted SCOP domain. The X axis shows results per domain grouped by predictor, the Y axis shows the percentage of all amino acid residues for a given domain of life predicted disordered by a given method.