| Literature DB >> 27899635 |
Robert D Finn1, Teresa K Attwood2, Patricia C Babbitt3, Alex Bateman4, Peer Bork5, Alan J Bridge6, Hsin-Yu Chang4, Zsuzsanna Dosztányi7, Sara El-Gebali4, Matthew Fraser4, Julian Gough8, David Haft9, Gemma L Holliday3, Hongzhan Huang10, Xiaosong Huang11, Ivica Letunic12, Rodrigo Lopez4, Shennan Lu13, Aron Marchler-Bauer13, Huaiyu Mi11, Jaina Mistry4, Darren A Natale14, Marco Necci15, Gift Nuka4, Christine A Orengo16, Youngmi Park4, Sebastien Pesseat4, Damiano Piovesan15, Simon C Potter4, Neil D Rawlings4, Nicole Redaschi6, Lorna Richardson4, Catherine Rivoire6, Amaia Sangrador-Vegas4, Christian Sigrist6, Ian Sillitoe16, Ben Smithers8, Silvano Squizzato4, Granger Sutton9, Narmada Thanki13, Paul D Thomas11, Silvio C E Tosatto15,17, Cathy H Wu10, Ioannis Xenarios6, Lai-Su Yeh14, Siew-Yit Young4, Alex L Mitchell4.
Abstract
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.Entities:
Mesh:
Year: 2016 PMID: 27899635 PMCID: PMC5210578 DOI: 10.1093/nar/gkw1107
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Example of an InterPro family hierarchical relationship. The FGGY carbohydrate kinases entry (IPR000577) provides a parent to a series of child entries that match smaller, more functionally-specific sets of proteins.
Figure 2.Timeline showing the member databases that have joined InterPro since version 1.0, released in 2000.
Figure 3.Examples of the CDD and SFLD hierarchies (A and B). (A) CDD models for related domains are organized hierarchically, reflecting major events in the domain family's molecular evolution and functional diversification. The hierarchy usually follows a tree structure obtained from (C) phylogenetic analysis of multiply aligned sequences. The relationship between the CDD entries in panel A and the sequences in panel B is indicated by colour. The top ‘parent’ entry (isoprenoid biosynthesis enzymes, Class 1 superfamily) would be less specific than the ‘leaf’ node entry (trans-isoprenyl diphosphate synthase, head-to-head). (B) The corresponding superfamily, Isoprenoid Synthase Type I, from SFLD. The specificity relationships between the entries is similarly arranged as in panel A. (D) SFLD network analysis graph showing the sequence identity relationships between the Isoprenoid Synthase Type I superfamily members. The E-value threshold for the network is 1e-10 and sequences within nodes share 50% or more sequence identity, calculated using CD-HIT. Note, figures C and D are visualizations from the respective source database and are not available from the InterPro website. These figures demonstrate the different approaches for visualizing and defining relationships between families.
Member database release versions integrated into InterPro since release 48.0
| InterPro release | Member database update |
|---|---|
| 49.0 | PROSITE patterns (20.105), PROSITE profiles (20.105) |
| 50.0 | PIRSF (3.01) |
| 51.0 | TIGRFAMs (15.0), HAMAP (201502.04) |
| 52.0 | PROSITE patterns (20.113), PROSITE profiles (20.113) |
| 53.0 | Pfam (28.0) |
| 54.0 | PANTHER (10.0) |
| 55.0 | HAMAP (201511.02) |
| 56.0 | PROSITE patterns (20.119), PROSITE profiles (20.119) |
| 57.0 | Pfam (29.0), SMART (7.1) |
| 58.0 | CDD (1.0)*, HAMAP (201605.11) |
| 59.0 | Pfam (30.0), SFLD (1.0)* |
| 60.0 | MobiDB** |
*New member databases.
**MobiDB is a new non-signature based database that has been integrated into InterPro to provide ID region annotations. See text for details.
Coverage of the major sequence databases UniProtKB and UniParc (the non-redundant protein sequence archive) by InterPro signatures
| Sequence database | Number of proteins in database | Number of proteins with one or more matches to InterPro |
|---|---|---|
| UniProtKB/Swiss-Prot | 552 884 | 533 303 (96.5%) |
| UniProtKB/TrEMBL | 70 656 157 | 56 310 112 (79.7%) |
| UniProtKB (total) | 71 209 041 | 56 843 415 (79.8%) |
| UniParc | 132 489 873 | 103 835 823 (78.4%) |
Figure 4.Integration of MobiDB Lite annotation within InterPro, enabling annotation of intrinsic disordered (ID) regions within proteins. Top - InterPro annotations for the Human mediator of RNA polymerase II transcription subunit 1 protein (UniProtKB accession Q15648). Middle - Zoomed in view of the consensus long range ID predictions provided by MobiDB Lite. InterPro only captures the consensus output for each sequence, but the graphical representations of the ID regions link to the source website, MobiDB (bottom), where the individual predictions can be viewed.