Literature DB >> 27899671

RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures.

Lisanna Paladin¹, Layla Hirsh^1,2, Damiano Piovesan¹, Miguel A Andrade-Navarro³, Andrey V Kajava^4,5,6, Silvio C E Tosatto^7,8.

Abstract

RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27899671 PMCID： PMC5210593 DOI： 10.1093/nar/gkw1136

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Tandem repeat regions in proteins are characterized by a repeated sequence coding for a modular architecture, where structural modules are called ‘units’. Proteins with tandem repeats play important functional roles (1), are abundant in nature and related to major health threats (2–6). Detecting and annotating them appropriately may increase our understanding of mechanisms of pathogenicity (e.g. virulence factors (7)), allow the design of scaffold proteins for engineered ligand binding with multiple applications (e.g. cancer therapy (8)) and generally expand our knowledge of the function and structure of many proteins (e.g. the mineralocorticoid receptor (9)). It is widely accepted that structural and functional complexity in domains evolved through fusion, recombination, accretion and repetition of a very small set of elementary functions (10,11). Therefore, units in tandem repeat proteins represent a fundamental source of information to explain contemporary structural diversity and the physico-chemical properties of highly designable folds (12). However, identification of sequence periodicity is an extremely hard task, since repetitive proteins evolve quickly, for two main reasons. The process of duplication originating new repeats is error-prone and identical flanking repeat units have an intrinsic tendency to diverge (13). A number of structure-based methods for the identification of repeats has been developed to fill this gap (14–17). RepeatsDB (18) was proposed in 2014 as a database of repeat protein structures and a resource for high-quality repeat structure annotation. The data was collected with RAPHAEL (19), a state-of-the-art method for the detection of Protein Data Bank (PDB) (20) structures containing repeat regions. The entries were classified into repeat structural classes (21) and further divided into subclasses. The five repeat classes are mainly distinguished by repeat unit length and general structural arrangement, and subclasses by the secondary structure assignment of the repeat unit. The shortest repeats, of one or two residues, form crystallites and are typically harmful or non-functional in natural organisms. No example of their structure is present in the PDB and consequently in RepeatsDB. Class II structures are fibrous proteins with very short units stabilized by interchain interactions, typically collagens and α-helical coiled coils, with various arrangements described in (22). Class III contains the most typical repeat examples, elongated structures where repetitive units require one another to maintain folding, e.g. β-, α/β- and α-solenoids. Class IV includes all closed repeat structures. Widespread across all types of organisms, this class includes the TIM-barrel and β-propeller subclasses. Both class III and IV contain units of length between 10 and 50 residues. Class V has unit lengths >40 residues and groups ‘beads on a string’ repeats, whose repeat units are large enough to fold independently. All subclasses are characterized by a strong structural conservation in repeat units frequently not clearly reflected in sequence. This is the main reason why domain sequence databases such as Pfam (23) and SMART (24) fail to detect a large number of repeats (25). Indeed, as most of the largest clusters of human sequence regions not covered by Pfam were found to be repeated (25,26). RepeatsDB was developed to fill this gap and provides the community with a high-quality resource of reliable datasets of repeat structures for various purposes. The first and most obvious goal achieved was to compare the structural classification of repeats with the sequence-based one (26). Other uses of RepeatsDB are the extraction of repeat datasets to discuss specific features (27,28), benchmarking both sequence- and structure-based repeat detection methods and discussing the role of proteins with repeats (17,29–33). The high-quality manually annotated set of RepeatsDB units (Structural Repeat Unit Library, SRUL) exploits ReUPred (34) to predict repeat unit position and classification. RepeatsDB 2.0 includes new annotations, an improved classification and completely redesigned web server and interface, to guarantee availability of data and better user experience in terms of database usability and look-and-feel.

DATABASE DESCRIPTION

RepeatsDB 2.0 data have been completely regenerated taking advantage of the new ReUPred predictor (34) for automatic detection of tandem repeat units. In the new database version all entries are annotated at the unit level, i.e. providing start and end position for each repeated segment, and classified at the subclass level. Compared to the old version, unit annotations have grown by more than an order of magnitude. A detailed description of the RepeatsDB annotation pipeline follows.

Data curation

The initial dataset for RepeatsDB is the entire PDB (20). Repeat candidates are extracted with RAPHAEL (19) and processed with ReUPred (34) to confirm the presence of repeat regions and provide detailed unit information. ReUPred is a predictor able to identify the position of repeated fragments by performing iterative structural alignments against a manually refined library of representative units. ReUPred is also able to assign the class and subclass by transferring this information from the unit library. The final dataset available in RepeatsDB 2.0 is the result of an iterative process where the ReUPred library has been refined manually multiple times to resolve conflicts, improve its ability to generalize and include newly discovered subclasses. At the end of the process, an extensive validation and refinement of the predictions has been carried out by expert visual inspection. More than 60% of the entries have been reviewed and five new subclasses created, three for class IV (closed structures) and two for class V (beads on a string).

Implementation

RepeatsDB was designed as a multi-tier architecture, with three modules managing data storage, processing and presentation, respectively. Data are stored in a MongoDB database, and processed with Node.js. The server is accessible through a web interface or programmatically exploiting a RESTful architecture. The web interface is designed using the Angular.js and Bootstrap frameworks. Dynamic and interactive elements of the entry page are developed using PV for structure visualization and Bio.js as sequence feature viewer, respectively. Both the database structure and Node.js server have been completely rewritten to improve efficiency and data reliability. Moreover, all data derived from third party resources have been processed and stored locally to prevent broken dependencies.

Innovations

Apart from the new annotation pipeline, several bugs have been fixed and many improvements have been introduced since the last RepeatsDB release. All positional annotations are now based on SIFTS (35), making them consistent with both PDB (20) and UniProt (36) references. The search engine has been completely redesigned. An intuitive search interface allows to perform complex queries using logical operators and guides the user through all possible searching fields. A new classification level has been added to include evolutionary relationships among different repeat regions. An all-vs.-all alignment of the repeat regions allowed to group them according to sequence similarity and to identify different repeat families. The new classification has been implemented as an independent layer on top of the existing structural features, and is available at three different identity thresholds (40%, 60% and 90%). The web interface allows to navigate entry clusters, providing an overview of the representative sequences inside each structural subclass.

DATABASE USAGE

The user interface presents an intuitive summary table providing direct access to all entries by structural class directly from the home page. For a finer search, the user can visit either the ‘Browse’ page providing subclass access or the ‘Search’ page for generating complex queries (Figure 1A and B). All entry points redirect to the same result page listing the retrieved proteins in a table (Figure 1C). The table can be further filtered by providing additional matching strings in the column headers. The ‘Browse’ page also provides direct access to sequence clusters, where entries are grouped by sequence similarity. The redesigned entry page (Figure 2) is much more informative compared to the previous RepeatsDB version, including several cross-links to third party resources. It also integrates several structural features useful for comparing CATH, SCOP, Pfam and DSSP annotations with RepeatsDB data. Regions, units and insertions are provided for all entries and correctly mapped both to UniProt and PDB reference (SEQRES field in the PDB file) sequences thanks to the SIFTS service. The correct mapping can strongly improve RepeatsDB impact since it is now very easy to link repeat data with other sequence features like mutations or post-translational modifications. Thanks to a RESTful architecture, all RepeatsDB data are accessible from external APIs and third party resources through HTTP URLs. Please refer to the ‘Help’ page of the website for details on using the RepeatsDB web services. Customized datasets can be downloaded in JSON or text format using the browse function or RESTful web services.

Figure 1.

Figure 2.

Screenshot of RepeatsDB sample entry page for PDB code 1ialA. The top part of the page (A) reports structure information from the PDB and cross-references to third-party databases including UniProt, MobiDB, SCOP, CATH and Pfam (when available). RepeatsDB annotations are available for download both in text and JSON formats on the top-right corner. (B) A table provides region details such as structural classification, start/end position, number of units, repeat period and cluster families. (C) The feature viewer summarizes available annotation for the PDB reference sequence, i.e. the SEQRES field in the PDB file. An overview of RepeatsDB information (regions, units and insertions) along with secondary structure (DSSP), Pfam, SCOP and CATH tracks (when available) are shown. (D) A detailed view of RepeatsDB annotations is highlighted in the sequence and PDB viewers.

Retrieving RepeatsDB data. RepeatsDB data can be retrieved in three different ways. (A) The ‘Browse’ page provides the entry point for both the structural hierarchy and sequence clusters. (B) The ‘Search’ page allows the user to perform advanced queries against a range of RepeatsDB-specific and third-party search fields. The input can be simple text or numeric (single value or range) according to the field type and multiple queries can be combined by boolean operators (AND, OR, NOT). Both the ‘Browse’ and ‘Search’ pages redirect to the results page (C). This page provides a table with the list of retrieved entries and can be further filtered (and sorted) through column header fields. Results can be displayed by PDB chain (default), region or UniProt. Screenshot of RepeatsDB sample entry page for PDB code 1ialA. The top part of the page (A) reports structure information from the PDB and cross-references to third-party databases including UniProt, MobiDB, SCOP, CATH and Pfam (when available). RepeatsDB annotations are available for download both in text and JSON formats on the top-right corner. (B) A table provides region details such as structural classification, start/end position, number of units, repeat period and cluster families. (C) The feature viewer summarizes available annotation for the PDB reference sequence, i.e. the SEQRES field in the PDB file. An overview of RepeatsDB information (regions, units and insertions) along with secondary structure (DSSP), Pfam, SCOP and CATH tracks (when available) are shown. (D) A detailed view of RepeatsDB annotations is highlighted in the sequence and PDB viewers.

Statistics

RepeatsDB provides high quality annotation for ∼5400 entries. Figure 3 compares the current RepeatsDB content to the previous version. The chart shows the total number of entries belonging to each class. However, the new version provides unit definition and subclass classification for all entries where the old version annotated only a tiny fraction (327 entries, cyan bar). Moreover, in RepeatsDB 2.0 more than 60% of the entries have been manually reviewed by expert curators (blue segment). Further details such as the number of regions, units and genes are available from the ‘Stats’ page of the web site.

Figure 3.

RepeatsDB growth. RepeatsDB 2.0 is compared to the previous release. Entries have unit and subclass annotation, with more than 60% manually reviewed (blue). For the old version, only a tiny fraction of entries have unit definition (cyan) and the rest is mostly annotated only at the class level (yellow).

CONCLUSION AND FUTURE WORK

RepeatsDB was presented in 2014 with the goal to provide the community with a central resource for high-quality tandem repeat protein structure annotation. It has been cited in a number of different studies regarding repeat proteins, and has been used to extract datasets for repeat proteins analysis and to test algorithms for repeat proteins annotation. The detailed annotation of entries performed by RepeatsDB curators has allowed us to build of a high quality Structure Repeat Unit Library (SRUL). This library was exploited by the ReUPred algorithm (34) as a gold standard to define unit position in new entries in an iterative process. The new release of RepeatsDB includes a new annotation pipeline, combining the RAPHAEL algorithm for repeat detection (19) and ReUPred for annotation (34), producing extensive annotation for all entries. The pipeline is fully automated and allows the easy regular update of the database. The iterative execution of the pipeline already demonstrated its efficacy both because it identified a large number of new entries, and because new subclasses were identified and added to the structural classification scheme. RepeatsDB will benefit from regular updates, which will steadily increase the number of available annotations. Future work will concentrate on exploiting the repeat unit definitions to create profiles for use in detecting repeats from sequence for genome-scale analysis (37).

37 in total

Review 1. Protein repeats: structures, functions, and evolution.

Authors: M A Andrade; C Perez-Iratxeta; C P Ponting
Journal: J Struct Biol Date: 2001 May-Jun Impact factor: 2.867

2. Tandem repeats in proteins: from sequence to structure.

Authors: Andrey V Kajava
Journal: J Struct Biol Date: 2011-08-24 Impact factor: 2.867

3. SMART: recent updates, new developments and status in 2015.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

4. PDBe: improved accessibility of macromolecular structure data from PDB and EMDB.

Authors: Sameer Velankar; Glen van Ginkel; Younes Alhroub; Gary M Battle; John M Berrisford; Matthew J Conroy; Jose M Dana; Swanand P Gore; Aleksandras Gutmanas; Pauline Haslam; Pieter M S Hendrickx; Ingvar Lagerstedt; Saqib Mir; Manuel A Fernandez Montecelo; Abhik Mukhopadhyay; Thomas J Oldfield; Ardan Patwardhan; Eduardo Sanz-García; Sanchayita Sen; Robert A Slowley; Michael E Wainwright; Mandar S Deshpande; Andrii Iudin; Gaurav Sahni; Jose Salavert Torres; Miriam Hirshberg; Lora Mak; Nurul Nadzirin; David R Armstrong; Alice R Clark; Oliver S Smart; Paul K Korir; Gerard J Kleywegt
Journal: Nucleic Acids Res Date: 2015-10-17 Impact factor: 16.971

5. Swelfe: a detector of internal repeats in sequences and structures.

Authors: Anne-Laure Abraham; Eduardo P C Rocha; Joël Pothier
Journal: Bioinformatics Date: 2008-05-16 Impact factor: 6.937

Review 6. Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.

Authors: Marco Pellegrini
Journal: Front Bioeng Biotechnol Date: 2015-09-24

7. The InterPro protein families database: the classification resource after 15 years.

Authors: Alex Mitchell; Hsin-Yu Chang; Louise Daugherty; Matthew Fraser; Sarah Hunter; Rodrigo Lopez; Craig McAnulla; Conor McMenamin; Gift Nuka; Sebastien Pesseat; Amaia Sangrador-Vegas; Maxim Scheremetjew; Claudia Rato; Siew-Yit Yong; Alex Bateman; Marco Punta; Teresa K Attwood; Christian J A Sigrist; Nicole Redaschi; Catherine Rivoire; Ioannis Xenarios; Daniel Kahn; Dominique Guyot; Peer Bork; Ivica Letunic; Julian Gough; Matt Oates; Daniel Haft; Hongzhan Huang; Darren A Natale; Cathy H Wu; Christine Orengo; Ian Sillitoe; Huaiyu Mi; Paul D Thomas; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

8. Protein Repeats from First Principles.

Authors: Pablo Turjanski; R Gonzalo Parra; Rocío Espada; Verónica Becher; Diego U Ferreiro
Journal: Sci Rep Date: 2016-04-05 Impact factor: 4.379

9. Exploring the repeat protein universe through computational protein design.

Authors: T J Brunette; Fabio Parmeggiani; Po-Ssu Huang; Gira Bhabha; Damian C Ekiert; Susan E Tsutakawa; Greg L Hura; John A Tainer; David Baker
Journal: Nature Date: 2015-12-16 Impact factor: 49.962

10. The early history and emergence of molecular functions and modular scale-free network behavior.

Authors: M Fayez Aziz; Kelsey Caetano-Anollés; Gustavo Caetano-Anollés
Journal: Sci Rep Date: 2016-04-28 Impact factor: 4.379

8 in total

1. MemSTATS: A Benchmark Set of Membrane Protein Symmetries and Pseudosymmetries.

Authors: Antoniya A Aleksandrova; Edoardo Sarti; Lucy R Forrest
Journal: J Mol Biol Date: 2019-10-16 Impact factor: 5.469

2. Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm.

Authors: Spencer E Bliven; Aleix Lafita; Peter W Rose; Guido Capitani; Andreas Prlić; Philip E Bourne
Journal: PLoS Comput Biol Date: 2019-04-22 Impact factor: 4.475

3. INGA 2.0: improving protein function prediction for the dark proteome.

Authors: Damiano Piovesan; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

4. Self-analysis of repeat proteins reveals evolutionarily conserved patterns.

Authors: Matthew Merski; Krzysztof Młynarczyk; Jan Ludwiczak; Jakub Skrzeczkowski; Stanisław Dunin-Horkawicz; Maria W Górna
Journal: BMC Bioinformatics Date: 2020-05-07 Impact factor: 3.169

5. A New Census of Protein Tandem Repeats and Their Relationship with Intrinsic Disorder.

Authors: Matteo Delucchi; Elke Schaper; Oxana Sachenkova; Arne Elofsson; Maria Anisimova
Journal: Genes (Basel) Date: 2020-04-09 Impact factor: 4.096

6. RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins.

Authors: Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

7. Systematic mapping of free energy landscapes of a growing filamin domain during biosynthesis.

Authors: Christopher A Waudby; Tomasz Wlodarski; Maria-Evangelia Karyadi; Anaïs M E Cassaignau; Sammy H S Chan; Anne S Wentink; Julian M Schmidt-Engler; Carlo Camilloni; Michele Vendruscolo; Lisa D Cabrita; John Christodoulou
Journal: Proc Natl Acad Sci U S A Date: 2018-09-10 Impact factor: 12.779

Review 8. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

Authors: Ole K Tørresen; Bastiaan Star; Pablo Mier; Miguel A Andrade-Navarro; Alex Bateman; Patryk Jarnot; Aleksandra Gruca; Marcin Grynberg; Andrey V Kajava; Vasilis J Promponas; Maria Anisimova; Kjetill S Jakobsen; Dirk Linke
Journal: Nucleic Acids Res Date: 2019-12-02 Impact factor: 16.971

8 in total