Literature DB >> 21059684

PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.

Amy L Robertson¹, Mark A Bate, Steve G Androulakis, Stephen P Bottomley, Ashley M Buckle.

Abstract

The polyglutamine diseases are caused in part by a gain-of-function mechanism of neuronal toxicity involving protein conformational changes that result in the formation and deposition of β-sheet rich aggregates. Recent evidence suggests that the misfolding mechanism is context-dependent, and that properties of the host protein, including the domain architecture and location of the repeat tract, can modulate aggregation. In order to allow the bioinformatic investigation of the context of polyglutamines, we have constructed a database, PolyQ (http://pxgrid.med.monash.edu.au/polyq). We have collected the sequences of all human proteins containing runs of seven or more glutamine residues and annotated their sequences with domain information. PolyQ can be interrogated such that the sequence context of polyglutamine repeats in disease and non-disease associated proteins can be investigated.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2010 PMID： 21059684 PMCID： PMC3013692 DOI： 10.1093/nar/gkq1100

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Polyglutamine (PolyQ) repeats are implicated in several neurodegenerative diseases, including Huntington’s disease and several spinocerebellar ataxia’s. It is commonly thought that a toxic gain-of-function mechanism is triggered by the presence of a polyQ tract, involving a conformational change within the protein and the formation and deposition of β-sheet rich amyloid-like fibrils (1–3). The length of the polyQ repeat is critical to pathogenesis; however, there is evidence that other protein factors, including the location, type and number of flanking domains can modulate pathogenesis (4–10). Although there are many human polyQ-containing proteins (11), only nine polyQ-containing proteins are implicated in pathogenesis, and the precise repeat threshold to pathogenesis varies within the disease subset, for example, a 37 glutamine repeat is sufficient to lead to Huntington's disease, while SCA3 results only when the polyQ repeat expands to 45 or greater (12–14). Many other human, non-disease related proteins contain polyQ repeats, which are intrinsically prone to expansion at the genetic level (11,15,16). In fact, a 40 glutamine repeat is the normal allele present in forkhead box P2 transcription factor; a protein that has not been found to be associated with a polyQ disease (17,18). This evidence has led to the hypothesis that protein characteristics modulate the propensity of polyQ-containing proteins to aggregate and cause disease. To investigate the variable characteristics of polyQ proteins we have performed a bioinformatics investigation of the protein context of polyglutamine repeats, and constructed a web-accessible database of all human proteins containing a polyQ repeat greater than seven glutamines in length, termed ‘PolyQ’. The PolyQ database provides a tool to compare the polyQ repeat location, the occurrence/type of domains and the number of domain repeats present across disease and non-disease proteins.

PolyQ DESCRIPTION AND USE

PolyQ was created using open-source MySQL relational database server software, version 5.0.82 (http://www.mysql.com), running on an Apple 8-core 3.0 GHz Xeon/OS X Server (version 10.5.8). The database consists of three tables. A web-based query interface to the database was developed using the PHP5 programming language, hosted via Apache 2.2.14. The user interface was developed with the utilisation of the JQuery Javascript library and JQuery widgets. Charts and graphs are constructed on the fly using the Google Visualization API. The PolyQ database was populated by extracting all human sequences from the NCBI non-redundant (NR) database that contained at least seven consecutive glutamine residues. We then performed a Pfam (19) domain search to find protein domains within this subset of sequences. The NCBI NR contains many versions of the same protein, which created bias in the statistical analysis of PolyQ location data. We simplified the analysis by indentifying protein variants/isoforms and using only the longest protein isoforms (which we termed ‘master sequences’), therefore eliminating splice variants/protein fragments. Multiple variants/isoforms of each protein were crudely identified by comparing the protein sequence following the PolyQ chains. The original sequences were then subjected to the BLASTClust (20), FORCE (21), MCL (22) and HomoClust algorithms (23), and the variants/isoforms were adjusted as necessary. The crude identification used the 10 amino acids immediately after the PolyQ chain as a ‘search string’; any sequence that had the 10 amino acids immediately following its own polyQ chain was presumed to have homology with that sequence. The homology groups were confirmed by analyzing the data using the above algorithms. This yielded a total of 128 master sequences, from an original data set of >700 polyQ-containing human protein sequences. The database can be searched according to protein name, Pfam domain or sequence. The results of a typical search, shown in Figure 1A, show both a graphical summary (Figure 1A, top) and textual details (Figure 1A, bottom) according to sequence classification (see below). The graphical summary shows pie chart and bar chart representations of the results according to sequence classification (Figure 1A, top), Pfam domain occurrence (Figure 1B) and Pfam domain repetition (Figure 1C). Retrieved database entries are listed in table format with one row per protein, and three columns containing protein name (with links to the GenBank entry), Pfam domains, and protein sequence (with the polyQ region annotated), respectively. Homologs in the database can be included or excluded from the search. From this view, the domain and sequence context of the polyQ sequence can be identified and further interrogated. To aid analysis specific entries can be selected from the results (using the ‘examine’ button) and grouped together.

Figure 1.

(A) Typical results of a simple search (blank in this instance), showing graphical breakdown according to sequence classification; Results shown graphically according to domain occurrence (B) and domain repeats (C) using the tabs at top of page [as seen in (A)].

Sequence classification

The data are sorted and annotated according to the following sequence classifications: N-Terminal PolyQs—sequences where the first polyQ chain appears before all Pfam domains; C-Terminal PolyQs—sequences where the last polyQ chain appears after all Pfam domains; Interdomain PolyQs—sequences where the polyQ chains appear between the first Pfam domain and the last Pfam domain; Mid Domain PolyQs—sequences in which the polyQ chain appears in the middle of a Pfam domain, or overlaps a Pfam domain; No Significant Domain PolyQs—sequences that do not contain any significant Pfam domains; Unclassified PolyQs—sequences that did not fit into any of the above classifications. Each group is readily accessed using the tabs in the web page (Figure 1A). We have also further reduced the redundancy in the data by clustering sequence homologs, and have also tagged known disease proteins.

Domain occurrence, repeats and disease statistics

The website features pre-constructed pages that show the database entries sorted according to non-disease and disease-causing proteins respectively. This distinction is applied to the sequence classifications above, the domain occurrence (e.g. listing all domains, Figure 1B), and domain repeats (Figure 1C). This allows database entries to be grouped and examined according to whether the polyQ tracts are found in non-disease or disease-causing proteins (Figure 2).

Figure 2.

Selecting the ‘Stats’ menu text shows the entire database contents to be grouped into non-disease and disease causing proteins. To aid analysis specific entries can be selected (indicated by a tickbox), using the ‘examine’ button and grouped together.

CONCLUSIONS AND FUTURE DIRECTIONS

PolyQ is a valuable resource for theoreticians and experimentalists looking for insights into the context of PolyQ repeats in proteins and relationships with disease. Although the query tool allows searching across much of the database, we are developing a custom interface that will allow user-configurable queries against the whole data set as well as user customization of how the results are displayed. We are also adding the structural information [e.g. from the SCOP (24), CATH (25) and PDB databases (26)] to the resources such that the structural context of polyQ repeats can be investigated.

FUNDING

This work is supported by National Health and Medical Research Council and the Australian Research Council. S.P.B. and A.M.B. are NHMRC Senior Research Fellows. Funding for open access charge: National Health and Medical Research Council (Australia). Conflict of interest statement. None declared.

25 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. An efficient algorithm for large-scale detection of protein families.

Authors: A J Enright; S Van Dongen; C A Ouzounis
Journal: Nucleic Acids Res Date: 2002-04-01 Impact factor: 16.971

3. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

Review 4. Protein aggregation and aggregate toxicity: new insights into protein folding, misfolding diseases and biological evolution.

Authors: Massimo Stefani; Christopher M Dobson
Journal: J Mol Med (Berl) Date: 2003-08-27 Impact factor: 4.599

5. Caspase cleavage of gene products associated with triplet expansion disorders generates truncated fragments containing the polyglutamine tract.

Authors: C L Wellington; L M Ellerby; A S Hackam; R L Margolis; M A Trifiro; R Singaraja; K McCutcheon; G S Salvesen; S S Propp; M Bromm; K J Rowland; T Zhang; D Rasper; S Roy; N Thornberry; L Pinsky; A Kakizuka; C A Ross; D W Nicholson; D E Bredesen; M R Hayden
Journal: J Biol Chem Date: 1998-04-10 Impact factor: 5.157

6. Aggregation of huntingtin in neuronal intranuclear inclusions and dystrophic neurites in brain.

Authors: M DiFiglia; E Sapp; K O Chase; S W Davies; G P Bates; J P Vonsattel; N Aronin
Journal: Science Date: 1997-09-26 Impact factor: 47.728

7. Cleavage of atrophin-1 at caspase site aspartic acid 109 modulates cytotoxicity.

Authors: L M Ellerby; R L Andrusiak; C L Wellington; A S Hackam; S S Propp; J D Wood; A H Sharp; R L Margolis; C A Ross; G S Salvesen; M R Hayden; D E Bredesen
Journal: J Biol Chem Date: 1999-03-26 Impact factor: 5.157

8. Amyloid-like features of polyglutamine aggregates and their assembly kinetics.

Authors: Songming Chen; Valerie Berthelier; J Bradley Hamilton; Brian O'Nuallain; Ronald Wetzel
Journal: Biochemistry Date: 2002-06-11 Impact factor: 3.162

9. CTG repeats show bimodal amplification in E. coli.

Authors: P S Sarkar; H C Chang; F B Boudi; S Reddy
Journal: Cell Date: 1998-11-13 Impact factor: 41.582

10. Glutamine repeats as polar zippers: their possible role in inherited neurodegenerative diseases.

Authors: M F Perutz; T Johnson; M Suzuki; J T Finch
Journal: Proc Natl Acad Sci U S A Date: 1994-06-07 Impact factor: 11.205

14 in total

Review 1. Critical evaluation of in silico methods for prediction of coiled-coil domains in proteins.

Authors: Chen Li; Catherine Ching Han Chang; Jeremy Nagel; Benjamin T Porebski; Morihiro Hayashida; Tatsuya Akutsu; Jiangning Song; Ashley M Buckle
Journal: Brief Bioinform Date: 2015-07-15 Impact factor: 11.622

2. Flanking domain stability modulates the aggregation kinetics of a polyglutamine disease protein.

Authors: Helen M Saunders; Dimitri Gilis; Marianne Rooman; Yves Dehouck; Amy L Robertson; Stephen P Bottomley
Journal: Protein Sci Date: 2011-08-18 Impact factor: 6.725

Review 3. Fibrillogenesis of huntingtin and other glutamine containing proteins.

Authors: Yuri L Lyubchenko; Alexey V Krasnoslobodtsev; Sorin Luca
Journal: Subcell Biochem Date: 2012

4. DbStRiPs: Database of structural repeats in proteins.

Authors: Broto Chakrabarty; Nita Parekh
Journal: Protein Sci Date: 2021-03-06 Impact factor: 6.725

5. Trinucleotide repeats: a structural perspective.

Authors: Bruno Almeida; Sara Fernandes; Isabel A Abreu; Sandra Macedo-Ribeiro
Journal: Front Neurol Date: 2013-06-20 Impact factor: 4.003

6. Nucleotide polymorphisms in the canine Noggin gene and their distribution among dog (Canis lupus familiaris) breeds.

Authors: Yuji Ishii; Tatsuya Takizawa; Hiroshi Iwasaki; Yukihiro Fujita; Masaru Murakami; Jay C Groppe; Kazuaki Tanaka
Journal: Biochem Genet Date: 2011-09-01 Impact factor: 1.890

7. ProRepeat: an integrated repository for studying amino acid tandem repeats in proteins.

Authors: Hong Luo; Ke Lin; Audrey David; Harm Nijveen; Jack A M Leunissen
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

8. Induction of functional Brm protein from Brm knockout mice.

Authors: Kenneth W Thompson; Stefanie B Marquez; Li Lu; David Reisman
Journal: Oncoscience Date: 2015-04-18

9. RepeatsDB: a database of tandem repeat protein structures.

Authors: Tomás Di Domenico; Emilio Potenza; Ian Walsh; R Gonzalo Parra; Manuel Giollo; Giovanni Minervini; Damiano Piovesan; Awais Ihsan; Carlo Ferrari; Andrey V Kajava; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2013-12-05 Impact factor: 16.971

10. Polyglutamine-rich suppressors of huntingtin toxicity act upstream of Hsp70 and Sti1 in spatial quality control of amyloid-like proteins.

Authors: Katie J Wolfe; Hong Yu Ren; Philipp Trepte; Douglas M Cyr
Journal: PLoS One Date: 2014-05-14 Impact factor: 3.240