| Literature DB >> 30698641 |
Pablo Mier1, Lisanna Paladin2, Stella Tamana3, Sophia Petrosian4, Borbála Hajdu-Soltész5, Annika Urbanek6, Aleksandra Gruca7, Dariusz Plewczynski8,9, Marcin Grynberg10, Pau Bernadó6, Zoltán Gáspári11, Christos A Ouzounis4, Vasilis J Promponas3, Andrey V Kajava12,13, John M Hancock14,15, Silvio C E Tosatto2,16, Zsuzsanna Dosztanyi5, Miguel A Andrade-Navarro1.
Abstract
There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, and more generally the overlaps between different properties related to LCRs, using examples. We argue that statistical measures alone cannot capture all structural aspects of LCRs and recommend the combined usage of a variety of predictive tools and measurements. While the methodologies available to study LCRs are already very advanced, we foresee that a more comprehensive annotation of sequences in the databases will enable the improvement of predictions and a better understanding of the evolution and the connection between structure and function of LCRs. This will require the use of standards for the generation and exchange of data describing all aspects of LCRs. SHORT ABSTRACT: There are multiple definitions for low complexity regions (LCRs) in protein sequences. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, plus overlaps between different properties related to LCRs, using examples.Entities:
Keywords: composition bias; disorder; low complexity regions; structure
Year: 2020 PMID: 30698641 PMCID: PMC7299295 DOI: 10.1093/bib/bbz007
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Overview of complexity terms and their definitions
|
|
|
|
|---|---|---|
|
| ||
| LCR | Regions with a skewed amino acid composition | [ |
| Compositionally biased region | [ | |
| X-rich region | Region with a high proportion of a specific amino acid, where X is the abundant residue | — |
|
| ||
| Repeat motif | Reiteration of residues: (...)n | — |
| Homorepeat (polyX) | Consecutive runs of a single residue: (X)n | [ |
| Direpeat | Consecutive runs of two ordered different residues: (XY)n | — |
| Tandem repeat | Pattern of residues which are directly adjacent to each other: (XYZ…)n | [ |
| Cryptic repeat | Scrambled arrangements of repetitive motifs | [ |
| Imperfect repeat | Regions in which the repeat units are not the same | [ |
|
| ||
| Intrinsically disordered protein | Protein that lacks a fixed or ordered 3D-structure | [ |
| Coiled coil | Structural motif characterized by a seven-residue sequence repeat in which alpha-helices are coiled together to form an extended rope-like structure: ( | [ |
| (Charged) single alpha-helix | A segment forming stable monomeric alpha-helix in aqueous solution, typically rich in Arg/Lys/Glu forming an alternating pattern of short runs of oppositely charged residues | [ |
| Protein flexibility | Ability of a protein to fold into multiple stable 3D-structures | [ |
| Amyloid fibrils | Stable insoluble protein assemblies composed predominantly of β-sheet structures in a cross-β conformation | [ |
Figure 1The LC diagram: sequence complexity composition versus periodicity. The diagram illustrates where several types of sequences would be placed in relation to two measures related to sequence complexity.
Illustrative set of proteins with LCRs, ordered by the length of the protein
|
|
|
|
|
|
|---|---|---|---|---|
| Q38PT6 | Q38PT6_9HEXA | 6.5 kDa glycine-rich antifreeze protein | 103 |
|
| P35226 | BMI1_HUMAN | Polycomb complex protein BMI-1 | 326 |
|
| P20226 | TBP_HUMAN | TATA-box-binding protein | 339 |
|
| P04637 | P53_HUMAN | Cellular tumor antigen p53 | 393 |
|
| P32583 | SRP40_YEAST | Suppressor protein SRP40 | 406 |
|
| P34945 | SYS_THET2 | Serine-tRNA ligase | 421 |
|
| P0C2W0 | YADA2_YEREN | Adhesin YadA | 422 |
|
| P02930 | TOLC_ECOLI | Outer membrane protein TolC | 493 |
|
| P35637 | FUS_HUMAN | RNA-binding protein | 526 |
|
| P49711 | CTCF_HUMAN | Transcriptional repressor CTCF | 727 |
|
| P15502 | ELN_HUMAN | Elastin | 786 |
|
| P42566 | EPS15_HUMAN | Epidermal growth factor receptor substrate 15 | 896 |
|
| Q9BVN2 | RUSC1_HUMAN | RUN and SH3 domain-containing protein 1 | 902 |
|
| P10275 | ANDR_HUMAN | Androgen receptor | 920 |
|
| Q8WVM7 | STAG1_HUMAN | Cohesin subunit SA-1 | 1258 |
|
| Q9NZW4 | DSPP_HUMAN | DSPP | 1301 |
|
| Q8ZL64 | SADA_SALTY | Autotransporter adhesin SadA | 1461 |
|
| P02452 | CO1A1_HUMAN | Collagen alpha-1(I) chain | 1464 |
|
| A3M3H0 | ATA_ACIBT | Adhesin Ata autotransporter | 1873 |
|
| P24928 | RPB1_HUMAN | DNA-directed RNA polymerase II subunit RPB1 | 1970 |
|
| P42858 | HD_HUMAN | Huntingtin | 3142 |
|
CBRs detected by CAST. A single protein sequence may contain one or more CBRs of the same or even different residue types. The last two columns refer to UniProt/Swiss-Prot entries (release 2014_05) as retrieved from LCR-eXXXplorer
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| A | 4 | 4 | 19.0 | 19465 | 19.5 |
| D | 1 | 1 | 4.8 | 5293 | 5.3 |
| E | 8 | 7 | 33.3 | 25438 | 25.5 |
| G | 7 | 5 | 23.8 | 8771 | 8.8 |
| K | 2 | 1 | 4.8 | 14936 | 15.0 |
| N | 2 | 2 | 9.5 | 5428 | 5.4 |
| P | 9 | 8 | 38.1 | 12000 | 12.0 |
| Q | 5 | 5 | 23.8 | 9149 | 9.2 |
| S | 14 | 13 | 61.9 | 25081 | 25.1 |
| T | 2 | 2 | 9.5 | 4216 | 4.2 |
| R | 0 | 0 | 0 | 3768 | 3.8 |
| C | 0 | 0 | 0 | 1083 | 1.1 |
| H | 0 | 0 | 0 | 2584 | 2.6 |
| I | 0 | 0 | 0 | 2178 | 2.2 |
| L | 0 | 0 | 0 | 2422 | 2.4 |
| M | 0 | 0 | 0 | 766 | 0.8 |
| F | 0 | 0 | 0 | 756 | 0.8 |
| W | 0 | 0 | 0 | 274 | 0.3 |
| Y | 0 | 0 | 0 | 562 | 0.6 |
| V | 0 | 0 | 0 | 1487 | 1.5 |
CBRP, CBR protein.
Figure 2Shannon entropy value for each detected CBR against the CAST score normalized by the sequence length.
Numbers and major classes of repeats identified by SIMPLE analysis
|
|
|
|
|---|---|---|
| Q38PT6_9HEXA | 23 | G (19) |
| TBP_HUMAN | 336 | Q (41) |
| P53_HUMAN | 11 | AP (6) |
| SRP40_YEAST | 794 | S (168) |
| FUS_HUMAN | 175 | G (60) |
| CTCF_HUMAN | 1 | EP (1) |
| ELN_HUMAN | 350 | A (30), GV (28) |
| EPS15_HUMAN | 11 | DPF (6) |
| RUSC1_HUMAN | 6 | PP (3) |
| ANDR_HUMAN | 351 | Q (25), G (23) |
| DSPP_HUMAN | 3082 | S (459) |
| SADA_SALTY | 3 | NTT (2) |
| CO1A1_HUMAN | 113 | GP (17) |
| ATA_ACIBT | 21 | NTK, TKTEL (3) |
| RPB1_HUMAN | 948 | SP (96) |
| HD_HUMAN | 211 | P (27) |
Figure 3Motif graph based on SIMPLE analysis of CO1A1_HUMAN.
(A) Fraction of residues predicted by one method (columns) that are predicted by another method (rows). (B) Enrichment ratio of overlapping residues between two methods compared to random overlap
|
|
| ||||
|---|---|---|---|---|---|
| % residues predicted by |
|
|
|
| |
|
| 44.89 | 15.04 | 50.16 | 18.51 | |
|
| 100.00 | 27.07 | 78.66 | 32.03 | |
|
| 80.78 | 100.00 | 98.41 | 90.32 | |
|
| 70.40 | 29.51 | 100.00 | 35.27 | |
|
| 77.69 | 73.42 | 95.89 | 100.00 | |
|
|
| ||||
| Enrichment of overlap |
|
|
|
| |
|
| 1.00 | 1.80 | 1.57 | 1.73 | |
|
| 1.80 | 1.00 | 1.96 | 4.88 | |
|
| 1.57 | 1.96 | 1.00 | 1.91 | |
|
| 1.73 | 4.88 | 1.91 | 1.00 |
Figure 4Comparison of positions detected to be of LC in the 21 proteins of our dataset. Methods SEG (in orange), CAST (in red), SIMPLE (in brown) and IUPred (in purple) were used. ANCHOR (in light blue), which includes structural aspects, is also compared.
Figure 5LC diagram for various sequence datasets. The percentage of the top amino acid as a function of the percentage of mutations to perfect repeats calculated for a dataset of globular (GLOB), disordered (IUP) sequences as well as fragments of our protein dataset with LC character according to the SEG, CAST and SIMPLE methods.
Figure 6Structural features of LC proteins. Venn diagram representing the FELLS prediction of dataset proteins, in four categories: secondary structure (SS), LCRs, disorder and aggregation. Each protein is assigned to a category if more than 30% of the residues in its sequence are predicted in that state.
Number of residues predicted to be in different structural states
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Q38PT6_9HEXA | 0 | 0 | 0 | 0 | 0 | 0 | 48 | 48 |
| SYS_THET2 | 0 | 0 | 0 | 0 | 63 | 63 | 0 | 0 |
| EPS15_HUMAN | 287 | 228 | 59 | 0 | 161 | 102 | 0 | 0 |
| STAG1_HUMAN | 202 | 202 | 0 | 0 | 31 | 31 | 0 | 0 |
| CO1A1_HUMAN | 1168 | 390 | 0 | 778 | 0 | 0 | 778 | 0 |
| ATA_ACIBT | 546 | 450 | 96 | 0 | 96 | 0 | 0 | 0 |