| Literature DB >> 31986130 |
Sean M Cascarina1, Mikaela R Elder1, Eric D Ross1.
Abstract
A variety of studies have suggested that low-complexity domains (LCDs) tend to be intrinsically disordered and are relatively rare within structured proteins in the Protein Data Bank (PDB). Although LCDs are often treated as a single class, we previously found that LCDs enriched in different amino acids can exhibit substantial differences in protein metabolism and function. Therefore, we wondered whether the structural conformations of LCDs are likewise dependent on which specific amino acids are enriched within each LCD. Here, we directly examined relationships between enrichment of individual amino acids and secondary structure tendencies across the entire PDB proteome. Secondary structure tendencies varied as a function of the identity of the amino acid enriched and its degree of enrichment. Furthermore, divergence in secondary structure profiles often occurred for LCDs enriched in physicochemically similar amino acids (e.g. valine vs. leucine), indicating that LCDs composed of related amino acids can have distinct secondary structure tendencies. Comparison of LCD secondary structure tendencies with numerous pre-existing secondary structure propensity scales resulted in relatively poor correlations for certain types of LCDs, indicating that these scales may not capture secondary structure tendencies as sequence complexity decreases. Collectively, these observations provide a highly resolved view of structural tendencies among LCDs parsed by the nature and magnitude of single amino acid enrichment.Entities:
Year: 2020 PMID: 31986130 PMCID: PMC7004392 DOI: 10.1371/journal.pcbi.1007487
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1LCDs are abundant in structured proteins.
Each subplot indicates the frequency distributions for all peptide subsequences within PDB proteins as a function of scanning window size and amino acid composition for each amino acid indicated in the subplot titles.
Fig 2Abundances of classically-defined LCDs within the PDB and eukaryotic proteomes.
The PDB proteome was exhaustively scanned using a 12aa window size, and the Shannon entropy was calculated for each segment. Segments with Shannon entropy ≤2.2 bits were classified as LCDs. For each amino acid, LCDs were assigned to that category if the frequency of the amino acid was ≥ the maximum frequency for all amino acids within the LCD sequence. The total number of PDB chain sequences with each type of LCD (A) and the total number of regions scoring as LCDs for each LCD category (B) are indicated. The number of proteins with each type of LCD were similarly plotted for the yeast and human proteomes and indicated in (C) and (D) respectively. Additionally, the percentage of each amino acid found within LCDs of that amino acid type is indicated in (E) for all 20 amino acids. For all plots, the LCDs for which the indicated amino acid was clearly the predominant amino acid in the LCD sequence represent “unambiguous” LCDs, while LCDs for which another amino acid was equally abundant within the LCD sequence represent “ambiguous” LCDs. Proteins were preferentially assigned to the “unambiguous” category if they contained at least one region that could be unambiguously identified as an LCD of a given type. Insets in the upper-right corner of panels A-D indicate the frequencies of LCDs when grouped by physicochemical categories: hydrophobic (A, I, L, M, and V); charged (D, E, H, K, and R); polar (C, N, Q, S, and T); aromatic (F, W, and Y); and hydrophilic (combination of charged and polar classes). Total values corresponding to broad physicochemical categories in the insets do not represent the sum of the frequencies of individual types of LCDs, since some proteins contain multiple distinct types of LCDs that fall into the same physicochemical category.
Fig 3Depiction of computational strategy for relating local amino acid composition to secondary structure annotations across the PDB proteome.
For each amino acid, PDB sequences were scanned with a 12aa window size. For each peptide subsequence, the corresponding DSSP (i.e. secondary structure) annotations were sorted into bins based on the frequency of the amino acid of interest (e.g. serine, in the depicted example). After an exhaustive scan of the PDB proteome, the mean fraction of each secondary structure type was calculated within each residue count bin.
Fig 4Conformational tendencies are highly dependent on both amino acid type and degree of enrichment.
The PDB proteome was exhaustively scanned using a 12aa window size as depicted in Fig 3. Secondary structure tendencies across a range of enrichment for each amino acid are depicted in separate subplots. Within each subplot, scatter points represent the mean fraction of each secondary structure type across all peptide sequences within each indicated “residue count” bin. Sample sizes for all amino acids and residue count bins are indicated in S1 Fig.
Fig 5LCD classes parsed by predominant amino acid exhibit unique structural tendency profiles.
LCDs with Shannon entropy ≤ 2.2 bits were parsed into amino acid categories based on the most frequent amino acid within each peptide subsequence. Bars indicate the mean fraction of each secondary structure type for all subsequences within each amino acid-specific LCD bin. For comparison, the mean fraction of each type of secondary structure for all subsequences across all proteins (“PDB” group) and for all peptides qualifying as LCDs combined into a single category (“LCDs” group) are also shown.
Fig 6Computational strategy for assessing the efficacy of established secondary structure propensity scales in predicting secondary structure tendencies among LCDs and non-LCD regions.
(A) For each amino acid, the PDB proteome is scanned using a 12aa window, and all windows are parsed into either a “highly-enriched LCD” category (windows with ≥50% composition of that amino acid) or a “non-LCD” category (in this context, defined as windows with <50% composition of that amino acid). For both categories, the fraction of that amino acid annotated as α-helix and the fraction annotated as β-sheet are calculated. This procedure is repeated for all 20 canonical amino acids. (B) Then, pairwise regression analyses are performed between the fraction of residues in α-helices and each of the α-helix propensity scales. To determine how well secondary structure propensity scales apply to LCD and non-LCD regions, regression analyses are performed separately for “highly-enriched LCDs” and the “non-LCD” category. This process is repeated for the fraction of residues in β-sheets and each of the β-sheet propensity scales. In regression analyses, missing amino acids indicate that the amino acid was removed from analyses either because few of the established secondary structure propensity scales scored that amino acid or, in the case of highly-enriched LCDs, that few LCDs with ≥50% of that amino acid exist in the PDB (see Methods).
Fig 7Observed secondary structure content for some LCDs systematically deviates from secondary structure propensity scales.
For each set of highly-enriched LCDs (defined as those for which a single amino acid comprises at least 50% of the overall composition), the fractions of the defining amino acid in α-helices or in β-sheets were calculated separately and plotted against all secondary structure propensity scales in a pairwise fashion (see S4 and S5 Figs). For each amino acid, the residuals were calculated from the regression line and indicated in the box plots for α-helix propensity scales (A) and β-sheet propensity scales (B). For comparison, the same procedure was performed for each amino acid among non-LCD regions, and the resulting residuals are indicated in boxplots for α-helix propensity scales (C) and β-sheet propensity scales (D).
Top 10 significantly enriched Pfam annotations associated with each LCD category.
For each LCD class, up to 10 significantly enriched (lnOR > 0.0) Pfam annotations are indicated in ascending order according to Holm-Šidák corrected p-value.
| Bacterial extracellular solute-binding protein (lnOR = 0.75; Adj. p = 0.0023) | ||||||||||
| Metallothionein (lnOR = 6.45; Adj. p = 3.36e-13) | 2Fe-2S iron-sulfur cluster binding domain (lnOR = 4.23; Adj. p = 1.93e-09) | Peptidase family C1 propeptide (lnOR = 5.73; Adj. p = 1.06e-07) | Spider insecticidal peptide (lnOR = 6.65; Adj. p = 0.00031) | [2Fe-2S] binding domain (lnOR = 4.32; Adj. p = 0.0005) | Papain family cysteine protease (lnOR = 3.39; Adj. p = 0.0006) | Phlebovirus glycoprotein G2 (lnOR = 5.67; Adj. p = 0.0013) | Insulin-like growth factor binding protein (lnOR = 5.35; Adj. p = 0.0022) | CO dehydrogenase flavoprotein C-terminal domain (lnOR = 3.96; Adj. p = 0.027) | Aldehyde oxidase and xanthine dehydrogenase; a/b hammerhead domain (lnOR = 3.92; Adj. p = 0.028) | |
| Type III restriction enzyme; res subunit (lnOR = 2.22; Adj. p = 0.0021) | ||||||||||
| Elongation factor Tu domain 2 (lnOR = 1.18; Adj. p = 0.0016) | NAD-dependent DNA ligase adenylation domain (lnOR = 2.19; Adj. p = 0.047) | |||||||||
| Orotidine 5'-phosphate decarboxylase / HUMPS family (lnOR = 3.03; Adj. p = 0.0013) | Influenza RNA-dependent RNA polymerase subunit PB2 (lnOR = 3.62; Adj. p = 0.019) | Eukaryotic translation initiation factor 3 subunit 8 N-terminus (lnOR = 4.7; Adj. p = 0.044) | ||||||||
| Cyclophilin type peptidyl-prolyl cis-trans isomerase/CLD (lnOR = 1.47; Adj. p = 2.8e-06) | Berberine and berberine like (lnOR = 1.89; Adj. p = 2.14e-05) | Pyridoxal-phosphate dependent enzyme (lnOR = 1.1; Adj. p = 3.05e-05) | Zinc-binding dehydrogenase (lnOR = 1.02; Adj. p = 0.00022) | FtsZ family; C-terminal domain (lnOR = 1.95; Adj. p = 0.0005) | Pyridine nucleotide-disulphide oxidoreductase; dimerisation domain (lnOR = 1.06; Adj. p = 0.00077) | Alcohol dehydrogenase GroES-like domain (lnOR = 0.87; Adj. p = 0.0017) | ROK family (lnOR = 1.61; Adj. p = 0.0027) | Pyridine nucleotide-disulphide oxidoreductase (lnOR = 0.77; Adj. p = 0.012) | FAD binding domain (lnOR = 1.24; Adj. p = 0.012) | |
| Domain of unknown function (DUF3869) (lnOR = 6.02; Adj. p = 0.0054) | Anaphase-promoting complex subunit 4 WD40 domain (lnOR = 2.71; Adj. p = 0.029) | |||||||||
| Clostridium neurotoxin; Translocation domain (lnOR = 3.28; Adj. p = 3.92e-07) | SAC3/GANP family (lnOR = 4.12; Adj. p = 0.0025) | Clostridial neurotoxin zinc protease (lnOR = 2.53; Adj. p = 0.0027) | ||||||||
| Fes/CIP4; and EFC/F-BAR homology domain (lnOR = 2.23; Adj. p = 0.0014) | ||||||||||
| Leucine rich repeat N-terminal domain (lnOR = 1.13; Adj. p = 4.59e-06) | Leucine rich repeat N-terminal domain (lnOR = 1.52; Adj. p = 0.0059) | Leucine rich repeat C-terminal domain (lnOR = 1.29; Adj. p = 0.016) | ||||||||
| Signal peptide binding domain (lnOR = 4.64; Adj. p = 0.0004) | Domain of unknown function (DUF305) (lnOR = 5.4; Adj. p = 0.0042) | Septin (lnOR = 4.71; Adj. p = 0.014) | NOPS (NUC059) domain (lnOR = 4.6; Adj. p = 0.017) | Multicopper oxidase (lnOR = 3.17; Adj. p = 0.025) | Multicopper oxidase (lnOR = 2.95; Adj. p = 0.044) | |||||
| Bacterial adhesion/invasion protein N terminal (lnOR = 3.25; Adj. p = 2.92e-08) | Clostridium neurotoxin; Translocation domain (lnOR = 3.3; Adj. p = 1.73e-05) | Clostridial neurotoxin zinc protease (lnOR = 2.93; Adj. p = 2.2e-05) | Duffy binding domain (lnOR = 3.04; Adj. p = 0.00065) | Pectate lyase superfamily protein (lnOR = 2.69; Adj. p = 0.00068) | Clostridium neurotoxin; N-terminal receptor binding (lnOR = 2.49; Adj. p = 0.011) | Alpha-2-macroglobulin MG1 domain (lnOR = 4.33; Adj. p = 0.018) | Nontoxic nonhaemagglutinin C-terminal (lnOR = 3.92; Adj. p = 0.041) | |||
| Collagen triple helix repeat (20 copies) (lnOR = 1.88; Adj. p = 0.0011) | ||||||||||
| Retroviral envelope protein (lnOR = 2.18; Adj. p = 4.87e-07) | Cupin (lnOR = 2.02; Adj. p = 3.87e-05) | Fes/CIP4; and EFC/F-BAR homology domain (lnOR = 3.0; Adj. p = 0.00082) | STAT protein; all-alpha domain (lnOR = 4.05; Adj. p = 0.0017) | XPC-binding domain (lnOR = 3.7; Adj. p = 0.005) | Protease inhibitor/seed storage/LTP family (lnOR = 2.58; Adj. p = 0.037) | |||||
| Helicase conserved C-terminal domain (lnOR = 1.47; Adj. p = 1.62e-08) | Hepatitis C virus NS3 protease (lnOR = 2.42; Adj. p = 0.0025) | CXXC zinc finger domain (lnOR = 2.36; Adj. p = 0.015) | Snurportin1 (lnOR = 3.52; Adj. p = 0.036) | |||||||
| Immunoglobulin C1-set domain (lnOR = 1.13; Adj. p = 0.0) | Immunoglobulin V-set domain (lnOR = 0.94; Adj. p = 0.0) | |||||||||
| Immunoglobulin C1-set domain (lnOR = 0.55; Adj. p = 1.52e-07) | B domain (lnOR = 2.69; Adj. p = 5.25e-07) | Prion/Doppel alpha-helical domain (lnOR = 2.9; Adj. p = 2.86e-06) | Immunoglobulin V-set domain (lnOR = 0.4; Adj. p = 0.00061) | Gram-positive pilin backbone subunit 2; Cna-B-like domain (lnOR = 2.59; Adj. p = 0.00074) | Gram-positive pilin subunit D1; N-terminal (lnOR = 2.73; Adj. p = 0.0017) | Urease alpha-subunit; N-terminal domain (lnOR = 1.87; Adj. p = 0.026) | Mur ligase family; catalytic domain (lnOR = 2.02; Adj. p = 0.027) | Mur ligase family; glutamate ligase domain (lnOR = 2.02; Adj. p = 0.027) | Glycosyl hydrolase family 7 (lnOR = 2.21; Adj. p = 0.03) | |
| Alcohol dehydrogenase GroES-like domain (lnOR = 1.41; Adj. p = 5.44e-08) | Zinc-binding dehydrogenase (lnOR = 1.31; Adj. p = 0.00016) | Subtilase family (lnOR = 1.44; Adj. p = 0.0034) | Zinc-binding dehydrogenase (lnOR = 1.52; Adj. p = 0.011) | |||||||
| Cellulase (glycosyl hydrolase family 5) (lnOR = 4.6; Adj. p = 3.38e-06) | Reverse transcriptase connection domain (lnOR = 6.66; Adj. p = 7.48e-05) | Reverse transcriptase thumb domain (lnOR = 6.48; Adj. p = 9.57e-05) | RNase H (lnOR = 5.6; Adj. p = 0.00045) | Reverse transcriptase (RNA-dependent DNA polymerase) (lnOR = 5.6; Adj. p = 0.00045) | Domain of unknown function (DUF4136) (lnOR = 7.17; Adj. p = 0.011) | Domain of unknown function (DUF1957) (lnOR = 6.66; Adj. p = 0.015) | Glycosyl hydrolase family 57 (lnOR = 5.87; Adj. p = 0.026) | Sortilin; neurotensin receptor 3; (lnOR = 5.71; Adj. p = 0.027) | BNR repeat-like domain (lnOR = 5.71; Adj. p = 0.027) | |
| Scavenger mRNA decapping enzyme C-term binding (lnOR = 4.23; Adj. p = 0.0039) | Scavenger mRNA decapping enzyme (DcpS) N-terminal (lnOR = 4.23; Adj. p = 0.0039) | Peptidase family M3 (lnOR = 4.0; Adj. p = 0.0068) | Phospholipase A2 (lnOR = 2.46; Adj. p = 0.017) | Staphylococcus aureus coagulase (lnOR = 4.92; Adj. p = 0.034) | WxcM-like; C-terminal (lnOR = 3.34; Adj. p = 0.039) |