Literature DB >> 26047845

Polymorphism Analysis Reveals Reduced Negative Selection and Elevated Rate of Insertions and Deletions in Intrinsically Disordered Protein Regions.

Tahsin Khan¹, Gavin M Douglas², Priyenbhai Patel¹, Alex N Nguyen Ba¹, Alan M Moses³.

Abstract

Intrinsically disordered protein regions are abundant in eukaryotic proteins and lack stable tertiary structures and enzymatic functions. Previous studies of disordered region evolution based on interspecific alignments have revealed an increased propensity for indels and rapid rates of amino acid substitution. How disordered regions are maintained at high abundance in the proteome and across taxa, despite apparently weak evolutionary constraints, remains unclear. Here, we use single nucleotide and indel polymorphism data in yeast and human populations to survey the population variation within disordered regions. First, we show that single nucleotide polymorphisms in disordered regions are under weaker negative selection compared with more structured protein regions and have a higher proportion of neutral non-synonymous sites. We also confirm previous findings that nonframeshifting indels are much more abundant in disordered regions relative to structured regions. We find that the rate of nonframeshifting indel polymorphism in intrinsically disordered regions resembles that of noncoding DNA and pseudogenes, and that large indels segregate in disordered regions in the human population. Our survey of polymorphism confirms patterns of evolution in disordered regions inferred based on longer evolutionary comparisons.

Entities: Chemical Disease Gene Mutation Species

Keywords: SNP; indel

Mesh：

Substances：

Year: 2015 PMID： 26047845 PMCID： PMC4494057 DOI： 10.1093/gbe/evv105

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

It is widely accepted that the three-dimensional structure of a protein determines its molecular structure and affects its biological function (Alberts et al. 2002; Berg et al. 2002). However, growing evidence suggests that protein regions without rigid, tertiary, three-dimensional structures are prevalent in the eukaryotic proteome (Iakoucheva et al. 2002, 2004; Ward et al. 2004; Galea et al. 2006, 2009). These so-called “intrinsically disordered regions” have different amino acid compositions than globular proteins (Uversky et al. 2000; Singh et al. 2006; Theillet et al. 2013) and are crucial for signaling and protein–protein interactions (Ren et al. 2008; Garza et al. 2009). Disordered regions have been of recent research interest and approximately 30% of human proteins contain disordered regions (Ward et al. 2004), including many disease-associated proteins (Uversky et al. 2008). Despite the abundance of these regions, the amino acid sequences of most disordered regions are poorly conserved across taxa (Daughdrill et al. 2007). Therefore, it remains unclear how these disordered regions have been maintained through evolutionary time and why they are so predominant in eukaryotic proteins. The amino acid composition of disordered regions is significantly different from random amino acids expected based on the genetic code, which suggests that these regions are unlikely to be entirely randomly evolving “junk” (Szalkowski and Anisimova 2011). Preceding investigations have revealed increased rates of amino acid substitutions (Brown et al. 2002), differences in patterns of substitution (compared with typical ordered protein) (Brown et al. 2010), and increased rates of insertion and deletion (indels) in disordered regions (de la Chaux et al. 2007; Light, Sagit, Ekman, et al. 2013). Furthermore, indels in disordered regions are under weaker negative selection relative to structured protein regions, and a large degree of variation in protein length is attributable to disordered regions (Light, Sagit, Sachenkova, et al. 2013). Indeed, disordered regions seem to have particularly weak constraints on indels, as the ratio of substitutions to indels in disordered regions is dramatically lower (Toth-Petroczy and Tawfik 2013). All these patterns are consistent with weak constraints on disordered regions compared with globular proteins. On the other hand, a recent study has identified about 5% of residues in yeast disordered regions as short stretches of conserved amino acids, a subset of which are confirmed functional motifs, although most have still not been examined in detail (Nguyen Ba et al. 2012). It has also been suggested that some disordered regions are conserved over long evolutionary distances even when the specific amino acids are not (Chen et al. 2006; Daughdrill et al. 2007; Toth-Petroczy et al. 2008) and that there is a subset of disordered regions that are conserved at the amino acid level (Bellay et al. 2011; Colak et al. 2013). Furthermore, one study suggested that disordered regions are the primary targets of positive selection in yeast and that disordered regions in Pfam domains contain similar numbers of functional sites as other protein regions (Nilsson et al. 2011). These observations point to the possibility of strong functional constraints on disordered regions. Most studies that estimated evolutionary constraint on disordered regions have relied on interspecies comparisons using pairwise alignments of disordered regions across species. Because of their apparent rapid rate of evolution (Brown et al. 2002), protein sequences of disordered regions are difficult to align accurately, particularly at long evolutionary distances, which limits the power of classical comparative molecular evolutionary approaches to study these regions. Here, we take advantage of population genomics data to circumvent potential alignment issues to confirm the previous observations, and quantify rates of polymorphism and test for evidence of selection based on site frequency spectra (SFSs). Using single nucleotide polymorphisms (SNPs) from human and yeast populations, and polarized distribution of fitness effect (DFE) analyses, we validate previous reports that amino acid replacements are under negative selection in yeast and humans, although disordered regions appear to be under moderately weaker constraints of selection compared with folding protein regions. We also find up to an order of magnitude increase in the rate of nonframeshifting insertion–deletion (indel) polymorphism in disordered regions, which approaches the rate in noncoding DNA or pseudogenes. We identify examples of large indels in disordered regions segregating in the human population. Our findings also suggest that nonframeshifting indels in disordered regions are largely neutral.

Materials and Methods

Predictions of Disordered Regions and Pfam Domains

Protein coding regions for yeast were obtained from SGD (Cherry et al. 2012) and for human from Ensembl (v62; Flicek et al. 2013). For human proteins, the longest splice form was used. Disordered regions were predicted using DISOPRED v3 (Jones and Cozzetto 2015) with default settings using the UniRef90 database (http://www.uniprot.org/help/, last accessed May 14, 2010; UniProt Consortium 2014) that had been filtered for repetitive regions, although we also performed the analysis using DISOPRED v2 (Ward et al 2004) and found similar results. For yeast, 28,560 disordered regions were predicted, and for human, 409,044 disordered regions were predicted in all protein isoforms (multiple transcripts per gene). Pfam domains were predicted by running HMMer 3.0 (http://hmmer.janelia.org/, last accessed May 14, 2010; Finn et al. 2011) (using default settings) on PFAM v.24 (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam24.0/relnotes.txt, last accessed February 22, 2010; Finn et al. 2008) on the yeast and human proteomes with the E-value of 0.001 as threshold. For both yeast and human, Pfam domains were required to have at least ten occurrences in the proteome, to ensure that they were likely to be independently folding domains, and not highly conserved entire proteins. This left 3,258 (of 8,482) domains in yeast and 27,264 (of 36,561) domains in human. We also excluded from the analysis polymorphisms in Pfam domains that were predicted to be disordered, as Pfam domains have been reported to contain disordered regions (Williams et al. 2013).

Determination of Pa/Ps Ratios

A total of 99,656 protein coding yeast SNPs were obtained from SGRP (Bergstrom et al. 2014) and 427,282 human SNPs were obtained from the 1000 Genomes Project (1000 Genomes Project Consortium 2012) website from the phase 1 Release v3, which is based on the GENCODE v7 (Derrien et al. 2012) annotations of genes and pseudogenes. The total number of SNPs in each region in humans includes the following: 132,661 in disordered regions, 188,393 in other ordered regions, and 100,337 in Pfam domains. Rates of amino acid (nonsynonymous) polymorphism were computed by dividing the total number of amino acid changing (nonsynonymous) polymorphisms by the total number of amino acid changing (synonymous) sites (calculated using the method of Nei and Gojoburi 1986).

Analysis of Yeast Protein Expression Levels

Yeast protein expression levels were obtained from PaxDB (Wang et al. 2012). Proteins were binned by protein expression levels and Pa/Ps ratios were calculated as above. Polymorphisms in proteins that had no expression information were excluded from the analysis. P-value for the figure 2 was determined using t-approximation for the distribution of the correlation coefficient. A total of 96,373 yeast SNPs were found in the regions of interest and more than 95% of yeast genes with SNPs had protein expression data from PaxDB (http://pax-db.org/#!species/4932, last accessed April 14, 2010).

— Nonsynonymous to synonymous polymorphism ratios in disordered regions, Pfam domains, and other ordered regions in the Saccharomyces cerevisiae and human genomes. (A) The Pa/Ps ratio is highest in disordered regions (white) and lowest in Pfam domains (black) in S. cerevisiae and human proteins. (B) Pa/Ps ratios computed for each gene show a wide distribution, and there is significant overlap between the disordered regions (unfilled squares) and the more structured regions (unfilled triangles and filled circles).

Yeast Derived Allele Frequency Spectra

The ancestral state of each SNP for the determination of derived allele frequencies (DAFs) was inferred by using Saccharomyces paradoxus as the reference outgroup. The SNPs were assorted according to their predicted region in S. cerevisiae (disordered regions, other ordered regions, Pfam domains). The assorted data can be downloaded from http://www.moseslab.csb.utoronto.ca/alan/snps_diso_pfam.txt. Finally, the SNPs in each protein region were assorted into 10 bins (0–0.9) according to their DAF values.

Human Site Frequency Spectra

SNPs from the Yoruba (YRI) population of the 1000 genomes project were used to estimate the DFE, while the full data set was used to analyze Pa/Ps ratios and indel polymorphisms (1000 Genomes Project Consortium 2012). Relatives of the third order (1st cousins) and closer were removed from the DFE analysis, leaving 105 YRI individuals. Inferred ancestral alleles reported in this data set were used to determine the DAF for each SNP.

Distribution of Fitness Effects

DFE-alpha v2.03 (Keightley and Eyre-Walker 2007) was implemented to estimate the distribution of effects of new mutations in focal protein regions, with the default options (Keightley and Eyre-Walker 2012) and custom scripts courtesy of Dan Halligan. We used the input folded SFS from the YRI data set described above and sites divergent with macaque for each focal site type and region were counted according to the Enredo-Pecan-Ortheus 6 primate alignment. This method assumes independence between sites (i.e., no linkage disequilibrium). This assumption was tested by dividing the genome into 50-kb windows and generating 200 bootstrap replicates for each site category, which gives a sense of the estimate error due to nonindependence. Significance between site categories was determined through a randomization test as described in Keightley and Eyre-Walker (2007). The total number of sites used in the DFE analysis is 5,820,297 (Pfam domains), 7,202,040 (disordered regions), and 9,167,914 (other ordered regions).

Indel Polymorphisms

Indel polymorphisms were obtained from SGRP (Bergstrom et al. 2014) and the 1000 genomes project (1000 Genomes Project Consortium 2012). Dubious genes were excluded from the analysis. The total number of S. cerevisiae indels (≤40 bp) is 415, 48, 800, and 8202 for disordered regions, Pfam domains, other ordered regions, and noncoding regions, respectively. Moreover, the number of human indels (≤30 bp) is 534, 125, 320, and 15,601 for disordered regions, Pfam domains, other ordered regions, and pseudogenes, respectively. To obtain a sufficient sample size, indels in pseudogene introns were included in the “pseudogene” indel set. Although introns are not expected to have similar patterns of indels to bona fide protein coding genes, because the idea of the psuedogene set was to have a set of indel polymorphisms that reflects the mutation spectrum with as few constraints as possible, including introns was acceptable, as they are expected to have few constraints as well. To compute the frequency of indel polymorphisms in each region, indels of each size were reported per kilobase pair of nucleotides in the corresponding region. For analysis of indel positions in protein coding genes, intron-containing genes were not included.

DNA and Protein Sequence Alignments

Vertebrate coding sequence alignments were downloaded from the UCSC genome browser (dbSNP 138) as subsets of the Multiz 100 vertebrate multiple sequence alignment. The UCSC versions of the sequence alignments of each species include the following: hg19 (human), canfam3 (dog), nomLeu3 (gibbon), gorGor3 (gorilla), mm10 (mouse), rheMac3 (rhesus), ponAbe2 (orangutan), calJac3 (marmoset), saiBol1 (squirrel monkey), and panTro4 (chimp) (Kent et al. 2002). The dbSNP (Sherry et al. 2001) IDs for interferon regulator factor 5 (IRF5) and glutamate receptor 3B (GRIN3B) are rs199508964 and rs142516571, respectively. The allele frequencies were also obtained from the UCSC genome browser (Kent et al. 2002).

Simulations of Molecular Evolution

To test our ability to estimate protein substitution rates from sequence alignments as a function of indel to substitution ratio, we used indelible (Fletcher and Yang 2009) to simulate 100 amino acid proteins randomly generated from the amino acid frequencies found in yeast disordered regions under the WAG model with the default indel model. We let the ancestral protein evolve into two extant proteins at the evolutionary distance indicated and realigned them using MAFFT (Katoh et al. 2002). We estimated rates of evolution using AAML for pairwise comparison from the PAML package (Yang 2007) assuming WAG evolution and filtering gapped columns on both the true alignments (obtained from indelible) and the MAFFT alignments. For each evolutionary distance, we did 100 replicates.

Results

Single Nucleotide Polymorphisms in Disordered Regions Show Evidence for Negative Selection

To compare patterns of evolution in disordered regions with those exhibited by structured protein regions, we divided the protein coding regions of S. cerevisiae into three parts (see Materials and Methods): Disordered regions, conserved protein domains (hereafter referred to as “Pfam domains”), and all other proteins regions, which we expect to consist largely of globular, “ordered” proteins. Although computational predictions of disordered regions include errors, we expect them to be strongly enriched for disordered regions. Errors in prediction will lead our estimates of the differences between types of proteins to be conservative. To exclude regions of disorder within Pfam domains (Williams et al. 2013), we excluded from our analysis ∼5% of residues that were predicted to be disordered within the Pfam domains. To determine the relative importance of natural selection in driving amino acid alterations within eukaryotic proteins, we determined the ratio of nonsynonymous (amino acid changing) to synonymous polymorphisms (Pa/Ps) in the S. cerevisiae genome and observed Pa/Ps values of 0.24, 0.090, and 0.13 for disordered regions, Pfam domains, and other ordered regions, respectively (fig. 2A). We also determined Pa/Ps ratios and indel polymorphism rates in humans, using data from the 1000 genomes project (1000 Genomes Project Consortium 2012). The Pa/Ps ratios in disordered regions, Pfam domains, and other ordered regions were 0.48, 0.36, and 0.40, respectively (fig. 1A). As expected, disordered regions had the highest Pa/Ps ratio, while Pfam domains showed the lowest value, a trend that is consistent with our findings in the S. cerevisiae genome. Because all these ratios are much lower than 1, it can be inferred that all three regions are under strong negative selection, albeit weaker in disordered regions relative to structured protein regions. To get a sense of the variability, we also computed these ratios for each gene and plotted the distribution (fig. 1B). The results were consistent with the pooled result above.

— Polymorphism patterns and expression levels in yeast. (A) The Pa/Ps ratio decreases for all three regions with increased protein expression levels. The difference in polymorphism ratio between the three regions remains approximately constant, independent of protein expression level. Disordered regions (white) show higher Pa/Ps ratios across the whole range of expression levels compared with Pfam domains (black) or other protein regions (white). Protein abundance estimates were measured in ppm (parts per million). (B) Similar results are found for the per gene estimates of the Pa/Ps ratio. Error bars represent three times the standard error of the mean. One of the strongest predictors of protein evolutionary rate is the expression level of the protein, due in part to stronger natural selection to reduce the negative consequences of protein misfolding in highly expressed proteins (Drummond et al. 2005). Hence, we sought to rule out the possibility that the differences observed were due to differences in expression levels between the different types of proteins regions by determining Pa/Ps ratios as a function of protein expression levels (fig. 2A). This analysis revealed consistent differences in Pa/Ps values between disordered regions, Pfam domains, and other ordered regions at all expression levels. Interestingly, a clear negative correlation (R2 = 0.73, P = 0.002) was observed for disordered regions. Although disordered regions might not be expected to show a correlation between the rate of evolution and expression level under the model of selection against protein misfolding (Drummond et al. 2005), there are several reasons why this might be expected, such as spurious protein–protein interactions (Yang et al. 2012), amyloid formation (Knowles et al. 2014), and folding upon target binding (Love et al. 1995; Bowers et al. 1999; Young et al. 2000). As above, to get an idea of the variability in these values, we averaged the per gene Pa/Ps ratios and plotted the mean and three times the standard error (fig. 2B). Once again, we found that disordered regions have higher Pa/Ps ratios across the whole range of expression levels. Taken together, these results indicate that reduced constraints in disordered regions are not likely to be due to overall expression differences between ordered and disordered protein regions.

— Frequency spectra of single nucleotide polymorphisms in the S. cerevisiae and human genomes. Nonsynonymous and synonymous SNPs are represented by black and white bars, respectively. (A) Allele frequencies of SNPs in yeast regions are all skewed toward lower frequencies. (B) Site frequency of spectra of nonsynonymous and synonymous SNPs in human is also skewed to the right. Only the first 10 bins of the spectra are shown for illustrative purposes. Because inferences of selection pressures based on nonsynonymous to synonymous substitution ratios from polymorphism data may be less sensitive (Kryazhimskiy and Plotkin 2008), we analyzed DAF spectra in yeast (fig. 3A), which is an alternative method to infer selection (Nielsen 2005). The DAF spectra of nonsynonymous SNPs for all three regions were skewed toward low frequency SNPs relative to synonymous sites, which support the finding that all three regions are under negative selection, albeit weaker in disordered regions. The same trend is also observed for SNPs in humans (fig. 3B). It is important to note that population structure can also influence the DAF spectra, which we believe accounts for the minor peaks between allele frequencies of 0.5 and 0.7 (fig. 3A). Regardless, our conclusions based on comparisons between classes of SNPs in the genome remain unaffected because these factors are expected to equally influence all three regions. — Distribution of fitness effects of nonsynonymous sites in disordered regions (white), other ordered regions (light gray), and Pfam domains (dark gray) in humans. Synonymous sites within each region were used as the neutral reference. Error bars correspond to bootstrapped (n = 200) 95% confidence interval. *,** indicate that the proportion of sites in disordered regions is significantly different from other ordered regions at P < 0.05 and P < 0.005, respectively, based on a randomization test of the bootstrap replicates. Alleles at low frequencies are representative of new, random mutations, which are most likely to be deleterious, while alleles at higher frequencies reflect mutations that segregate and persist in the population either due to genetic drift or due to positive selection. As such, we analyzed the behavior of nonsynonymous to synonymous mutation ratios as a function of the DAFs of the respective SNPs (supplementary fig. S1, Supplementary Material online). For SNPs with DAF < 10%, there were 1.36, 0.78, and 0.57 amino acid changing polymorphisms for every synonymous one in disordered regions, other ordered regions, and Pfam domains, respectively. In contrast, this ratio respectively decreased to 0.77, 0.39, and 0.25 for SNPs with DAF > 10% (high frequency). Assuming that the differences between these ratios are due to the removal of deleterious mutations by natural selection, we determined that 43% (1−0.77/1.36) of the nsSNPs (non-synonymous single nucleotides) (n = 4,277) in disordered regions with DAFs < 10% were deleterious. This fraction increased to 51% (1 − 0.39/0.78) in other ordered regions (n = 7,242) and 56% (1 − 0.25/0.57) in Pfam domains (n = 959). As expected, this suggests that disordered regions are more tolerant of new mutations compared with Pfam domains and other ordered regions. Taken together, we can infer that disordered regions evolve under similar, albeit weaker constraints relative to structured protein regions.

Increased Proportion of Nearly Neutral Sites in Disordered Regions Indicates Weaker Negative Selection

To assess whether negative selection is relaxed in disordered regions relative to structured regions in humans, the DFE of new mutations in these regions was estimated using DFE-alpha (Keightley and Eyre-Walker 2007), which models demographic changes explicitly, unlike Pa/Ps ratios. This approach uses the SFS to infer potential fitness consequences of new point mutations in particular regions. The SFS of SNPs segregating in 105 unrelated individuals of the Yoruba (YRI) population of the 1000 genomes project (1000 Genomes Project Consortium 2012) was used for this analysis. By comparing the SFSs of nonsynonymous SNPs with synonymous SNPs, the strength of negative selection acting upon nonsynonymous changes can be estimated. The DFE of nonsynonymous mutations is shown in figure 4 in terms of Nes, a measure of the efficacy of negative selection, where Ne is the effective population size and s is the selective coefficient. If their product Nes is <1, mutations segregate like neutral mutations. The DFE estimated for nonsynonymous mutations within all regions (fig. 4) is consistent with previous reports for humans (Haerty and Ponting 2013) and other mammals (Halligan et al. 2010). If only about 5% of disordered regions, representing short linear motifs (Nguyen Ba et al. 2012), were under negative selection, then 95% of sites would be found in the nearly neutral bin (Nes < 1). This is clearly not the case, because only 24% of nonsynonymous sites were identified as nearly neutral. This finding is instead consistent with moderately weaker negative selection acting in these regions compared with structured regions with the majority of nonsynonymous changes still being selected against.

— Distribution of fitness effects of nonsynonymous sites in disordered regions (white), other ordered regions (light gray), and Pfam domains (dark gray) in humans. Synonymous sites within each region were used as the neutral reference. Error bars correspond to bootstrapped (n = 200) 95% confidence interval. *,** indicate that the proportion of sites in disordered regions is significantly different from other ordered regions at P < 0.05 and P < 0.005, respectively, based on a randomization test of the bootstrap replicates.

— Boxplots of the fraction of nearly neutral sites where biochemical changes can occur. Disordered regions show significantly greater fraction of nearly neutral sites for positions that change disordered to ordered amino acids (A), as well as ordered to disordered (B). Distribution is based on bootstrapped replicates as described in Materials and Methods. O and D refer to amino acids found at relatively higher proportions in ordered and disordered regions, respectively. To test whether these patterns were caused by the specific biochemical properties of disordered regions, or simply due to overall relaxed selection in disordered regions, the DFE of nonsynonymous substitutions to amino acids that are more or less predominant in particular regions was also assessed. Disordered regions contain proportionately more charged and Proline residues and have a smaller proportion of hydrophobic residues compared with globular protein regions (Xie et al. 1998; Uversky et al. 2000; Singh et al. 2006; Theillet et al. 2013). On the basis of these observations, we chose L, I, V, F, and Y to represent amino acids that are more abundant in ordered regions (“O” residues) and we chose D, E, P, S, and N to represent amino acids more common in disordered regions (“D” residues). Nonsynonymous substitutions involving amino acids in these two sets were polarized and classified based on whether they changed the residue from “O” to “D” or not. For example, a nonsynonymous change from L to P was classified as O→D, a change from an ordered residue to a disordered residue. Each invariable site was counted as a different polarized class based on what substitution would be caused by every possible point mutation. As expected, a higher proportion of substitutions causing O→O and D→D changes are effectively neutral, meaning they are under less negative selection on average, than either O→D and D→O changes, respectively, over all regions (supplementary fig. S2, Supplementary Material online). In other words, changes between these residue groupings are less likely to be detrimental to fitness than changes across the groupings, as proteins tend to preserve the biochemical type of their amino acids. We sought to test whether the efficacy of selection to retain the biochemical type of amino acid was similar in ordered regions and disordered regions. To do so, we computed the fraction of the total nearly neutral sites (Nes < 1) that change the biochemical type. For example, is the fraction of neutral biochemically changing sites for the “O” residues. In defining this fraction, F, we are controlling for the total number of neutral sites, which we found above to be higher in disordered regions (fig. 4). We find that the fraction of neutral biochemically changing sites is higher in disordered regions than in other more structured regions (both F, fig. 6A, and F, fig. 6B; Wilcoxon test, P < 10−6). This confirms that negative selection acts to preserve these biochemical types of residues more strongly within ordered regions, as opposed to simply being stronger proportionately over all residues.

— Frequency of insertions/deletions per kilobase pair (Kb) in disordered regions, Pfam domains, other ordered regions, and noncoding DNA in S. cerevisiae and humans. (A) The number of indel polymorphisms per Kb as a function of indel size for disordered regions (unfilled squares), other protein regions (gray squares), Pfam domains (filled squares) and Non-coding DNA (unfilled triangles). (B) Gray bars represent indels that were observed in multiples of three (nonframeshifting indels). White bars represent indels that were not found as multiples of three (frameshifting indels). The frequency of frameshifting indels (Diso) in disordered regions is similar to that of Pfam domains (Pfam). The frequency of nonframeshifting indels in disordered regions is similar to that of noncoding DNA (NonC), and much higher than the frequency of frameshifting indels. (C, D) Indels in the human genome display similar patterns as in yeast. (E) The fraction of out of frame indels is lower in proteins than expected based on noncoding regions (dashed line), but is higher at the termini of proteins, consistent with reduced selection on indels in the termini. This effect does not explain the difference between disordered regions (unfilled squares) and other protein regions (gray squares).

Disordered Regions Show Greater Indel Polymorphism than Ordered Regions

Given that disordered regions have been predicted to be locations of increased rates of insertions and deletions (Toth-Petroczy and Tawfik 2013), we sought to test for differences in the abundance of indel polymorphism in disordered regions in comparison with structured protein regions. We computed the rate of indel polymorphisms in disordered regions, Pfam domains, other ordered regions, and noncoding DNA in S. cerevisiae and in humans (see Materials and Methods). As expected, we find periodic variation (multiples of three) in the rate of indels in protein coding regions of all types. However, we observed a greater than 10-fold increase in the rate of nonframeshifting (in frame) indels in disordered regions compared with Pfam domains (fig. 6A). Interestingly, frameshifting indels in disordered regions are similar in rate to other protein coding regions (fig. 6B), but the rate of nonframeshifting indels is similar between disordered regions and indels sizes that are multiples of three in noncoding DNA. To illustrate the remarkable difference in rates between disordered regions and Pfam domains, we note that the rate of nonframeshifting (in frame) indels in yeast disordered regions is approximately 8-fold higher than frameshifting indels, whereas in Pfam domains, the rate of nonframeshifting (in frame) indels is less than half of that of frameshifting indels. The indel polymorphism patterns in human protein coding regions also show a periodic pattern as expected (fig. 6C). As in yeast, we observed that nonframeshifting indels in disordered regions are much more frequent relative to Pfam domains and other ordered regions, and show a similar rate to indels in pseudogenes (and pseudogene introns) that are multiples of three. As with the yeast indels, the differences between regions are dramatic: The rate of nonframeshifting (in frame) indels in disordered regions is 2.8 times higher than frameshifting indels, while the frequency of nonframeshifting indels in Pfam domains is less than one half of that of frameshifting indels. Moreover, the frequency of nonframeshifting (in frame) indels in disordered regions is approximately 9-fold higher than that of nonframeshifting indels in Pfam domains (fig. 6D). Collectively, our analyses on human indel polymorphisms show similar trends to those observed in S. cerevisiae, suggesting that the observed patterns may be universal to eukaryotic proteins. The polymorphism frequency for nonframeshifting (in frame) indels, which reaches the frequency expected based on noncoding sequences, suggests that they are under much weaker constraints in disordered regions than in other protein regions. — Frequency of insertions/deletions per kilobase pair (Kb) in disordered regions, Pfam domains, other ordered regions, and noncoding DNA in S. cerevisiae and humans. (A) The number of indel polymorphisms per Kb as a function of indel size for disordered regions (unfilled squares), other protein regions (gray squares), Pfam domains (filled squares) and Non-coding DNA (unfilled triangles). (B) Gray bars represent indels that were observed in multiples of three (nonframeshifting indels). White bars represent indels that were not found as multiples of three (frameshifting indels). The frequency of frameshifting indels (Diso) in disordered regions is similar to that of Pfam domains (Pfam). The frequency of nonframeshifting indels in disordered regions is similar to that of noncoding DNA (NonC), and much higher than the frequency of frameshifting indels. (C, D) Indels in the human genome display similar patterns as in yeast. (E) The fraction of out of frame indels is lower in proteins than expected based on noncoding regions (dashed line), but is higher at the termini of proteins, consistent with reduced selection on indels in the termini. This effect does not explain the difference between disordered regions (unfilled squares) and other protein regions (gray squares). Segregating frameshift causing indels in the yeast population are enriched near the C-termini of proteins (Liti et al. 2009), presumably because frameshifts at the C-terminus are less likely to disrupt protein function, and may simply result in the addition of additional residues and a new stop codon. We sought to rule out the possibility that the elevated rate of indels we observed in disordered regions was related to the enrichment of disordered proteins near the termini of proteins. We computed the fraction of indels that are out of frame as a function of position along the gene for indels that fall in disordered versus other protein regions (there were too few indels in Pfam domains to compute these fractions reliably across the protein length). We found that both disordered and other ordered protein regions show higher proportions of frameshifting indels at both N- and C-termini, probably due to the availability of alternative start and stop codons for many genes. Nevertheless, the proportion of frameshifting indels is dramatically lower for disordered regions across the whole length of the protein (fig. 6E). This is due to the elevated rate of in-frame indel polymorphism reported above, and consistent with the model that disordered regions are much more tolerant of in-frame indels, irrespective of the position in which these indels occur in protein. The dramatic increase in rate of nonframeshifting indel polymorphism suggests that most of the large protein coding indels segregating in the human population will be found in disordered protein regions. In figure 7, we show examples of large indels segregating at high frequency in two important human proteins, IRF5 (Fan et al. 2010) and GRIN3B (Niemann et al. 2008). In the case of IRF5, an insertion seems to have appeared in the human–chimp ancestor and reached a frequency of 54.6% in the overall 1000 Genomes population. This region is not of low complexity, but repeating codons could be increasing the region’s propensity for indels. Different length indels of similar sequence in orangutan, marmoset, and squirrel monkey support this idea. Interestingly, the orangutan genome appears to contain a similar, albeit independent, insertion in this region. In the case of GRIN3B, the deletion likely represents the derived state, and removes nine amino acids in around 16% of the 1000 genomes population. These examples also illustrate the difficulty in properly aligning rapidly evolving disordered regions over long evolutionary distances.

— Multiple sequence alignments of indel and subset of Multiz 100 vertebrate alignment. For each species, the first row represents DNA sequence alignments, while the second row of letters represents the amino acid sequence alignments corresponding to each codon. The black boxes highlight the region of insertion/deletion. The percentage value corresponding to each human protein alignment represents the frequency of the respective allele. (A) Alignment of the insertion TLQPPTLRPP (ten amino acids) in the IRF5 protein in humans. (B) Alignment of the deletion APAEAPPHS (nine amino acids) in the GRIN3B (Glutamate receptor) protein in humans. We considered whether the dramatic increase in indel rates we observed in disordered regions could cause the difficulty in aligning disordered regions, and speculated that errors in alignment could explain the rapid rates of substitutions observed in disordered regions over long evolutionary distances. We tested this hypothesis using simulations of molecular evolution under standard models of molecular evolution (see Materials and Methods), but we found that at the range of indel to substitution ratios consistent with the yeast polymorphism data (0.05–0.1), estimates of substitution rate are still accurate to long evolutionary distances (supplementary fig. S3, Supplementary Material online).

Discussion

We used two population-based methods to study the strength of selection on intrinsically disordered protein regions. Both Pa/Ps ratios and the DFE support weaker negative selection on substitutions in disordered regions. Prior analysis using a sequence alignment approach coincides with this interpretation (Brown et al. 2002). Using the DFE approach we also exploited polarized nonsynonymous changes to compare changes from amino acid residues that are more predominant in ordered regions (“O” residues) with those that are more predominant in disordered regions (“D” residues) and vice versa. While there are more nearly neutral potential sites for all these “biochemical” changes in disordered regions relative to other regions, the magnitude of increase in nearly neutral sites is not equal across the different categories of changes. This finding suggests not only overall weaker negative selection in disordered regions, but also subsets of nonsynonymous changes having fundamentally different fitness consequences in disordered regions compared with more structured regions. Switching states between “D” and “O” amino acids could be potentially more deleterious in structured regions, because they would more likely disrupt secondary structures of the peptide sequence. In contrast, the observation that switching residues in disordered regions is less deleterious compared with structured regions suggests that the composition of residues is more important for the functions of disordered regions, rather than their primary sequence. This is in accordance with previous evidence that the type of residue found in a disordered region is more strongly conserved than the amino acid sequence in that region (Moesa et al. 2012). Although disordered regions showed reduced constraints on amino acid changing polymorphisms relative to structured proteins, the rate of amino acid polymorphism was not dramatically different. In contrast, although rates of frameshifting mutations in disordered regions and Pfam domains are similar, nonframeshifting indels in disordered regions are much more abundant, approaching the frequencies observed in pseudogenes and noncoding DNA. This is consistent with other recent studies that associate indels with disordered regions (de la Chaux et al. 2007; Light, Sagit, Ekman, et al. 2013; Light, Sagit, Sachenkova, et al. 2013; Toth-Petroczy and Tawfik 2013). We suggest that segregating nonframeshifting indels in disordered regions are largely neutral, allowing these regions to change their size with minimal functional consequences. This inherent plasticity of disordered regions, which is not evident in ordered regions, is consistent with models suggesting that the lack of a stable protein conformation might be functionally advantageous by allowing the flexibility to interact with many different targets (Dyson and Wright 2002, 2005). Although the consistency of these observations with models of disordered region function is appealing, it is also possible that mutational biases may play a role in preferentially producing nonframeshifting indels in disordered regions (due, e.g., to repetitive or low-complexity amino acid sequences; fig. 6A). If natural selection is not strong enough to counteract this elevated mutation rate, increased rates of indel polymorphism and divergence are expected under this mutation bias model as well. Although preceding investigations have suggested the former, recent systematic analysis of selection on protein coding indels suggests complex interactions between selection and mutation (Chong et al. 2013). We speculated that poor quality alignments and large rates of substitution typically observed in disordered regions might be due simply to the increased rate of indels. However, standard simulations of molecular evolution indicate that elevated indel rates alone are insufficient to bias the estimation of evolutionary rates. In the context of more realistic mutation processes (e.g., indels that are the result of small duplications, rather than random amino acids, and substitution models that yield the equilibrium distribution of disordered regions), alignment errors could still account for some of the patterns of evolution observed in disordered regions. Nevertheless, it is reassuring that our study of polymorphism and simulations of molecular evolution generally corroborate previous work based on more distant evolutionary comparisons. While our SNP and indel polymorphism analyses of human and yeast proteins identified trends that are consistent across both species, the absolute values of the Pa/Ps ratios and indel frequencies were different. The Pa/Ps ratio in the human population is closer to a value of 1 for two possible reasons: There is either less negative selection in humans, or human samples have diverged more recently from their most recent common ancestor. Indel mutations on the other hand occur at much lower frequencies in humans relative to yeast (Lynch et al. 2008), which may partially account for the overall lower frequency of indels in humans as shown in figure 1.

Conclusion

Our investigation suggests that amino acid polymorphisms in disordered regions of both humans and yeast are under slightly weaker negative selection compared with structured protein regions. In contrast, the rate of in-frame indel polymorphisms in disordered regions is dramatically elevated, similar to that of noncoding DNA. Our findings using population genomics confirm recent observations regarding the evolution of disordered regions based on interspecific comparisons, and indicate that disordered regions are probably the major source of segregating protein length variation in the human population.

Supplementary Material

Supplementary figures S1–S3 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

62 in total

1. An accessory DNA binding motif in the zinc finger protein Adr1 assists stable binding to DNA and can be replaced by a third finger.

Authors: E T Young; N Kacherovsky; C Cheng
Journal: Biochemistry Date: 2000-01-25 Impact factor: 3.162

2. Proteomic studies of the intrinsically unstructured mammalian proteome.

Authors: Charles A Galea; Vishwajeeth R Pagala; John C Obenauer; Cheon-Gil Park; Clive A Slaughter; Richard W Kriwacki
Journal: J Proteome Res Date: 2006-10 Impact factor: 4.466

Review 3. Role of intrinsically disordered protein regions/domains in transcriptional regulation.

Authors: Anna S Garza; Nihal Ahmad; Raj Kumar
Journal: Life Sci Date: 2008-12-24 Impact factor: 5.037

4. A folding transition and novel zinc finger accessory domain in the transcription factor ADR1.

Authors: P M Bowers; L E Schaufler; R E Klevit
Journal: Nat Struct Biol Date: 1999-05

5. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies.

Authors: Peter D Keightley; Adam Eyre-Walker
Journal: Genetics Date: 2007-12 Impact factor: 4.562

Review 6. Intrinsically disordered proteins in human diseases: introducing the D2 concept.

Authors: Vladimir N Uversky; Christopher J Oldfield; A Keith Dunker
Journal: Annu Rev Biophys Date: 2008 Impact factor: 12.981

7. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors: J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal: J Mol Biol Date: 2004-03-26 Impact factor: 5.469

8. Large-scale analysis of thermostable, mammalian proteins provides insights into the intrinsically disordered proteome.

Authors: Charles A Galea; Anthony A High; John C Obenauer; Ashutosh Mishra; Cheon-Gil Park; Marco Punta; Avner Schlessinger; Jing Ma; Burkhard Rost; Clive A Slaughter; Richard W Kriwacki
Journal: J Proteome Res Date: 2009-01 Impact factor: 4.466

9. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Authors: Thomas Derrien; Rory Johnson; Giovanni Bussotti; Andrea Tanzer; Sarah Djebali; Hagen Tilgner; Gregory Guernec; David Martin; Angelika Merkel; David G Knowles; Julien Lagarde; Lavanya Veeravalli; Xiaoan Ruan; Yijun Ruan; Timo Lassmann; Piero Carninci; James B Brown; Leonard Lipovich; Jose M Gonzalez; Mark Thomas; Carrie A Davis; Ramin Shiekhattar; Thomas R Gingeras; Tim J Hubbard; Cedric Notredame; Jennifer Harrow; Roderic Guigó
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

10. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

13 in total

1. The Argonaute-binding platform of NRPE1 evolves through modulation of intrinsically disordered repeats.

Authors: Joshua T Trujillo; Mark A Beilstein; Rebecca A Mosher
Journal: New Phytol Date: 2016-07-19 Impact factor: 10.151

2. Pipeline to Identify Hydroxyproline-Rich Glycoproteins.

Authors: Kim L Johnson; Andrew M Cassin; Andrew Lonsdale; Antony Bacic; Monika S Doblin; Carolyn J Schultz
Journal: Plant Physiol Date: 2017-04-26 Impact factor: 8.340

3. Selection maintains signaling function of a highly diverged intrinsically disordered region.

Authors: Taraneh Zarin; Caressa N Tsai; Alex N Nguyen Ba; Alan M Moses
Journal: Proc Natl Acad Sci U S A Date: 2017-02-06 Impact factor: 11.205

Review 4. Role of the XPA protein in the NER pathway: A perspective on the function of structural disorder in macromolecular assembly.

Authors: Elisa Fadda
Journal: Comput Struct Biotechnol J Date: 2015-12-08 Impact factor: 7.271

5. Insights into the Immunological Properties of Intrinsically Disordered Malaria Proteins Using Proteome Scale Predictions.

Authors: Andrew J Guy; Vashti Irani; Christopher A MacRaild; Robin F Anders; Raymond S Norton; James G Beeson; Jack S Richards; Paul A Ramsland
Journal: PLoS One Date: 2015-10-29 Impact factor: 3.240

6. New Perspectives on Ebola Virus Evolution.

Authors: Celeste J Brown; Caleb J Quates; Christopher A Mirabzadeh; Craig R Miller; Holly A Wichman; Tanya A Miura; F Marty Ytreberg
Journal: PLoS One Date: 2016-08-01 Impact factor: 3.240

7. ASXL gain-of-function truncation mutants: defective and dysregulated forms of a natural ribosomal frameshifting product?

Authors: Adam M Dinan; John F Atkins; Andrew E Firth
Journal: Biol Direct Date: 2017-10-16 Impact factor: 4.540