Literature DB >> 24916662

Discovering functional DNA elements using population genomic information: a proof of concept using human mtDNA.

Daniel R Schrider1, Andrew D Kern2.   

Abstract

Identifying the complete set of functional elements within the human genome would be a windfall for multiple areas of biological research including medicine, molecular biology, and evolution. Complete knowledge of function would aid in the prioritization of loci when searching for the genetic bases of disease or adaptive phenotypes. Because mutations that disrupt function are disfavored by natural selection, purifying selection leaves a detectable signature within functional elements; accordingly, this signal has been exploited for over a decade through the use of genomic comparisons of distantly related species. While this is so, the functional complement of the genome changes extensively across time and between lineages; therefore, evidence of the current action of purifying selection in humans is essential. Because the removal of deleterious mutations by natural selection also reduces within-species genetic diversity within functional loci, dense population genetic data have the potential to reveal genomic elements that are currently functional. Here, we assess the potential of this approach by examining an ultradeep sample of human mitochondrial genomes (n = 16,411). We show that the high density of polymorphism in this data set precisely delineates regions experiencing purifying selection. Furthermore, we show that the number of segregating alleles at a site is strongly correlated with its divergence across species after accounting for known mutational biases in human mitochondrial DNA (ρ = 0.51; P < 2.2 × 10(-16)). These two measures track one another at a remarkably fine scale across many loci-a correlation that is purely the result of natural selection. Our results demonstrate that genetic variation has the potential to reveal with surprising precision which regions in the genome are currently performing important functions and likely to have deleterious fitness effects when mutated. As more complete human genomes are sequenced, similar power to reveal purifying selection may be achievable in the human nuclear genome.
© The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  mitochondria; natural selection; population genetics

Mesh:

Substances:

Year:  2014        PMID: 24916662      PMCID: PMC4122919          DOI: 10.1093/gbe/evu116

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

Only 1–2% of human genome lies within protein-coding sequence (Lander et al. 2001). Determining the extent to which the remainder of the genome is functional is crucial to our understanding of human biology. A variety of recently developed experimental techniques have aided in the search of noncoding DNA for functional elements (Dunham et al. 2012); however, on their own these techniques can produce a huge number of false positives (Graur et al. 2013). Searches for the evolutionary signature of purifying selection have therefore proved a more fruitful strategy for identifying functional elements; indeed phylogenetic searches comparing sequences of related species have revealed that approximately 5% of the human genome is constrained by natural selection (Chinwalla et al. 2002; Siepel et al. 2005; Lunter et al. 2006; Birney et al. 2007; Davydov et al. 2010), and similar strategies have been used to predict the phenotypic severity of mutations (Stone and Sidow 2005). Although whole-genome comparisons aimed at identifying the footprints of selection are highly effective, they have been used primarily to detect elements under constraint for hundreds of millions of years of evolutionary history (Siepel et al. 2005; Davydov et al. 2010). However, the set of functional elements in the genome experiences considerable turnover (Demuth et al. 2006). Comparative genomic techniques will fail in these instances, particularly for the supremely interesting cases of human-specific gain (Knowles and McLysaght 2009) and loss of function (Wang et al. 2006). Surveys of genetic diversity within species, on the other hand, have the potential to identify regions currently experiencing purifying selection and that are therefore functional, as purifying selection will remove genetic diversity from such loci. Unfortunately, genetic variation in the human genome is quite sparse, with a comparison of any two homologous chromosomes uncovering less than 1 single-nucleotide polymorphism (SNP) every kilobase (Lander et al. 2001). Sampling more individuals, however, yields additional polymorphisms, and an ultradeep sample of mitochondrial variation from 16,411 genomes is available in the MITOMAP database (Ruiz-Pesini et al. 2007). These data are extremely polymorphic, with more than one SNP on every other base pair on average. This data set thus serves as an ideal proving ground for the approach of identifying functional constraint using massive amounts of polymorphism data, which will soon be available for nuclear genomes. Here, we show that the density of polymorphism in these data closely tracks divergence at a fine scale, implying that these data can indeed be used to reveal the strength of purifying selection in the human mitochondrial genome at a very high resolution. Our results suggest an enormous potential for population genomic data to uncover functional DNA elements, including those not conserved across species.

Results and Discussion

We set out to determine the extent to which polymorphism data reveal the strength of purifying selection across the human mitochondrial genome and downloaded the coordinates of all 8,944 SNPs from MITOMAP (http://www.mitomap.org/MITOMAP, last accessed August 4, 2013). We reasoned that if the density of polymorphism was governed by the amount of purifying selection acting on each site, then SNP density would be correlated with divergence across species, in accordance with expectations under the Neutral model (Kimura 1982). This is indeed what we observe in the form of a strong correlation between the number of alleles per site and its average negated phyloP score (Siepel et al. 2006) measuring divergence across vertebrates (Spearman’s ρ = 0.52; P < 2.2 × 10−16). This correlation is also highly significant when averaging polymorphism and divergence within 10-bp adjacent windows (ρ = 0.50; P < 2.2 × 10−16; fig. 1).
F

The correlation between polymorphism and divergence in the human mitochondrial genome. The average number of alleles per base pair in 10-bp windows is shown in the x-axis and divergence as measured by the negated phyloP score is shown in the y-axis.

The correlation between polymorphism and divergence in the human mitochondrial genome. The average number of alleles per base pair in 10-bp windows is shown in the x-axis and divergence as measured by the negated phyloP score is shown in the y-axis. Although this observation is consistent with purifying selection both removing diversity and constraining divergence at functional elements, such a pattern could also be generated by variation in the spontaneous mutation rate. It has been shown that mutation rate in the mitochondria varies according to the duration for which a given site remains single stranded on the H strand (DssH) during DNA replication (Reyes et al. 1998). We also find evidence for this in the form of a significant correlation between divergence at each site and the duration the site is single stranded on the H strand during replication, although this correlation is far weaker than that shared between polymorphism and divergence (ρ = 0.11; P < 2.2 × 10−16). Moreover, after correcting for DssH, the correlation between polymorphism and divergence at individual sites is essentially unchanged and still highly significant (ρ = 0.49; P < 2.2 × 10−16). Rather than being driven by a subset of mitochondrial loci, this correlation is significant (at P < 0.05) in 36/37 genes and is significant in 35/37 genes after correcting for DssH (table 1). Similarly, polymorphism and divergence are more strongly correlated in protein-coding (ρ = −0.53; P < 2.2 × 10−16) and RNA-coding genes (ρ = −0.43; P < 2.2 × 10−16) than noncoding DNA within the control region (ρ = −0.23; P = 9.3 × 10−15) or outside of it (ρ = −0.30; P = 0.021). This correlation is also far stronger at nonsynonymous than synonymous sites (ρ = 0.25 for second codon position sites, P < 2.2 × 10−16; ρ = 0.079 for 4-fold degenerate sites, P = 3.6 × 10−4; fig. 2) as expected if purifying selection is a more predominant force at nonsynonymous sites. Finally, the minor allele frequencies of SNPs from the Human Mitochondrial Genome Database (mtDB; Ingman and Gyllensten 2006) are correlated with divergence (ρ = 0.076; P = 4.0 × 10−16), even though variation in mutation rate is not expect to affect allele frequencies. Thus, purifying selection uniquely drives patterns of polymorphism in the human mitochondrial genome. This finding supports previous reports that purifying selection is a prominent force in the mitochondrial genome (Rand and Kann 1996; Nielsen and Weinreich 1999; Elson et al. 2004; Stewart et al. 2008). The patterns we observed are not the result of positive selection, as the fixation of a beneficial mutation through a selective sweep removes all genetic diversity from a nonrecombining chromosome (Smith and Haigh 1974). We are thus limited to observing mutations occurring since the most recent sweep.
Table 1

Gene-Specific Correlations Between SNP Density and Negative phyloP Score

Gene NameGene DescriptionGene Start (hg19)Gene End (hg19)Spearman’s ρPSpearman's ρ (Correcting for DssH)P(Correcting for DssH)
MT-TFtRNA phenylalanine579649−0.4841.90 × 10−5−0.4689483.71 × 10−5
MT-RNR112S ribosomal RNA6501603−0.429<2.20 × 10−16−0.4169046<2.20 × 10−16
MT-TVtRNA valine16041672−0.3360.004783−0.3260220.006261
MT-RNR216S ribosomal RNA16733230−0.443<2.20 × 10−16−0.4435798<2.20 × 10−16
MT-TL1tRNA leucine 132313305−0.3010.008593−0.298370.00932
MT-ND1NADH dehydrogenase subunit 133084263−0.542<2.20 × 10−16−0.5149936<2.20 × 10−16
MT-TItRNA isoleucine42644332−0.2690.02551−0.25237450.03643
MT-TQtRNA glutamine43304401−0.4634.19 × 10−5−0.4763852.34 × 10−5
MT-TMtRNA methionine44034470−0.2910.01611−0.28807270.01721
MT-ND2NADH dehydrogenase subunit 244715512−0.516<2.20 × 10−16−0.461188<2.20 × 10−16
MT-TWtRNA tryptophan55135580−0.5099.42E × 10−6−0.46268127.11 × 10−5
MT-TAtRNA alanine55885656−0.4290.0002381−0.44126210.0001475
MT-TNtRNA asparagine56585730−0.3780.0009899−0.37656260.001025
MT-TCtRNA cysteine57625827−0.5961.26 × 10−7−0.57098375.55 × 10−7
MT-TYtRNA tyrosine58275892−0.3520.00371−0.34865350.004118
MT-CO1Cytochrome c oxidase subunit I59057446−0.633<2.20 × 10−16−0.6186776<2.20 × 10−16
MT-TS1tRNA serine 174477515−0.6278.31 × 10−9−0.61072292.51 × 10−8
MT-TDtRNA aspartic acid75197586−0.3650.002186−0.3202530.007758
MT-CO2Cytochrome c oxidase subunit II75878270−0.573<2.20 × 10−16−0.5291891<2.20 × 10−16
MT-TKtRNA lysine82968365−0.3310.005158−0.28099190.01846
MT-ATP8ATP synthase F0 subunit 883678573−0.2660.0001042−0.26372310.0001233
MT-ATP6ATP synthase F0 subunit 685289208−0.389<2.20 × 10−16−0.3823366<2.20 × 10−16
MT-CO3Cytochrome c oxidase subunit III92089991−0.538<2.20 × 10−16−0.5314281<2.20 × 10−16
MT-TGtRNA glycine999210059−0.3960.0008191−0.34048120.004497
MT-ND3NADH dehydrogenase subunit 31006010405−0.510<2.20 × 10−16−0.5000739<2.20 × 10−16
MT-TRtRNA arginine1040610470−0.4707.66 × 10−5−0.44152060.0002317
MT-ND4LNADH dehydrogenase subunit 4L1047110767−0.491<2.20 × 10−16−0.5133884<2.20 × 10−16
MT-ND4NADH dehydrogenase subunit 41076112138−0.561<2.20 × 10−16−0.5419698<2.20 × 10−16
MT-THtRNA histidine1213912207−0.2770.02112−0.17864130.1419
MT-TS2tRNA serine 21220812266−0.4700.0001712−0.4891468.45 × 10−5
MT-TL2tRNA leucine 21226712337−0.3740.001311−0.35430690.002434
MT-ND5NADH dehydrogenase subunit 51233814149−0.532<2.20 × 10−16−0.5141183<2.20 × 10−16
MT-ND6NADH dehydrogenase subunit 61415014674−0.500<2.20 × 10−16−0.4776221<2.20 × 10−16
MT-TEtRNA glutamic acid1467514743−0.3470.003468−0.34767310.003421
MT-CYBCytochrome b1474815888−0.551<2.20 × 10−16−0.5025513<2.20 × 10−16
MT-TTtRNA threonine1588915954−0.5601.00 × 10−6−0.43547230.0002578
MT-TPtRNA proline1595716024−0.1030.4044−0.18702940.1267
F

The correlation between polymorphism and divergence at nonsynonymous and synonymous sites. The number of alleles observed at each site is shown in the x-axis, and divergence (negative phyloP score) is shown in the y-axis at (A) second codon position (nonsynonymous) sites and (B) 4-fold degenerate (synonymous) sites. We added noise to the number of alleles to reveal the density of sites along the y-axis.

The correlation between polymorphism and divergence at nonsynonymous and synonymous sites. The number of alleles observed at each site is shown in the x-axis, and divergence (negative phyloP score) is shown in the y-axis at (A) second codon position (nonsynonymous) sites and (B) 4-fold degenerate (synonymous) sites. We added noise to the number of alleles to reveal the density of sites along the y-axis. Gene-Specific Correlations Between SNP Density and Negative phyloP Score Having established that patterns of polymorphism across the human mitochondria are largely determined by purifying selection, we sought to determine the resolution at which these data reveal the strength of selection acting on particular sites in the genome. We examined patterns of SNP diversity and divergence in 5-bp sliding windows across each gene in the mitochondrial genome, observing the extent to which the two measures mirror one another on a small scale. In particular, within each window, we calculated the average SNP density per base pair and the average probability that the site is not conserved across vertebrates according to phastCons (Siepel et al. 2005). We find that for many loci, tRNA genes in particular, these two measures track one another to a surprising extent (e.g., the phenylalanine and tryptophan tRNA genes shown in fig. 3; the remaining tRNA genes are shown in supplementary fig. S1, Supplementary Material online). This result demonstrates that SNP density has the ability to reveal the strength of selection at a surprisingly detailed resolution—on the scale of a few base pairs.
F

The probability of polymorphism versus the probability of unconstrained evolution across vertebrates. (A) The 5-bp sliding genomic windows showing SNP density (blue) and one minus the probability of conservation across vertebrates (red) according to phastCons (Siepel et al. 2005) across the phenylalanine tRNA gene. (B) The same plot for the tryptophan tRNA gene.

The probability of polymorphism versus the probability of unconstrained evolution across vertebrates. (A) The 5-bp sliding genomic windows showing SNP density (blue) and one minus the probability of conservation across vertebrates (red) according to phastCons (Siepel et al. 2005) across the phenylalanine tRNA gene. (B) The same plot for the tryptophan tRNA gene. As a simple proof of concept, we sought to use SNP density to predict function at a fine scale via a hidden Markov model (HMM; Rabiner 1989). Using a similar strategy as phastCons (Siepel et al. 2005), we learned a two-state HMM (constrained vs. unconstrained) where the observation for each site in the genome is the number of alleles at the site. We then used this HMM to predict constrained regions to which we refer as mitoPopCons elements. There is extremely strong overlap between mitoPopCons elements obtained from polymorphism data and phastCons elements predicted from divergence (P < 0.0001; fig. 4; see Materials and Methods). Although mitoPopCons recovers somewhat fewer genic base pairs than phastCons (35.4% of all genic base pairs are recovered by mitoPopCons versus 49.7% by phastCons), mitoPopCons elements contain fewer intergenic base pairs (0.75% of mitoPopCons base pairs are intergenic versus 3.4% of phastCons bases). Given the dramatically deeper evolutionary time period examined by phastCons data, that it seems to perform only marginally better than mitoPopCons underscores the potential of population genetic approaches. phastCons elements are smaller and more numerous (1,395 elements averaging 6.7 bp in length) than mitoPopCons elements (33 elements averaging 167 bp), perhaps implying that phylogenetic data allow for higher resolution prediction than even our dense polymorphism data. On the other hand, element length distributions may have been influenced by the difference in emission probability training methods used for mitoPopCons (trained via the Baum–Welch algorithm; see Materials and Methods) and phastCons elements. In any case, the success of this simple HMM shows that SNP diversity has the ability to accurately predict function at a fine scale in the human mitochondria.
F

Comparison of conserved elements called from phylogenetic data and those called from population genetic data. This image form the UCSC Genome Browser shows positions 1–5,000 of the human mitochondrial genome. Conserved elements called from the polymorphism-based HMM (mitoPopCons) appear in blue, whereas phastCons elements obtained from a comparison of mammalian genomes appear in red. phastCons conservation probabilities are shown at the bottom in green. Gene locations are shown at the top.

Comparison of conserved elements called from phylogenetic data and those called from population genetic data. This image form the UCSC Genome Browser shows positions 1–5,000 of the human mitochondrial genome. Conserved elements called from the polymorphism-based HMM (mitoPopCons) appear in blue, whereas phastCons elements obtained from a comparison of mammalian genomes appear in red. phastCons conservation probabilities are shown at the bottom in green. Gene locations are shown at the top. We have shown that ultradense polymorphism data can be used to accurately detect functional nucleotides in the human mitochondrial genome, potentially at the level of the individual base pair, while sidestepping limitations of phylogenetic approaches. This result suggests that as whole-genome sequencing becomes more ubiquitous, it may become possible to perform such high-resolution prediction in the nuclear genome as well. Applying a polymorphism-based approach to the nuclear genome will present several additional challenges. First, independent assortment and recombination in the nuclear genome cause different loci to have distinct genealogical histories and therefore varying levels of diversity under neutrality, thereby potentially impeding the detection of selection. As another consequence of recombination, both positive (Smith and Haigh 1974) and negative selection (Charlesworth et al. 1993) will have localized effects on flanking variation, rather than genome wide as in the mitochondria. These forces will further increase variance in polymorphism at unselected sites and may thus obscure the signal of negative selection at selected sites. Another difficulty of the nuclear genome is that its nucleotide diversity is far lower than that of the mitochondrial genome (Lander et al. 2001), meaning that an even larger number of sequences than examined here may be required to accurately detect selection. Moreover, the power to detect function increases logarithmically with sample size (supplementary fig. S2, Supplementary Material online). However, given the ever-increasing rate at which new human genome sequences are released, this problem may not be insurmountable. Finally, there is likely more variation in the strength of purifying selection acting in the nuclear genome than in the mitochondria. As a consequence, weakly constrained but still functionally important regions may evade detection, especially by a two-state method, allowing for only one level of constraint like the HMM used here. If these hurdles can be overcome, approaches such as ours will then have an enormous impact on biological research, allowing for the discovery of the complete set of functional elements in the human genome and the degree to which new mutations at each site are deleterious. Such efforts will vastly improve predictions of the phenotypic impact of mutations occurring in humans and will prioritize searches for disease-causing mutations. This information will also reveal species-specific changes in selective pressures at the resolution of individual nucleotides, thereby greatly improving our understanding of how the functional components of genomes evolve.

Materials and Methods

We converted all coordinates obtained from MITOMAP and mtDB from the Cambridge Reference Sequence (CRS) coordinate space to the human reference genome’s (GRCh37) coordinate space by aligning the two mitochondrial sequences using MUSCLE (Edgar 2004). We downloaded phyloP scores for each mitochondrial site from the UCSC Genome Browser Database (Meyer et al. 2013). Because there are no allele frequency data available for MITOMAP SNPs, we used the number of alleles at each site as our measure of diversity. For the correlation of minor allele frequency with divergence, we used frequencies of biallelic SNPs from mtDB. We measured the duration of single strandedness (DssH) following Chong and Mueller (2013), using the coordinates of the light-strand origin of replication from MITOMAP. For the DssH analyses, we omitted sites within the control region. To control for the DssH-divergence correlation, we fit a generalized linear model treating the number of segregating alleles at each site as a Poisson distributed response variable and with DssH value as the predictive variable using R. We then calculated the correlation between phyloP scores and the polymorphism residuals. The HMM analysis was performed in MATLAB with the same transition matrix as used to learn the phastCons HMM for the UCSC Genome Browser (Kent et al. 2002). For each base pair, the emitted observation was the number of alleles (ranging one and four) found in the MITOMAP data set. The rationale for this approach was that after training the HMM, its two states should have different probabilities of emitting 1, 2, 3, or 4 alleles at a given site, and this would result in one state being more likely in stretches of DNA under selective constraint, whereas the other state would have higher likelihood in less constrained sequences. After assigning an initially random emission probability matrix, we then used Baum–Welch training to learn the emission parameters only for the constrained and unconstrained states by using large number of pseudocounts for transition matrix to ensure that it remained invariant during training. To guard against convergence at a poor local optimum, we repeated this process ten times and kept the emission matrix under which the observed sequence data had the highest likelihood. After training, we considered the state with the higher probability of emitting monomorphic sites to be the “functional” state and the other state to be “nonfunctional.” Next, we predicted constrained elements by using the Viterbi algorithm. We tested for significant overlap between our HMM predictions and mammalian phastCons elements from the UCSU Genome Browser by counting the number of base pairs constrained according to both methods and compared this count with those obtained after permuting the coordinates of our predictions. This was repeated for 10,000 permutations to obtain a P value. To reveal the relationship between sample size and the HMM’s power to recover functional sites, we also repeated the HMM analysis as described above for different sample sizes using the number of alleles at each site in the mtDB set—the MITOMAP set could not be used for this as it lacks reliable allele frequency estimates. The sample sizes we examined ranged from 100 to 1,800, incrementing by 100. We also used a sample size of 1,864, which was the minimum sample size among all SNPs in the mtDB set. For each sample size, we randomly downsampled from the full mtDB data set by randomly drawing the desired number of alleles from the full set at each site without replacement. We then counted the number of distinct alleles observed at each site within the downsampled set, and this number was used as the observation for the HMM.

Supplementary Material

Supplementary figures S1 and S2 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).
  25 in total

1.  The age of nonsynonymous and synonymous mutations in animal mtDNA and implications for the mildly deleterious theory.

Authors:  R Nielsen; D M Weinreich
Journal:  Genetics       Date:  1999-09       Impact factor: 4.562

2.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

3.  Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity.

Authors:  Eric A Stone; Arend Sidow
Journal:  Genome Res       Date:  2005-06-17       Impact factor: 9.043

4.  Asymmetrical directional mutation pressure in the mitochondrial genome of mammals.

Authors:  A Reyes; C Gissi; G Pesole; C Saccone
Journal:  Mol Biol Evol       Date:  1998-08       Impact factor: 16.240

5.  The hitch-hiking effect of a favourable gene.

Authors:  J M Smith; J Haigh
Journal:  Genet Res       Date:  1974-02       Impact factor: 1.588

6.  Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans.

Authors:  D M Rand; L M Kann
Journal:  Mol Biol Evol       Date:  1996-07       Impact factor: 16.240

7.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors:  Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal:  Genome Res       Date:  2005-07-15       Impact factor: 9.043

8.  Genome-wide identification of human functional DNA using a neutral indel model.

Authors:  Gerton Lunter; Chris P Ponting; Jotun Hein
Journal:  PLoS Comput Biol       Date:  2006-01-13       Impact factor: 4.475

9.  mtDB: Human Mitochondrial Genome Database, a resource for population genetics and medical sciences.

Authors:  Max Ingman; Ulf Gyllensten
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

10.  Gene losses during human origins.

Authors:  Xiaoxia Wang; Wendy E Grus; Jianzhi Zhang
Journal:  PLoS Biol       Date:  2006-02-14       Impact factor: 8.029

View more
  1 in total

1.  Inferring Selective Constraint from Population Genomic Data Suggests Recent Regulatory Turnover in the Human Brain.

Authors:  Daniel R Schrider; Andrew D Kern
Journal:  Genome Biol Evol       Date:  2015-11-19       Impact factor: 3.416

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.