| Literature DB >> 35879308 |
Noah Dukler1, Mehreen R Mughal1, Ritika Ramani1, Yi-Fei Huang2, Adam Siepel3.
Abstract
Large-scale genome sequencing has enabled the measurement of strong purifying selection in protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring such selection in noncoding as well as coding regions of the human genome. ExtRaINSIGHT estimates the prevalence of "ultraselection" by the fractional depletion of rare single-nucleotide variants, after controlling for variation in mutation rates. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find abundant ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. By contrast, we find much less ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest levels in ultraconserved elements. We estimate that ~0.4-0.7% of the human genome is ultraselected, implying ~ 0.26-0.51 strongly deleterious mutations per generation. Overall, our study sheds new light on the genome-wide distribution of fitness effects by combining deep sequencing data and classical theory from population genetics.Entities:
Mesh:
Year: 2022 PMID: 35879308 PMCID: PMC9314448 DOI: 10.1038/s41467-022-31872-6
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1Measures of purifying selection at coding and coding-proximal genomic elements.
A Estimates for various annotation types are shown for both ExtRaINSIGHT (λ; teal) and INSIGHT (ρ; orange). B Similar estimates are shown for protein-coding genes by deciles of the loss-of-function observed/expected upper bound fraction (LOEUF) measure[13]. Results are shown for 80,950 isoforms of 19,677 genes. Notice that lower LOEUF scores are associated with stronger depletions of LoF variants, so λ and ρ tend to decrease as LOEUF increases. Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”).
Fig. 2Measures of purifying selection in protein-coding genes by biological pathway.
Genes were assigned coarse-grained functional categories using the top-level annotation from the Reactome pathway database[31]. An estimates for each category is shown for both ExtRaINSIGHT (λ; teal) and INSIGHT (ρ; orange). Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”). Total number of genes: n = 19,256 (ranging from 125 to 2707 per category).
Fig. 3Measures of purifying selection at annotated noncoding elements and in genomic intervals near protein-coding genes.
A Estimates for both ExtRaINSIGHT (λ; teal) and INSIGHT (ρ; orange) at noncoding elements (x-axis). B Estimates of λ in windows upstream of the transcription start site (TSS) and downstream of the polyadenylation site (PAS) (x-axis). The and UTRs are also shown, as are fourfold degenerate (4d) coding sites (CDS). C Estimates of λ for the extended promoter region (2kb upstream of the TSS) within transcription factor binding sites (TFBS) annotated in the Ensembl Regulatory Build[44] and in the immediate flanking sequences (10bp on each side). The difference in (C) is highly statistically significant by a two-sided likelihood ratio test based on the ExtRaINSIGHT likelihood model (p = 2.8 × 10−13). Error bars are centered at the MLE and indicate one standard error in each direction (see “Methods”). Numbers of elements: A old miRNA: n = 7537; UCNE: n = 1,415,142; HAR: n = 674,492; young miRNA: n = 6285; other intron: n = 971,109,276; other intergenic: n = 1,255,478,347; lncRNA: n = 453,200,392; miRNA: n = 140,681; snoRNA: n = 49,837; snRNA: n = 155,304; B n = 58,496. C n = 1,120,839.
Ultraselection across the human genome (based on ExtRaINSIGHT).
| Feature | ± (stderr)a | no. sites (M) | prop. sites | exp. no. (M)b | exp. prop.c | fold enrich. | exp. s-del.d | ||
|---|---|---|---|---|---|---|---|---|---|
| CDS | 0.148 | 0.0004 | 33.8 | 1.18% | 4.7 | 43.5% | 36.9 | 0.11 | – |
| 5 | −0.161 | 0.0006 | 8.2 | 0.29% | 0.0 | 0.0% | 0.0 | 0.00 | – |
| 3 | 0.028 | 0.0002 | 36.1 | 1.26% | 0.7 | 6.2% | 5.0 | 0.02 | – |
| splice | 0.464 | 0.0012 | 0.8 | 0.03% | 0.4 | 3.3% | 121.3 | 0.01 | 2.0% |
| nonconserved lncRNAe | 0.009 | 0.0001 | 453.6 | 15.78% | 0.0 | 0.0% | 0.0 | 0.00 | – |
| conserved lncRNAf | 0.055 | 0.0003 | 23.3 | 0.81% | 1.1 | 9.8% | 12.1 | 0.03 | – |
| nonconserved introne | 0.009 | 0.0000 | 972.6 | 33.83% | 0.0 | 0.0% | 0.0 | 0.00 | – |
| conserved intronf | 0.058 | 0.0002 | 44.3 | 1.54% | 2.2 | 20.1% | 13.1 | 0.05 | – |
| nonconserved intergenice | 0.003 | 0.0000 | 1255.5 | 43.67% | 0.0 | 0.0% | 0.0 | 0.00 | – |
| conserved intergenicf | 0.048 | 0.0002 | 46.9 | 1.63% | 1.8 | 17.0% | 10.5 | 0.04 | – |
| Total | 2875.1 | 100.00% | 10.8 | 100.0% | 0.26 |
aThe similar values of the standard errors (equal after rounding) reflect the maximum of 1M sites used for estimation.
bExpected number of ultraselected sites after adjusting for background. In this case, the estimate for nonconserved introns (0.009) was subtracted from each estimate of λ (see Supplementary Table 1 for a less conservative correction).
Expected proportion of ultraselected sites after adjusting for background.
dExpected number of new strongly deleterious mutations per diploid individual, assuming a mutation rate of 1.2 × 10−8 per generation per site.
eSites not classified as conserved by phastCons.
fSites classified as conserved by phastCons.
Weaker selection across the human genome (based on INSIGHT).
| Feature | ± (stderr) | no. sites (M) | prop. sites | exp. no. (M)a | exp. prop.b | fold enrich. | exp. del.c | |
|---|---|---|---|---|---|---|---|---|
| CDS | 0.624 | 0.020 | 33.8 | 1.18% | 19.7 | 21.5% | 18.2 | 0.47 |
| 5 | 0.222 | 0.035 | 8.2 | 0.29% | 1.5 | 1.6% | 5.6 | 0.04 |
| 3 | 0.237 | 0.033 | 36.1 | 1.26% | 7.0 | 7.7% | 6.1 | 0.17 |
| splice | 0.883 | 0.013 | 0.8 | 0.03% | 0.7 | 0.7% | 26.3 | 0.02 |
| nonconserved lncRNAd | 0.025 | 0.020 | 453.6 | 15.78% | 0.0 | 0.0% | 0.0 | 0.00 |
| conserved lncRNAe | 0.412 | 0.019 | 23.3 | 0.81% | 8.6 | 9.4% | 11.6 | 0.21 |
| nonconserved intrond | 0.042 | 0.022 | 972.6 | 33.83% | 0.0 | 0.0% | 0.0 | 0.00 |
| conserved introne | 0.426 | 0.019 | 44.3 | 1.54% | 17.0 | 18.5% | 12.0 | 0.41 |
| nonconserved intergenicd | 0.059 | 0.036 | 1255.5 | 43.67% | 21.7 | 23.6% | 0.5 | 0.52 |
| conserved intergenice | 0.376 | 0.020 | 46.9 | 1.63% | 15.7 | 17.0% | 10.4 | 0.38 |
| Total | 2875.1 | 100.00% | 91.9 | 100.0% | 2.21 |
aExpected number of deleterious sites after adjusting for background. In this case, the estimate for nonconserved introns (0.022) was subtracted from each estimate of ρ.
bExpected proportion of deleterious sites after adjusting for background.
cExpected number of new deleterious mutations per diploid individual, assuming a mutation rate of 1.2 × 10−8 per generation per site.
dSites not classified as conserved by phastCons.
eSites classified as conserved by phastCons.