| Literature DB >> 24336482 |
Lex E Flagel1, John H Willis, Todd J Vision.
Abstract
Major unresolved questions in evolutionary genetics include determining the contributions of different mutational sources to the total pool of genetic variation in a species, and understanding how these different forms of genetic variation interact with natural selection. Recent work has shown that structural variants (SVs) (insertions, deletions, inversions, and transpositions) are a major source of genetic variation, often outnumbering single nucleotide variants in terms of total bases affected. Despite the near ubiquity of SVs, major questions about their interaction with natural selection remain. For example, how does the allele frequency spectrum of SVs differ when compared with single nucleotide variants? How often do SVs affect genes, and what are the consequences? To begin to address these questions, we have systematically identified and characterized a large set of submicroscopic insertion and deletion (indel) variants (between 1 and 200 kb in length) among ten inbred lines from a single natural population of the plant species Mimulus guttatus. After extensive computational filtering, we focused on a set of 4,142 high-confidence indels that showed an experimental validation rate of 73%. All but one of these indels were less than 200 kb. Although the largest were generally at lower frequencies in the population, a surprising number of large indels are at intermediate frequencies. Although indels overlapping with genes were much rarer than expected by chance, approximately 600 genes were affected by an indel. Nucleotide-binding site leucine-rich repeat (NBS-LRR) defense response genes were the most enriched among the gene families affected. Most indels associated with genes were rare and appeared to be under purifying selection, though we do find four high-frequency derived insertion alleles that show signatures of recent positive selection.Entities:
Keywords: Mimulus guttatus; indel; natural selection; population genomics; structural variation
Mesh:
Substances:
Year: 2014 PMID: 24336482 PMCID: PMC3914686 DOI: 10.1093/gbe/evt199
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Resequenced Accessions, Including Outgroups
| Line | Species | Origin | Generations of Inbreeding | Seq. Facility | Total Paired End Reads | Read Type (bp) | Sites Available | Median Per Site Coverage | NCBI SRA Accession No. |
|---|---|---|---|---|---|---|---|---|---|
| IM109 | Iron Mountain | 11 | UNC | 24,671,221 | 2 × 75 | 127,863,968 | 8 | SRX021073 | |
| IM1145 | Iron Mountain | 11 | UNC | 22,839,207 | 2 × 75 | 138,331,815 | 8 | SRX021074 | |
| IM155 | Iron Mountain | 12 | Duke | 37,172,361 | 2 × 75 | 138,971,514 | 15 | SRX055301 | |
| IM320 | Iron Mountain | 15 | Duke | 23,226,015 | 2 × 75 | 160,689,132 | 8 | SRX055300 | |
| IM479 | Iron Mountain | 9 | UNC | 24,086,031 | 2 × 75 | 134,867,795 | 9 | SRX021077 | |
| IM62 | Iron Mountain | >10 | UNC | 24,911,877 | 2 × 75 | 206,733,050 | 7 | SRX021072 | |
| IM624 | Iron Mountain | 13 | UNC | 22,433,144 | 2 × 75 | 137,605,733 | 8 | SRX021075 | |
| IM693 | Iron Mountain | 9 | UNC | 21,969,210 | 2 × 75 | 133,128,765 | 8 | SRX021078 | |
| IM767 | Iron Mountain | 11 | UNC | 25,497,466 | 2 × 75 | 135,649,759 | 9 | SRX021079 | |
| IM835 | Iron Mountain | 13 | UNC | 17,966,309 | 2 × 75 | 129,433,596 | 6 | SRX021076 | |
| DUN | Florence | >6 | JGI | 262,093,335 | 2 × 35 | 94,024,553 | 23 | SRX030973, SRX030974 | |
| SF5 | Sherar’s Falls | Natural selfer | JGI | 24,199,117 | 2 × 76 | 65,812,268 | 10 | SRX116529 |
aAll accessions originate from Oregon, USA. Approximate geographic coordinates as follows: Iron Mountain [44.4005, −122.1428], Florence [43.8891, −124.1360], and Sherar’s Falls [45.2587, −121.0201], with [Latitude, Longitude] given in decimal format.
bDuke University Sequence Facility (Duke), DOE Joint Genome Institute (JGI), University of North Carolina High-Throughput Sequencing Facility (UNC).
cNucleotide sites belonging to reads with mapping quality scores ≥ 29.
Indel and SNP Variants among the Ten Resequenced Lines
| Polymorphism Type | Count | Median Size (bp) | Average MAF | Proportion of Derived Minor Alleles |
|---|---|---|---|---|
| Indels: all | 4,142 | 2,563 | 0.255 | 0.718 |
| Indels in genes | 414 | 3,804 | 0.218 | 0.739 |
| Indels in TEs | 1,855 | 2,839 | 0.277 | 0.743 |
| SNPs: all | 1,337,759 | 1 | 0.222 | 0.619 |
| SNPs: synonymous | 259,676 | 1 | 0.239 | 0.550 |
| SNPs: nonsynonymous | 106,638 | 1 | 0.227 | 0.626 |
aAverage MAF from neutral coalescent simulations = 0.222.
FDistribution of indel size as a function of allele frequency. The box plots indicate the distribution of indel sizes (bp) at different MAF for all indels identified in the focal Iron Mountain population. Indel size (y-axis) is plotted on a log scale.
FAllele frequency distributions in the ten resequenced lines. Cumulative frequency as a function of MAF is shown for genic and TE indels, synonymous and nonsynonymous SNPs, and a neutral coalescent simulation (neutral). Inset: Estimates of mean Tajima’s D color coded as in the main panel. Bars indicate the 95% confidence intervals as obtained by delete-one jackknifing the ten lines. Sample sizes are given in table 2.
FDistribution of indel polymorphisms in genes and TEs. The observed and expected number of nucleotide sites in segregating indels among (A) genes and TEs, and (B) different gene components. (C) Indel density along a normalized transcript, using data from all annotated genes overlapping with segregating indels. Each genic region (5′ UTR, gene body consisting of exons plus introns, and 3′ UTR) was divided into 100 equally sized bins, and for each bin the relative density among all polymorphic indels was recorded (y-axis). The distribution of bin densities is expected to approximately follow a binomial distribution. The red dashed lines indicate the upper and lower 95% confidence bounds for a binomial distribution with P = 0.01 and n given by the total number of inserted/deleted base pairs observed in that region.
FIdentification of indel loci putatively under positive selection. The plot includes Normalized Fay and Wu’s H and Tajima’s D estimates for each indel. The red box in the lower left corner indicates the area outside the lower 95% confidence interval of both metrics as assessed by coalescent simulation.