| Literature DB >> 33964136 |
Anna Tigano1,2, Arne Jacobs1, Aryn P Wilder1,3, Ankita Nand4, Ye Zhan4, Job Dekker4,5, Nina Overgaard Therkildsen1.
Abstract
The levels and distribution of standing genetic variation in a genome can provide a wealth of insights about the adaptive potential, demographic history, and genome structure of a population or species. As structural variants are increasingly associated with traits important for adaptation and speciation, investigating both sequence and structural variation is essential for wholly tapping this potential. Using a combination of shotgun sequencing, 10x Genomics linked reads and proximity-ligation data (Chicago and Hi-C), we produced and annotated a chromosome-level genome assembly for the Atlantic silverside (Menidia menidia)-an established ecological model for studying the phenotypic effects of natural and artificial selection-and examined patterns of genomic variation across two individuals sampled from different populations with divergent local adaptations. Levels of diversity varied substantially across each chromosome, consistently being highly elevated near the ends (presumably near telomeric regions) and dipping to near zero around putative centromeres. Overall, our estimate of the genome-wide average heterozygosity in the Atlantic silverside is among the highest reported for a fish, or any vertebrate (1.32-1.76% depending on inference method and sample). Furthermore, we also found extreme levels of structural variation, affecting ∼23% of the total genome sequence, including multiple large inversions (> 1 Mb and up to 12.6 Mb) associated with previously identified haploblocks showing strong differentiation between locally adapted populations. These extreme levels of standing genetic variation are likely associated with large effective population sizes and may help explain the remarkable adaptive divergence among populations of the Atlantic silverside.Entities:
Keywords: Hi-C; fish; genome assembly; heterozygosity; inversions; nucleotide diversity
Mesh:
Year: 2021 PMID: 33964136 PMCID: PMC8214408 DOI: 10.1093/gbe/evab098
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Summary statistics for each of the intermediate and final assemblies of the reference genome from Georgia
| 10x | Dovetail Chicago | Dovetail Hi-C | Final Assembly | Chromosome Assembly | |
|---|---|---|---|---|---|
| Total length | 645.45 Mb | 647.32 Mb | 647.39 Mb | 620.04 Mb | 465.69 Mb |
| Longest scaffold | 12,248,921 bp | 12,871,938 bp | 26,678,928 bp | 26,678,928 bp | 26,678,928 bp |
| Number of scaffolds | 99,541 | 80,990 | 80,312 | 42,220 | 27 |
| Number of scaffolds > 1 kb | 61,451 | 42,898 | 42,220 | 42,220 | 27 |
| Contig N50 | 39.55 kb | 39.51 kb | 39.51 kb | 105.76 kb | 202.88 kb |
| Scaffold L50/N50 | 83/1.328 Mb | 42/2.936 Mb | 16/18.159 Mb | 15/18.199 Mb | 11/19.68 Mb |
| % gaps | 2.69% | 2.97% | 2.98% | 3.08% | 3.00% |
| BUSCOs | C: 88.1%; F: 5.3%; M: 6.6% | C: 89.5%; F: 4.6%; M: 5.9% | C: 89.6%; F: 4.8%; M: 5.6% | C: 89.6%; F: 4.5%; M: 5.9% | C: 88.3%, F: 2.7%; M: 9.0% |
Note.—“10x” refers to the draft assembly based only on 10x linked reads including scaffolds > 500 bp, “Dovetail Chicago” refers to the 10x assembly improved with Dovetail Chicago library data, and “Dovetail Hi-C” refers to the 10x assembly improved with both Dovetail Chicago and Hi-C data. The “Final assembly” represents the Dovetail Hi-C assembly but including only scaffolds > 1 kb, and the “Chromosome assembly” is the subset of scaffolds > 1 Mb from the “Final assembly.”
C, complete; F, fragmented; M, missing.
Circos plots showing synteny between the Atlantic silverside and medaka across all chromosomes (center) and in the four chromosomes (left and right) with large haploblocks on the sides. Chromosomes are color-coded consistently among plots and the colored portion (dark gray for chromosome 24) of the smaller plots refer to the medaka sequences on the right, whereas the light gray portion to the Atlantic silverside sequences on the left. Alignments shorter than 500 bp were excluded. Supplementary figure S1, Supplementary Material online shows plots for the remaining chromosomes. Note that the consistently shorter length of the Atlantic silverside genome is consistent with a lower overall estimate of genome size (554 Mb based on k-mer analysis compared with the 700 Mb of the assembled medaka genome). The three and two scaffolds making up chromosomes 1 and 24, respectively, are represented separately here and denoted by decimal suffixes (e.g., 1.1 and 24.1).
Examples of heterozygosity levels in single fish genomes, estimated either with GenomeScope from raw sequencing data or through direct calling of heterozygous sites
| Common Name | Scientific Name | Heterozygosity (%) | Method | Reference |
|---|---|---|---|---|
|
|
|
| GenomeScope |
|
|
|
|
| GenomeScope |
|
| European sardine |
| 1.60–1.75 | GenomeScope |
|
| American eel |
| 1.5–1.6 | GenomeScope |
|
| European eel |
| 1.48–1.59 | GenomeScope |
|
|
|
|
|
|
|
| Pearlscale pygmy angelfish |
| 1.36 | GenomeScope |
|
|
|
|
|
|
|
| Marine medaka |
| 1.19 | GenomeScope |
|
| Large yellow croaker |
| 1.06 | GenomeScope |
|
| Javafish medaka |
| 0.96 | GenomeScope |
|
| Greater amberjack |
| 0.65 | GenomeScope |
|
| Clownfish |
| 0.60 | GenomeScope |
|
| Hilsa shad |
| 0.58–0.66 | GenomeScope |
|
| Whitefish |
| 0.44 | GenomeScope |
|
| Corkwing wrasse |
| 0.40 | GenomeScope |
|
| Herring |
| 0.32 | Variant calling |
|
| Golden pompano |
| 0.31 | GenomeScope |
|
| Coelacanth |
| 0.28 | Variant calling | Amemiya et al. (2013) |
| NA |
| 0.26 | GenomeScope |
|
| Eurasian perch |
| 0.24–0.28 | GenomeScope |
|
| Atlantic cod |
| 0.20 | Variant calling |
|
| Big-eye mandarin Fish |
| 0.16 | GenomeScope |
|
| Threespine stickleback |
| 0.14 | Variant calling |
|
| Pikeperch |
| 0.14 | GenomeScope |
|
| African arowana |
| 0.13 | GenomeScope |
|
| Orange clownfish |
| 0.12 | GenomeScope |
|
| Murray cod |
| 0.10 | GenomeScope |
|
| Toothed Cuban cusk-eel |
| 0.10 | GenomeScope |
|
Note.— The reported estimates of heterozygosity are expressed in percentages, i.e., the number of heterozygous sites per 100 bp, and can be converted to mutations/bp, in which π estimates are generally expressed, by dividing by 100. In bold are the estimates for the Atlantic silverside from this study. 'GA' stands for Georgia and 'CT' stands for Connecticut, the two locations of origin of the individuals analyzed.
The genomic landscape of structural and sequence variation in Connecticut and Georgia. (a) Large inversions (> 1 Mb) as identified from shotgun and Hi-C data from two different individuals from Connecticut mapped to the reference genome from Georgia. (b) Manhattan plots showing the genomic landscape of variation in heterozygosity in 50 kb moving windows across single genomes from Connecticut and Georgia where the alternating colors are used to distinguish adjacent chromosomes. The three and two scaffolds making up chromosomes 1 and 24, respectively, are represented separately here and denoted by decimal suffixes. (c) Enlarged Manhattan plots for each of the four chromosomes with large haploblocks and inversions. Dashed vertical line represents the breakpoints of the large inversions as identified by Delly2 with the shotgun data.
Summary of intraspecific SVs identified in the Atlantic silverside by mapping sequence data from an individual from Connecticut to the reference genome from Georgia, and their features
| SV Type | Number of Variants | Size Range (bp) | Sequence Affected (kb) | % Genome Affected |
|---|---|---|---|---|
| Insertions | 299 | 42–83 | 18 | <0.01 |
| Deletions | 3,905 | 38–9,740,501 | 71,754 | 15 |
| Duplications | 34 | 110–150,263 | 479 | 0.1 |
| Inversions | 662 | 203–12,585,625 | 109,201 | 23 |
Hi-C contact maps of data mapped to the chromosome assembly from Georgia. Maps on the left show Hi-C data obtained from the same Georgia individual used to generate the reference assembly (mapped to self), maps on the right show data obtained from a Connecticut individual. Maps in the top panel show data for all the chromosomes binned in 100 kb sections. The three lower panels show data binned in 50 kb sections from each of the three chromosomes showing both large haploblocks in Wilder et al. (2020) and evidence for the presence of inversions from Hi-C data. Dark shades on the diagonal are indicative of high structural similarity between the reference and the Hi-C library analyzed. Dashed lines represent putative inversion breakpoints. The “butterfly pattern” of contacts observed at the point when the dashed lines meet is diagnostic of inversions.
The genomic landscape of sequence divergence between Connecticut and Georgia. (a) Manhattan plot showing the genomic landscape of variation in divergence, where the position of each point represents the start position of an aligned sequence segment of the Connecticut genome to the Georgia reference genome on the x axis and the estimated sequence divergence across that sequence segment on the y axis. The alternating colors are used to distinguish adjacent chromosomes. The three and two scaffolds making up chromosomes 1 and 24, respectively, are represented separately here and denoted by decimal suffixes. (b) Enlarged Manhattan plots for each of the four chromosomes with large haploblocks and inversions. Dashed vertical line represents the breakpoints of the large inversions as identified by Delly2 with the shotgun data and the solid horizontal line represents the sequence divergence weighted average across the genome. The small violin plots summarize and compare the distribution of sequence divergence estimates in the genomic areas not affected by large inversions (“noninv”) and areas affected by inversions in each of the four chromosomes with large haploblocks (in all comparisons sequence divergence was significantly higher in the inversion(s) in a given chromosome than in the “noninv” areas; P < 0.005).