| Literature DB >> 25995208 |
Owen A Thompson1, L Basten Snoek2, Harm Nijveen3, Mark G Sterken2, Rita J M Volkers2, Rachel Brenchley4, Arjen Van't Hof4, Roel P J Bevers5, Andrew R Cossins4, Itai Yanai6, Alex Hajnal7, Tobias Schmid7, Jaryn D Perkins8, David Spencer1, Leonid Kruglyak9, Erik C Andersen10, Donald G Moerman8, LaDeana W Hillier1, Jan E Kammenga2, Robert H Waterston11.
Abstract
The Hawaiian strain (CB4856) of Caenorhabditis elegans is one of the most divergent from the canonical laboratory strain N2 and has been widely used in developmental, population, and evolutionary studies. To enhance the utility of the strain, we have generated a draft sequence of the CB4856 genome, exploiting a variety of resources and strategies. When compared against the N2 reference, the CB4856 genome has 327,050 single nucleotide variants (SNVs) and 79,529 insertion-deletion events that result in a total of 3.3 Mb of N2 sequence missing from CB4856 and 1.4 Mb of sequence present in CB4856 but not present in N2. As previously reported, the density of SNVs varies along the chromosomes, with the arms of chromosomes showing greater average variation than the centers. In addition, we find 61 regions totaling 2.8 Mb, distributed across all six chromosomes, which have a greatly elevated SNV density, ranging from 2 to 16% SNVs. A survey of other wild isolates show that the two alternative haplotypes for each region are widely distributed, suggesting they have been maintained by balancing selection over long evolutionary times. These divergent regions contain an abundance of genes from large rapidly evolving families encoding F-box, MATH, BATH, seven-transmembrane G-coupled receptors, and nuclear hormone receptors, suggesting that they provide selective advantages in natural environments. The draft sequence makes available a comprehensive catalog of sequence differences between the CB4856 and N2 strains that will facilitate the molecular dissection of their phenotypic differences. Our work also emphasizes the importance of going beyond simple alignment of reads to a reference genome when assessing differences between genomes.Entities:
Keywords: C. elegans; evolution; genome assembly; variation
Mesh:
Year: 2015 PMID: 25995208 PMCID: PMC4512556 DOI: 10.1534/genetics.115.175950
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
CB4856 sequence resources
| Data set | PI | Type | Platform | [S|P]E, length (bp) | Insert size | Clones/total bases in reads | Coverage expected (%) |
|---|---|---|---|---|---|---|---|
| Princeton University | Andersen | DNA | Illumina | PE 104, 104 | 321 bp | 34,711,778/7,220,049,824 | 69.52× (96.6) |
| University of Washington ( | Waterston | DNA | Illumina | PE 76, 76 | 179 bp | 21,252,827/3,230,429,704 | 31.10× (96.6) |
| Technion | Yanai | DNA | Illumina | PE 100, 100 | 221 bp | 79,406,930/15,881,386,000 | 80.08× (50.6) |
| University of Zurich | Hajnal | DNA | Illumina | PE 101,101 | 484 bp | 825,754/166,799,884 | 1.41× (84.9) |
| University of Zurich | Hajnal | DNA | SOLiD | PE 50, 35 | 124 bp | 15,760,405/2,679,268,850 | 7.20× (26.9) |
| University of British Columbia ( | Moerman | DNA | Sanger | PE ∼770 bp | ∼33 kbp | 15,360/20,520,434 | 0.20× (97.2) |
| Washington University ( | Waterston | DNA | Sanger | SE ∼764 bp | NA | 11,541/8,843,526 | 0.07× (81.7) |
| Wageningen University/University of Liverpool | Kammenga/Cossins | DNA (ILs/RILs) | SOLiD | SE 50 | NA | 2,709,932,329/135,496,616,450 | 766.85× (56.8) |
| 956.43× |
SE = single end
PE = paired end
Figure 1Strategy for constructing a Hawaiian reference sequence. (A) Alignment of 100-bp paired-end reads from the CB4856 genome to the N2 genome. Sites that differed by base substitution and insertion and deletion were recognized, and the N2 genome was altered at those sites. For insertions larger than a read and at the edge of divergent regions, the consensus sequences from the unmatched segments of the reads were added to the reference. Then the reads were aligned to the modified reference, and the cycle was repeated for 20 times, by which time few changes were being made. (B) After the 20 cycles of alterations, areas with incomplete coverage still persisted. To correct these areas, individual reads were assembled de novo with the JR-Assembler and aligned against the modified reference. Typically, these JR contigs would show good agreement where read coverage was good, and thus corrections had been made, but poor alignment where the reference sequence did not have coverage and had not been altered from the N2 reference. The JR contigs were also aligned against sequence reads from RILs and ILs. Only RILs and ILs containing a segment of the Hawaiian genome that spanned the JR contig yielded good coverage across these divergent regions, thereby locating the JR contigs on the genome. Where the JR contigs had regions of good match against the reference and their location was confirmed by alignment of reads from RILs and ILs, they were spliced cleanly into the reference. Remaining large deletions were also removed.
Figure 2Read coverage and SNV density in the N2 reference genome and the iteratively corrected CB4856 genome. (A) A typical region for most of the genome is shown, with good coverage (top track) and infrequent SNVs and indels (second track). Genes are shown below. (B) A region of the N2 reference showing poor coverage and a high SNV/indel density with the Hawaiian reads. (C) After 20 iterations of reference-guided corrections, the same region as in B now has improved coverage by the CB4856 reads. In addition to coverage, the tracks show the SNV calls (MMP SNVs) reported in (Thompson ), the SNV calls based on the new reference (SNVs), indels based on the new reference (Indels), and regions that failed to align with sequence present in the N2 reference (Unaligned). Gene models for each region are shown below. (D) The boundary of a divergent region (left) with a less divergent region of the genome is shown. The density of SNVs and indels changes abruptly. Tracks are as in C.
Comparison of reference sequence lengths
| Chromosome | N2 | HA | Difference | % |
|---|---|---|---|---|
| I | 15,072,423 | 14,890,789 | 181,634 | 1.21 |
| II | 15,279,345 | 14,885,952 | 393,393 | 2.57 |
| III | 13,783,700 | 13,596,826 | 186,874 | 1.36 |
| IV | 17,493,793 | 17,183,857 | 309,936 | 1.77 |
| V | 20,924,149 | 20,182,852 | 741,297 | 3.54 |
| X | 17,718,866 | 17,537,347 | 181,519 | 1.02 |
| Total | 100,272,276 | 98,277,623 | 1,994,653 | 1.99 |
% difference in size expressed as a percentage of the length of the chromosome in the Hawaiian genome.
Number of deletion events and base counts in deletions in N2 and CB4856
| Deletions in N2 | Deletions in CB4856 | |||
|---|---|---|---|---|
| Chromosome | Events | Bases | Events | Bases |
| I | 5,693 | 94,543 | 6,158 | 233,930 |
| II | 7,370 | 230,813 | 7,692 | 478,884 |
| III | 5,464 | 99,324 | 6,008 | 275,997 |
| IV | 5,700 | 116,249 | 5,841 | 265,504 |
| V | 10,902 | 343,640 | 11,740 | 854,733 |
| X | 3,453 | 37,442 | 3,507 | 156,819 |
| Total | 38,582 | 922,011 | 40,946 | 226,5867 |
Figure 3Overlap with previous SNV calls. A Venn diagram shows the overlap of the previous SNV calls with those obtained with the CB4856 reference.
SNV distribution
| Chromosome | Total | Centers | SNV/kb centers | Arms | SNV/kb arms | % on arms |
|---|---|---|---|---|---|---|
| I | 36,192 | 9,424 | 1.15 | 26,768 | 3.89 | 73.96 |
| II | 65,592 | 8,843 | 1.11 | 56,749 | 7.80 | 86.52 |
| III | 38,938 | 5,429 | 0.78 | 33,509 | 4.94 | 86.06 |
| IV | 36,198 | 9,016 | 1.00 | 27,182 | 3.20 | 75.09 |
| V | 129,096 | 18,179 | 2.02 | 110,917 | 9.30 | 85.92 |
| X | 21,034 | 8,076 | 1.08 | 12,958 | 1.27 | 61.61 |
| Total | 327,050 | 58,967 | 1.21 | 268,083 | 5.20 | 81.97 |
SNVs in divergent regions
| Divergent regions | Other regions | |||||
|---|---|---|---|---|---|---|
| Chromosome | SNVs | Bases | SNVs/kb | SNVs | Bases | SNVs/kb |
| I | 3,940 | 87,170 | 45.20 | 32,252 | 14,985,253 | 2.15 |
| II | 27,649 | 709,991 | 38.94 | 37,943 | 14,569,354 | 2.60 |
| III | 13,962 | 344,847 | 40.49 | 24,976 | 13,438,853 | 1.86 |
| IV | 5,657 | 206,442 | 27.40 | 30,541 | 17,287,351 | 1.77 |
| V | 77,704 | 1,444,451 | 53.79 | 51,392 | 19,479,698 | 2.64 |
| X | 900 | 38261 | 23.52 | 20,134 | 17,680,605 | 1.14 |
| Total | 129,812 | 2,831,162 | 45.85 | 197,238 | 97,441,114 | 2.02 |
Figure 4Density of variant sites in the first three megabases of (A) chromosome I and (B) chromosome II. Blue boxes indicate the regions identified as highly divergent.
Figure 5Percent divergence by length of divergent region per chromosome. The mutational events (SNVs and indels, counting each indel as a single event) per aligned bases (percentage divergence) are plotted for each region against the length of the region in N2. The chromosomal assignment for each region is indicated in the inset.
Genes in diverged regions and with LOF mutations
| Genome | Divergent regions | ||||||
|---|---|---|---|---|---|---|---|
| Gene class | Total | Disabling | Expected | Total | Expected | Disabling | Expected |
| Serpentine receptor ( | 1346 | 204 | 123.7
2.20 | 118 | 49.7
7.38 | 72 | 80.0
8.70 |
| F-box ( | 353 | 129 | 32.5
1.56 | 71 | 15.3
3.75 | 60 | 46.3
1.51 |
| C-lectin ( | 254 | 53 | 23.4
1.05 | 31 | 11.0
1.81 | 19 | 20.2
7.49 |
| Math ( | 48 | 39 | 4.4
1.92 | 37 | 2.1
2.00 | 34 | 24.1
1.47 |
| Bath ( | 37 | 16 | 3.4
4.83 | 15 | 1.6
1.11 | 14 | 9.7
1.42 |
| Nuclear hormone receptor ( | 278 | 26 | 25.6
4.90 | 27 | 11.8
7.38 | 10 | 17.6
9.99 |
The expected number of disabled genes in the total genome based on 20,504 genes and 1885 disabled overall.
The expected number of genes in the divergent regions based on 883 genes of the 20,504 genes in the genome and 576 of the 883 genes disabled.
Hypergeometric test.
Figure 6A heatmap representation of the allelic content of the 39 strains (rows) across 44 of the 61 divergent regions (columns). Regions matching N2 (yellow) and CB4856 (red) are indicated along with intermediate regions (orange) and regions different from either (green). For reference, an N2-derived strain, VC2010, and CB4856 are shown in the bottom two rows. Strains that may represent the same isotype are highlighted in blue and green.