| Literature DB >> 27621377 |
Victoria L Sork1,2, Sorel T Fitz-Gibbon3, Daniela Puiu4, Marc Crepeau5, Paul F Gugger6,7, Rachel Sherman8,9, Kristian Stevens5, Charles H Langley5, Matteo Pellegrini9, Steven L Salzberg8,9,10,11.
Abstract
Oak represents a valuable natural resource across Northern Hemisphere ecosystems, attracting a large research community studying its genetics, ecology, conservation, and management. Here we introduce a draft genome assembly of valley oak (Quercus lobata) using Illumina sequencing of adult leaf tissue of a tree found in an accessible, well-studied, natural southern California population. Our assembly includes a nuclear genome and a complete chloroplast genome, along with annotation of encoded genes. The assembly contains 94,394 scaffolds, totaling 1.17 Gb with 18,512 scaffolds of length 2 kb or longer, with a total length of 1.15 Gb, and a N50 scaffold size of 278,077 kb. The k-mer histograms indicate an diploid genome size of ∼720-730 Mb, which is smaller than the total length due to high heterozygosity, estimated at 1.25%. A comparison with a recently published European oak (Q. robur) nuclear sequence indicates 93% similarity. The Q. lobata chloroplast genome has 99% identity with another North American oak, Q. rubra Preliminary annotation yielded an estimate of 61,773 predicted protein-coding genes, of which 71% had similarity to known protein domains. We searched 956 Benchmarking Universal Single-Copy Orthologs, and found 863 complete orthologs, of which 450 were present in > 1 copy. We also examined an earlier version (v0.5) where duplicate haplotypes were removed to discover variants. These additional sources indicate that the predicted gene count in Version 1.0 is overestimated by 37-52%. Nonetheless, this first draft valley oak genome assembly represents a high-quality, well-annotated genome that provides a tool for forest restoration and management practices.Entities:
Keywords: GenPred; Genomic Selection; Quercus; Shared Data Resources; adaptation; annotation; chloroplast; nuclear genome assembly
Year: 2016 PMID: 27621377 PMCID: PMC5100847 DOI: 10.1534/g3.116.030411
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Information on sequenced Q. lobata adult #786. (A) Map of California with species distribution indicated in blue, and location of sequenced tree shown with shaded triangle. (B) Local map of sequenced tree #786 within the University of California Santa Barbara Sedgwick Reserve in the San Ynez Valley, Santa Barbara Co., CA. (C) Photo of the sequenced tree #786. (Photo by A. Lentz.)
Summary statistics for assembly of Quercus lobata
| Number | Total size (bp) | N50 Size (bp) | Mean Size (bp) | |
|---|---|---|---|---|
| All scaffolds | 94,394 | 1,182,727,890 | 278,077 | 12,529 |
| Scaffolds ≥ 2000 bp | 18,512 | 1,153,710,009 | 278,077 | 62,322 |
N50 size defined as the value N such that at least 50% of the genome is covered by scaffolds of size N or larger. We used 730 Mb as the genome size for the N50 calculation.
Figure 2(A) Histograms of k-mer frequencies in the raw read data for k = 25 (blue) and k = 31 (green). The x-axis shows the number of times a k-mer occurred; e.g., the peaks near x = 50 indicate the number of k-mers that occurred 50 times in the data. (B) Histogram of contig coverage in the assembly, based on mapping all reads back to the assembled contigs. The left peak shows the number of bases in contigs with 55–60× coverage, which correspond to regions where the assembler created two distinct contigs for divergent putative haplotypes. The right peak, at ∼110–120× coverage, contains contigs from regions where the genome are less variable, allowing the assembler to construct a single contig for those regions.
Properties of the Q. lobata k-mer distributions for k = 25 and k = 31
| Word Size | k = 25 | k = 31 |
|---|---|---|
| Total | 77,397,680,210 | 75,887,842,801 |
| Error | 1,753,595,327 | 2,535,065,256 |
| Haploid coverage depth | 51 | 49 |
| Diploid coverage depth | 106 | 101 |
| Diploid genome size | 720 Mb | 730 Mb |
Diploid genome size was estimated by dividing the number of k-mers under the haploid peak by haploid coverage depth, dividing all other k-mers counted by the diploid coverage depth, and summing these counts.
Summary of BUSCO analysis indicating the number of BUSCO plant single copy orthologs detected in each of four genome assemblies: the Q. lobata genome reported here (v1.0), a collapsed version of the Q. lobata genome (v0.5), the European oak Q. robur (Plomion ), and the black cottonwood tree Populus trichocarpa (Tuskan )
| Complete | 863 (90%) | 751 (79%) | 885 (93%) | 931 (97%) |
| Duplicated (% of complete) | 450 (52%) | 279 (37%) | 437 (49%) | 341 (37%) |
| Fragmented | 35 (4%) | 96 (10%) | 29 (3%) | 9 (1%) |
| Missing | 58 (6%) | 109 (11%) | 42 (4%) | 16 (2%) |
| Total BUSCO groups | 956 | 956 | 956 | 956 |
Figure 3Average read coverage of BUSCO genes for Q. lobata v0.5 (collapsed to reduce haplotype duplication). We mapped 133 million pairs of reads to the assembled contigs, yielding an expected mean coverage of 75×. (A) BUSCO genes that are represented only once in the genome show a unimodal distribution around the expected coverage. (B) BUSCO genes represented twice in the genome show a bimodal distribution due to some genes having only half the expected coverage. These 0.5× coverage genes are presumably in genome regions for which the collapsing of haplotypes failed, leaving both haplotypes represented as independent contigs. The genes falling in the 1× coverage peak are expected to be truly present in two copies.
Figure 4Frequency distribution results of Maker annotation of the Q. lobata v1.0 genome. (A) Transcript lengths. (B) Protein lengths. (C) Intron lengths. (D) Number of exons. Light gray bars represent all 61,773 gene models. Dark gray bars represent the high confidence subset of 13,898 models.
RepeatMasker results for genome version 1.0
| Number of Elements | Length Occupied (bp) | Percentage of Sequence (%) | |
|---|---|---|---|
| Retroelements | 119,106 | 74,981,848 | 6.34 |
| SINEs | 1214 | 141,334 | 0.01 |
| Penelope | 0 | 0 | 0.00 |
| LINEs | 33,149 | 15,516,316 | 1.31 |
| CRE/SLACS | 0 | 0 | 0.00 |
| L2/CR1/Rex | 0 | 0 | 0.00 |
| R1/LOA/Jockey | 0 | 0 | 0.00 |
| R2/R4/NeSL | 0 | 0 | 0.00 |
| RTE/Bov-B | 2013 | 670,790 | 0.06 |
| L1/CIN4 | 31,146 | 14,847,288 | 1.26 |
| LTR elements | 84,743 | 59,324,198 | 5.02 |
| BEL/Pao | 0 | 0 | 0.00 |
| Ty1/Copia | 23,614,490 | 35,709 | 2.00 |
| Gypsy/DIRS1 | 31,606,882 | 40,559 | 2.67 |
| Retroviral | 0 | 0 | 0.00 |
| DNA transposons | 37,056 | 8,974,135 | 0.76 |
| hobo-Activator | 16,517 | 4,618,534 | 0.39 |
| Tc1-IS630-Pogo | 51 | 7850 | 0.00 |
| En-Spm | 0 | 0 | 0.00 |
| MuDR-IS905 | 0 | 0 | 0.00 |
| PiggyBac | 0 | 0 | 0.00 |
| Tourist/Harbinger | 5412 | 1,226,849 | 0.10 |
| Other (Mirage P-element, P-element, Transib) | 0 | 0 | 0.00 |
| Rolling-circles | 0 | 0 | 0.00 |
| Unclassified | 4672 | 2,347,981 | 0.20 |
| Total interspersed repeats | 86,303,964 | 7.30 | |
| Small RNA | 1455 | 312,636 | 0.03 |
| Satellites | 650 | 73,309 | 0.01 |
| Simple repeats | 787,721 | 29,398,846 | 2.49 |
| Low complexity | 148,512 | 7,684,148 | 0.65 |
All 94,394 contigs, total length 1,182,727,890 bp (1,069,186,757 bp excl N-runs) run with query eudicotyledons. GC level are 35.36%. 123,530,980 bp (10.44%) are masked.
Most repeats fragmented by insertions or deletions have been counted as one element.
Figure 5View of a cysteine-rich receptor-like protein kinase 29 gene in Q. lobata. The gene model is shown at the top, and below are the tracks for the expression of these gene across different trees, shown as counts of RNA-seq reads. The gene has a stress/antifungal domain, and exhibits low expression in one of the trees (DIA3), demonstrating that the expression of this locus is variable across trees (see https://valleyoak.ucla.edu/genomicresources/).