| Literature DB >> 29253147 |
Patrick P Edger1,2, Robert VanBuren1, Marivi Colle1, Thomas J Poorten3, Ching Man Wai1, Chad E Niederhuth4, Elizabeth I Alger1, Shujun Ou1,2, Charlotte B Acharya3, Jie Wang5, Pete Callow1, Michael R McKain6, Jinghua Shi7, Chad Collier7, Zhiyong Xiong8, Jeffrey P Mower9, Janet P Slovin10, Timo Hytönen11, Ning Jiang1,2, Kevin L Childs5,12, Steven J Knapp3.
Abstract
Background: Although draft genomes are available for most agronomically important plant species, the majority are incomplete, highly fragmented, and often riddled with assembly and scaffolding errors. These assembly issues hinder advances in tool development for functional genomics and systems biology. Findings: Here we utilized a robust, cost-effective approach to produce high-quality reference genomes. We report a near-complete genome of diploid woodland strawberry (Fragaria vesca) using single-molecule real-time sequencing from Pacific Biosciences (PacBio). This assembly has a contig N50 length of ∼7.9 million base pairs (Mb), representing a ∼300-fold improvement of the previous version. The vast majority (>99.8%) of the assembly was anchored to 7 pseudomolecules using 2 sets of optical maps from Bionano Genomics. We obtained ∼24.96 Mb of sequence not present in the previous version of the F. vesca genome and produced an improved annotation that includes 1496 new genes. Comparative syntenic analyses uncovered numerous, large-scale scaffolding errors present in each chromosome in the previously published version of the F. vesca genome. Conclusions: Our results highlight the need to improve existing short-read based reference genomes. Furthermore, we demonstrate how genome quality impacts commonly used analyses for addressing both fundamental and applied biological questions.Entities:
Keywords: Fragaria vesca; optical map; rosaceae; strawberry; third-generation sequencing
Mesh:
Year: 2018 PMID: 29253147 PMCID: PMC5801600 DOI: 10.1093/gigascience/gix124
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Chromosome landscapes of the F. vesca V4 genome. The distribution of genes and long terminal repeat retrotransposons (LTR-RTs) are plotted for each of the 7 chromosomes. Heatmaps reflect the distribution of elements, with blue indicating the lowest abundance and red signifying high abundance. Plots were generated with a sliding window of 50 kb, with a 10-kb shift across each chromosome. Terminal telomeric repeat arrays are denoted in purple.
Figure 2:Macrosyntenic comparison of the V2 and V4 F. vesca assemblies. Syntenic gene pairs between V4 (x-axis) and V2 (y-axis) of F. vesca were identified by DAGChainer [44], sorted by chromosome (Fvb1-7), and colored based on their synonymous substitution rate, as calculated by CodeML [45] using SynMap within CoGe [46]. Syntenic “orthologous” regions are colored in blue, and duplicated genes retained from a whole-genome triplication event (At-gamma [47]) in other colors. Regions that were misassembled and incorrectly scaffolded in F. vesca V2 are identified by negatively sloped and repositioned lines.
Figure 3:Distribution of gene body methylation in the V2 and V4 F. vesca assemblies. This plot shows the average DNA methylation patterns (CG = blue, CHG = green, CHH = red; H = A, T, or C) across all genes in the V2 (darker colors) and V4 (lighter colors) assemblies. The x-axis shows the transcription start sites (TSS; left dashed line) and the transcription termination sites (TTS; right dashed line), plus +/- 2000 bp from each gene.
Figure 4:Expression patterns of newly annotated genes across diverse tissue types. Heatmap consists of a random subset of 100 genes from the unique 810 newly identified genes in the F. vesca V4 assembly, across 22 tissue types at different developmental stages. Two biological replicates were sequenced per tissue, with the exception of 6 with only 1 biological replicate each (Table S2). Blue indicates the lowest expression, and red signifies the highest expression abundance. Gene expression level was calculated based on reads per kilobase of transcript per million mapped reads (RPKM) and visualized through heatmap analysis using variance-stabilized transformed values on a log2 scale.