| Literature DB >> 28159771 |
Wen-Biao Jiao1, Gonzalo Garcia Accinelli2, Benjamin Hartwig1, Christiane Kiefer1, David Baker2, Edouard Severing1, Eva-Maria Willing1, Mathieu Piednoel1, Stefan Woetzel1, Eva Madrid-Herrero1, Bruno Huettel3, Ulrike Hümann1, Richard Reinhard3, Marcus A Koch4, Daniel Swan2, Bernardo Clavijo2, George Coupland1, Korbinian Schneeberger1.
Abstract
Long-read sequencing can overcome the weaknesses of short reads in the assembly of eukaryotic genomes; however, at present additional scaffolding is needed to achieve chromosome-level assemblies. We generated Pacific Biosciences (PacBio) long-read data of the genomes of three relatives of the model plant Arabidopsis thaliana and assembled all three genomes into only a few hundred contigs. To improve the contiguities of these assemblies, we generated BioNano Genomics optical mapping and Dovetail Genomics chromosome conformation capture data for genome scaffolding. Despite their technical differences, optical mapping and chromosome conformation capture performed similarly and doubled N50 values. After improving both integration methods, assembly contiguity reached chromosome-arm-levels. We rigorously assessed the quality of contigs and scaffolds using Illumina mate-pair libraries and genetic map information. This showed that PacBio assemblies have high sequence accuracy but can contain several misassemblies, which join unlinked regions of the genome. Most, but not all, of these misjoints were removed during the integration of the optical mapping and chromosome conformation capture data. Even though none of the centromeres were fully assembled, the scaffolds revealed large parts of some centromeric regions, even including some of the heterochromatic regions, which are not present in gold standard reference sequences.Entities:
Mesh:
Year: 2017 PMID: 28159771 PMCID: PMC5411772 DOI: 10.1101/gr.213652.116
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Assembly statistics
Figure 1.Assembly results and strategies. (A–C) Assembly contiguity of the assemblies of three species: A. alpina (A), E. syriacum (B), C. planisiliqua (C). The x-axis indicates the cumulative length of contigs sorted by length (expressed as percent of the entire assembly). The y-axis shows individual contig or scaffold length. The dashed line indicates the N50/L50 values. (D) Misassemblies identified with Illumina mate-pairs (yellow) and their overlap with breaks introduced during misassembly identification using optical maps (in two steps shown in green and blue). (E) Misassemblies identified with Illumina mate-pair alignments (yellow) and their overlap with breaks introduced during our integration of Dovetail Genomics chromosome conformation capture data (again, two steps shown in green and blue). (F) Inter-chromosome misassemblies identified by a genetic map in each of the assemblies (as shown in A).
Figure 2.Optical mapping-based assembly correction and scaffolding. (A) Example of misassembly breakage and new scaffolding using optical mapping data. Three misassemblies in contig-5097 were identified with the optical map alignments (and also validated by the genetic maps; markers shown with red ticks). The original contig was broken, and the subsequent scaffolding of the four contigs, which resulted from breaking the original contig at the misassemblies, introduced them into the context of larger scaffolds, which were supported by the genetic map. (LG) Linkage group. (B) Improved optical mapping scaffolding workflow. Integration of optical mapping information includes breakage of misassembled contigs and consensus maps (c-maps) followed by hybrid scaffolding. (C) FALCON contig 000108F is apparently misassembled, as two different consensus maps (CMAP-183 and CMAP-361) have conflicting alignments with the same region of this contig. (D) A conflict between FALCON contig 000090F and CMAP-625 is not sufficient to decide on the origin of the underlying misassembly. However, CMAP-625 can be fully aligned to contig scf7180000005182 of a different (PBcR) assembly, supporting the correctness of this consensus map and thereby suggesting a misassembly in the contig.
Figure 3.Assembly scaffolding using chromosome conformation capture data. (A) Improved chromosome conformation capture data scaffolding workflow. (B) Misassembly identification using chromosome conformation capture read pairs. The paired-end mapping positions in the region 300–500 kb of FALCON contig 000171F show a sudden absence of read pairs spanning across the region at around 410 kb. A misassembly at this region was indicated by HiRise. (MQ) Mapping quality.
Figure 4.Comparing the assemblies of E. syriacum and C. planisiliqua to the ancestral karyotype present in the genome of A. lyrata. The eight chromosomes of A. lyrata are shown in colored blocks. Centromeric regions are indicated by white breaks. Scaffolds of the assemblies generated here of more than 1 Mb are shown in light blue blocks. The two histograms outside of chromosome karyotypes show the gene (orange) and repeat (blue) densities assessed with window sizes of 1 Mb for A. lyrata and 200 kb for E. syriacum or C. planisiliqua. (A) Three scaffolds of E. syriacum include similarities to the two flanking regions of A. lyrata CEN2, CEN3, and CEN4. (B) Scaffolds 3, 5, 6, and 14 include up to 7 Mb of putative centromeric regions, which are absent in the core assembly of A. lyrata, as these regions do not show any homology to any region in the A. lyrata genome.