| Literature DB >> 31661016 |
Michael Alonge1, Sebastian Soyk2, Srividya Ramakrishnan1, Xingang Wang2, Sara Goodwin2, Fritz J Sedlazeck3, Zachary B Lippman2,4, Michael C Schatz5,6,7.
Abstract
We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO .Entities:
Keywords: Genome alignment; Genome assembly; Long-read sequencing; Pseudomolecule; Reference-guided; Scaffolding; Tomato
Year: 2019 PMID: 31661016 PMCID: PMC6816165 DOI: 10.1186/s13059-019-1829-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The RaGOO pipeline. a Contigs are aligned to the reference genome with Minimap2 and are ordered and oriented according to those alignments. b Normal alignments between a contig and a reference chromosome (top) and example alignments between a reference chromosome and an intrachromosomal chimera (bottom left) and an interchromosomal chimera (bottom right). Red arrows represent potential contig breakpoints
Fig. 2Scaffolding simulated assemblies. Ordering and localization results for “easy” and “hard” simulated tomato genome assemblies. Normalized edit distance and adjacent pair accuracy measure the success of contig ordering and are averaged across the 12 simulated chromosomes. The percentage of the genome localized measures how much of the simulated assemblies were clustered, ordered and oriented into pseudomolecules
Fig. 3M82 assembly contiguity. “Nchart” of the M82 and Heinz contigs and pseudomolecules. M82 pseudomolecules were established by ordering and orienting M82 contigs with RaGOO. Heinz contigs were derived from the SL3.0 pseudomolecules by splitting sequences at stretches of 20 or more contiguous “N” characters
Fig. 4Reference-free vs. reference-guided scaffolding of M82. Both the top and bottom panels depict a dotplot (left) and Hi-C heatmap (right). The dotplots are generated from alignments to the Heinz reference assembly. On the top panel is the reference-guided RaGOO assembly dotplot, with chromosomes 1 through 12 depicted from top left to bottom right, and the Hi-C heatmap for chromosome 12. On the bottom is the de novo SALSA scaffolds dotplot, with the 12 largest scaffolds depicted in descending order of length from top left to bottom right and the Hi-C heatmap for the 12th largest scaffold
Summary statistics of the reference tomato genome as well as the three novel accessions. Chromosome span indicates the total span of all of the chromosomes, including gaps. Chromosome N50 is the length such that half of the total span is covered in chromosome sequences this length or longer. Chr0 bases report the number of bases assigned to the unresolved chromosome 0. Contig span is the total length of non-gap (N) characters. Contig N50 is the length such that half of the contig span is covered by contigs this length or longer. Number SVs reports the number of SVs reported by RaGOO using the integrated version of Assemblytics
| Accession | Chromosome span (bp) | Chromosome N50 (bp) | Chr0 bases (bp) | Number Contigs | Contig span (bp) | Contig N50 (bp) | Number SVs |
|---|---|---|---|---|---|---|---|
| Heinz | 828,076,956 | 66,723,567 | 20,852,292 | 22,705 | 746,357,581 | 133,084 | NA |
| M82 | 792,934,937 | 67,021,692 | 8,891,603 | 2910 | 771,143,786 | 1,458,445 | 36,191 |
| BGV | 794,568,563 | 67,174,401 | 4,643,553 | 638 | 769,694,915 | 4,105,177 | 45,927 |
| FLA | 796,004,315 | 67,650,907 | 5,490,904 | 2577 | 750,743,510 | 795,751 | 45,478 |
Fig. 5The tomato pan-genome. (left) Circos plot (http://omgenomics.com/circa/) depicting the size and type of structural variant. From the outer ring to the inner ring: M82, FLA, and BGV. Point height (y-axis) is scaled by the size of the variant, with red indicating insertions and blue indicating deletions. (right) Euler diagrams (https://github.com/jolars/eulerr) depicting the insertions and deletions shared among the three accessions
Fig. 6The Arabidopsis pan-genome. a Map of the 103 Arabidopsis accessions that were assembled in this study. b Principal components analysis of the structural variant presence/absence matrix of the 103 Arabidopsis accessions
Summary of the ten most variable genes in the Arabidopsis pan-genome. “Number of variants” is the total number of variants intersecting a given gene, and “Normalized number of variants” is the number of intersecting variants divided by gene length
| Gene | Annotation | Number of variants | Normalized number of variants | Number of accessions with variants |
|---|---|---|---|---|
| AT4G16960 | Defense response, chloroplast | 62 | 0.00715605 | 80 |
| AT1G58602 | ADP binding, defense response, ATP binding | 57 | 0.00244101 | 90 |
| AT3G44400 | ADP binding, defense response, cytoplasm, signal transduction | 56 | 0.00621256 | 89 |
| AT3G44630 | Defense response | 55 | 0.00593312 | 84 |
| AT4G16920 | Defense response, chloroplast, cytoplasm | 55 | 0.00522913 | 79 |
| AT1G62620 | 54 | 0.00850796 | 91 | |
| AT4G16950 | Defense response to fungus, incompatible interaction, nucleotide binding, defense response, protein binding | 54 | 0.00558486 | 70 |
| AT1G62630 | Defense response, ATP binding, N-terminal protein myristoylation, ADP binding, nucleus | 50 | 0.00748391 | 93 |
| AT5G41740 | Nucleus, defense response, chloroplast | 48 | 0.00565171 | 91 |
| AT4G16890 | Defense response, cytosol, signal transduction, defense response to bacterium, protein binding, ATP binding, defense response to bacterium, incompatible interaction, ADP binding, systemic acquired resistance, salicylic acid-mediated signaling pathway, cytoplasm, intracellular membrane-bounded organelle, nucleus, nucleotide binding, endoplasmic reticulum, response to auxin | 48 | 0.00536373 | 75 |