Literature DB >> 28407147

HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly.

Shengfeng Huang¹, Mingjing Kang¹, Anlong Xu¹.

Abstract

SUMMARY: De novo assembly is a difficult issue for heterozygous diploid genomes. The advent of high-throughput short-read and long-read sequencing technologies provides both new challenges and potential solutions to the issue. Here, we present HaploMerger2 (HM2), an automated pipeline for rebuilding both haploid sub-assemblies from the polymorphic diploid genome assembly. It is designed to work on pre-existing diploid assemblies, which are typically created by using de novo assemblers. HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies. Using HM2, we demonstrate that two haploid sub-assemblies reconstructed from a real, highly-polymorphic diploid assembly show greatly improved continuity.
AVAILABILITY AND IMPLEMENTATION: Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/. CONTACT: hshengf2@mail.sysu.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28407147 PMCID： PMC5870766 DOI： 10.1093/bioinformatics/btx220

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

There is an increasing demand for sequencing of heterozygous diploid genomes. However, since the era of Sanger sequencing, de novo assembly of heterozygous diploid genomes has been a difficult issue (Vinson ). It becomes more challenging when using massive, short-read sequencing technologies (Zhang ). Though several de novo assembly methods and post-assembly methods have been designed to improve heterozygous short-read assemblies (Gnerre ; Huang ; Kajitani ; Pryszcz and Gabaldon, 2016; Safonova ), these assemblies hardly reach the same level of quality as non-heterozygous assemblies. The latest high-throughput long-read sequencing technologies provide a promising approach to polymorphic assembly (Berlin ; Chin ; Koren ; Xiao ), especially when combined with heterozygosity-aware assembly algorithms (Chin ). However, in long-read diploid assemblies, there are still assembly errors, and, more importantly, allelic relations between scaffolds might not be fully resolved. This is because, as the heterozygosity increases, alleles from the same locus are more likely to be mistaken as sequences from different loci. Previously, we developed HaploMerger (HM), an automated pipeline to resolve allelic relations in polymorphic diploid assembly and output the reference haploid assembly (Huang ). Since HM works on pre-existing diploid assemblies, it can be easily incorporated into any assembly pipelines. Thus far, HM has been used to create over ten published reference assemblies, including large draft genomes for amphioxus (∼450 Mb with ∼4% heterozygosity) and hookworms (∼330 Mb with <1% heterozygosity) (Huang ; Schwarz ). We perceive that, because alleles often functionally complement each other in a highly polymorphic genome, both haploid sub-assemblies are necessary to represent the complete genomic landscape. Moreover, the presence of both haploid sub-assemblies enables the study of widespread heterozygosities in a highly polymorphic genome, including single nucleotide polymorphisms, indels, copy-number variation, structural variation and recent transposition. Here, we provide HaploMerger2 (HM2), a major upgrade over the old pipeline, which we redesigned to reconstruct both haploid sub-assemblies from short-read and long-read diploid assemblies. HM2 can work with both heterozygosity-aware and -unaware genome assemblers and process both low and high heterozygosity assemblies. However, it is especially suitable for difficult tasks in which the diploid assemblies have high heterozygosity (≥3%). Compared with the old pipeline, HM2 also implements more flexible assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method on haploid assemblies (Fig. 1). In this applications note, we describe the features and applications of HM2.

Fig. 1

A flowchart of the HaploMerger2 (HM2) pipeline. HM2 comprises five functionally independent modules. Each module can work on its own. The users can run any module separately or choose some of them to form a specific pipeline, as suggested in the flowchart. In this specific pipeline, all but the second module are optional and can be iterated to achieve better results

2 Software description

2.1 Preparation and requirements

An initial diploid assembly should first be generated by using de novo assemblers. To include as many alleles as possible into the diploid assembly, the de novo assembler should be run with stringent parameters (e.g. low error rates) or in the heterozygosity-aware mode, which forces alleles from the same locus to be assembled and outputted separately. To avoid false alignments, repetitive sequences in the diploid assembly, including simple repeats, transposable elements and highly duplicated coding sequences, should be soft-masked using WindowMasker and/or RepeatMask (Morgulis ; Tarailo-Graovac and Chen, 2009). To achieve optimal specificity and sensitivity, knowledge of the allelic polymorphism rate and mutational biases is very important. For example, if the heterozygosity is 1% and the alignment threshold is set to 10%, many sequences will be falsely removed as alleles. On the other hand, if the heterozygosity is 10% and the alignment threshold is set to 1%, many true allele pairs will be undetected and remain in the haploid sub-assemblies. HM2 provides tools to infer proper parameters to handle these situations. Finally, due to algorithmic limitation, HM2 is more suitable to process diploid assemblies with an initial scaffold N50 size >100 Kb.

2.2 Detection and break-up of potential mis-joins

Mis-joins of unrelated genomic portions can be detected by examining the alignments between allelic scaffolds. In the old HM pipeline, mis-join processing and haploid assembly rebuilding were inseparable. In HM2, mis-join processing is redesigned as the first independent module, which allows for choosing optimal parameters and running iterations of the module to maximize error detection. It is worth noting that false detection of mis-joins due to repetitive sequences has been suppressed by the initial repeat-masking procedure. In a pair of allelic scaffolds involved in a mis-join, it is difficult to determine which has the error. Additionally, it is hard to discriminate between mis-joins from natural inversions and translocations. Therefore, HM2 breaks up both scaffolds involved in a potential mis-join. The correct connection can be restored later by the scaffolding module.

2.3 Reconstruction of two separated haploid sub-assemblies

The old version of HM reconstructs allelic relations based on the best reciprocal, mirrored whole-genome alignments of the diploid assembly. Then, a heuristic method is employed to elect the best allele into the reference haploid assembly, whereas another allele is used to fill the N-gaps or is simply discarded. This procedure might cause the loss of the alternative alleles and excessive switches between two haplotypes. In HM2, we revised the algorithm to reconstruct both haploid sub-assemblies: the reference sub-assembly and the alternative sub-assembly. Specifically, if two alleles are available for a locus, HM2 separates them into two different sub-assemblies, with the better-quality allele placed in the reference sub-assembly. If only one allele is available for a locus (often due to haplotype collapsing or the allele is simply discarded by the de novo assembler), HM2 puts this allele into both sub-assemblies. In the sub-assemblies, the allelic scaffolds are given the same scaffold name. Finally, because there are switches between haplotypes in the rebuilt haploid sub-assemblies, the sub-assemblies are not haplotype phased.

2.4 Hierarchical scaffolding of the haploid sub-assemblies

In polymorphic diploid assembly, scaffolding with mate-pairs is ineffective because reads of the same pair are often aligned to different haplotypes. This is the major factor that causes heavy fragmentation and excessive assembly errors in polymorphic assembly (Huang ; Kajitani ; Safonova ; Vinson ). In HM2, we implement hierarchical scaffolding in the haploid sub-assemblies. Without interference of the different alleles, this re-scaffolding procedure can dramatically improve the sequence continuity. Currently, HM2 invokes a third-party program, SSPACE v3.0, to implement scaffolding (Boetzer ). In our experience, SSPACE v3.0 implements a fast, straightforward greedy scaffolding algorithm. However, HM2 also supports other scaffolders, as long as the scaffolders do not remove or add sequences to the sub-assembly. Continuity could be further improved by invoking multiple rounds of the scaffolding module. Additionally, only the reference sub-assembly needs re-scaffolding because HM2 will update the alternative sub-assembly according to the new scaffolding layout of the reference sub-assembly.

2.5 Detection and removal of potential tandem assembly errors

HM2 utilizes an updated module with several fixed bugs and new configurable options. For example, the module now can scan tandems as small as 100 bp, and detect tandems of unequal length (option ‘XvY’). In addition, multiple rounds of tandem removal can be performed, usually with decreased tandem sizes and increased sensitivity. Then, the tandem-assembled sequences that have been removed are collected in an output file rather than discarded as was done before. Finally, the users should be careful with this module because it is the only module in HM2 that may lose genomic information.

2.6 N-gap closing

HM2 invokes a third-party software, GapCloser (Luo ), to implement N-gap closing. Since gap-filling sequences generated by GapCloser are not always reliable, HM2 will re-examine all the gap-filling sequences and choose to retain the reliable ones. Because this examination is specific to the GapCloser output, HM2 currently does not support other gap-filling software. It is possible to run multiple rounds of gap-filling with different datasets. All gap-filling sequences are annotated in an AGP-formatted file (v.1.1).

3 Sample applications

We provide three examples for testing HM2. The first two examples use an artificial diploid assembly (∼100 Kb) to test if HM2 is installed successfully and functions properly. Both examples can be finished in a minute. The third example uses a real, highly-polymorphic diploid assembly for a wild-type amphioxus. This assembly was created from a mixture of 454 and Illumina reads (∼60X) using the Celera assembler CABOG v6.1 (Miller ). A copy of this assembly can be downloaded from GenBank (accession: AYSR00000000.1), or from our HM2 release website (named ‘bbv18wm.fa.gz’). It has been soft-masked and is ready to use. This assembly contains ∼708 M bases and has a scaffold/contig N50 size of 264 Kb/30 Kb, exhibiting an average rate of allelic polymorphism of ∼4%. After a single round of HM2, we can obtain two separated haploid sub-assemblies of ∼406 Mb with a scaffold/contig N50 size of 2.2 Mb/40 Kb. This takes <3 hours to finish on a machine with 12-cores and 64 Gb of memory. The results and performance are highly reproducible. A full description of this application is provided in the Supplementary information.

4 Discussion

HM2 works in the post-assembly stage. Its performance is bound by the quality of the initial diploid assembly. For example, if one of the haplotypes is largely missing in the diploid assembly, HM2 cannot recover it. However, a reference haploid sub-assembly is always guaranteed. HM2 has algorithmic limitations, which offer little help if the diploid assembly is too fragmented (i.e. <100 kb). In essence, HM2 is a tool kit comprising a set of executables of independent function, as well as wrappers for winMasker, Lastz, chainNet, SSPACE and GapCloser. The intermediate information and running messages are tracked and documented for each step and function. The pipeline presented here is a special organization of a selection of tools from this kit. Therefore, HM2 can be used for other applications in post-assembly analysis. For example, it can be used to create self-versus-self whole-genome alignments or pairwise alignments between two genome assemblies to detect tandem duplication, further scaffold an assembly and close some N-gaps.

Funding

This work was supported by the 973 Project [grant number 2013CB835305], the National Nature Science Fund [grand number 31171193] and by the National Supercomputer Center in Guangzhou and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund. Conflict of Interest: none declared. Click here for additional data file.

18 in total

1. High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Authors: Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe
Journal: Proc Natl Acad Sci U S A Date: 2010-12-27 Impact factor: 11.205

2. WindowMasker: window-based masker for sequenced genomes.

Authors: Aleksandr Morgulis; E Michael Gertz; Alejandro A Schäffer; Richa Agarwala
Journal: Bioinformatics Date: 2005-11-15 Impact factor: 6.937

3. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors: Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal: Nat Biotechnol Date: 2015-05-25 Impact factor: 54.908

4. dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes.

Authors: Yana Safonova; Anton Bankevich; Pavel A Pevzner
Journal: J Comput Biol Date: 2015-03-03 Impact factor: 1.479

5. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

Authors: Chen-Shan Chin; David H Alexander; Patrick Marks; Aaron A Klammer; James Drake; Cheryl Heiner; Alicia Clum; Alex Copeland; John Huddleston; Evan E Eichler; Stephen W Turner; Jonas Korlach
Journal: Nat Methods Date: 2013-05-05 Impact factor: 28.547

6. Phased diploid genome assembly with single-molecule real-time sequencing.

Authors: Chen-Shan Chin; Paul Peluso; Fritz J Sedlazeck; Maria Nattestad; Gregory T Concepcion; Alicia Clum; Christopher Dunn; Ronan O'Malley; Rosa Figueroa-Balderas; Abraham Morales-Cruz; Grant R Cramer; Massimo Delledonne; Chongyuan Luo; Joseph R Ecker; Dario Cantu; David R Rank; Michael C Schatz
Journal: Nat Methods Date: 2016-10-17 Impact factor: 28.547

7. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads.

Authors: Chuan-Le Xiao; Ying Chen; Shang-Qian Xie; Kai-Ning Chen; Yan Wang; Yue Han; Feng Luo; Zhi Xie
Journal: Nat Methods Date: 2017-09-18 Impact factor: 28.547

8. Decelerated genome evolution in modern vertebrates revealed by analysis of multiple lancelet genomes.

Authors: Shengfeng Huang; Zelin Chen; Xinyu Yan; Ting Yu; Guangrui Huang; Qingyu Yan; Pierre Antoine Pontarotti; Hongchen Zhao; Jie Li; Ping Yang; Ruihua Wang; Rui Li; Xin Tao; Ting Deng; Yiquan Wang; Guang Li; Qiujin Zhang; Sisi Zhou; Leiming You; Shaochun Yuan; Yonggui Fu; Fenfang Wu; Meiling Dong; Shangwu Chen; Anlong Xu
Journal: Nat Commun Date: 2014-12-19 Impact factor: 14.919

9. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

10. Aggressive assembly of pyrosequencing reads with mates.

Authors: Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton
Journal: Bioinformatics Date: 2008-10-24 Impact factor: 6.937

53 in total

1. Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins.

Authors: Da Yin; Erich M Schwarz; Cristel G Thomas; Rebecca L Felde; Ian F Korf; Asher D Cutter; Caitlin M Schartner; Edward J Ralston; Barbara J Meyer; Eric S Haag
Journal: Science Date: 2018-01-05 Impact factor: 47.728

2. Whole-chromosome hitchhiking driven by a male-killing endosymbiont.

Authors: Simon H Martin; Kumar Saurabh Singh; Ian J Gordon; Kennedy Saitoti Omufwoko; Steve Collins; Ian A Warren; Hannah Munby; Oskar Brattström; Walther Traut; Dino J Martins; David A S Smith; Chris D Jiggins; Chris Bass; Richard H Ffrench-Constant
Journal: PLoS Biol Date: 2020-02-27 Impact factor: 8.029

3. Mimicry diversification in Papilio dardanus via a genomic inversion in the regulatory region of engrailed-invected.

Authors: Martijn J T N Timmermans; Amrita Srivathsan; Steve Collins; Rudolf Meier; Alfried P Vogler
Journal: Proc Biol Sci Date: 2020-04-29 Impact factor: 5.349

4. Fungal evolution: cellular, genomic and metabolic complexity.

Authors: Miguel A Naranjo-Ortiz; Toni Gabaldón
Journal: Biol Rev Camb Philos Soc Date: 2020-04-17

5. The genome of the jellyfish Clytia hemisphaerica and the evolution of the cnidarian life-cycle.

Authors: Lucas Leclère; Coralie Horin; Sandra Chevalier; Pascal Lapébie; Philippe Dru; Sophie Peron; Muriel Jager; Thomas Condamine; Karen Pottin; Séverine Romano; Julia Steger; Chiara Sinigaglia; Carine Barreau; Gonzalo Quiroga Artigas; Antonella Ruggiero; Cécile Fourrage; Johanna E M Kraus; Julie Poulain; Jean-Marc Aury; Patrick Wincker; Eric Quéinnec; Ulrich Technau; Michaël Manuel; Tsuyoshi Momose; Evelyn Houliston; Richard R Copley
Journal: Nat Ecol Evol Date: 2019-03-11 Impact factor: 15.460

6. Chromosome Level Assembly of the Comma Butterfly (Polygonia c-album).

Authors: Maria de la Paz Celorio-Mancera; Pasi Rastas; Rachel A Steward; Soren Nylin; Christopher W Wheat
Journal: Genome Biol Evol Date: 2021-05-07 Impact factor: 3.416

7. Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms.

Authors: Nadège Guiglielmoni; Antoine Houtain; Alessandro Derzelle; Karine Van Doninck; Jean-François Flot
Journal: BMC Bioinformatics Date: 2021-06-05 Impact factor: 3.169

8. Chromosomal-Level Genome Assembly of the Sea Urchin Lytechinus variegatus Substantially Improves Functional Genomic Analyses.

Authors: Phillip L Davidson; Haobing Guo; Lingyu Wang; Alejandro Berrio; He Zhang; Yue Chang; Andrew L Soborowski; David R McClay; Guangyi Fan; Gregory A Wray
Journal: Genome Biol Evol Date: 2020-07-01 Impact factor: 3.416

9. Synteny-Based Genome Assembly for 16 Species of Heliconius Butterflies, and an Assessment of Structural Variation across the Genus.

Authors: Fernando A Seixas; Nathaniel B Edelman; James Mallet
Journal: Genome Biol Evol Date: 2021-07-06 Impact factor: 3.416

10. A de novo transcriptional atlas in Danaus plexippus reveals variability in dosage compensation across tissues.

Authors: José M Ranz; Pablo M González; Bryan D Clifton; Nestor O Nazario-Yepiz; Pablo L Hernández-Cervantes; María J Palma-Martínez; Dulce I Valdivia; Andrés Jiménez-Kaufman; Megan M Lu; Therese A Markow; Cei Abreu-Goodger
Journal: Commun Biol Date: 2021-06-25