Literature DB >> 33194395

BiSCoT: improving large eukaryotic genome assemblies with optical maps.

Benjamin Istace¹, Caroline Belser¹, Jean-Marc Aury¹.

Abstract

MOTIVATION: Long read sequencing and Bionano Genomics optical maps are two techniques that, when used together, make it possible to reconstruct entire chromosome or chromosome arms structure. However, the existing tools are often too conservative and organization of contigs into scaffolds is not always optimal.
RESULTS: We developed BiSCoT (Bionano SCaffolding COrrection Tool), a tool that post-processes files generated during a Bionano scaffolding in order to produce an assembly of greater contiguity and quality. BiSCoT was tested on a human genome and four publicly available plant genomes sequenced with Nanopore long reads and improved significantly the contiguity and quality of the assemblies. BiSCoT generates a fasta file of the assembly as well as an AGP file which describes the new organization of the input assembly. AVAILABILITY: BiSCoT and improved assemblies are freely available on GitHub at http://www.genoscope.cns.fr/biscot and Pypi at https://pypi.org/project/biscot/. ©2020 Istace et al.

Entities: Chemical

Keywords: Bioinformatics; Bionano; Genome assembly; Long reads; Nanopore; Optical maps; PacBio; Scaffolding; Tool

Year: 2020 PMID： 33194395 PMCID： PMC7649008 DOI： 10.7717/peerj.10150

Source DB: PubMed Journal: PeerJ ISSN： 2167-8359 Impact factor: 2.984

Introduction

Assembling large and repetitive genomes, such as plant genomes, is a challenging field in bioinformatics. The appearance of short reads technologies several years ago improved considerably the number of genomes publicly available. However, a high proportion of them are still fragmented and few represent the chromosome organization of the genome. Recently, long reads sequencing techniques, like Oxford Nanopore Technologies and Pacific Biosciences, were introduced to improve the contiguity of assemblies, by sequencing DNA molecules that can range from a few kilobases to more than a megabase in size (Istace et al., 2017; Schmidt et al., 2017; Kim et al., 2019; Shafin et al., 2019). Nevertheless and even if the assemblies were greatly improved, the chromosome-level organization of the sequenced genome cannot be deciphered in a majority of cases. In 2017, Bionano Genomics launched its Saphyr system which was able to generate optical maps of a genome, by using the distribution of enzymatic labelling sites. These maps were used to orient and order contigs into scaffolds but the real improvement came in 2018, when Bionano Genomics introduced their Direct Label and Stain (DLS) technology that was able to produce genome maps at the chromosome-level with a N50 several times higher than previously (Belser et al., 2018; Formenti et al., 2018; Hu et al., 2019). However, scaffolds generated with the tool provided by Bionano Genomics do not reach optimal contiguity. Indeed, when two contigs C1 and C2 are found to share labels, one could expect that the tool would merge the two sequences at the shared site. Instead, the software chooses a conservative approach and outputs the sequence of C1 followed by a 13-Ns gap and then the C2 sequence, thus duplicating the region that is shared by the two contigs (Fig. 1A and 1B) and in numerous cases, these duplicated regions could reach several kilobases. As an example, on the human genome we used to evaluate BiSCoT (see ‘Results’), we could detect 515 of those regions, affecting 16 genes and corresponding to around 24.5 Mb of duplicated sequences, the longest being 237 kb in size. These duplicated regions affect the contiguity and have to be corrected as they can be problematic for downstream analyses, like copy number variation studies. They originate from overlaps that are not fused in the input assembly and usually correspond to allelic duplications. In addition, contigs can sometimes be inserted into other contigs, these cases are not handled by the Bionano scaffolding tool that discards the inserted contigs (Fig. 1C).

Figure 1

The Bionano scaffolding tool does not merge contigs even if they share labels.

The Bionano scaffolding tool does not merge contigs even if they share labels.

Instead, it inserts 13 N’s gap between contigs, thus artificially duplicating the shared region. (A) BiSCoT merges contigs that share enzymatic labelling sites. (B) If contigs do not share labels but share a genomic region, BiSCoT attempts to merge them by aligning the borders of the contigs. (C) The Bionano scaffolding tool does not handle cases where contigs can be inserted into others. BiSCoT attempts to merge the inserted map with the one containing it if they share labels. We developed BiSCoT, a python script that examinates data generated during a previous Bionano scaffolding and merges contigs separated by a 13-Ns gap if needed. BiSCoT also re-evaluates gap sizes and searches for an alignment between two contigs if the gap size is inferior to 1,000 nucleotides. BiSCoT is therefore not a traditional scaffolder since it can only be used to improve an existing scaffolding, based on an optical map.

Methods

Mandatory files loading

During the scaffolding, the Bionano scaffolder generates a visual representation of the hybrid scaffolds that is called an ‘anchor’. It also generates one ‘key’ file, which describes the mapping between map identifiers and contig names, several CMAP files, which contain the position of enzymatic labelling sites on contig maps and on the anchor, and a XMAP file, that describes the alignment between a contig map and an anchor. BiSCoT first loads the contigs into memory based on the key file. Then, the anchor CMAP file and contig CMAP files are loaded into memory. Finally, the XMAP file is parsed and loaded.

Scaffolding

Alignments of contigs onto anchors contained in the XMAP file are first sorted by their starting position on the anchor. Then, alignments on one anchor are parsed by pairs of adjacent contigs, i.e alignment of contig C is examined at the same time as contig C, with C aligned before C on the anchor. Aligned anchor labels are extracted from these alignments and a list of shared labels L is built. For the following cases, we suppose C and C to be aligned on the forward strand (Fig. 1).

Case 1: contig maps share at least one anchor label

The last label l from L is extracted and the position P of l on both contigs C and C is recovered from the CMAP files. In the resulting scaffold, the sequence of C will be included up to the P position and the sequence of C will be included from the P position. In this case, the gap is removed, both contigs C and C are fused and BiSCoT generates a single contig instead of two contigs initially separated by a gap in the input assembly.

Case 2: contig maps do not share anchor labels

Let Size be the size of the contig C, Sm and Em the start and end of an alignment on a contig map and Sa and Ea the corresponding coordinates on the anchor. The number n of bases between the last aligned label of C and the first aligned label of C is then: We then have to subtract the part d of C after the last aligned label of C and the part d of C before the first aligned label of C: Finally, we can compute the gap size g with: If g ≤ 1000, a BLAT (Kent, 2002) alignment of the last 30 kb of C is launched against the first 30kb of C. If an alignment is found and if its score is higher than 5,000, C and C are merged at the starting position of the alignment and, as in case1, BiSCoT generates a single contig instead of two contigs initially separated by a gap in the input assembly. Otherwise, a number g of Ns is inserted between C and C.

Case 3: insertion of small contigs

Let Sm and Em the start and end of an alignment on a contig map. If [Sm, Em] ⊂ [Sm, Em], then the left-most shared label identifier l and right-most shared label identifier l are extracted. If C has more of its labels mapped in this region than C, the sequence of C will be inserted between l and l in the scaffolds. Otherwise, the sequence of C remains unchanged and C will be included as a singleton sequence in the scaffolds file. Finally, if an Illumina polishing step was done before or after Bionano scaffolding, we recommend doing one additional round of polishing using Illumina reads after BiSCoT has been applied. Indeed, short reads tend to be aligned only against one copy of the duplicated regions, leaving the other copy unpolished.

Results and Discussions

Validation on simulated data

In order to simulate a genome assembly, we downloaded the chromosome 1 of the GRCh38.p12 human reference genome and fragmented it to create contigs. We generated 120 contigs with an N50 size of 2.4 Mb and a cumulative size of 231 Mb. Contigs were generated with either overlaps or gaps between them. We introduced 50 gaps with a mean length of 50 kb, the smallest being 3.4kbp long and the largest 99.6 kb long, and 50 overlaps with a mean size of 44kb, the smallest being 278b long and the largest 98.6 kb long. We also generated five contigs, with an N50 of 254 kb, that were subsequences of larger contigs, to simulate contained contigs. Then, we used these contigs and Bionano DLE and BspQI optical maps available on the Bionano Genomics website as input to the Bionano scaffolder. We gave the results of this scaffolding to BiSCoT and aligned all assemblies to the chromosome 1 reference using Quast (Gurevich et al., 2013, v5.0.2). BiSCoT was able to resolve 39 overlaps out of the 50 we introduced (Table S1), 31 using shared labels and 8 using a Blat alignment. The 11 remaining overlaps could not be resolved due to contigs not sharing enough labels or the overlap being too small to produce an alignment of sufficient confidence. BiSCoT was also able to integrate all contained contigs back to their original place in the assembly. Furthermore, BiSCoT did not close any of the real gaps introduced during the assembly generation. Regarding assembly metrics (Table S2), The N50 decreased by 1.4% in scaffolds and increased by 22% in contigs. The number of Ns in scaffolds decreased from 20.7Mb to 20.4Mb. Moreover, the number of misassemblies decreased by 68% after applying BiSCoT and the duplication ratio estimated by Quast decreased from 1.026 in Bionano scaffolds to 1.021 in BiSCoT scaffolds. In order to estimate the accuracy of gap sizes, we compared the gap sizes we introduced in the input assembly to the ones that were estimated using optical maps (Fig. S1). We found that estimated gap sizes were very close to the reality, with a mean scaled absolute error of 0.8%.

Validation on real data

We downloaded genome assemblies for which a DLE optical map was available: the NA12878 human genome (Jain et al., 2018), Brassica oleracea HDEM (PRJEB26621, Belser et al., 2018), Brassica rapa Z1 (PRJEB26620, Belser et al., 2018), Musa schizocarpa (PRJEB26661, Belser et al., 2018) and Sorghum bicolor Tx430 (PRJNA472170, Deschamps et al., 2018). The QUAST and BUSCO (Simão et al. (2015), v4.0.5) tools were used respectively to evaluate the number of misassemblies to the GRCh38.p12 human reference genome and the number of conserved genes among eukaryotes. In all cases, we first used the Bionano workflow to scaffold the draft assembly and launched BiSCoT using the files generated by the Bionano tools (Table 1, Tables S3–S6). The output of the Bionano workflow and BiSCoT are scaffolds, but we generated a contig file for each assembly by splitting each scaffold at every position with at least one N.

Table 1

Metrics of the NA12878 scaffolds and contigs before or after BiSCoT treatment.

Bold formatting indicates the best scoring assembly among contigs.

	Nanopore contigs	Bionano		BiSCoT
		Contigs	Scaffolds	Contigs	Scaffolds
Cumulative size	2,818,937,673	2,818,997,568	2,878,230,106	2,810,480,725	2,868,077,379
N50	11,821,944	10,566,783	86,858,024	12,894,141	86,833,728
L50	67	71	14	64	14
N90	2,143,851	1,863,173	26,054,782	2,321,940	26,037,000
L90	280	301	36	254	36
auN^a	15,164,719	14,547,428	82,760,251	15,977,835	82,474,548
# Ns	0	0	59,232,538	0	57,596,654
NGA50	5,794,944	5,729,014	10,816,842	6,360,576	11,713,900
NGA75	1,511,206	1,495,174	2,701,541	1,596,102	2,938,187
# misassemblies	1,356	1,299	1,602	1,278	1,515
Complete BUSCOs	235 (92.2%)	234 (91.8%)	231 (90.6%)	235 (92.2%)	231 (90.6%)
Duplicated BUSCOs	5 (2.0%)	4 (1.6%)	4 (1.6%)	4 (1.6%)	4 (1.6%)
Missing BUSCOs	11 (4.3%)	10 (3.9%)	13 (5.1%)	10 (3.9%)	13 (5.1%)

Notes.

auN is a new metric to measure assembly contiguity Li (2020).

Metrics of the NA12878 scaffolds and contigs before or after BiSCoT treatment.

Bold formatting indicates the best scoring assembly among contigs. Notes. auN is a new metric to measure assembly contiguity Li (2020). Concerning the NA12878 genome, we could detect 515 overlapping regions with a mean size of 47kb and representing in total 24.5 Mb of duplicated sequences. Among these 515 regions, 499 were corrected by BiSCoT using either shared labels (113 regions) or a BLAT alignment (386 regions) when no shared labels were found. Globally, the contig NX and NGAX metrics increased drastically: the contigs NGA50 of NA12878 increased by around 10%, going from 5.8 Mb to 6.3 Mb. The scaffolds NGAX metrics also increased: the scaffolds NGA50 increased from 10.8 Mb in Bionano scaffolds to 11.7Mb in BiSCoT scaffolds. Moreover, the number of Ns decreased marginally and the number of complete eukaryotic genes stayed the same in scaffolds. More importantly, when aligning the assemblies against the reference genome, we could detect a decrease in the number of mis-assemblies going from 1,602 in Bionano scaffolds to 1,515 in BiSCoT scaffolds. The same kind of results were observed in the four plant genomes with a slight decrease in scaffolds NX metrics and number of Ns but an increase in contigs NX metrics (Fig. 2 and Tables S2–S5).

Figure 2

(A) Distribution of the sizes of overlapping regions in the raw assemblies. Detection was done using either Bionano labels (Case 1) or a BLAT alignment (Case 2). (B) N50 contigs of raw assemblies and assemblies before or after BiSCoT treatment.

Summary

Thanks to the advent of long reads and optical maps technologies, it is now possible to obtain high-quality chromosome-scale assemblies. However, the official Bionano scaffolding tool does not always perform optimally when joining two contigs. Indeed, it does not merge two sequences when they share a genomic region, creating artificial gaps in the assembly. We developed BiSCoT, a tool that corrects these problematic regions in a prior Bionano scaffolding and showed that it increased significantly contiguity metrics of the resulting assembly, while preserving its quality. Click here for additional data file. Click here for additional data file.

12 in total

1. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

2. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps.

Authors: Caroline Belser; Benjamin Istace; Erwan Denis; Marion Dubarry; Franc-Christophe Baurens; Cyril Falentin; Mathieu Genete; Wahiba Berrabah; Anne-Marie Chèvre; Régine Delourme; Gwenaëlle Deniot; France Denoeud; Philippe Duffé; Stefan Engelen; Arnaud Lemainque; Maria Manzanares-Dauleux; Guillaume Martin; Jérôme Morice; Benjamin Noel; Xavier Vekemans; Angélique D'Hont; Mathieu Rousseau-Gueutin; Valérie Barbe; Corinne Cruaud; Patrick Wincker; Jean-Marc Aury
Journal: Nat Plants Date: 2018-11-02 Impact factor: 15.793

3. De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing.

Authors: Maximilian H-W Schmidt; Alexander Vogel; Alisandra K Denton; Benjamin Istace; Alexandra Wormit; Henri van de Geest; Marie E Bolger; Saleh Alseekh; Janina Maß; Christian Pfaff; Ulrich Schurr; Roger Chetelat; Florian Maumus; Jean-Marc Aury; Sergey Koren; Alisdair R Fernie; Dani Zamir; Anthony M Bolger; Björn Usadel
Journal: Plant Cell Date: 2017-10-12 Impact factor: 11.277

4. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Authors: Kishwar Shafin; Trevor Pesout; Ryan Lorig-Roach; Marina Haukness; Hugh E Olsen; Colleen Bosworth; Joel Armstrong; Kristof Tigyi; Nicholas Maurer; Sergey Koren; Fritz J Sedlazeck; Tobias Marschall; Simon Mayes; Vania Costa; Justin M Zook; Kelvin J Liu; Duncan Kilburn; Melanie Sorensen; Katy M Munson; Mitchell R Vollger; Jean Monlong; Erik Garrison; Evan E Eichler; Sofie Salama; David Haussler; Richard E Green; Mark Akeson; Adam Phillippy; Karen H Miga; Paolo Carnevali; Miten Jain; Benedict Paten
Journal: Nat Biotechnol Date: 2020-05-04 Impact factor: 54.908

5. Nanopore sequencing and assembly of a human genome with ultra-long reads.

Authors: Miten Jain; Sergey Koren; Karen H Miga; Josh Quick; Arthur C Rand; Thomas A Sasani; John R Tyson; Andrew D Beggs; Alexander T Dilthey; Ian T Fiddes; Sunir Malla; Hannah Marriott; Tom Nieto; Justin O'Grady; Hugh E Olsen; Brent S Pedersen; Arang Rhie; Hollian Richardson; Aaron R Quinlan; Terrance P Snutch; Louise Tee; Benedict Paten; Adam M Phillippy; Jared T Simpson; Nicholas J Loman; Matthew Loose
Journal: Nat Biotechnol Date: 2018-01-29 Impact factor: 54.908

6. de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.

Authors: Benjamin Istace; Anne Friedrich; Léo d'Agata; Sébastien Faye; Emilie Payen; Odette Beluche; Claudia Caradec; Sabrina Davidas; Corinne Cruaud; Gianni Liti; Arnaud Lemainque; Stefan Engelen; Patrick Wincker; Joseph Schacherer; Jean-Marc Aury
Journal: Gigascience Date: 2017-02-01 Impact factor: 6.524

7. SMRT long reads and Direct Label and Stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica).

Authors: Giulio Formenti; Matteo Chiara; Lucy Poveda; Kees-Jan Francoijs; Andrea Bonisoli-Alquati; Luca Canova; Luca Gianfranceschi; David Stephen Horner; Nicola Saino
Journal: Gigascience Date: 2019-01-01 Impact factor: 6.524

8. The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis.

Authors: Lisong Hu; Zhongping Xu; Maojun Wang; Rui Fan; Daojun Yuan; Baoduo Wu; Huasong Wu; Xiaowei Qin; Lin Yan; Lehe Tan; Soonliang Sim; Wen Li; Christopher A Saski; Henry Daniell; Jonathan F Wendel; Keith Lindsey; Xianlong Zhang; Chaoyun Hao; Shuangxia Jin
Journal: Nat Commun Date: 2019-10-16 Impact factor: 14.919

9. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping.

Authors: Stéphane Deschamps; Yun Zhang; Victor Llaca; Liang Ye; Abhijit Sanyal; Matthew King; Gregory May; Haining Lin
Journal: Nat Commun Date: 2018-11-19 Impact factor: 14.919

10. Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information.

Authors: Hui-Su Kim; Sungwon Jeon; Changjae Kim; Yeon Kyung Kim; Yun Sung Cho; Jungeun Kim; Asta Blazyte; Andrea Manica; Semin Lee; Jong Bhak
Journal: Gigascience Date: 2019-12-01 Impact factor: 6.524

4 in total

1. Long-read assembly of the Brassica napus reference genome Darmor-bzh.

Authors: Mathieu Rousseau-Gueutin; Caroline Belser; Corinne Da Silva; Gautier Richard; Benjamin Istace; Corinne Cruaud; Cyril Falentin; Franz Boideau; Julien Boutte; Regine Delourme; Gwenaëlle Deniot; Stefan Engelen; Julie Ferreira de Carvalho; Arnaud Lemainque; Loeiz Maillet; Jérôme Morice; Patrick Wincker; France Denoeud; Anne-Marie Chèvre; Jean-Marc Aury
Journal: Gigascience Date: 2020-12-15 Impact factor: 6.524

2. Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing.

Authors: Caroline Belser; Franc-Christophe Baurens; Benjamin Noel; Guillaume Martin; Corinne Cruaud; Benjamin Istace; Nabila Yahiaoui; Karine Labadie; Eva Hřibová; Jaroslav Doležel; Arnaud Lemainque; Patrick Wincker; Angélique D'Hont; Jean-Marc Aury
Journal: Commun Biol Date: 2021-09-07

3. Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves high resolution for research and breeding.

Authors: Jean-Marc Aury; Stefan Engelen; Benjamin Istace; Cécile Monat; Pauline Lasserre-Zuber; Caroline Belser; Corinne Cruaud; Hélène Rimbert; Philippe Leroy; Sandrine Arribat; Isabelle Dufau; Arnaud Bellec; David Grimbichler; Nathan Papon; Etienne Paux; Marion Ranoux; Adriana Alberti; Patrick Wincker; Frédéric Choulet
Journal: Gigascience Date: 2022-04-28 Impact factor: 7.658

4. Population genomics of apricots unravels domestication history and adaptive events.

Authors: Alexis Groppi; Shuo Liu; Amandine Cornille; Stéphane Decroocq; Quynh Trang Bui; David Tricon; Corinne Cruaud; Sandrine Arribat; Caroline Belser; William Marande; Jérôme Salse; Cécile Huneau; Nathalie Rodde; Wassim Rhalloussi; Stéphane Cauet; Benjamin Istace; Erwan Denis; Sébastien Carrère; Jean-Marc Audergon; Guillaume Roch; Patrick Lambert; Tetyana Zhebentyayeva; Wei-Sheng Liu; Olivier Bouchez; Céline Lopez-Roques; Rémy-Félix Serre; Robert Debuchy; Joseph Tran; Patrick Wincker; Xilong Chen; Pierre Pétriacq; Aurélien Barre; Macha Nikolski; Jean-Marc Aury; Albert Glenn Abbott; Tatiana Giraud; Véronique Decroocq
Journal: Nat Commun Date: 2021-06-25 Impact factor: 14.919

4 in total