| Literature DB >> 31767853 |
Huilong Du1,2, Chengzhi Liang3,4.
Abstract
The abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. Here we report a genome assembly method HERA, which resolves repeats efficiently by constructing a connection graph from an overlap graph. We test HERA on the genomes of rice, maize, human, and Tartary buckwheat with single-molecule sequencing and mapping data. HERA correctly assembles most of the previously unassembled regions, resulting in dramatically improved, highly contiguous genome assemblies with newly assembled gene sequences. For example, the maize contig N50 size reaches 61.2 Mb and the Tartary buckwheat genome comprises only 20 contigs. HERA can also be used to fill gaps and fix errors in reference genomes. The application of HERA will greatly improve the quality of new or existing assemblies of complex genomes.Entities:
Mesh:
Year: 2019 PMID: 31767853 PMCID: PMC6877557 DOI: 10.1038/s41467-019-13355-3
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of HERA. a Two copies of repeats (R1 and R2) are similar to each other but they also contain sequence variations which can be found in the reads originating from them. The alignments of junction reads (across the boundary between repeat and unique sequences) to a different repeat copy form overhangs of unaligned, unique sequences. b A subgraph of an overlap graph corresponding to the genome segments and sequencing reads shown in (a). The sequencing reads can be classified into three types: unique reads (U), repeat reads (R) and junction reads (UR). c The path extension from contig end C1h can reach a set of other contig ends, which include C2h, the true target, and C4t, the false target, and possibly others (Cjh) from background noise. d A connection graph showing the number of paths (NP) between each pair of contig nodes. e A subgraph of a connection graph with examples of conflicting connections. The conflicting indices of two contig ends were: CI54t = 211/215 = 0.98; CI78h = 211/218 = 0.97. These conflicting connections can be resolved because the number of paths between C365t-C55h was very small, so that C78h-C365t can be connected first. f Sequence alignments showing a fragment of at least 36 kb in C78 being similar to the connecting sequence between C54t and C55h and 18 kb of highly similar sequence in C365t overlapping C78h. g The alignments to BioNano genome maps confirmed that the connections of C54t-C55h and C78h-C365t were correct.
Fig. 2An illustration of identifying tandemly repetitive sequences by HERA. a A tandemly repetitive sequence on chromosome 5 of R498 with a unit length of 65 kb. The upper green horizontal bar represents the assembled sequence lacking a unit and the lower blue bar represents the BioNano map. b A repetitive sequence on chromosome 8 of R498 with a unit length of 22 kb. c The length distribution of HERA generated tiling paths for the repeat shown in (a). The paths are divided into several clusters and the distances between adjacent peaks are 65 kb which matched the repeat unit length in (a). The second peak represents the whole region of two repeat units (130 kb). d The length distribution of HERA generated tiling paths for the repeat in (b). The paths are divided into two clusters and the distance between the two peaks is around 35 kb. e The schematic representation of the repeat region in (b). In this region, there are two highly similar repeat units of 22 kb (rectangle) being separated by one of the two dissimilar repeat units of 13 kb (triangle). Ref, the full repeat region; ctg, the flanking sequences to be connected; cns1 and cns2, excluding the flanking sequences shown in ctg, correspond to the second and the first peak in (d), respectively.
The summary of genome assemblies.
| Genome | Method | Seq Num | N50 (Mb) | Max Len (Mb) | Total Len (Mb) |
|---|---|---|---|---|---|
| R498 | CANU | 811 | 1.31 | 5.43 | 402.5 |
| CANU + HERA | 206 | 13.24 | 25.88 | 399.2 | |
| BioNano map | 453 | 1.22 | 5.78 | 406.1 | |
| BioNano + CANUa | 105 | 5.67 | 18.25 | 388.9 | |
| BioNano + HERAa | 32 | 17.51 | 32.2 | 390.2 | |
| BioNano + HERA + GF | 89 | 14.42 | 30.03 | 391.1 | |
| On Chromosome | 73 | 14.42 | 30.03 | 390.5 | |
| R498_HERA1b | 40 | 15.38 | 30.03 | 391.6 | |
| B73 | RefGen_v4 (PBcR) | 2,790 | 1.28 | 7.26 | 2106.3 |
| PBcR + HERA | 416 | 31.53 | 121.28 | 2118.2 | |
| BioNano map | 1271 | 2.51 | 12.45 | 2079.7 | |
| BioNano + PBcRa | 319 | 10.2 | 45.88 | 2060.3 | |
| BioNano + HERAa | 68 | 107.5 | 194.6 | 2110 | |
| BioNano + HERA + GF | 130 | 61.2 | 142.5 | 2105.8 | |
| B73_HERA1c | 86 | 61.2 | 142.5 | 2103.9 | |
| HX1 | HX1_FALCON | 2,710 | 8.33 | 38.18 | 2873.2 |
| FALCON + HERA | 850 | 32.53 | 109.81 | 2840.3 | |
| BioNano map | 2,487 | 1.68 | 11.41 | 2890.7 | |
| BioNano + FALCONa | 325 | 24.05 | 83.66 | 2724.4 | |
| BioNano + HERA + GF | 1,518 | 54.41 | 109.81 | 2871.3 | |
| HX1_HERA1c | 815 | 54.41 | 109.81 | 2841.7 | |
| Pinku1 | CANU | 839 | 1.1 | 10.83 | 452.1 |
| PBcR | 6,033 | 0.45 | 2.11 | 587.7 | |
| PBcR + HERA | 48 | 22.24 | 43.19 | 453.4 | |
| BioNano map | 374 | 1.71 | 6.85 | 461.3 | |
| BioNano + PBcRa | 550 | 5.43 | 15.04 | 451.9 | |
| BioNano + HERAa | 22 | 51.77 | 61.99 | 453.7 | |
| BioNano + HERA + GF | 30 | 27.85 | 49.83 | 453.2 | |
| Pinku1_HERA1c | 20 | 51.77 | 62.08 | 453.5 |
Seq Num the total number of contigs or scaffolds, BioNano maps do not have sequences, +GF with gap filling, On Chromosome the HERA contigs anchored on nuclear genome chromosomes
aHybrid scaffolds included unfilled gaps
bWith gap filling after anchoring on chromosomes
cOnly the contigs anchored on chromosomes were included here (no gap filling after anchoring on chromosomes). The unanchored sequences may include contaminations from other species. The sequences on chromosomes were corrected using Illumina short reads, which changed the sequence length
Fig. 3Comparison of maize B73 HERA assembly and RefGen_v4. a The comparison of HERA assembled B73 genome with the published B73 RefGen_v4. The top green horizontal bar represents RefGen_v4 and the bottom blue horizontal bar represents the HERA assembly. Each black triangle represents a sequence gap. The red vertical bars represent the >10 kb InDels that were present in the contigs of RefGen_v4. The purple vertical bars represent the >10 kb InDels introduced by HERA with the orange vertical bars showing the positions of the corresponding gaps in RefGen_v4. b An example of gap filling and sequence correction by HERA. The green horizontal bars represent the maize genome sequences and the blue horizontal bars represent BioNano maps. The upper panel is the alignment between RefGen_v4 and BioNano maps, and the lower panel is the alignment between the HERA assembly and BioNano maps. The gaps (right red box in the upper panel) in RefGen_v4 were filled with ‘N’s, which were correctly assembled by HERA. The inserted sequence in the left red box was not present in the HERA assembly.
The running time of HERA.
| Genome | Overlap (BWA) hour | Total hour | BWA ratio |
|---|---|---|---|
| R498 | 2515 | 2641 | 95.23% |
| Pinku1 | 2922 | 3043 | 96.02% |
| B73 | 14,409 | 16,265 | 88.59% |
| HX1 | 17,657 | 20,141 | 87.67% |