| Literature DB >> 27131372 |
Leszek P Pryszcz1, Toni Gabaldón2.
Abstract
Many genomes display high levels of heterozygosity (i.e. presence of different alleles at the same loci in homologous chromosomes), being those of hybrid organisms an extreme such case. The assembly of highly heterozygous genomes from short sequencing reads is a challenging task because it is difficult to accurately recover the different haplotypes. When confronted with highly heterozygous genomes, the standard assembly process tends to collapse homozygous regions and reports heterozygous regions in alternative contigs. The boundaries between homozygous and heterozygous regions result in multiple assembly paths that are hard to resolve, which leads to highly fragmented assemblies with a total size larger than expected. This, in turn, causes numerous problems in downstream analyses such as fragmented gene models, wrong gene copy number, or broken synteny. To circumvent these caveats we have developed a pipeline that specifically deals with the assembly of heterozygous genomes by introducing a step to recognise and selectively remove alternative heterozygous contigs. We tested our pipeline on simulated and naturally-occurring heterozygous genomes and compared its accuracy to other existing tools. Our method is freely available at https://github.com/Gabaldonlab/redundans.Entities:
Mesh:
Year: 2016 PMID: 27131372 PMCID: PMC4937319 DOI: 10.1093/nar/gkw294
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Genome assembly from short reads. Standard (A) and heterozygous (B) genome assembly pipelines are compared. Diploid chromosomes are indicated as horizonal bars with heterozygous regions marked as red and blue. Paired-end reads produced from sequencing of those chromosomes are indicated as smaller bars linked by thin lines below the chromosomes. Assemblies are indicated as horizontal bars, in the same way as chromosomes, but a single reference is produced for diploid chromosomes. Heterozygous genome assembly pipeline consists of five steps. (a) Standard de novo assembly is performed and (b) optionally gaps are closed. Obtained assembly is larger than expected and fragmented because two alternative contigs are recovered from heterozygous regions (blue and red), while single contig is recovered from homozygous regions (gray). Further scaffolding of such assembly is impossible, as homozygous contigs can be joined to any of heterozygous contigs (blue and red).(c) To overcome this, redundant contigs from heterozygous regions are removed (here the red contig) and (d) homogenised assembly is further scaffolded. (e) Finally, gaps are closed.(C) Schematic representation of redundans mechanisms. Redundans pipeline consists of three steps: reduction, scaffolding and gap closing. Program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly, that should be less fragmented and with total size smaller than the input contigs. Note, scaffolding and gap closing may be executed in multiple iterations. In the first step, only heterozygous contigs are used. Paired-end and/or mate pair libraries are used for scaffolding and gap closing. The latter steps can be repeated to achieve incremental assembly improvement. Redundans is very flexible, thus any of the above mentioned step can be omitted.
Figure 2.Heterozygous genome assemblies characteristics. The total SOAPdenovo2 assembly size (blue bars) as well as number of scaffolds longer than 1 kb (red plot) are given for reconstructed genome assemblies: one homozygous (LOH of 100%) and five heterozygous (with 5% divergence between haplomes and varying loss of heterozygosity level: 0%, 20%, 40%, 60%, 80%). Expected genome size is marked with purple baseline. The assemblies recovered for heterozygous genomes are much more fragmented (2237–3743 scaffolds) than those recovered for homozygous genome (250 scaffolds). The assembly reconstructed for homozygous genome (100% LOH) has expected size, while the remaining assemblies have size larger than expected (119.20–198.85% of expected size). The size of the genome assembly is negatively correlated with LOH level (Pearson coefficient of -0.9996).
Figure 3.The assembly of simulated heterozygous genome. Pairwise genome alignment of the final assembly for simulated heterozygous genome with 0% LOH and its reference, C. parapsilosis CDC317. Synteny blocks have been coloured accordingly to the identity level between pair of query and target sequences. The assembled genome represent a mixture of two haplomes: 5% diverged (blue or violet) and identical (red) to reference genome. In addition, two short regions with divergence of 2–3% (green and cyan) are present in HE605206. The regions with intermediate divergence were likely assembled from very short contigs from both haplomes.
Genome and assembly statistics
| No. of scaffolds | N50 | N90 | Ns | Longest scaffold | Size [%] | ||
|---|---|---|---|---|---|---|---|
| ref | 9 | 2 091 826 | 957 321 | 0 | 3 023 470 | - | |
| SPAdes | min | 188 | 2607 | 111 | 0 | 35 992 | 102.61 |
| max | 3861 | 183 185 | 38 399 | 469 611 | 975 662 | 199.59 | |
| dipSPAdes | min | 47 | 140 841 | 10 876 | 0 | 369 644 | 44.06 |
| max | 201 | 443 228 | 168 647 | 0 | 1 277 709 | 100.67 | |
| Platanus | min | 97 | 8525 | 2240 | 7816 | 62 024 | 99.30 |
| max | 3503 | 314 901 | 105 072 | 115 364 | 1 233 366 | 156.29 | |
| Redundansa | min | 16 | 294 087 | 88 621 | 277 | 870 098 | 99.58 |
| max | 109 | 1 880 160 | 557 645 | 73 286 | 3 104 871 | 106.17 | |
| Redundansb | min | 13 | 432 772 | 179 116 | 3053 | 1 026 161 | 99.39 |
| max | 49 | 1 599 252 | 651 286 | 38 113 | 3 144 038 | 100.35 | |
| ref | 5 | 23 459 830 | 18 585 056 | 185 738 | 30 427 671 | – | |
| SPAdes | min | 7261 | 1242 | 111 | 0 | 14 802 | 84.00 |
| max | 45 769 | 91 641 | 6562 | 0 | 786 328 | 190.87 | |
| dipSPAdes | min | 978 | 38 818 | 4393 | 0 | 378 250 | 57.99 |
| max | 4305 | 217 604 | 41 794 | 0 | 1 120 742 | 81.89 | |
| Platanus | min | 1225 | 9078 | 2304 | 300 250 | 89 082 | 96.80 |
| max | 29 856 | 330 677 | 63 136 | 1 907 392 | 2 741 383 | 152.32 | |
| Redundansa | min | 127 | 810 991 | 51 721 | 191 751 | 2 162 015 | 92.30 |
| max | 1471 | 2 909 961 | 612 329 | 1 830 833 | 9 950 963 | 101.68 | |
| Redundansb | min | 97 | 1 796 239 | 163 909 | 151 424 | 7 206 143 | 97.67 |
| max | 864 | 13 602 130 | 4 371 075 | 1 162 308 | 28 354 754 | 101.29 |
All genomes used in this study are listed. For respective reference genomes, we provide it's accession together with number of scaffolds/chromosomes, N50 and N90, cumulative size of gaps and the length of the longest scaffold / chromosome. For assemblies produced in this study based on simulated heterozygous genomes with various level of divergence between heterozygous regions (1–40%) and level of loss of heterozygosity (0–100%), we provide minimum and maximum value for each of these metrics obtained. In addition, minimum and maximum percentage of expected assembly size is given.
aRedundans reconstruction started with SPAdes contigs.
bRedundans reconstruction started with Platanus scaffolds.