| Literature DB >> 30972949 |
Jean P Elbers1, Mark F Rogers2, Polina L Perelman3, Anastasia A Proskuryakova3, Natalia A Serdyukova3, Warren E Johnson4, Petr Horin5, Jukka Corander6,7, David Murphy8, Pamela A Burger1.
Abstract
Researchers have assembled thousands of eukaryotic genomes using Illumina reads, but traditional mate-pair libraries cannot span all repetitive elements, resulting in highly fragmented assemblies. However, both chromosome conformation capture techniques, such as Hi-C and Dovetail Genomics Chicago libraries and long-read sequencing, such as Pacific Biosciences and Oxford Nanopore, help span and resolve repetitive regions and therefore improve genome assemblies. One important livestock species of arid regions that does not have a high-quality contiguous reference genome is the dromedary (Camelus dromedarius). Draft genomes exist but are highly fragmented, and a high-quality reference genome is needed to understand adaptation to desert environments and artificial selection during domestication. Dromedaries are among the last livestock species to have been domesticated, and together with wild and domestic Bactrian camels, they are the only representatives of the Camelini tribe, which highlights their evolutionary significance. Here we describe our efforts to improve the North African dromedary genome. We used Chicago and Hi-C sequencing libraries from Dovetail Genomics to resolve the order of previously assembled contigs, producing almost chromosome-level scaffolds. Remaining gaps were filled with Pacific Biosciences long reads, and then scaffolds were comparatively mapped to chromosomes. Long reads added 99.32 Mbp to the total length of the new assembly. Dovetail Chicago and Hi-C libraries increased the longest scaffold over 12-fold, from 9.71 Mbp to 124.99 Mbp and the scaffold N50 over 50-fold, from 1.48 Mbp to 75.02 Mbp. We demonstrate that Illumina de novo assemblies can be substantially upgraded by combining chromosome conformation capture and long-read sequencing.Entities:
Keywords: chromosome conformation capture; chromosome mapping; dromedary; genome annotation; genome assembly; scaffolding
Mesh:
Year: 2019 PMID: 30972949 PMCID: PMC6618069 DOI: 10.1111/1755-0998.13020
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 7.090
Assembly statistics for the original North African dromedary assembly (CamDro1) (Fitak et al., 2016; GenBank accession: GCA_000803125.1); the North African dromedary assembly after improvement (CamDro2) by Chicago and Dovetail Hi‐C sequencing libraries, followed by filling in gaps with 11x coverage PacBio Sequel reads using pbjelly (English et al., 2012), next polishing with Illumina short‐insert libraries using pilon (Walker et al., 2014), and then filling in gaps with Illumina short‐insert libraries using abyss sealer (Jackman et al., 2017), and polishing again but also filling in gaps with Pilon; and for comparison the Arabian dromedary assembly (Wu et al., 2014; GCA_000767585.1)
| Assembly | |||
|---|---|---|---|
| Original North African Dromedary (CamDro1) | Improved North African Dromedary (CamDro2) | Arabian Dromedary | |
| Total size | 2,055,063,633 | 2,154,386,959 | 2,004,047,047 |
| Gap length | 53,035,436 | 20,341,506 | 22,407,814 |
| Scaffolds | |||
| Number | 35,752 | 23,439 | 32,572 |
| Longest | 9,719,801 | 124,992,380 | 23,736,781 |
| N90 | 260,185 | 24,922,612 | 689,795 |
| L90 | 1,592 | 31 | 594 |
| N50 | 1,482,444 | 75,021,453 | 4,188,677 |
| L50 | 393 | 11 | 132 |
| Contigs | |||
| Number | 133,158 | 45,969 | 93,701 |
| Longest | 413,938 | 9,491,684 | 896,174 |
| N90 | 11,508 | 177,667 | 17,513 |
| L90 | 42,697 | 1,944 | 25,175 |
| N50 | 50,278 | 1,333,231 | 88,36 |
| L50 | 11,378 | 423 | 6,074 |
| Single‐copy BUSCOs | 3,820 | 3,851 | 3,811 |
| Duplicated BUSCOs | 22 | 24 | 19 |
| Fragmented BUSCOs | 164 | 133 | 178 |
| Missing BUSCOs | 98 | 96 | 96 |
| Proportion of complete BUSCOs | 0.936 | 0.944 | 0.933 |
N90/N50 are the scaffold or contig lengths such that the sum of the lengths of all scaffolds or contigs of this size or larger is equal to 90/50% of the total assembly length.
L90/L50 are the smallest number of scaffolds or contigs that make up at least 90/50% of the total assembly length.
Using minimum gap length of 25 bp.
BUSCOs: Benchmarking Universal Single‐Copy Orthologs (Simão et al., 2015) are mammalian BUSCOs from orthodb v. 9.1 genes (Zdobnov et al., 2017).
Figure 1Dovetail Genomics’ Hi‐C linkage density plot for Hi‐C reads mapped to the Hi‐C assembly. X‐ and Y‐axes give the cumulative mapping positions of the first and second read in a read pair respectively, grouped into bins. The colour of each square gives the number of reads pairs within that bin. Grey vertical and white horizontal lines separate borders between scaffolds. Only scaffolds >1 Mbp are shown [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 2Cumulative assembly length for scaffolds of the original North African dromedary assembly (CamDro1; Fitak et al., 2016; GenBank accession: GCA_000803125.1); the North African dromedary assembly after improvement (CamDro2); and for the Arabian dromedary assembly (Wu et al., 2014; GCA_000767585.1). Circles and triangles indicate L50 and L90 values, respectively. L50/L90 are the smallest number of scaffolds that make up at least 50/90% of the total assembly length [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 3D‐GENIES (Cabanettes & Klopp, 2018) dot plot made with Minimap2 (Li, 2018) whole‐genome alignment between CamDro1 and CamDro2 assemblies. Contigs are sorted and matches are filtered out by size using ≤0.001% dot plot width and identity ≤0.75 [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 4Frequency polygons of query sequence length (predicted proteins) divided by subject (UniProt/TrEMBL) sequence length for diamond (Buchfink et al., 2015) mapped maker (Holt & Yandell, 2011) predicted proteins against UniProt/TrEMBL release 2018_04 database for: (red line) the original North African dromedary genome (CamDro1; Fitak et al., 2016 predicted protein sequences; GenBank accession: GCA_000803125.1); (green line) the North African dromedary genome after adding ~11× PacBio sequencing reads (CamDro2) for MAKER run 1; and (blue line) MAKER run 2. Values near 1.0 are desired, indicating untruncated proteins due to lack of indels from PacBio reads [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 5Cumulative proportion of transcripts with specific or lower annotation edit distance (AED) for each MAKER run. MAKER run 1 (solid line) had AED ≤0.50 for 78.4% transcripts, whilst MAKER run 2 (dashed line) had 39.2% transcripts with AED ≤0.50. Grey vertical line indicates AED = 0.50. Note that having a larger proportion of lower AED values indicates a genome annotation that is more congruent with the evidence used during the annotation process