| Literature DB >> 28720171 |
S A Lapp1, J A Geraldo2, J-T Chien1, F Ay3, S B Pakala4, G Batugedara5, J Humphrey4, J D DeBARRY4, K G Le Roch5, M R Galinski1, J C Kissinger4.
Abstract
Plasmodium knowlesi has risen in importance as a zoonotic parasite that has been causing regular episodes of malaria throughout South East Asia. The P. knowlesi genome sequence generated in 2008 highlighted and confirmed many similarities and differences in Plasmodium species, including a global view of several multigene families, such as the large SICAvar multigene family encoding the variant antigens known as the schizont-infected cell agglutination proteins. However, repetitive DNA sequences are the bane of any genome project, and this and other Plasmodium genome projects have not been immune to the gaps, rearrangements and other pitfalls created by these genomic features. Today, long-read PacBio and chromatin conformation technologies are overcoming such obstacles. Here, based on the use of these technologies, we present a highly refined de novo P. knowlesi genome sequence of the Pk1(A+) clone. This sequence and annotation, referred to as the 'MaHPIC Pk genome sequence', includes manual annotation of the SICAvar gene family with 136 full-length members categorized as type I or II. This sequence provides a framework that will permit a better understanding of the SICAvar repertoire, selective pressures acting on this gene family and mechanisms of antigenic variation in this species and other pathogens.Entities:
Keywords: zzm321990 Plasmodium knowlesizzm321990 ; zzm321990 SICAvarzzm321990 ; Hi-C; MaHPIC; PacBio; annotation; antigenic variation; genome; sequence
Mesh:
Year: 2017 PMID: 28720171 PMCID: PMC5798397 DOI: 10.1017/S0031182017001329
Source DB: PubMed Journal: Parasitology ISSN: 0031-1820 Impact factor: 3.234
Characteristics of nuclear genome sequences utilized in this study
| Name | Genome size (nt) | Scaffold number | Unplaced contig number | Gaps | N50 contig length | N50 scaffold length | Technology |
|---|---|---|---|---|---|---|---|
| PKNOH-PacBio | 24 588 173 | N/A | 50 | N/A | 1 207 278 | N/A | PacBio |
| PKNOH-PacBio-Hi-C | 24 771 595 | 14 | 14 | 25 | 16 231 | 1 832 627 | PacBio & Hi-C |
| PKNH V2 | 24 359 384 | 14 | 148 | 77 | N/A | 2 162 603 | Sanger & Illumina |
| PKNA1-C.2 | 24 359 887 | N/A | 45 | N/A | 1 061 780 | N/A | PacBio |
| PKNA1-H.1 | 23 958 038 | N/A | 37 | N/A | 1 017 166 | N/A | PacBio |
N/A, not applicable.
Genome size includes scaffolds and unplaced contigs. Contigs are only unplaced, i.e. non-scaffolded sequences. Gaps are only present in scaffolds. PKNOH data presented here have had organellar sequences removed. Data are from GenBank.
Fig. 1.Hi-C assisted scaffolding of PacBio contigs. (A) Alignment of Hi-C data to the initial set of 35 high-coverage contigs by PacBio assembly showed that one of the contigs includes DNA from three different chromosomes as evidenced by the tri-partite structure of intracontig contact map of this contig (right). Other contigs did not exhibit similar contact patterns (representative example – left) suggesting they are contiguous pieces from a single chromosome. (B) Intercontig Hi-C contact maps of the unordered set of contigs (left) that were named according to their similarity with chromosomes in the PKNH assembly show striking off-diagonal contact enrichment suggesting that pairs of contigs that belong to the same chromosome are not ordered consecutively. Similar intercontig maps when contigs are clustered into scaffolds according to their Hi-C contact counts (mid) show minimal off-diagonal enrichment. Interchromosomal/scaffold contact map generated by aligning Hi-C reads to the new, chromosome level assembly (right) exhibits contact patterns that are expected of and observed in Plasmodium and yeast species (Ay et al. 2014b; Duan et al. 2010). This assembly was generated by breaking down the problematic contig, clustering contigs into chromosomal groups, and ordering and reorienting contigs within each group to maximize Hi-C contacts between adjacent and correctly oriented contigs to create scaffolds representative of each chromosome. (C) Intrascaffold Hi-C contact maps (normalized counts, 10 kb resolution) from two representative scaffolds in the new assembly. Scaffold 6 (left) and scaffold 14 were constructed by joining two and four PacBio contigs, respectively. The rows/columns marked by white represent unmappable or poorly mappable regions with Hi-C reads (Illumina 76 × 2 bp, paired-end sequencing).
Fig. 2.Chromosomal synteny between PKNH and the MaHPIC PKNOH genome sequences. (A) SyMAP circular DNA comparison of the MaHPIC Pk genome sequence scaffolds to the PKNH 2015 consensus sequence. (B) SyMAP circular DNA comparison of the MaHPIC Pk genome sequence scaffolds to the Plasmodium coatneyi HACKERI genome sequence that was assembled using PacBio technologies (Chien et al. 2016). (C) SyMAP circular DNA comparison of the PKNH 2015 consensus sequence and P. coatneyi genome sequence.
Fig. 3.Hi-C contact maps for the join regions present on scaffolds 8 and 9. Hi-C contact maps of two scaffolds from the PKNOH-PacBio-Hi-C assembly that contain contigs previously assigned to two different chromosomes in the PKNH assembly. These contact maps are zoomed in to the join regions and are at the single MboI restriction fragment level (~1 kb in resolution). Each heatmap is rotated 45 degrees compared with previous intracontig/scaffold heatmaps for visualization purposes. (A) The 200 kb region of scaffold 8 (scf8:500 000– 700 000) that surrounds the join (at scf8:593 400) between two contigs previously assigned to chr13 and chr4 (left) compared with a matched 200 kb region from scaffold 12, which consists of a single contiguous PacBio contig (right). (B) Similar case vs control figure for scaffold 9 compared with matched coordinates in scaffold 5. The dashed blue lines correspond to location of the join (or matching coordinates on the right) and the sum and average number (excluding zeros) of interactions between the left and right (rectangular area) of a join are reported for each case.
Nuclear genome annotation metrics
| Name | Genes | Proteins | tRNA | rRNA | Pseudogenes |
|---|---|---|---|---|---|
| PKNH | 5483 | 5282 | 45 | 11 | 8 |
| PKNA1-H.1 | 5373 | 5138 | 45 | 14 | 4 |
| PKNOH-Maker2 | 5356 | 5300 | 45 | 11 | N/A |
| PKNOH-Companion | 5315 | 5253 | 45 | 11 | 152 |
| PKNOH-manually curated | 5342 | 5217 | 45 | 12 | 22 |
N/A, not applicable.
Values obtained from archival records or generated in this study; see ‘Materials and Methods’ section.
Comparative SICAvar gene statistics in the PKNH (April 2017) and PKNOH (MAHPIC PacBio) assemblies
| PKNOH | PKNH | |||
|---|---|---|---|---|
| Type I | Type II | Type I | Type II | |
| Full | 117 | 19 | 87 | 14 |
| 22 | 0 | 128 | 5 | |
| Exon # | 11 (5–16) | 3 (3–4) | 11 (3–16) | 3 (3–4) |
| Gene length (nt) | 14 159 (6312–32 900) | 3473 (2051–4269) | ND | ND |
| Coding length | 5652 (834–6777) | 1488 (963–3051) | 5940 (1401–8295) | 2630 (1302–3309) |
| Protein length (aa) | 1892 (278–2783) | 496 (321–1017) | 1979 (466–2764) | 876 (433–1102) |
| Alpha domains | 133 | 138 | ||
| Beta domains | 763 | 765 | ||
| 133 | 143 | |||
Exon number and lengths are reported as median values of the ranges provided. Protein domain counts include full SICAvar proteins and fragments of both types.
Fig. 4.SICAvar distribution and gene models. (A) Shown are representative examples of types I and II SICAvar genes with exons noted in blue, and their directionality indicated with arrow heads placed at the end of the 3-prime exons. Type I SICAvar genes are characterized by multiple exons (5–16), often with extremely large introns, particularly between exons 2 and 3. Type II SICAvar genes have three or four exons and are more compact with smaller introns. In five of the six examples shown, the initial two exons shown are typical. (B) Distribution of full SICAvar genes (types I and II) along the PKNOH scaffolds. (C) Distribution of partial SICAvar gene segments (types I and II) along the PKNOH scaffolds.