| Literature DB >> 30409989 |
Patrick Lypaczewski1, Johanna Hoshizaki1, Wen-Wei Zhang1, Laura-Isobel McCall1,2, John Torcivia-Rodriguez3, Vahan Simonyan3, Amanpreet Kaur4,5, Ken Dewar4,5, Greg Matlashewski6.
Abstract
Leishmania donovani is responsible for visceral leishmaniasis, a neglected and lethal parasitic disease with limited treatment options and no vaccine. The study of L. donovani has been hindered by the lack of a high-quality reference genome and this can impact experimental outcomes including the identification of virulence genes, drug targets and vaccine development. We therefore generated a complete genome assembly by deep sequencing using a combination of second generation (Illumina) and third generation (PacBio) sequencing technologies. Compared to the current L. donovani assembly, the genome assembly reported within resulted in the closure over 2,000 gaps, the extension of several chromosomes up to telomeric repeats and the re-annotation of close to 15% of protein coding genes and the annotation of hundreds of non-coding RNA genes. It was possible to correctly assemble the highly repetitive A2 and Amastin virulence gene clusters. A comparative sequence analysis using the improved reference genome confirmed 70 published and identified 15 novel genomic differences between closely related visceral and atypical cutaneous disease-causing L. donovani strains providing a more complete map of genes associated with virulence and visceral organ tropism. Bioinformatic tools including protein variation effect analyzer and basic local alignment search tool were used to prioritize a list of potential virulence genes based on mutation severity, gene conservation and function. This complete genome assembly and novel information on virulence factors will support the identification of new drug targets and the development of a vaccine for L. donovani.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30409989 PMCID: PMC6224596 DOI: 10.1038/s41598-018-34812-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Location of the gaps along 36 chromosomes that have been closed in this new assembly. Chromosomal locations of gaps are indicated in red. No gaps remain in the current assembly.
Quality assessment metrics of the previous and current assemblies.
| Contigs | N50 (bp) | Protein coding | tRNA | rRNA | snRNA | SLRNA | snoRNA | Genes mapped | |
|---|---|---|---|---|---|---|---|---|---|
| Old Assembly | 2,154 | 45,436 | 7,969 | 64 | 11 | 4 | — | 31 | 8,081 |
| New Assembly | 36 | 1,067,468 | 8,633 | 90 | 51 | 6 | 68 | 910 | 9,758 |
Old assembly refers to ASM22713v2 from strain BPK282A, new assembly refers to the assembly presented in this work. Contigs denotes the number of genomic fragments uninterrupted by stretches of unknown bases (Ns) or chromosome ends. N50 is used as a measure of contiguity, 50% of the genome is contained in contigs of size N50 and above. Annotated genes were broken down into protein coding, transfer-RNA (tRNA), ribosomal RNA (rRNA), small nuclear RNA (snRNA), spliced leader RNA (SLRNA) and small nucleolar RNA (snoRNA) genes. The number of genes mapped indicates the number of annotated genes along the genome.
Figure 2Organization of the 4 copies of the A2 gene on chromosome 22 in the attenuated cutaneous L. donovani strain. (a) Locations of the 4 A2 genes are shown in blue and numbered 1–4. Interspaced A2-rel genes are labeled in orange, 3′ A2-rel genes are labeled in green and 5′ A2-rel genes are labeled in yellow. A2-rel genes have no homology with A2 genes[15]. Transcription direction is shown according to strandedness: blue represents reverse strand direction of transcription, red represents forward strand transcription. The genes located in the 63 kb region between opposing A2 clusters are not depicted for clarity. (b) Alignment of the longest (~11 kb+) PacBio reads to the A2 clusters. Reads in the 5′ to 3′ direction labeled in red; reads in the 3′ to 5′ direction labeled in blue. (c) Western blot analysis of A2 proteins in the attenuated cutaneous L. donovani strain. The sizes of the A2 proteins are consistent with the ORFs and number of A2 genes identified in this assembly. (d) Coverage graph of chromosome 22 using Illumina (blue) and PacBio (orange) reads.
Figure 3L. donovani maintains high levels of synteny with L. major including chromosome 22 where the A2 genes are located. Left: Dot plot of the coding DNA sequences of L. major compared to those of L. donovani generated from our assembly across the entire genome. Right: Synteny comparison of chromosome 22. The outer most circle represents the chromosomal location. The second circle is labelled with genes on the forward strand (blue) and genes on the reverse strand (red). The third circle represents genes that are only present in one of the two compared species. The inner association lines join syntenic genes between the two species.
Figure 4The new L. donovani genome assembly results in a significant change in gene annotations. (a) New or improved gene annotations are highlighted in Blue along the 36 chromosomes. Compared to the previous L. donovani reference assembly (ASM22713v2 from strain BPK282A1), there were 1,087 protein coding genes unannotated or differently annotated in the current assembly. Unannotated or differently annotated genes were obtained by removing all annotations generated from our assembly that shared 95% or greater similarity to those previously available[8]. (b) Expansion of the amastin gene cluster on chromosome 8. Top track contains the previously two known coding sequences aligned to the previous L. donovani reference assembly (ASM22713v2 from strain BPK282A1). Gaps in the previous assembly depicted as dotted lines. Bottom track contains 10 amastin genes identified in the updated assembly. One previously identified Amastin gene has been aligned, 1 has been expanded and 8 have been annotated de novo.
Figure 5Verification of previously identified SNPs and location of new SNPs that differ between the virulent VL and attenuated CL strains of L. donovani. Chromosomal location of previously identified homozygous non-synonymous SNPs between the cutaneous and visceral disease derived L. donovani strains (Red)[4] compared to the novel SNPs identified only in this study (Blue) (synonymous and heterozygous codon changes identified are not labeled). Note that all the previously identified SNPs were also identified, or confirmed, in this study. 70 SNPs were previously identified across 66 genes. The same 70 SNPs were identified in this study, with an additional 15 novel SNPs not previously seen specific to the cutaneous strain. Genomic locations of SNPs identified in the previous study were translated to new genomic coordinates based on the new assembly for consistency. Arrows in yellow highlight the position of the previously identified RagC SNP on chromosome 36 and the A2 copy number difference on chromosome 22.
Summary of novel mutations identified in this study.
| Chr | Gene | Mutation | PROVEAN | Protein Name |
|---|---|---|---|---|
| 7 | Ala282Val | −0.743 | vacuolar-type Ca2 ± ATPase, putative | |
| 12 | Glu1157Asp | −0.258 | Myotubularin-related protein, putative | |
| 14 | LdCL_140017600 | Ser2919fs | N/A | kinesin k39 |
| 14 | Glu1034Asp | −1.06 | kinesin K39 | |
| 22 | Pro219F/S | N/A | hypothetical protein | |
| 23 | LdCL_230017500 | INS:446Glua |
| sucrose hydrolase-like protein |
| 25 | Ala969Glu | 0.736 | Raptor N-terminal CASPase like domain containing protein | |
| 25 | INS :110 Ala, Asn, Ser, Ala, Ala, Ala, Ala | N/A | hypothetical protein | |
| 27 | Ala1493Thr | −0.25 | ATP-binding cassette protein subfamily A | |
| 29 | LdCL_290028400 | Thr208Ala | 0.4 | VIT family putative |
| 301 | Gln334STOPa | N/A | hypothetical protein | |
| 31 | STOP1486Leu,Ser,His | 0 | hypothetical protein | |
| 31 | Thr498Alaa | −0.15 | hypothetical protein | |
| 31 | His497Arga | 0.942 | hypothetical protein | |
| 31 | Gly380Aspa | −0.383 | hypothetical protein | |
| 23 | Asp712Glu | −1.625 | hypothetical protein, unknown function | |
| 31 | Met189Thr |
| hypothetical protein, unknown function | |
| 31 | Val187Phe | −0.634 | Hypothetical protein | |
| 34 | Thr116DEL | −1.098 | hypothetical protein | |
|
| ||||
| 14 | Gln89Lys | −0.044 | cystathionine beta-lyase-like protein | |
| 31 | Cys173Phe |
| regulator of chromosome condensation (RCC1) repeat, putative | |
| 32 | Gly667Ser | −1.292 | hypothetical protein | |
| 32 | Val250Ile | 0 | hypothetical protein, unknown function | |
| 36 | Gene deletion | N/A | Serine/Threonine Kinase, putative | |
| 36 | Gene deletion | N/A | Serine/Threonine Kinase, putative | |
| 36 | Gene deletion | N/A | Engulfment and cell motility domain 2, putative | |
| 36 | Gene deletion | N/A | Predicted tripartite motif protein | |
All mutations are annotated using VL as the wild type amino acids and CL as the mutated amino acids. Genes with annotations in the previous assembly list the previous gene ID in italic, genes annotated only in this assembly list only one gene ID. The top segment lists fifteen attenuated cutaneous strain specific mutations identified in this study. Mutations marked witha appear at 50% but also co-occur with gene duplication event and are therefore possibly homozygous on one copy. ‘INS’ denotes amino acid insertions, ‘F/S’ denotes frameshifts, ‘DEL’ denotes amino acid deletions. The middle segment lists four mutations where the gain-of-function IV strain’s genotype changed towards that of the visceral genotype. The bottom segment lists eight mutations present only in the gain-of-function IV strain and likely represents adaptations specific to the murine host. Calculated PROVEAN scores are shown in the fourth column, scores below the −2.5 threshold for deleterious mutations are highlighted in bold[25].
Figure 6Summary of all genes with non-synonymous mutations between the cutaneous, visceral, and gain-of-function strains of L. donovani. All non-synonymous SNPs and Indels were classified as common to our previous study (2014 CL[4]) or identified in this study (Novel), as well as by their effect on amino acid changes from top to bottom, colored red to green in descending order of likelihood to affect the phenotype of the parasite. 66 genes were common to the previous data set. Of those genes, 7 were previously investigated[4] and 1 was rejected due to an open reading frame misannotation. 25 genes were only listed in this study (Novel). Diagram created using SankeyMATIC (http://sankeymatic.com).
Summary of all genes containing mutations in the cutaneous isolates and classification into clusters.
|
| Cluster Mutation Type | New annotation | Equivalents (when available) |
|---|---|---|---|
| Cluster 1 (13) | Nonsense, Frameshift, Insertions, Deletions, IV to VL | LdCL_300021700 | LdBPK_301640 |
| LdCL_310020800 | LdBPK_311390 | ||
| LdCL_250013200 | LdBPK_250790 | ||
| LdCL_310020100 | LdBPK_311320 | ||
| LdCL_310022200 | LdBPK_311510 | ||
| LdCL_080011700 | LdBPK_080670 | ||
| LdCL_340029800 | LdBPK_342210 | ||
| LdCL_230014900 | LdBPK_230830 | ||
| LdCL_310037100 | LdBPK_312870 | ||
| LdCL_220015800 | — | ||
| LdCL_140017600 | — | ||
| LdCL_230017500 | — | ||
| LdCL_310041200 | LdBPK_313290 | ||
| Cluster 2 (9) | Multiple SNPs in the same gene, Non-conservative amino acid change in conserved region with good PROVEAN score | LdCL_270015000 | LdBPK_270840 |
| LdCL_290026900 | LdBPK_292100 | ||
| LdCL_310021600 | LdBPK_311470 | ||
| LdCL_310028800 | LdBPK_312080 | ||
| LdCL_340046300 | LdBPK_343690 | ||
| LdCL_290022800 | LdBPK_291720 | ||
| LdCL_310024300 | LdBPK_311710 | ||
| LdCL_360006000 | LdBPK_360120 | ||
| LdCL_360062000 | LdBPK_365480 | ||
| Cluster 3 (18) | Non-conservative amino acid change in conserved region with poor PROVEAN score | LdCL_070018300 | LdBPK_071330 |
| LdCL_320013800 | LdBPK_320820 | ||
| LdCL_250016900 | LdBPK_251150 | ||
| LdCL_220022000 | LdBPK_221470 | ||
| LdCL_250015300 | LdBPK_251000 | ||
| LdCL_040011100 | LdBPK_040560 | ||
| LdCL_360016300 | LdBPK_361120 | ||
| LdCL_200014300 | LdBPK_200960 | ||
| LdCL_090011700 | LdBPK_090660 | ||
| LdCL_130016200 | LdBPK_131090 | ||
| LdCL_340009000 | LdBPK_340390 | ||
| LdCL_130017800 | LdBPK_131230 | ||
| LdCL_230009900 | LdBPK_230440 | ||
| LdCL_220018100 | LdBPK_221070 | ||
| LdCL_340044900 | LdBPK_343550 | ||
| LdCL_290028400 | — | ||
| LdCL_250011400 | LdBPK_250620 | ||
| LdCL_270014900 | LdBPK_270830 | ||
| Cluster 4 (8) | Conservative amino acid change in conserved region with good PROVEAN score | LdCL_350013100 | LdBPK_350830 |
| LdCL_360052700 | LdBPK_364550 | ||
| LdCL_230026600 | LdBPK_231940 | ||
| LdCL_230009400 | LdBPK_230400 | ||
| LdCL_320031100 | LdBPK_322560 | ||
| LdCL_020008200 | LdBPK_020280 | ||
| LdCL_310027700 | LdBPK_311990 | ||
| LdCL_320031200 | LdBPK_322570 | ||
| Cluster 5 (13) | Conservative amino acid change in conserved region with poor PROVEAN score | LdCL_330011900 | LdBPK_330640 |
| LdCL_170010200 | LdBPK_170470 | ||
| LdCL_070011900 | LdBPK_070700 | ||
| LdCL_210025000 | LdBPK_211930 | ||
| LdCL_290022900 | LdBPK_291730 | ||
| LdCL_200006300 | LdBPK_200140 | ||
| LdCL_290029000 | LdBPK_292290 | ||
| LdCL_030007500 | LdBPK_030250 | ||
| LdCL_250006200 | LdBPK_250110 | ||
| LdCL_360015800 | LdBPK_361070 | ||
| LdCL_310023400 | LdBPK_311630 | ||
| LdCL_360062700 | LdBPK_365540 | ||
| LdCL_140017700 | LdBPK_141190 | ||
| Cluster 6 (14) | Non-conservative amino acid change in less conserved region, Conservative amino acid change in less conserved region | LdCL_340022100 | LdBPK_341580 |
| LdCL_060011600 | LdBPK_060650 | ||
| LdCL_210015400 | LdBPK_211040 | ||
| LdCL_050010900 | LdBPK_050580 | ||
| LdCL_230011600 | LdBPK_230610 | ||
| LdCL_250024100 | LdBPK_251840 | ||
| LdCL_070015100 | LdBPK_071060 | ||
| LdCL_250005300 | LdBPK_250040 | ||
| LdCL_310022100 | LdBPK_311500 | ||
| LdCL_200006800 | LdBPK_200200 | ||
| LdCL_250014400 | LdBPK_250910 | ||
| LdCL_230010400 | LdBPK_230500 | ||
| LdCL_120008300 | LdBPK_120275 | ||
| LdCL_290028100 | LdBPK_292210 | ||
| Cluster 7 (4) | IV-only mutations | LdCL_140010000 | LdBPK_140470 |
| LdCL_310035800 | LdBPK_312770 | ||
| LdCL_310036400 | LdBPK_312810 | ||
| LdCL_320046000 | LdBPK_324000 | ||
| LdCL_360021300 | LdBPK_361580 | ||
| LdCL_360021400 | LdBPK_361590 | ||
| LdCL_360021500 | LdBPK_361600 | ||
| LdCL_360021600 | LdBPK_361610 |
Entries were not repeated in multiple lists.
Identified mutations were further classified into priority clusters for effect on protein function and future analysis for genes associated with survival in visceral organs. Mutations were prioritized by likelihood of contributing to visceral tissue tropism by severity of the coding change, accumulation of secondary mutations and conservation. Gene loci listed from the current assembly as well as previous ID numbers when available.