| Literature DB >> 23035642 |
Oscar Franzén1, Carlos Talavera-López, Stephen Ochaya, Claire E Butler, Louisa A Messenger, Michael D Lewis, Martin S Llewellyn, Cornelis J Marinkelle, Kevin M Tyler, Michael A Miles, Björn Andersson.
Abstract
BACKGROUND: Trypanosoma cruzi marinkellei is a bat-associated parasite of the subgenus Schizotrypanum and it is regarded as a T. cruzi subspecies. Here we report a draft genome sequence of T. c. marinkellei and comparison with T. c. cruzi. Our aims were to identify unique sequences and genomic features, which may relate to their distinct niches.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23035642 PMCID: PMC3507753 DOI: 10.1186/1471-2164-13-531
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Raw sequence data
| | ||||
| # | 1.3 | 23.0 | 1.3 | 28.7 |
| # | 0.47 | 35.6 | 0.52 | 44.3 |
| 357 | 77 | 393 | 77 | |
| ~ | 12 | 91 | 9 | 103 |
a Single end 454 reads.b No. Read-pairs (true mate-paired reads after adapter trimming).c Billion nucleotides.d The average read length (after adapter trimming).e The theoretical genome coverage based on known genome sizes and the number of sequenced nucleotides.
Figure 1Schematic overview of the sequence assembly. Schematic overview of the genome assembly steps. Illumina reads were assembled into contigs with Velvet. Unused reads were extracted and used for a second Velvet assembly with a different kmer length. 454 reads were assembled with CELERA. The 454-assembly was then subjected to homopolymer error correction with Illumina reads. The Illumina and 454 assemblies were merged into a non-redundant assembly using the Zorro pipeline. The assembly was then subjected to scaffolding using SSPACE and physical distance information. The final step involved gap closure with the IMAGE pipeline.
Genome assembly statistics and summary
| # | # | |||||||
|---|---|---|---|---|---|---|---|---|
| 454 assembly | CELERA | 37.3 | 30,737 | - | 1,216 | 1,670 | 539 | |
| Illumina assembly | Velvet (kmer 43) | 16.7 | 9,247 | - | 1,813 | 2,378 | 851 | |
| Assembly of non-assembled Illumina reads | Velvet (kmer 53) | 1.17 | 2,094 | - | 562 | 536 | 418 | |
| Assembly merging | Zorro | 33.5 | 24,799 f | - | 1,353 | 2,218 | 549 | |
| Scaffolding | SSPACE | 38.8 | 23,813 f | 1,835 | 2,296 | 25,044 | 576 | |
| Gap closure | IMAGE | 38.6 | 23,000 | 1,774 | 2,302 | 25,781 | 583 | |
| 454 assembly | CELERA | 41.8 | 33,686 | - | 1,243 | 1,516 | 549 | |
| Illumina assembly | Velvet (kmer 43) | 17.0 | 8,523 | - | 1,997 | 2,742 | 904 | |
| Assembly of non-assembled Illumina reads | Velvet (kmer 53) | 1.14 | 2,116 | - | 543 | 523 | 416 | |
| Assembly merging | Zorro | 38.0 | 28,389 f | - | 1,339 | 1,869 | 560 | |
| Scaffolding | SSPACE | 43.7 | 27,605 f | 2,476 | 2,162 | 14,067 | 589 | |
| Gap closure | IMAGE | 43.4 | 26,889 | 2,423 | 2,158 | 14,516 | 592 |
a The length when sequences are combined (Mb).
b The number of contigs/scaffolds.
c The average contig length (bp). For the SSPACE row, this refers to the average scaffold length.
d The length N for which half of all bases are in a sequence of this length or longer.
e The length N for which 90% of all bases are in a sequence of this length or longer.
f Contigs >500 bp.
Figure 2Distribution of heterozygosity along a genomic segment of . . . Distribution of heterozygosity and sequence coverage along scaffold 143 of T. c. marinkellei B7. Heterozygosity was counted in non-overlapping sliding windows of 1000 bp (red line). Coverage is shown as the log10-scaled average coverage of the 1000 bp window (turquoise line). The x-axis shows the start position of the window along the sequence and the y-axis shows the number of heterozygous nucleotide positions per window or log10-scaled average coverage of the window.
Comparison of gene family content
| 2,129,983 (6.22 %) | 3.433 | 1,265,650 (3.28 %) | 1.324 | ||
| 2,109,163 (6.16 %) | 6.291 | 2,953,602 (7.65 %) | 6.298 | ||
| 540,360 (1.58 %) | 1.317 | 727,537 (1.88 %) | 1.434 | ||
| 521,665 (1.52 %) | 2.234 | 1,314,589 (3.41 %) | 2.915 | ||
| 452,732 (1.32 %) | 1.229 | 514,422 (1.33 %) | 0.898 | ||
| 273,890 (0.80 %) | 0.557 | 334,544 (0.87 %) | 0.515 | ||
| 37,490 (0.11 %) | 0.124 | 42,072 (0.11 %) | 0.162 | ||
| 25,946 (0.08 %) | 0.080 | 26,732 (0.07 %) | 0.074 | ||
a Gene family abbreviations: DGF=Dispersed Gene Family, TS=trans-sialidase, MASP=Mucin-associated surface protein, GP63=Surface protease, RHS=Retrotransposon Hot Spot protein, ABC=ABC Transporter, RBP=RNA Binding Protein.
b The combined number of base pairs of this gene family that was identified in the assembly. Sequences were identified using RepeatMasker and a repeat library of coding sequences from the Tcc CLBR genome. These numbers include partial coding sequences. The number inside parenthesis refers to the percentage of total assembly size.
c The percentage of short reads that mapped to these features.
d SE=Significantly Enriched. Refers to if one genome contained significantly more of this gene family. The significance was determined from an empirical distribution of read depth differences from homologous regions of Tcm and Tcc X10, corrected for genome size. The empirical distribution was used to calculate a p-value.
Figure 3Histogram of pairwise nucleotide identities between orthologous genes. Histogram of pairwise nucleotide identities between orthologs of T. c. marinkellei B7 and T. c. cruzi CL Brener non-Esmeraldo-like haplotype. 5618 orthologs were included in the comparison, for which the average nucleotide identity was 92.6% ± 3.3 (Tcm vs Tcc CLBR non-Esm). The genes included in this analysis mainly comprised the non-repetitive component of these genomes. Orthologs were defined as the best reciprocal BLASTp hit between the genomes. Nucleotide sequences were aligned with ClustalW version 2.1. Mismatches (single nucleotide polymorphisms) within each alignment were identified and counted using a Perl script. Pairwise orthologs with lower identity than 80% were excluded from the analysis.
Figure 4Genomic location of the -specific acetyltransferase gene(MOQ_006101). Screenshot from Artemis Comparison Tool of a ~19 kb homologous region of T. c. marinkellei B7 (contig 2842) and T. c. cruzi CL Brener (non-Esmeraldo-like haplotype; chr 37). Vertical black lines in each frame represent stop codons. Genes with shared homology in both genomes are shown in blue and the specific T. c. marinkellei gene (MOQ_006101) is shown in green. Red stripes represent regions with high sequence similarity between the two genomes. Abbreviations: ETIF (eukaryotic translation initiation factor 3 subunit 8, putative); HP (hypothetical protein); PDI (protein disulfide isomerase); CO (cytochrome c oxidase subunit IX); AT (acetyltransferase); and RP (U1A small nuclear ribonucleoprotein).
List of hits obtained from PSI-BLAST after 4 iterations querying MOQ_006101 against GenBank non-redundant database
| Predicted protein | XP_002298511.1 | Cas1_AcylT | 38% | 0 | |
| Putative O-acetyltransferase | NP_568662.1 | Cas1_AcylT | 39% | 0 | |
| AT5g46340/MPL12_14 | AAL11600.1 | Cas1_AcylT | 38% | 0 | |
| O-acetyltransferase-like protein | NP_180988.3 | Cas1_AcylT | 37% | 0 | |
| Predicted protein | XP_002317300.1 | Cas1_AcylT | 37% | 0 | |
| CAS1 domain-containing protein 1-like | XP_002272126.2 | Cas1_AcylT | 38% | 0 | |
| O-acetyltransferase family protein | XP_002879497.1 | Cas1_AcylT | 37% | 0 | |
| Hypothetical protein | XP_002863407.1 | Cas1_AcylT | 39% | 0 | |
| O-acetyltransferase, putative | XP_002519732.1 | Cas1_AcylT | 38% | 0 | |
| CAS1 domain-containing protein 1-like | XP_003532649.1 | Cas1_AcylT | 38% | 0 |
The best hit from the NCBI Conserved Domain Database.
Figure 5Analyses of the . . -specific acetyltransferase gene. (A) Maximum likelihood phylogenetic tree of the T. c. marinkellei B7 specific gene (MOQ_006101) based on protein sequences. Phylogenetic inference was done on a protein dataset extracted with Blast Explorer (E-value <1e-40). The multiple sequence alignment was done with ClustalW version 2.1 and filtered with Gblocks. The final alignment, which the tree is based on, contained 62 columns. The phylogeny was inferred using RAxML version 7.0.4 with the PROTGAMMAJTT model and 100 bootstrap replicates. Only bootstrap values >40 are shown. Accession numbers for protein sequences are shown in parenthesis after species names. The Tcm gene is shown in red. (B) GC content analysis of MOQ_006101 in relation to all genes. Error bars represent one standard deviation. The white dot represents all genes and the blue dot represents MOQ_006101. GC1, GC2 and GC3 refer to the %GC content at the first, second and third codon positions. 10,342 coding sequences were included in the analysis. (C) Histogram of Codon Adaptation Index (CAI) for all genes in the Tcm genome. 43 ribosomal proteins were used as a reference for highly expressed genes. The vertical black line represents the median CAI (0.545), the two red lines represent +/− one median absolute deviation (0.0548) and the blue line represents the CAI of MOQ_006101 (0.518). CAI was calculated using emboss programs cai and cusp.
Figure 6Duplicated regions of . . and . . Sylvio X10 inferred from short reads. Short read coverage on assembly scaffolds of T. c. marinkellei B7 and T. c. cruzi Sylvio X10. Coverage was counted in 100 bp sliding windows along the scaffolds, with 50 bp step size. Each dot represents a 100 bp window. The horizontal axis shows the position of the window along the scaffold and the vertical axis shows the log10-scaled coverage of each window. Coverage was incremented by 1 to avoid infinite values for empty regions. Red lines show the global median coverage and blue lines show +/− 2X of the median absolute deviation. On scaffold 1093 (Tcm) the amplified region contained a dynein light chain protein, a nucleoside transporter and two genes of unknown function. On scaffold 1101 the first amplified region was 3 kb and contained a gene of unknown function, the second region was 3.7 kb and contained a pyruvate phosphate dikinase gene, the third region was 1.2 kb and contained a gene of unknown function. Scaffold 1531 (Tcc X10) contained a 4.7 kb amplification which contained a nucleoside transporter. Scaffold 1556 (Tcc X10) contained two genes of unknown function. Scaffold 3025 (Tcc X10) contained a 5 kb amplification with a copper-transporting ATPase gene. Scaffold 78 from Tcm showed evidence of aneuploidy, as the mean coverage was lower than for the other scaffolds.
Comparison of repetitive elements
| 574,697 (1.679 %) | 1.535 | 1,116,378 (2.892 %) | 1.811 | ||
| 433,619 (1.267 %) | 1.156 | 655,064 (1.697 %) | 0.907 | ||
| 432,474 (1.263 %) | 1.168 | 805,885 (2.088 %) | 2.158 | ||
| 382,416 (1.117 %) | 1.024 | 481,685 (1.248 %) | 1.081 | ||
| 223,679 (0.653 %) | 0.630 | 281,491 (0.729 %) | 0.590 | ||
| 176,724 (0.516 %) | 0.497 | 238,914 (0.619 %) | 0.527 | ||
| 94,765 (0.277 %) | 0.224 | 151,879 (0.393 %) | 0.275 | ||
| 18,338 (0.054 %) | 0.104 | 102,810 (0.266 %) | 0.203 | ||
| 4,705 (0.014 %) | 0.010 | 10,936 (0.028 %) | 0.020 | ||
| 2,944 (0.009 %) | 0.006 | 167 (0.000 %) | 0.000 | ||
| 621 (0.002 %) | 0.149 | 7,573 (0.020 %) | 0.628 | ||
a The sum of masked base pairs in the assembly. The number inside parenthesis refers to the percentage of assembled bases.
b The percentage of short reads that was mapped on these features.
c SE=Significantly Enriched. Refers to if one genome contained significantly more of this gene family. The significance was determined from an empirical distribution of read depth differences from homologous regions of Tcm and Tcc X10, corrected for genome size. The empirical distribution was used to calculate a p-value.
Figure 7Cell invasion assay. (A) Intracellular T. c. marinkellei parasites stained with Giemsa. Scale bars correspond to: 10 μm (field) and 5 μm (enlarged). (B) Immunolabelled intracellular and extracellular T. c. marinkellei parasites. Intracellular parasites were labelled with anti-WCB antibody (green), while extracellular parasites were labelled with anti-WCB antibody (green) and anti-T. c. marinkellei serum (red), which superimposed gives the yellow color. Nuclei and kinetoplasts were counter stained in DAPI (blue). (C) Number of intracellular T. c. marinkellei parasites in the Giemsa assay. (D) Number of intracellular T. c. marinkellei parasites in the immunolabeling assay. (E) Intracellular T. c. cruzi and T. c. marinkellei parasites in three different cell types. T. c. cruzi and T. c. marinkellei parasites were incubated for 5 days with Vero (monkey), OK (opossum) and Tb1-Lu cells before Giemsa staining. Two hundred cells were assayed in 3 independent experiments for Giemsa and immunolabeling assays. The scale bars correspond to 5 μm.