| Literature DB >> 29708484 |
Luisa Berná1, Matias Rodriguez2, María Laura Chiribao1,3, Adriana Parodi-Talice1,4, Sebastián Pita1,4, Gastón Rijo1, Fernando Alvarez-Valin2, Carlos Robello1,3.
Abstract
Although the genome of Trypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (the abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degrees of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated with T. cruzi's genome since they permit direct determination of the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, not only allows accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of two T. cruzi clones: the hybrid TCC (TcVI) and the non-hybrid Dm28c (TcI), determined by PacBio Single Molecular Real-Time (SMRT) technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome of T. cruzi is composed of a 'core compartment' and a 'disruptive compartment' which exhibit opposite GC content and gene composition. Novel tandem and dispersed repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families, mucins and trans-sialidases allows now a better overview of these complex groups of genes.Entities:
Keywords: Chagas disease; PacBio; Trypanosoma cruzi; whole genome sequencing
Mesh:
Substances:
Year: 2018 PMID: 29708484 PMCID: PMC5994713 DOI: 10.1099/mgen.0.000177
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 2.Haplotypes resolution and recombination. (a) Circos graph representation of homologous contigs (right). On the left is shown the Artemis view of the indicated fragments (for contig TCC_133 from 88 to 112 kb (top), and for contig TCC_64 from 50 to 77 kb (bottom)]. The six frames are shown and the annotated genes are represented in turquoise. (b) Alignment visualization (IGV) of the Esmeraldo Illumina reads (SRA833800) on the same homologous regions considered in (a) (TCC_133 on the top, TCC_64 on the bottom). (c) Alignment visualization (IGV) of PacBio TCC reads on the same region as in (b). On the bottom is represented the enlargement of the boxed region where Esmeraldo Illumina reads go from mapping to TCC_133 to mapping to TCC_64. (d) Circos graph representation of haplotype resolution contigs of different sizes.
Summary of the assembly and annotation of T. cruzi TCC and T. cruzi Dm28c genomes
| Total reads | 751 460 | 601 161 |
| Filtered reads | 343 384 | 261 392 |
| Read N50 | 20 939 | 21 011 |
| Total base pairs | 5 200 759 160 | 4 037 444 749 |
| Coverage | ~60× | ~76× |
| Size* (bp) | 86 772 227 | 53 163 602 |
| N50 | 265 169 | 317 638 |
| Number of contigs | 1142 | 599 |
| DNA G+C content (%) | 51.7 | 51.6 |
| Percentage coding | 49.6 | 49.8 |
| Number of gene models | 27 522 | 17 371 |
| Mean CDS length | 1388 | 1484 |
| DNA G+C content (%) | 54.3 | 53.6 |
| Gene density (genes per Mb) | 320 | 326 |
| Mean length (bp) | 1690 | 1660 |
| DNA G+C content (%) | 46.0 | 46.8 |
| tRNA | 115 | 94 |
| rRNA locus** | 8 | 14 |
| rRNA 5S | 193 | 77 |
| SL-RNA | 206 | 622 |
| snRNA | 16 | 13 |
| snoRNA | 1561 | 1024 |
*Includes all contigs >5Kb.
**Includes SSU+5.8S+LSU.
Fig. 1.Chromosomal assembly improvements. (a) ACT alignment of homologous chromosomes from three strains: TCC (contig TCC_10), Dm28c (contig Dm28c_6) and CL Brener (chromosome TcChr30-P). Previously undetermined sequences filled by Ns in CL Brener are marked in green. (b) Magnification of a fragment of a (boxed and shadowed in grey). The six frames and the DNA G+C content of each chromosome are plotted. Previously collapsed repetitive sequences (boxed in orange) are disaggregated in the new assembly. c) Visualization of the alignment of the same homologous chromosome showing additional details in TCC and Dm28c. The color patterns in the annotation bars (bottom and top-most horizontal stripped bars) correspond to the annotation as they appear in the web interface (DGF1 in red, GP63 in orange, RHS in brown, conserved genes in green). The six reading frames are also shown. (1) Terminal DGF-1 gene cluster present only in TCC. (2) Non-homologous region present only in Dm28c. (3) Repetitive region present in both strains. (4) Expansion of a GP63 cluster in TCC (four copies versus two copies in Dm28c). (5) Strain-specific amplifications of two different genes. There are seven GP63 copies (orange strips on the top annotation bar) in TCC but only one in Dm28c; moreover Dm28c contains four RHS copies in the same region. (6) Repetitive element present in both genomes having fewer copies in TCC (20 copies in TCC and 44 copies in Dm28c). The segment is followed by another strain-specific amplification consisting of a cluster of 14 GP63 genes in TCC and only one copy in Dm28c.
Fig. 3.The genome compartmentalization of T. cruzi. (a) Schematic representation of the two types of compartment in T. cruzi. Genes are visualized as in the web interface by strips (DGF1 red, GP63 orange, MASP blue, mucin light blue, TS light orange, conserved genes green). The core compartment is composed of conserved genes. The disruptive compartment is composed of surface multigene families TS, MASP and mucins. GP63, DGF-1 and RHS are distributed (sometimes in tandem clusters) in both compartments. (b) GC distribution of the compartments. Only contigs entirely composed of one compartment (80 % or higher proportion of conserved genes or surface multigene families) and longer than 10 kb were considered. (c) Schematic representation of a contig of Dm28c; genes are depicted as in (a) and colour compartment as in (b). The GC distribution is calculated over a sliding windows of 7000 bp. Strand-switch regions are indicated above the GC plot by black vertical stripes.
Gene families groups in T. cruzi. The total numbers of genes and clusters are listed
| Trans-sialidase (TS) | 1734 | 2 | 1491 | 3 |
| MASP | 1332 | 44 | 1045 | 38 |
| RHS | 1264 | 4 | 774 | 2 |
| Mucins | 970 | 21 | 571 | 10 |
| GP63 | 718 | 6 | 378 | 6 |
| DGF-1 | 491 | 1 | 215 | 1 |
| UDP-Gal or UDP-GlcNAc-dependent glycosyltransferase | 128 | 1 | 110 | 1 |
| Protein kinase | 152 | 6 | 118 | 4 |
| Amino acid permease/transporter | 128 | 1 | 93 | 1 |
| Elongation factor (1-alpha, 1-gamma and 2) | 109 | 3 | 63 | 3 |
| Protein Associated with Differentiation | 98 | 1 | 69 | 1 |
| Glutamamyl carboxypeptidase | 96 | 1 | 80 | 1 |
| Syntaxin binding protein | 91 | 2 | 40 | 2 |
| TASV | 87 | 5 | 53 | 3 |
| Heat shock protein 70 | 87 | 2 | 40 | 1 |
| Kinesin | 85 | 3 | 55 | 3 |
| Glycine dehydrogenase | 73 | 2 | 51 | 1 |
| Beta galactofuranosyl glycosyltransferase | 73 | 1 | 37 | 1 |
| Receptor-type adenylate cyclase | 63 | 1 | 27 | 1 |
| Histone H2B | 56 | 1 | 5 | 1 |
| Histone H3 | 54 | 1 | 27 | 1 |
| Oligosaccharyl transferase | 52 | 1 | 3 | 1 |
| Tryptophanyl-tRNA synthetase | 52 | 1 | 36 | 1 |
| ATP-dependent DEAD/H RNA helicase | 59 | 3 | 39 | 3 |
| Casein kinase | 50 | 1 | 42 | 1 |
| Cysteine proteinase | 46 | 1 | 56 | 1 |
| Flagellar calcium-binding 24 kDa protein | 41 | 1 | 52 | 1 |
*Only clusters with at least 50 members are shown.
Fig. 4.Tandem gene organization. (a) Representation of three contigs of TCC as in the web interface where only conserved genes are shown (green strips). Groups of tandemly arrayed genes are highlighted; parentheses indicate the number of copies. (b) Graph representation of the number of groups of tandemly arrayed genes (represented tandem length from four to ten genes) in the different genome assemblies. TCC in green, Dm28c in violet, CL Brener in gray.
Manually annotated surface multigene families
| Total | Genes | Pseudogenes | Total | Genes | Pseudogenes | |
|---|---|---|---|---|---|---|
| TS | 1734 | 689 | 1045 | 1491 | 709 | 782 |
| MASP | 1332 | 941 | 391 | 1045 | 736 | 309 |
| GP63 | 718 | 237 | 481 | 378 | 96 | 282 |
| DGF-1 | 491 | 191 | 300 | 215 | 75 | 140 |
| Mucin | 970 | 723 | 247 | 571 | 458 | 113 |
Complete retrotransposon copy numbers in T. cruzi
| Complete copies | Length (kb) | Identity (%) | GC content (%) | Complete copies | Length (kb) | Identity (%) | GC content (%) | |
|---|---|---|---|---|---|---|---|---|
| Non-LTR retrotransposons | ||||||||
| CZAR | 43 | 6.497 | 93.2 | 55.7 | 57 | 6.442 | 91.9 | 56.3 |
| L1Tc | 43(13) | 4.874 | 90.7 | 53.0 | 54(18) | 4.749 | 96.5 | 53.3 |
| NARTc* | 110 | 0.257 | 92.6 | 51.7 | 55 | 0.258 | 90.5 | 51.7 |
| TcTREZO | 978 | 1.459 | 92.1 | 50.9 | 297 | 1.423 | 95.2 | 50.3 |
| YR retrotransposons | ||||||||
| VIPER | 244 | 3.454 | 85.0 | 55.2 | 194 | 3.423 | 87.3 | 54.8 |
| SIRE* | 851 | 0.440 | 87.4 | 44.0 | 669 | 0.441 | 88.6 | 44.2 |
*Non autonomous.
Parentheses indicate the number of putative active copies.
Fig. 5.L1Tc phylogeny. Maximum-likelihood phylogeny of complete sequences of L1Tc. Elements from TCC in green, Dm28c in violet, SylvioX10/1 in light violet, CL Brener Esmeraldo-like in light violet grey, CL Brener non-Esmeraldo-like in b.
Tandem repeats in TCC and Dm28c
| TCC | 195 bp | 41 062 | 8 303 881 | 9.5 | 95.4 |
| Dm28c | 195 bp | 12 244 | 2 483 120 | 4.7 | 95.1 |