| Literature DB >> 29442617 |
Rodrigo P Baptista1,2, Joao Luis Reis-Cunha3, Jeremy D DeBarry1, Egler Chiari3, Jessica C Kissinger1,2,4, Daniella C Bartholomeu3, Andrea M Macedo5.
Abstract
Next-generation sequencing (NGS) methods are low-cost high-throughput technologies that produce thousands to millions of sequence reads. Despite the high number of raw sequence reads, their short length, relative to Sanger, PacBio or Nanopore reads, complicates the assembly of genomic repeats. Many genome tools are available, but the assembly of highly repetitive genome sequences using only NGS short reads remains challenging. Genome assembly of organisms responsible for important neglected diseases such as Trypanosoma cruzi, the aetiological agent of Chagas disease, is known to be challenging because of their repetitive nature. Only three of six recognized discrete typing units (DTUs) of the parasite have their draft genomes published and therefore genome evolution analyses in the taxon are limited. In this study, we developed a computational workflow to assemble highly repetitive genomes via a combination of de novo and reference-based assembly strategies to better overcome the intrinsic limitations of each, based on Illumina reads. The highly repetitive genome of the human-infecting parasite T. cruzi 231 strain was used as a test subject. The combined-assembly approach shown in this study benefits from the reference-based assembly ability to resolve highly repetitive sequences and from the de novo capacity to assemble genome-specific regions, improving the quality of the assembly. The acceptable confidence obtained by analyzing our results showed that our combined approach is an attractive option to assemble highly repetitive genomes with NGS short reads. Phylogenomic analysis including the 231 strain, the first representative of DTU III whose genome was sequenced, was also performed and provides new insights into T. cruzi genome evolution.Entities:
Keywords: DTUs; Trypanosoma cruzi; evolution; genome assembly
Mesh:
Year: 2018 PMID: 29442617 PMCID: PMC5989580 DOI: 10.1099/mgen.0.000156
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Outline of the combined-assembly pipeline.
Fig. 2.Competitive genome mapping results. Genome coverage results obtained by the competitive and individual mapping approaches of Tc231 reads against reference genome sequences of Esmeraldo-like and non-Esmeraldo-like CL Brener haplotypes.
Comparison of T. cruzi assemblies available on public databases, and our new assembly obtained by using different assembly methodologies
All data are available at: http://www.ncbi.nlm.nih.gov/genome/genomes/25. The Sylvio X10 genome does not have data for scaffolds, only contigs. na, Not applicable.
| Size* (MB) | G+C content (%) | No. of scaffolds | Scaffold N50 | No. of contigs | Contig N50 | Platform | |
|---|---|---|---|---|---|---|---|
| CL Brener | 89.94 | 51.7 | 29 495 | 88 624 | 32 746 | 14 669 | Sanger |
| JR cl.4 | 41.48 | 51.3 | 15 312 | 83 591 | 18 103 | 7407 | Roche 454 |
| Tula cl.2 | 83.51 | 51.4 | 45 711 | 7772 | 53 083 | 2193 | Roche 454 |
| Esmeraldo | 38.08 | 50.9 | 15 803 | 66 229 | 20 187 | 5353 | Roche 454 |
| Sylvio X10 | 38.59 | 51.1 | – | – | 27 019 | 2307 | Roche 454+Illumina |
| 231 | |||||||
| 28.41 | 50.0 | 13 482 | 3745 | 16 684 | 2242 | Illumina | |
| Reference based assembly | 24.98 | 50.7 | 21 464 | 3239 | |||
| 35.35 | 48.7 | 8471 | 14 202 | 13 576 | 5300 | ||
*Genome sizes were obtained by counting nucleotides in the genome excluding ‘N's’.
Fig. 3.Comparison of results produced by each assembly approach. MapView of the three strategies used in this study: de novo, reference-based and combined assembly. (a) Read alignment frequency along chromosome 1 in the reference-based assembly; (b) schematic drawing of chromosome 1, where black=hypothetical proteins, grey=housekeeping genes and yellow, dark blue and green=multigene families; (c) scaffold alignments along the T. cruzi Non-Esmeraldo chromosome 1 (TcChr1-P) sequence reference and their degree of identity. The lines connecting the sequences in the alignment represent different gapped alignments.
Overview of transferred annotation elements
| 34 803 | Elements were found on the reference |
| 31 182 | Elements were completely transferred |
| 0 | Elements were partially transferred |
| 1266 | Elements were split |
| 3621* | Elements were not transferred |
| 10 833 | Gene models were transferred from the reference |
| 10 592 | Gene models were transferred correctly |
| 0 | Gene models were partially transferred |
| 241* | Gene models were not transferred |
*Multi-copy gene families, pseudogenes and some hypothetical proteins.
Fig. 4.Venn diagram of Tc231 sequences indicating the number of single-copy gene orthologous gene clusters that are specific or shared between TcI, TcII and Non-Esmeraldo (TcIII-like).
Fig. 5.ML nuclear tree obtained from 30 concatenated nuclear gene sequences. For the reconstruction, the best amino acid substitution model was JTT+G obtained by ProtTest, with 1000 bootstrap resampling used for statistical support. The scale bar shows length of branch that represents an amount genetic change of 0.05.
Fig. 6.Average divergence times based on Bayesian analysis for the major T. cruzi lineages. (a) Divergence times estimated from 30 nuclear loci with the relaxed lognormal clock model. (b) Data for alternative models. Scale bar in million years ago (mya).