Literature DB >> 18974180

TcSNP: a database of genetic variation in Trypanosoma cruzi.

Alejandro A Ackermann¹, Santiago J Carmona, Fernán Agüero.

Abstract

The TcSNP database (http://snps.tcruzi.org) integrates information on genetic variation (polymorphisms and mutations) for different stocks, strains and isolates of Trypanosoma cruzi, the causative agent of Chagas disease. The database incorporates sequences (genes from the T. cruzi reference genome, mRNAs, ESTs and genomic sequences); multiple sequence alignments obtained from these sequences; and single-nucleotide polymorphisms and small indels identified by scanning these multiple sequence alignments. Information in TcSNP can be readily interrogated to arrive at gene sets, or SNP sets of interest based on a number of attributes. Sequence similarity searches using BLAST are also supported. This first release of TcSNP contains nearly 170,000 high-confidence candidate SNPs, derived from the analysis of annotated coding sequences. As new sequence data become available, TcSNP will incorporate these data, mapping new candidate SNPs onto the reference genome sequences.

Entities: Chemical Disease Species

Mesh：

Year: 2008 PMID： 18974180 PMCID： PMC2686512 DOI： 10.1093/nar/gkn874

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Trypanosoma cruzi is a protozoan pathogen that infects humans and other mammals, producing a pathology called Chagas disease. The disease is endemic in most of central and South America affecting ∼18 million people (1), with an increasing number of cases in North America (2). Trypanosoma cruzi is diploid, with a predominantly clonal (asexual) mode of replication (3), and a high degree of sequence and karyotype variability between strains (4). Based on a number of genetic and isoenzyme markers, the population of T. cruzi has been divided into six discrete evolutionary lineages (5,6), with other studies suggesting the existence of further genetic divisions (7,8). Infection with the parasite results in a number of pathologies and clinical outcomes—megacolon, megaesophagus and cardiomyopathy, among others—that are thought to be the result of a complex interplay between the host genetic background and the genetic variability present in the parasite population (9). Available studies in model organisms support this hypothesis (10), stressing the need for expanding our knowledge of the genetic variation present in the parasite. The genome of T. cruzi was sequenced by a whole-genome sequencing approach, from a hybrid strain (CL Brener) composed of two divergent parental haplotypes (11,12). This choice of strain and sequencing strategy resulted in a high sequence coverage from the two parental haplotypes. Because of the high allelic variation, ∼30 Mb of sequence (out of the estimated 100 Mb of diploid genome size) were found to be present twice in the assembly (12). We have used the genome sequence information, together with sequences from various strains of T. cruzi available in public databases to map polymorphic sites present in coding sequence loci in T. cruzi. These candidate SNPs have been analyzed and characterized based on a number of attributes (allele frequency, effect on the encoded protein product, probability of being a true polymorphism, their overlaps with restriction enzyme sites, etc.). All this information has been integrated into a database, called TcSNP and available online at http://snps.tcruzi.org. The data integrated in this database represent the first genome-wide compilation of genetic variation data for T. cruzi. In this article, we describe the TcSNP database, the underlying data and website functionality and demonstrate its application in a number of case scenarios.

OVERVIEW OF THE TCSNP DATABASE

The TcSNP database contains T. cruzi sequences, multiple sequence alignments obtained from these sequences, and single-nucleotide polymorphisms and small indels identified by scanning these multiple sequence alignments (Table 1). Interrogation of the data available in TcSNP can be performed on attributes from each of these objects (Figure 1). For sequences, the database offers text-based searches on attributes derived from their annotation, such as: gene names, locus and database identifiers, gene ontology terms (molecular function, cellular process and components), biochemical pathways, strain from which the gene has been sequenced, etc. In any of these cases, the result is a list of genes matching the specified criteria, containing links to the corresponding multiple sequence alignments, where users can visualize polymorphic sites in different colors, typefaces and styles, as an indication of SNP probability and the effect on the encoded protein sequence (Figure 1). Any result set containing genes can be used to obtain the corresponding set of polymorphic sites present in those genes.

Table 1.

Summary of data available in the current release of TcSNP, showing the numbers of sequences, alignments and SNPs

Sequences	Number	Strains
Reference coding sequences^a	25 013	1
Expressed sequence tags	13 968	3
Other (mRNAs, genomic)	2038	295^b
Alignments
Total No. of alignments	7482
Alignments with two reference sequences^c	5280
SNPs^d
Total No. of SNPs	269 686
With P > 0.7	195 160
Within high quality neighborhoods^e	204 823
Synonymous	110 031
Non-synonymous	111 117

aFrom the reference CL Brener genome (12).

bThis figure includes redundancy in strain names, see Methods section for more information.

cAllelic variants of the two CL Brener haplotypes.

dNumber of SNPs in each row is independent from other rows.

eLess than three SNPs in a window of 10 bp.

Figure 1.

Example search session showing the navigation flow in the TcSNP website. Users can do a gene-centric search (e.g. using the keywords ‘cell division protein kinase’), a SNP-centric search (e.g. SNP is polymorphic between strains Tul2 and CL Brener) or a sequence similarity-based search (using BLAST, not shown). From any list of results users can access the corresponding multiple sequence alignment of interest (path A), and view SNP-specific information (e.g. quality score, mutation type, detected alelles, etc.) (paths B and C). Summary of data available in the current release of TcSNP, showing the numbers of sequences, alignments and SNPs aFrom the reference CL Brener genome (12). bThis figure includes redundancy in strain names, see Methods section for more information. cAllelic variants of the two CL Brener haplotypes. dNumber of SNPs in each row is independent from other rows. eLess than three SNPs in a window of 10 bp. The polymorphic sites available in TcSNP have been characterized based on a number of criteria. Using PolyBayes (13), we have calculated the probability of these sites being true polymorphic sites (as opposed to sequencing errors, see Methods section). Also, we have obtained a measure of the quality of the sequence around each site by noting the distance between neighboring polymorphic sites. Based on this analysis, we have marked sites that are located inside (and outside) of sequence regions containing three or more polymorphic sites in a window of 10 bp. Finally, we have assessed the change introduced by each putative SNP on the encoded protein (either synonymous, non-synonymous substitutions or premature stops), and provide the ratio of dN (number of nonsynonymous changes per nonsynonymous site) and dS (number of synonymous changes per synonymous site) values for a significant portion of the alignments, as an estimate of the selection pressure acting on these genes (14). All these attributes can be used to filter SNPs and arrive at SNP sets of interest. Also, users interested in a specific genomic region can also look for SNPs using genome contig identifiers and base coordinates. In TcSNP, all searches in a user's session are shown in the query history page, where they can be combined using standard operators (UNION, INTERSECTION and SUBTRACTION). As an example, users interested in high-confidence polymorphic sites in the T. cruzi strain Dm28 (that belongs to the evolutionary lineage Tcruzi I) can arrive at this set by asking for the INTERSECTION of SNP sets obtained for each individual criterion. This is illustrated in Figure 2 that also shows how users can overcome the existing redundancy in strain names by obtaining a UNION of SNP sets with similar strain names (see discussion about this issue in the Data sources section). As was the case for genes, SNP sets can be easily converted into the corresponding set of genes containing these SNPs.

Figure 2.

Using the query history in TcSNP to combine queries. In this example, in order to obtain high quality SNPs in a strain of interest (high score, located in good quality neighborhoods), users combine SNP sets that were obtained by filtering SNPs based on specific attributes. In the figure, the intersection of the SNP sets #1, #2 and #5 has been calculated, resulting in SNP set #6. In particular, note that #5 is the result of a union of sets #3 and #4, showing how to overcome the redundancy in strain names (Dm28c is presumably a cloned stock derived from strain Dm28). Selected queries can be combined using standard set theory operators (UNION, INTERSECTION and SUBTRACTION). On the right, a Venn diagram illustrates the operations performed on the SNP sets in this example. In TcSNP both genes and SNPs are linked to a multiple sequence alignment, which is the source object from which all SNPs and indels were identified. Many properties that are specific for multiple sequence alignments are also searchable, such as the number of sequences contained in the alignment, the number of reference sequences from the CL-Brener genome, the number of SNPs identified in the alignment and the quality of the alignment, as estimated by two different parameters. Users can also perform searches based on sequence similarity, by interrogating the TcSNP database with their own query sequences using the BLAST search tool integrated into the website. Currently these searches are performed against the consensus sequences of all the alignments in TcSNP. Finally, the website provides linkouts to other web resources where users can find additional information on their genes of interest. Genes are linked to the corresponding gene pages at TcruziDB (15), GeneDB (16), TDR Targets (17) and to the corresponding source records at the NCBI.

APPLICATIONS OF THE TCSNP DATABASE

Individual researchers working with T. cruzi and/or Chagas disease are interested in sequence polymorphisms for different reasons. Molecular biologists may need information about SNPs in their genes of interest to avoid these polymorphic sites when designing oligonucleotide primers or gene knockout vectors. In contrast, researchers interested in the evolution of different T. cruzi lineages, or in the development of new typing assays might be interested in using selected SNPs as genetic markers. Genetic variation data are also important when assessing the potential for development of resistance of established drug targets; and for prioritizing candidate drug targets. The overall functionality of the TcSNP database (available search attributes, display of multiple sequence alignments and SNPs) has been designed based on the consideration of these possible uses. The exercise illustrated in Figure 2 shows how a user can quickly arrive at genetic markers of interest. In this example, we show how to select high-quality SNPs (high SNP score, with good sequence quality neighborhoods) that are polymorphic in a strain from the evolutionary lineage Tcruzi I, and which are therefore good candidates for a diagnostic assay. Another exercise facilitated by the database is the selection of genes that are under purifying selection (dN/dS ≪ 1). This set of genes is a good starting point to look for potential drug and/or vaccine candidates. Conversely, genes under diversifying selection (dN/dS ≥ 1) might represent interesting candidates for discriminating diagnostics.

CONCLUSIONS

The TcSNP database currently represents a comprehensive resource on T. cruzi coding single nucleotide polymorphisms. In this first release of TcSNP, the dataset contains minimal information on SNPs located in intergenic (noncoding) regions of the genome (most of these SNPs are derived from the untranslated regions of ESTs, and from intergenic regions present in sequences obtained from GenBank). As expected due to the high sequence coverage for the two parental haplotypes of the CL Brener strain, and the currently limited sequence information available for other strains, the majority of these candidate SNPs correspond to heterozygous sites in CL Brener. When interpreting SNP data in T. cruzi, the draft nature of the reference genome and its repetitive nature have to be taken into account. Sequence variation present in genes from large gene families might be underestimated in TcSNP when looking at a single multiple sequence alignment, because these families are usually represented by more than one multiple sequence alignment in the database. Although many problematic alignments have been manually curated (e.g. to join two alignments containing sequences from the same gene), the emphasis of this curation has been placed on alignments of single copy genes, where there is a low chance of dealing with paralogs. In this respect, the recent observation made by Arner et al. (18) about the collapsing of many gene copies in the genome assembly further reinforces the importance of interpreting SNP data with care. Their analysis shows that especially in the case of repetitive genes, copy numbers might have been underestimated (18). For SNP discovery, this collapsing of sequences during genome assembly, may result in an underestimation of polymorphic sites.

FUTURE WORK

Future releases of TcSNP will integrate new T. cruzi sequences as they become available. Many of these are expected to come from a resequencing effort that is underway in our lab. Planned updates to the website functionality include the development of a standardized API and web services through which other databases can consume the information provided by TcSNP and the development of functionality to design primers by interfacing with primer3 (19), while using the SNP information in this process. We encourage users to send feedback on desired features for improving the TcSNP database.

METHODS

Database and web application

The TcSNP database is composed of a web application written in Perl, running against a PostgreSQL database. The database schema is based on the Genomics Unified Schema (GUS, http://gusdb.org) with local customizations. The web application has been developed using a Model-View-Controller architectural pattern, where the work of each layer is performed by a specialized Perl component [all components are available from CPAN (20)]. Access to the database from Perl (i.e. the Model) is done through a hybrid combination of (i) an abstraction of the database schema using Perl's DBIx::Class package and (ii) custom SQL queries executed using Perl's database access interface (DBI). The controller component managing user's requests and dispatching calls to other components is Perl's Catalyst (21). A number of custom controller modules were developed, which contain the business logic of the TcSNP application. Finally, the Viewer component is Perl's Template Toolkit, which uses custom templates to render web pages using information provided by the controller. The database runs on dedicated FreeBSD servers, with the Catalyst web application running under the Apache web server.

Data sources

The reference genome sequence of the CL Brener strain of T. cruzi was obtained from GenBank using the umbrella accession number AAHK00000000 (July, 2005). Other T. cruzi sequences (mRNAs and ESTs) were also obtained from GenBank using custom Entrez queries (May, 2007). Before loading into the database, some curation has been done to standardize the names of T. cruzi strains. This was necessary because of the variations in how different authors write the names of strains in GenBank submissions and publications. For example, the ‘CL Brener’ strain appears in different GenBank sequences as CL Brener, CL-Brener, CLBrener, CL-Brenner, CL Brenner, Cl Brenner, or Cl Brener (note the different capitalization, use of middle dash, spaces and the writing of the strain name using a single or a double ‘n’). Because the database allows users to perform searches using strain names, it was important to reduce redundancy where possible. In some cases, however, it was important to keep the distinction between similar strain names, for example to discriminate between cloned stocks and their parental uncloned strains. As an example, we have kept CanIII and CanIII CL1 (CL1 stands for ‘clone 1’), and Sc43 and Sc43 CL1 as different ‘strains’ in TcSNP. Redundancy is still present, however, and further curation of strain names can be done. We also encourage users of the database to send feedback about this issue.

Sequence clustering and alignment

Before clustering all sequences were masked against a library of vector sequences and T. cruzi repetitive elements, as described previously (22). Annotated coding sequences from the reference genome, and other publicly available sequences were mapped against the genome scaffolds using BLAT. Sequences mapping to the same genomic regions were clustered together and multiple sequence alignments were obtained using phrap. This initial clustering allowed us to group mRNAs, ESTs and reference coding sequences to the reference genome assembly sequences. But because allelic variants in the CL-Brener genome were separated during assembly (12), those initial clusters showed many instances of allelic variants separated into different alignments (i.e. each mapping to its own contig). To obtain alignments between allelic variants, we merged alignments with highly similar consensus sequences (by BLAST analysis). Afterwards, and based on user feedback, we have also merged, splitted and re-analyzed many alignments. This manual curation effort was mainly focused on single copy genes. Users of TcSNP should also be aware of the fact that many sequences from the CL-Brener genome assembly are located in contigs with assembly problems or may represent assembly artifacts. For sequences containing assembly warnings in the original GenBank records, we have attached similar notes to the corresponding alignments in TcSNP to help users in the interpretation of the SNP data.

Candidate SNP identification and analysis

Multiple sequence alignments were scanned to identify polymorphic columns. To calculate the probability of these sites being true polymorphisms as opposed to sequencing errors, we have used the software package PolyBayes, version 5 (13). PolyBayes uses a Bayesian statistical framework that relies on allele frequency, alignment depth and base quality values amongst other attributes to calculate a probability score. Because chromatogram trace data are not available for many of the sequences in this release, we have devised a scoring strategy that uses arbitrary base quality values. These quality values are different depending on the sequence origin/type. Sequence bases obtained from the T. cruzi CL-Brener genome (∼19X shotgun coverage) were arbitrarily assigned a base quality value of 40; those from GenBank records, a value of 30 (individual submissions); and those from dbEST, a value of 20 (single-pass, unedited). As an example, using this scoring scheme, a single base from an EST differing from two allelic variants of CL-Brener reference sequence (depth = 3) would give a probability of 0.22 of being a true SNP (see for example: http://snps.tcruzi.org/snps/view/4028216). To analyze the effect of each SNP on the corresponding protein product, we noted the codon position of the SNP in each reference coding sequence and evaluated the change introduced by the polymorphic base. Also, for a subset of the alignments (those containing coding sequences of similar length, with indels being a multiple of 3), we calculated dN and dS values (14) using BioPerl's population genetics modules (23).

FUNDING

National Agency for the Promotion of Science and Technology (ANPCyT, Argentina) (grant PICT 38209); and the University of San Martín (grant S05/22). Funding for open access charge: ANPCyT. Conflict of interest statement. None declared.

20 in total

1. Genetic Variability of Trypanosoma cruzi:Implications for the Pathogenesis of Chagas Disease.

Authors: A M Macedo; S D Pena
Journal: Parasitol Today Date: 1998-03

2. A random sequencing approach for the analysis of the Trypanosoma cruzi genome: general structure, large gene and repetitive DNA families, and gene discovery.

Authors: F Agüero; R E Verdún; A C Frasch; D O Sánchez
Journal: Genome Res Date: 2000-12 Impact factor: 9.043

3. Characterisation of large and small subunit rRNA and mini-exon genes further supports the distinction of six Trypanosoma cruzi lineages.

Authors: S Brisse; J Verhoef; M Tibayrenc
Journal: Int J Parasitol Date: 2001-09 Impact factor: 3.981

4. The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease.

Authors: Najib M El-Sayed; Peter J Myler; Daniella C Bartholomeu; Daniel Nilsson; Gautam Aggarwal; Anh-Nhi Tran; Elodie Ghedin; Elizabeth A Worthey; Arthur L Delcher; Gaëlle Blandin; Scott J Westenberger; Elisabet Caler; Gustavo C Cerqueira; Carole Branche; Brian Haas; Atashi Anupama; Erik Arner; Lena Aslund; Philip Attipoe; Esteban Bontempi; Frédéric Bringaud; Peter Burton; Eithon Cadag; David A Campbell; Mark Carrington; Jonathan Crabtree; Hamid Darban; Jose Franco da Silveira; Pieter de Jong; Kimberly Edwards; Paul T Englund; Gholam Fazelina; Tamara Feldblyum; Marcela Ferella; Alberto Carlos Frasch; Keith Gull; David Horn; Lihua Hou; Yiting Huang; Ellen Kindlund; Michele Klingbeil; Sindy Kluge; Hean Koo; Daniela Lacerda; Mariano J Levin; Hernan Lorenzi; Tin Louie; Carlos Renato Machado; Richard McCulloch; Alan McKenna; Yumi Mizuno; Jeremy C Mottram; Siri Nelson; Stephen Ochaya; Kazutoyo Osoegawa; Grace Pai; Marilyn Parsons; Martin Pentony; Ulf Pettersson; Mihai Pop; Jose Luis Ramirez; Joel Rinta; Laura Robertson; Steven L Salzberg; Daniel O Sanchez; Amber Seyler; Reuben Sharma; Jyoti Shetty; Anjana J Simpson; Ellen Sisk; Martti T Tammi; Rick Tarleton; Santuza Teixeira; Susan Van Aken; Christy Vogt; Pauline N Ward; Bill Wickstead; Jennifer Wortman; Owen White; Claire M Fraser; Kenneth D Stuart; Björn Andersson
Journal: Science Date: 2005-07-15 Impact factor: 47.728

Review 5. Evolutionary genetics of Trypanosoma and Leishmania.

Authors: M Tibayrenc; F J Ayala
Journal: Microbes Infect Date: 1999-05 Impact factor: 2.700

6. Evolutionary relationships in Trypanosoma cruzi: molecular phylogenetics supports the existence of a new major lineage of strains.

Authors: C Robello; F Gamarro; S Castanys; F Alvarez-Valin
Journal: Gene Date: 2000-04-04 Impact factor: 3.688

Review 7. The trypanosomiases.

Authors: Michael P Barrett; Richard J S Burchmore; August Stich; Julio O Lazzari; Alberto Carlos Frasch; Juan José Cazzulo; Sanjeev Krishna
Journal: Lancet Date: 2003-11-01 Impact factor: 79.321

Review 8. Genomic-scale prioritization of drug targets: the TDR Targets database.

Authors: Fernán Agüero; Bissan Al-Lazikani; Martin Aslett; Matthew Berriman; Frederick S Buckner; Robert K Campbell; Santiago Carmona; Ian M Carruthers; A W Edith Chan; Feng Chen; Gregory J Crowther; Maria A Doyle; Christiane Hertz-Fowler; Andrew L Hopkins; Gregg McAllister; Solomon Nwaka; John P Overington; Arnab Pain; Gaia V Paolini; Ursula Pieper; Stuart A Ralph; Aaron Riechers; David S Roos; Andrej Sali; Dhanasekaran Shanmugam; Takashi Suzuki; Wesley C Van Voorhis; Christophe L M J Verlinde
Journal: Nat Rev Drug Discov Date: 2008-10-17 Impact factor: 84.694

9. TcruziDB: an integrated, post-genomics community resource for Trypanosoma cruzi.

Authors: Fernán Agüero; Wenlong Zheng; D Brent Weatherly; Pablo Mendes; Jessica C Kissinger
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. Neglected infections of poverty in the United States of America.

Authors: Peter J Hotez
Journal: PLoS Negl Trop Dis Date: 2008-06-25

9 in total

Review 1. Pathogenesis of chagas' disease: parasite persistence and autoimmunity.

Authors: Antonio R L Teixeira; Mariana M Hecht; Maria C Guimaro; Alessandro O Sousa; Nadjar Nitz
Journal: Clin Microbiol Rev Date: 2011-07 Impact factor: 26.132

2. Towards High-throughput Immunomics for Infectious Diseases: Use of Next-generation Peptide Microarrays for Rapid Discovery and Mapping of Antigenic Determinants.

Authors: Santiago J Carmona; Morten Nielsen; Claus Schafer-Nielsen; Juan Mucci; Jaime Altcheh; Virginia Balouz; Valeria Tekiel; Alberto C Frasch; Oscar Campetella; Carlos A Buscaglia; Fernán Agüero
Journal: Mol Cell Proteomics Date: 2015-04-28 Impact factor: 5.911

3. Diagnostic peptide discovery: prioritization of pathogen diagnostic markers using multiple features.

Authors: Santiago J Carmona; Paula A Sartor; María S Leguizamón; Oscar E Campetella; Fernán Agüero
Journal: PLoS One Date: 2012-12-14 Impact factor: 3.240

4. A genomic scale map of genetic diversity in Trypanosoma cruzi.

Authors: Alejandro A Ackermann; Leonardo G Panunzi; Raul O Cosentino; Daniel O Sánchez; Fernán Agüero
Journal: BMC Genomics Date: 2012-12-27 Impact factor: 3.969

5. A semantic problem solving environment for integrative parasite research: identification of intervention targets for Trypanosoma cruzi.

Authors: Priti P Parikh; Todd A Minning; Vinh Nguyen; Sarasi Lalithsena; Amir H Asiaee; Satya S Sahoo; Prashant Doshi; Rick Tarleton; Amit P Sheth
Journal: PLoS Negl Trop Dis Date: 2012-01-17

6. Strain-specific genome evolution in Trypanosoma cruzi, the agent of Chagas disease.

Authors: Wei Wang; Duo Peng; Rodrigo P Baptista; Yiran Li; Jessica C Kissinger; Rick L Tarleton
Journal: PLoS Pathog Date: 2021-01-28 Impact factor: 6.823

7. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009.

Authors: Michael Y Galperin; Guy R Cochrane
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

8. A simple strain typing assay for Trypanosoma cruzi: discrimination of major evolutionary lineages from a single amplification product.

Authors: Raul O Cosentino; Fernán Agüero
Journal: PLoS Negl Trop Dis Date: 2012-07-31

9. Genetic profiling of the isoprenoid and sterol biosynthesis pathway genes of Trypanosoma cruzi.

Authors: Raúl O Cosentino; Fernán Agüero
Journal: PLoS One Date: 2014-05-14 Impact factor: 3.240

9 in total