Literature DB >> 29343494

Uncovering Genomic Regions Associated with Trypanosoma Infections in Wild Populations of the Tsetse Fly Glossina fuscipes.

Andrea Gloria-Soria1, W Augustine Dunn2, Xiaoqing Yu3, Aurélien Vigneron4, Kuang-Yao Lee3, Mo Li3, Brian L Weiss4, Hongyu Zhao3, Serap Aksoy4, Adalgisa Caccone2.   

Abstract

Vector-borne diseases are responsible for > 1 million deaths every year but genomic resources for most species responsible for their transmission are limited. This is true for neglected diseases such as sleeping sickness (Human African Trypanosomiasis), a disease caused by Trypanosoma parasites vectored by several species of tseste flies within the genus Glossina We describe an integrative approach that identifies statistical associations between trypanosome infection status of Glossina fuscipes fuscipes (Gff) flies from Uganda, for which functional studies are complicated because the species cannot be easily maintained in laboratory colonies, and ∼73,000 polymorphic sites distributed across the genome. Then, we identify candidate genes involved in Gff trypanosome susceptibility by taking advantage of genomic resources from a closely related species, G. morsitans morsitans (Gmm). We compiled a comprehensive transcript library from 72 published and unpublished RNAseq experiments of trypanosome-infected and uninfected Gmm flies, and improved the current Gmm transcriptome assembly. This new assembly was then used to enhance the functional annotations on the Gff genome. As a consequence, we identified 56 candidate genes in the vicinity of the 18 regions associated with Trypanosoma infection status in Gff Twenty-nine of these genes were differentially expressed (DE) among parasite-infected and uninfected Gmm, suggesting that their orthologs in Gff may correlate with disease transmission. These genes were involved in DNA regulation, neurophysiological functions, and immune responses. We highlight the power of integrating population and functional genomics from related species to enhance our understanding of the genetic basis of physiological traits, particularly in nonmodel organisms.
Copyright © 2018 Andrea et al.

Entities:  

Keywords:  Trypanosoma; gene–phenotype association in tsetse flies; population genomics; sleeping sickness

Mesh:

Year:  2018        PMID: 29343494      PMCID: PMC5844309          DOI: 10.1534/g3.117.300493

Source DB:  PubMed          Journal:  G3 (Bethesda)        ISSN: 2160-1836            Impact factor:   3.154


The tsetse fly, Gff, is an important vector of both human (HAT, sleeping sickness) and animal (AAT, Nagana) Trypanosomiasis in Africa. Trypanosome transmission is complex, as it requires mammalian and invertebrate hosts, and involves domestic and wild reservoirs. Gff is the main HAT vector in Uganda, the only country where the two types of HAT-causing parasites cooccur. More specifically, Trypanosoma brucei gambiense, which causes the chronic form of HAT, is found in northwestern Uganda, while T. brucei rhodesiense, which causes the acute form of the disease, occurs in the country’s southeast. Although the range of the two diseases is geographically separated, the gap between them is rapidly narrowing and likely to merge in the near future in northern Uganda (Picozzi ). Because the pathology, diagnosis, and treatment varies between the two disease forms, an eventual overlap of the two disease belts would complicate HAT control (Welburn ). Previous work from our laboratory, using microsatellite loci and mitochondrial DNA, has documented spatial and temporal patterns of genetic variation, and levels of genetic differentiation within and among populations of Gff in Uganda. Three genetically distinct Gff population clusters can be identified: the northern, western, and southern regions (Beadell ; Hyseni ). These clusters are both spatially and temporally stable (Beadell ; Echodu ; Hyseni ; Opiro , 2017), even though hybrid zones among these units exist (Beadell ; Opiro ), which facilitate genetic admixing within a 100 km radius (Beadell ). This information can be used to guide vector monitoring and control interventions. However, it does not provide insights on the genetic basis of traits shaping the interactions between vectors and parasites. Understanding the interaction between vectors and their parasites is fundamental to inform disease control methods, but data from natural populations is scarce. This is a major knowledge gap with important epidemiological implications. To facilitate the study of genotype–phenotype interactions, we recently expanded the genetic marker toolset of Gff using whole-genome sequencing (WGS) coupled with double digestion RAD methods [ddRADSeq (Baird ; Peterson )] to a set of ∼73,000 SNPs distributed across the Gff genome (Gloria-Soria ). The Gff SNP data set obtained from this pilot study was generated from flies belonging to each of the three Ugandan genetic clusters (N = 53), with the same number of infected and uninfected flies per population. This data set was used to calculate the average linkage disequilibrium (LD) across the Gff genome and to look for signatures of selection via genomic scans (Gloria-Soria ). Our results indicated that LD in Gff decays at a slower rate than in other insects, being an order of magnitude lower than in the fruit fly Drosophila melanogaster (Mackay ). The average genomic LD also varied among the Gff populations (r2max/2 ranged between 1359 and 2429 bp) (Gloria-Soria ). These high levels of LD likely increased our ability to subsequently identify loci under selection, despite the small sample size (Gloria-Soria ). The results of this pilot study support the feasibility of conducting GWAS analyses in this species, using much lower sample sizes that those required by other species, including Anopheles or Aedes mosquitoes, whose LD levels are quite low (Harris ; David ; Marsden ). Although the pilot study yielded a list of candidate SNPs, mostly related to environmental adaptation (Gloria-Soria ), only one SNP was identified as associated with the infection status of the flies using outlier detection methods (Fischer ; Duforet-Frebourg ). Furthermore, functional characterization of candidate genes was complicated by the poor annotation of the Gff genome and the lack of transcriptomics for this species, a common problem in nonmodel organisms. The present work extends the analyses on the initial ddRAD data set from Gloria-Soria to gain additional insights on the nature of the genomic regions associated with the susceptibility to Trypanosoma infection of Gff populations by using statistical methods that maximize efficiency and power (Berg and Coop 2014), while accounting for spatial stratification of the samples as well as for species- and population-specific LD (Purcell ). This work takes advantage of the functional data available from a closely related species, Gmm (Petersen ), to improve the Gff genome annotation of candidate genes, thus enabling their functional characterization. Comparison of the genomes from six tsetse fly species, including Gmm and Gff, is ongoing, but our preliminary analysis indicates that > 90% of the predicted coding sequences from Gff align to the Gmm sequences. The remarkable similarity across the Gff and Gmm genomes highlights the validity of the present approach. As a result of this method, we identified candidate parasite resistance genes in Gff that could potentially be used to develop targeted control strategies for this and other Glossina species, paving the way for association studies on other phenotypes of epidemiological interest in this and other tsetse species (e.g., developmental time, life span, and susceptibility to insecticides, etc.).

Materials and Methods

Gff association analyses

Gff genomic data:

The starting SNP data set for this study included all 73,297 loci identified by Gloria-Soria , using a combination of double digest RADseq (Peterson ) and WGS (40× coverage/individual). The raw data are available from NCBI bioproject 303153. The SNPs are distributed across the Gff genome at a density of ∼1 SNP for every 5 kb and were derived from four geographically distinct populations in Uganda—Otuboi, OT (North), Masindi, MS (West), and Namutumba, NB and Kalangala Island (South)—which belong to three genetic clusters. Sampling details can be found in Gloria-Soria . The use of genetically distinct populations maximized the natural genetic diversity represented in the data set, while also minimizing ascertainment bias, thus reducing the impact of false positives in our analysis (Schoville ).

Genetic association analysis for susceptibility to trypanosome infection:

The analysis was conducted on individuals from three of the four Gff populations from Uganda described in Gloria-Soria . These populations represent each of the three Gff genetic clusters in this country. The Kalangala Island population was excluded due to small sample size. Association analysis was performed using PLINK v1.9 (Chang ) under a logistic regression model to identify SNPs associated with infection status. Infection status is treated as response and the predictor is the additive allele dosage effect of each marker. To minimize the confounding effects due to population structure, we used Faststructure v1.0 (Raj ) to characterize the population structure by a vector (X1, X2, and X3), where the optimal dimensionality was determined by chooseK.py. Because (X1, X2, and X3) is compositional, log transformations were applied on X1/X3 and X2/X3 before including them in the model. We used this method because it can automatically select the complexity of the population structure (i.e., the dimensionality of covariates), and provides more efficient computation and better interpretations of global ancestry (Raj ). In addition, log transformation is a common statistical practice to deal with composition variables (Aitchison 1982) and has been used to identify covariates in association studies in order to account for population structure (Mariette ). We also carried out a similar association analysis accounting for population structure via principle components (PCs). Specifically, the top two PCs were extracted (–pca in PLINK v1.9) and included in the logistic regression model. Statistical evidence of association between disease status and each marker was assessed through the Benjamini–Hochberg-adjusted p-values (Benjamini and Hochberg 1995). Subsequently, since the distribution of association p-values did not fit the expected uniform distribution in the tailed portion of the histogram (Supplemental Material, Figure S1A), we implemented an alternative strategy to simplify the underlying complexity of the statistical analysis; that is, we extracted representative SNPs based on different LD thresholds (r2 = 0.01–0.001). This strategy aims to facilitate the interpretation of associations across regions.

Harnessing Gmm genomic data

Gmm genomic data:

To functionally characterize the gene regions in Gff likely to be in association with Trypanosoma infection, we took advantage of the genomic and functional studies in a closely related species, Gmm, which is amenable to laboratory work (Petersen ). We combined old and new transcriptome (step 1) and RNAseq (step 2) data to improve the published Gmm transcriptome assembly (step 3), then we generated a list of expressed Gmm transcripts expressed in various Gmm tissues (step 4), and ultimately identified those DE genes in relation to Gmm Trypanosoma infection (step 5); see Figure S2.

Step 1: Gmm transcriptome data:

Methods from previously published transcriptomes can be found in the corresponding references listed in Table S1. Unpublished transcriptomes followed the same overall protocol. Briefly, Gmm were maintained at the Yale School of Public Health insectary and infected with T. brucei rhodensiense (Ytat 1.1), as described previously (Aksoy ). Fourteen days postchallenge, tissues were microscopically scored for the presence or absence of trypanosome infections, and tissues of interest were flash-frozen in liquid nitrogen for RNA extraction (see Table S1 for tissues processed). Total RNA was extracted using TRIzol (Invitrogen) and subjected to DNase treatment (TURBO DNA-free kit AM1907; Ambion). RNA quantity and quality were determined using an Agilent Bioanalyzer 2100 RNA Nano chip. cDNA libraries were prepared using a NEBNext Ultra Directional RNA Library Prep Kit (E7420S; New England Biolabs), according to the manufacturer’s protocol. The libraries were constructed and barcoded for Illumina HiSeq sequencing (unpaired 75 bases) at the Yale Center for Genome Analysis (New Haven, CT).

Step 2: Gmm RNAseq data:

The RNAseq data analysis pipeline is shown in Figure S2. The raw Illumina RNAseq reads from 72 samples (Table S1) were first assessed for quality by FastQC (Andrews 2010). The FASTX-toolkit () was then used to trim off Illumina adapter sequences and low-quality bases. Due to the prevalence of tsetse symbiont sequences in the reads, a specific quality control step was included to reduce bacterial sequence reads, using reference genome data sets from the three symbionts that are most commonly found in tsetse flies (Aksoy ): Wigglesworthia (Rio ), Wolbachia (Brelsfoard ), and Sodalis (Toh ). Parasite sequences were removed from parasitized samples by mapping the Illumina reads to the T. brucei genome strain 927 and T. congolense () prior to analysis. Reads passing quality control, and symbionts and parasites separation, were aligned to a previously generated contig library (Telleria ) by TopHat2 (Kim ) and Bowtie (Langmead ) alignment engines. Reads that mapped to multiple locations were discarded from further analysis. The percentage of reads aligned to bacteria, parasites, and the host genome is shown in Figure S3. The raw Illumina RNAseq reads can be found under the accession numbers listed in Table S1.

Step 3: updating Gmm transcriptome assembly:

We constructed a more comprehensive Gmm transcriptome assembly than the one available through VectorBase [VectorBase (Giraldo-Calderón ); GmorY1.4] by integrating the previous assembly with the 72 RNAseq samples described in Table S1. The reference annotation from VectorBase (Giraldo-Calderón ) (GmorY1.4) was used to guide transcript assembly by Cufflinks v2.2.1 (Roberts ; Trapnell ) and obtain the relative expression of a transcript (FPKM; fragments per kilobase of exon per million fragments mapped) for each annotated gene. The resulting 72 individual Cufflinks assemblies were then merged into a single unified assembly (Gmm Tx) using Cuffmerge, which can maximize assembly quality by removing transcripts that are artifacts and merging novel isoforms with known isoforms across all Cufflinks assemblies. This unified assembly was then integrated with the previous assembly (GmorY1.4), using Cufflinks to compare and determine a unique set of isoforms for each transcript locus.

Step 4: Gmm DE analysis:

Two strategies, a count-based method and Cufflinks pipeline, were used to identify DE genes between Trypanosoma-infected and noninfected Gmm samples. In the count-based strategy, raw gene counts were first counted by HTSeq package (Anders ) using the unified assembly generated as above. Batch effects and other unwanted variations were removed using the RUVseq package (Risso ). Batch effect-corrected counts were then normalized using the TMM method (Robinson and Oshlack 2010) in the edgeR package (Robinson and Smyth 2007, 2008; Robinson , 2012; Anders ), which accounts for RNA composition bias.

Step 5: DE genes between Gmm Trypanosoma-infected and noninfected flies:

A negative binomial generalized linear model implemented in edgeR was used to determine DE genes between Trypanosoma-infected and noninfected samples. In the Cufflinks strategy, the abundance of transcripts was quantified by FPKM and compared between Trypanosoma-infected and noninfected samples using Cuffdiff with fragment bias correction (Roberts ). This method corrects for sequence-specific bias and conducts a multihit read correction by dividing the value of a multimapped read between each map location based on a probabilistic model. In both strategies, genes with a fold-change > 2 and an FDR-controlled p-value < 0.05 were considered DE. For comparisons without biological replicates, only fold-change was used as a criterion.

Combining genomic data from Gff and Gmm

To gain insights into the functional role of Gff gene regions associated with Trypanosoma infection, we used the Gmm database, which included DE genes between infected and noninfected flies to narrow down the list of genes identified by the association analyses in Gff (see above) to only those genes displaying differential expression between infected and noninfected Gmm. First, we mapped the new Gmm transcriptome assembly to the Gff genome, producing a hybrid Gff annotation (step A). We then identified regions near the representative SNP of a region found to be associated with Trypanosoma infection in Gff, and DE in infected and noninfected Gmm flies (step B):

Step A: cross-species transcript mapping:

We used Blat (v35) (Kent 2002) to map the newly generated Gmm transcriptome assembly (Gmm Tx) to the Gff genome (GfusI1), with options set to exclude overoccurring 11mers from initial match seeds (–ooc 11). Blat hits (the Gmm transcripts that matched to some sequences of the Gff genome) were filtered and retained for further analysis if the hit was ≥ 100 bp long and had match coverage of ≥ 90%. Note that the purpose of using Gmm Tx was to supplement the current Gff annotation with genomic regions that may be missing, not to “fix” existing Gff annotations. For this reason, only Gmm Tx that did not overlap existing annotated features in Gff were added to the original Gff annotations to generate the final set of annotations used in our downstream analysis.

Step B: determination of DE genes in Gff linked to the SNPs in association with trypanosome infection:

After identifying Gff regions that shared > 90% sequence identity with annotated Gmm genes and were longer than ∼120 bp, we added any novel Gff-annotated region to a hybrid gene annotation for Gff. We then looked in this hybrid annotation for genome features within 2500 bp in either direction of each representative SNP associated to trypanosome infection status. This distance was determined based on previous LD findings (Gloria-Soria ). We then cross-checked these gene regions in Gff with the Gmm data (above) to identify those DE in Gmm. The infection-associated DE genes in Gmm were used to filter/annotate the Gff hybrid annotation for probable trypanosome infection-related Gmm homologs. This step helped to focus our efforts on candidates with a confirmed biological function (“real” genes as opposed to possible annotation artifacts) based on their DE expression in Gmm.

Functional characterization of candidate Gff genes

Drosophila and tsetse flies are both higher Dipterans that share several biological pathways. Previous studies investigating tsetse gene functions at the molecular level have shown a high conservation between tsetse flies and Drosophila, especially with regard to their interaction with microbes (Attardo ; Benoit ). Thus, to gain insight into the functions of our candidate genes, we looked for their homologs in these species. Transcripts from the Gff genes associated with trypanosome infection status were ascertained using the annotated VectorBase () Gff transcript database Gfus1.4. These resulting transcripts were then compared to the VectorBase Gmm transcript database GmorY1.5 using the VectorBase BLASTx tool to determine their corresponding homologous transcripts in Gmm. Similarly, the same subset of Gff transcripts were blasted (BLASTx) to FlyBase () to identify their corresponding D. melanogaster homologs.

Data availability

Original ddRAD data are available from NCBI bioproject 303153. Table 1 provides a list of the Gff genes associated with trypanosome infection that have homologs differentially expressed among Gmm-infected and noninfected flies. Table S1 provides information on the 72 Gmm transcriptomes used in this study and their corresponding SRA accession numbers. Table S2 and Table S3 provide the SNPs associated with trypanosome infection status in Gff using two alternative methods to control for population structure. Table S4 provides a comparison between the new and previous Gmm transcriptome assembly. Table S5 contains the genes DE between infected and uninfected Gmm tissues. Table S6 describes the multiple genome feature IDs used throughout the paper and association across the ID types. Table S7 lists the 56 candidate genes identified in the vicinity of representative SNPs associated with trypanosome infection status. Table S8 compares the DE genes in Gmm across tissues. Table S9 provides the results from comparing the candidates SNPs from Gloria-Soria with those found in this study. Figure 1 shows the DE genes shared across tissues for Gmm and the number of genes expressed by tissue. Figure S1 shows the Manhattan plot before and after clumping the association results based on LD. Figure S2 shows the analysis pipeline of the Gmm transcriptome. Figure S3 provides the percentage of transcriptome reads from Gmm aligned to the host, bacteria, and parasite genomes. Figure S4 shows the Manhattan plots and histograms of p-values from association for each population of G. fuscipes clumped at an LD threshold of 1%.
Table 1

Gff genes associated with trypanosome infection that have differentially expressed homologs among Gmm infected and noninfected flies

Functional CategoryGene IdentifieraG. morsitans morsitans HomologbBLAST Result Against Drosophila DatabasecDrosophila Annotation SymboldPotential FunctioneReferencesf
NeurophysiologyGFUI053401GMOY003473Protein Phosphatase 99A (Ptp99A)CG11516Neural receptor, potential target of miR-8, in Drosophila presents a SNP associated with microbiota-dependent nutriments traitDesai et al. (1997), Enright et al. (2003), Dobson et al. (2015)
GFUI021833GMOY011124Protein tincarCG31247Expressed during Drosophila embryogenesis in central/peripheral nervous systems and midgutHirota et al. (2005)
GFUI017734GMOY004860MalvolioCG3671Required for normal taste behavior in Drosophila, expressed in the central nervous system, macrophages, and gut of Drosophila, Expression of Malvolio correlates with feeding habits preferences in CalliphoridaeRodrigues et al. (1995), Folwell et al. (2006), Cardoso et al. (2016)
GFUI041241GMOY006879Insulin-like peptideCG14059Feeding regulation, growth and development in DrosophilaPool and Scott (2014), Garelli et al. (2012)
NovelGMOY009338Drosophila IGF-II mRNA-binding protein (dimp)CG9850Involved in nervous system development, promotes neuronal remodelingAdolph et al. (2009), Medioni et al. (2014)
GFUI019498GMOY008359Innexin 7 (inx7)CG2977Gap junctions involved in nervous systemLiu et al. (2016)
NovelGMOY005501Synaptic vesicle glycoprotein 2B-likeCG31106Homologous gene in cat flea, expression in parallel to feeding process, modulation of synaptic exocytosis in vertebratesWalmsley and Gaines (2004), Morgans et al. (2009)
Cell proliferation and differentiationGFUI009421GMOY009422Winged eyeCG31151Regulation of histone gene expressionOzawa et al. (2016)
GFUI043027GMOY005878DNA polymerase ɛ subunit 2 (pole2)CG10489Involved during the progression of mitosis (S phase)Sahashi et al. (2013)
GFUI023751GMOY005801Cytoskeleton-associated protein 5 (msps-like)CG5000Essential for structural integrity of mitotic spindleCullen et al. (1999)
GFUI043720GMOY007600Cell division cycle 14CG7134Controls late cycle of mitosisStegmeier and Amon (2004)
GFUI002252GMOY004487MoiraCG18740Chromatin-remodeling factor, associated with cell cycle and cell proliferationCrosby et al. (1999), Nakamura et al. (2008)
GFUI030827GMOY002286Wee1-like kinaseCG4488Regulator of Drosophila embryo mitosisStumpff et al. (2005)
GFUI011175GMOY009522Adenomatous polyposis coli (APC)CG1451Intestinal stem cell proliferation, regulation of Wnt pathwayLee et al. (2009), Kunttas-Tatli et al. (2015)
Immune responseGFUI007441GMOY004477DorsalCG6667Main transcription factor of the insect immune response Toll signaling pathwayValanne et al. (2011)
NovelGMOY007569Toll-likeCG7250Main receptor of the innate immunity Toll signaling pathwayLemaitre et al. (1996)
GFUI020794GMOY007570Peroxiredoxin 6005CG3083Antioxidant, probably cytosolic in Glossina morsitansMunks et al. (2005)
MiscellaneousGFUI020790GMOY007571acyl-CoA-binding proteinCG8814
GFUI051070GMOY001698SET and MYND domain-containing, arthropod-specific, member 4 (Smyd4)CG11160Muscle development in DrosophilaThompson and Travers (2008)
GFUI043025GMOY005879FocadhesinCG3520
GFUI028910GMOY008384Serine hydrolaseCG5707
GFUI020792GMOY007568Congested-like trachea (colt)CG3057
NovelGMOY005234PFTAIRE-interacting factor 1ACG42599

G.fuscipes gene ID in Vector Base. "Novel" refers to transcripts that have been characterized in this study.

G. morsitans (Gmm) closest homolog gene ID in Vector Base

The closest Drosophila homologs determined using BLASTx and displayed by their name and

annotation symbol.

Potential functions are listed along with the fcorresponding references.

Figure 1

Gene expression across G. morsitans morsitans (Gmm) tissues. (A) Upregulated and (B) downregulated genes in infected Gmm tissues. (C) Genes expressed across tissues. The count of genes identified in any more than one tissue is plotted [differentially expressed genes (DE), as identified by both cuffdiff and edgeR]. Dark circles in the bottom matrix indicate tissues that share the same set of genes. The number of genes in each tissue are shown as blue bars at the bottom left. A total of 19,110 unique genes were expressed in at least one tissue, with 8786 (46%) expressed across all six tissues.

G.fuscipes gene ID in Vector Base. "Novel" refers to transcripts that have been characterized in this study. G. morsitans (Gmm) closest homolog gene ID in Vector Base The closest Drosophila homologs determined using BLASTx and displayed by their name and annotation symbol. Potential functions are listed along with the fcorresponding references. Gene expression across G. morsitans morsitans (Gmm) tissues. (A) Upregulated and (B) downregulated genes in infected Gmm tissues. (C) Genes expressed across tissues. The count of genes identified in any more than one tissue is plotted [differentially expressed genes (DE), as identified by both cuffdiff and edgeR]. Dark circles in the bottom matrix indicate tissues that share the same set of genes. The number of genes in each tissue are shown as blue bars at the bottom left. A total of 19,110 unique genes were expressed in at least one tissue, with 8786 (46%) expressed across all six tissues.

Results

Associations between genetic variation and trypanosome infection status in Gff

We used the 73,297 SNPs previously developed for Gff (Gloria-Soria ) to identify markers associated with trypanosome infection status (see Materials and Methods). Association results were grouped based on LD clumping to facilitate interpretation of the association signal. We explored several LD thresholds (r2 = 0.01–0.001) and selected the data set with a null distribution of p-values closest to a uniform distribution (Figure S1). An LD threshold of r2 < 0.01 was chosen and yielded a set of 63 SNP groups centered around a representative SNP or “index variant” (PLINK; Chang ) shown in Table S2. The resulting Manhattan plot and the corresponding histogram of p-values are displayed in Figure S1C. As shown across Figure S1, accounting for LD facilitated detection of the association signal. From the 63 representative SNPs, 18 displayed statistically significant associations with trypanosome infection status, after correction for multiple testing (FDR-controlled p-value < 0.05; Table S2A) and accounting for population structure using FastStructure (Raj ). As LD levels can vary among populations, similar analyses were carried out separately for each of the three Gff populations—OT, NB, and MS—with the goal of identifying relevant population-specific markers. For each population, LD clumping yielded ∼60 representative SNPs (threshold of r2 > 0.01; Table S2, B–D) and we identified 19, 19, and 16 SNPs in OT, NB, and MS, respectively, which were significantly associated with the phenotype (Table S2, B–D). The corresponding Manhattan plots and histograms of p-values for each population are shown in Figure S4. Table S3 shows the results of the association analysis using PCs as covariates, as an alternative approach to FastStructure (Raj ) to control for population structure. Results were similar to those using FastStructure (above), but yielded 64 representative SNPs during the clumping procedure (previously 63). Additionally, one of the previously identified representative SNPs showed significant association values using the PC method (GFvariants_VB2014e_tvcf:KK351818.1:1387926, pval = 0.0046 after FDR).

Enhancing the Gff transcriptome assembly

In our effort to identify Gff genes relevant to the trypanosome infection status, we took advantage of the already existing Gmm transcriptome assembly [GmorY1.4 (Giraldo-Calderón )] and improved it with data from 72 additional RNAseq samples (Table S1). The new assembly (Gmm Tx) covers 41,174 genes with 85,849 transcripts, among which 12,053 are completely matched with previous assemblies and 38,632 are potentially novel isoforms. A comparison between the existing assembly and the one generated in the current study can be found in Table S4.

DE genes in Gmm

DE genes were identified between Trypanosoma-infected and uninfected Gmm samples in four tissues [midgut, salivary glands, cardia (proventriculus), and proboscis], using both count-based methods and the Cufflinks pipeline (Table S5). Both differential expression analysis strategies identified a substantial portion of genes (12–15% in the midgut, 40–50% in the salivary glands, 10–40% in the cardia, and 50% in the proboscis) that were only annotated in the new assembly (Table S5A). One-third of the DE genes were identified with both edgeR and Cuffdiff methods (Table S5B). Figure 1A-B show the count of genes identified as being DE across tissues.

Cross-species transcript mapping

The newly assembled Gmm transcriptome was then used to enhance Gff existing genomic features. Mapping the new Gmm transcriptome assembly (Gmm Tx) to the Gff genome resulted in 20,448 Gmm transcripts that matched to sequences in the Gff genome. Among them, 15,954 were already annotated in Gff. Another 4494 were mapped to the Gff genome and represent novel annotations. The vast majority of the filtered Gmm Tx (99.11%) mapped to a unique location on the Gff genome. Of the novel annotations, the maximum number of Gmm Tx mappings after filtering was 21, while the mean number of mappings for all Gmm Tx was 1.017 (median = 1). Details about the genome features used through this cross-species mapping can be found in Table S6.

Candidate Gff loci involved in susceptibility to trypanosome infection

We identified 56 Trypanosoma infection-associated Gff loci (Table S7) by screening for genes 2500 bp upstream and downstream of the representative SNPs displaying a significant association signal with the Trypanosoma infection phenotype (SNPs across all populations [N = 18] and SNPs for each individual population [N = 19, 19, and 16 for OT, NB, and MS, respectively] (see association section above). We also looked for genes within a 2500 bp window around the additional SNP identified across populations, using the PCA method to control for population structure, but found none. The 56 identified loci can be divided into two categories: “official” and “novel” Gff loci. The first category includes 42 loci that fall within official Gff (GFusI1.3) gene annotations and bear official gene symbols, prefixed with “GFUI.” The second category represent 14 Gff loci where no official annotation exists but where a transcript was inferred using cufflinks-deduced Gmm Tx transcripts longer than 100 bp (mapped with 100% identity over ≥ 90% of its length). These genes, added by virtue of the projection of homologous transcripts of Gmm onto the Gff genome, are referred to by a cufflinks-assigned ID and a “TCONS” prefix (see Table S6 for genome feature description). All official and novel loci reported here have two common characteristics; a SNP with a significant genetic association located within ≤ 2500 bp of its coordinates, and an ortholog (for official loci) or corresponding TCONS Tx (for novel loci) in Gmm (Table S7). Thus, 42 official Gff genes and 14 novel Gff genes (a total of 56 Trypanosoma-associated Gff loci) were identified by this selection criterion and are listed in Table S7. One of these novel genes (XLOC_035548) was not annotated in either of the two genomes. Twenty-nine of all 56 genes identified were DE between Trypanosoma-infected and uninfected Gmm flies (Table 1). To evaluate whether DE of 29 out of 56 loci in Gmm was significant (not random), we determined how common DE was across tissues in Gmm. We identified the total number of unique DE genes across the Gmm tissues tested and evaluated whether the disagreement in DE was due to a difference in the genes expressed across tissues. Figure 1, A and B show the DE genes shared by tissue, identified by methods cuffdiff, edgeR (Table S5A), and both (identified by both cuffdiff and edgeR, Table S5B), respectively. Between 2000 and 7000 unique DE genes were identified by these methods, with no DE genes shared by all six tissues and only one-tenth shared by at least two tissues (Figure 1, A and B and Table S8). This disagreement in DE could arise from tissues associated with infection differentially or simply because different genes are expressed in different tissues. To address this issue, we compared all expressed genes (genes with at least two reads aligned in at least half of the samples) across tissues. The number of genes expressed by tissue is shown in Figure 1C. A total of 19,110 unique genes were expressed in at least one tissue, with 8786 (46%) expressed across all six tissues. These two findings together indicate that a large difference exists across tissues in terms of the genes associated with infection status, and suggests that finding 29 out of 56 DE loci in Gff is not likely due to DE being common across tissues.

Functional characterization of candidate Gff genes found in association with Trypanosoma infection status

We functionally characterized the 29 candidate genes linked to the phenotype of interest (Trypanosoma-infected vs. noninfected flies) to which a DE Gmm Tx had been mapped. We used this criterion to reduce the possibility of annotation artifacts among our candidates, thus ensuring that only real genes were included. Furthermore, this specific data set of genes would more likely underlie differences in vector competency across both species, given that they are closely related. We searched for D. melanogaster homologs of these 29 Gff genes by BLASTx analysis of the ascertained Gff transcript sequences against the fruit fly genome database [FlyBase (Gramates )]. These 29 genes were assigned to 23 functional homologs in Drosophila, with five of the novel Gff genes being paralogs of Gff official genes (Table 1), and no functional homolog found for the XLOC_035548 gene. The XLOC_035548 (TCONS_00071717) sequence maps to Gmm scaffold “scf7180000652014” from position 87,226 to 87,796; the nearest gene is ∼20 kb upstream: CSP2 (GMOY010026). Further searching the PFAM-A database of protein domain models (Finn ) yielded no known functional domains within the sequence. Table 1 summarizes these results, showing that the products of seven of the Gff genes identified are associated with insect neurophysiology, including putative roles in feeding behavior; seven gene products are associated with DNA regulation, involving mitosis and cell proliferation processes; three gene products are involved in fly immune responses; and six gene products do not fall within any discrete category.

Comparison with previously identified SNPs associated with trypanosome susceptibility

The significant SNPs identified in the MS, NB, and OT populations, and the SNPs identified from the combined population set, were compared to the 360 SNPs previously reported as potentially associated with infection status using Fst outlier methods implemented in BayeScan (Gloria-Soria ). A total of 55 SNP pairs were identified between the two SNP lists, based on the scaffolds they were annotated to. These SNP–SNP pairs and the distance between each pair are reported in Table S9. SNP 709851, located in scaffold 27 and identified in the MS population, was the only SNP identified in both studies as a candidate SNP. The other SNP pairs were located > 34 kb away from each other.

Discussion

Performing genomic association analysis of disease susceptibility in nonmodel organisms is challenging due to the inability to perform controlled laboratory experiments, small sample sizes, and the lack of well-annotated genomes, or the absence of a reference genome to guide the efforts. Since the annotated Gff genome (Gfus1.3) was incomplete, we took advantage of additional genomic resources available for Gmm, a tsetse species for which there is extensive laboratory data (Petersen ). Using transcriptome data from Gmm, we complemented missing annotations in the Gff genome, which allowed us to go beyond SNP associations to functionally identify candidate genes. Furthermore, we narrowed down our gene candidate list by identifying homologs that were DE among infected or uninfected Gmm flies that had been experimentally challenged with Trypanosoma (Table 1 and Table S5).

Regions associated with trypanosome infection in Glossina spp

Tsetse flies of the genus Glossina are major vectors of human and nonhuman animal diseases caused by infection with African trypanosomes. Laboratory studies have identified genetic factors for resistance to trypanosomes in Gmm (Hamidou Soumana ; Geiger ), but no information is available on HAT vector species such as Gff (e.g., host–parasite strain combinations) and in general for any species from natural populations. Gff colonies are difficult to maintain in the laboratory, making experimental manipulations in this species extremely challenging. Here, we tested the hypothesis that natural variation for resistance or sensitivity to Trypanosoma infection in wild populations of tsetse flies has a genetic basis and could be used to identify the genetic components underlying this epidemiologically important trait. Preliminary analysis of the Gff SNP data set used for this study failed to identify candidate genes involved in susceptibility to trypanosome infection, based on allele frequency differences between treatment groups (Gloria-Soria ). Such Fst outlier methods are powerful at detecting markers subjected to strong selection pressures, but have low power to detect markers under weak or balancing selection (Narum and Hess 2011; Jensen ). Furthermore, the Fst outlier approach used in the pilot study, as implemented by BayeScan (Fischer ), assumes that the contrasting populations are evolutionary independent from each other and that gene frequencies under neutrality approximate a multinomial Dirichlet distribution (Beaumont 2005). This method is generally appropriate to identify loci involved in environmental adaptation, and Gloria-Soria successfully identified several candidate loci under this framework. However, the assumption of evolutionary independence is violated in case-control studies, where samples are drawn from the same population (Lotterhos and Whitlock 2014), such as when trypanosome-infected and uninfected flies from the same population are compared. More appropriate for this type of analysis is the haplotype-based analysis also used by Gloria-Soria , hapFLK (Fariello , 2014), because it corrects for group coancestry. Unfortunately, this approach lacked the resolution to identify outliers in the Gff data (Gloria-Soria ). Here, we applied genetic association methods, commonly used in human genetics and model organisms, to search for genetic determinants of trypanosome susceptibility in Gff. Association methods assume that common variants have modest effects on disease frequency and can explain the variability of a common disease [CD/CV hypothesis (Reich and Lander 2001)]. Their success in identifying a disease susceptibility locus relies on detecting increases in disease allele frequency in the affected individuals (Reich and Lander 2001). We applied a logistic regression of the phenotype on allele dosage that allows for multiple covariates, thus can control for sample stratification and coancestry by incorporating population structure into the analysis (Hill ). LD-based clumping of the SNPs provided a parsimonious representative genome that greatly reduced the complexity of model fitting. Furthermore, this process allowed us to increase the strength of the association signal, which was otherwise hindered by background noise (Figure S1). We then performed a gene search that extended ≤ 2500 bp away from the selected representative SNPs. This was a conservative search space based on the average LD reported for these populations in Gloria-Soria . Longer LD regions are likely to occur around genes under strong selection and some genes may have been missed in this quest. This approach yielded 56 Gff candidate loci associated with trypanosome infection (Table S7), providing a good starting point to begin the functional validation. The search can be extended, as needed, for future analyses.

Functional characterization of candidate genes

This study suggests that two main functional categories, cell proliferation and neurophysiology, may be important determinants of trypanosome infection outcomes in tsetse flies. Fifty-six genes were identified near SNPs statistically associated with trypanosome-parasitized individuals in Gff, and 29 of these genes were DE between Trypanosoma-infected and uninfected Gmm flies. Out of the 23 Gff genes with functional homologs in Drosophila, seven are involved in cell proliferation processes and DNA regulation (particularly in mitosis), and seven were associated with neurophysiology. These cell proliferation genes include wge, pole2, msps, cdc14, moira, apc, and wee-1. Midgut cell regeneration is an important process in the host defense mechanism against pathogens, and involves the delamination of damaged cells and their replacement with new ones derived from proliferating stem cells (Baton and Ranford-Cartwright 2007; Buchon ,b; Jiang ). Differential expression of cell proliferation-associated genes has been observed between trypanosome-infected midgut and salivary gland epithelial tissues of tsetse flies (Aksoy ; Telleria ; Matetovici ). The presence of the parasite within epithelial tissues likely disrupts cellular homeostasis, triggering the insect cell regeneration response. Since tissue integrity is essential for fighting invasive microbes (Buchon ), any mutation impairing cellular regeneration would have major consequences on the ability of tsetse flies to resist infectious trypanosomes. Among the seven genes assigned to the neurophysiology category, four gene products, homologous to Drosophila ptp99A, malvolio, SV2B, and an insulin-like peptide (Table 1), are DE in tsetse flies in response to infection. Little is known about the function of these genes in tsetse flies, but Drosophila Ptp99A is associated with microbiota-dependent nutriment functions (Dobson ), and insulin-like protein belongs to a family of genes involved in feeding regulation (Pool and Scott 2014). Interestingly, malvolio, which is expressed in Drosophila’s central nervous system, macrophages, and midgut (Folwell ), is required for normal taste behavior (Rodrigues ). Any mechanism that may lead to an increase in blood-feeding frequency would concurrently increase the exposure of tsetse flies to parasites. Finally, among the three genes involved in the tsetse immune response are the Drosophila homologs of the toll and dorsal genes, which encode the receptor and transcription factors, respectively, of the Toll pathway. This immune pathway is a major component of the insect innate immune system (Valanne ). In tsetse, the Toll pathway is induced upon parasite infection (Lehane ) and activates the production of antimicrobial peptides that kill invasive microbes, including African trypanosomes (Boulanger ). The identification of SNPs associated with trypanosome infection in two major genes of this pathway supports the important role of the insect immune response in the control of the parasite upon its entrance into the tsetse fly’s gut. These findings are consistent with physiological outcomes associated with parasite infections in many insects, including tsetse flies, and encompass modifications in behavior and locomotion (Hurd 2003; van Houte ; Caljon ; Van Den Abbeele ).

Conclusions

The methodological approach used here allowed us to identify 56 trypanosome infection-associated loci in Gff, 29 of which were DE between Trypanosoma-infected and uninfected flies in the closely related species Gmm. The gene products identified by this screen are putatively involved in two main functional categories, cell proliferation and neurophysiology, suggesting that these processes may be determinants of the outcome of trypanosome infection in tsetse flies. Identification of the genetic determinants of trypanosome infection in Gff, and in tsetse flies in general, can provide targets for pharmaceutical and genetic interventions aimed at aiding the fight against the disease in endemic regions. Once disease-specific alleles have been identified, disease susceptibility could be actively monitored in the different tsetse populations by determining their frequency. This information can guide control efforts focused on disease elimination by prioritizing areas of higher risk. Future work will involve the screening of a large number of populations to both increase the sample size and cover different genetic backgrounds to validate this finding, followed by functional work on the top candidate genes identified by this work.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300493/-/DC1. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.
  95 in total

1.  Massive genome erosion and functional adaptations provide insights into the symbiotic lifestyle of Sodalis glossinidius in the tsetse host.

Authors:  Hidehiro Toh; Brian L Weiss; Sarah A H Perkin; Atsushi Yamashita; Kenshiro Oshima; Masahira Hattori; Serap Aksoy
Journal:  Genome Res       Date:  2005-12-19       Impact factor: 9.043

Review 2.  The Drosophila Toll signaling pathway.

Authors:  Susanna Valanne; Jing-Huan Wang; Mika Rämet
Journal:  J Immunol       Date:  2011-01-15       Impact factor: 5.422

3.  Functional analysis of Drosophila DNA polymerase ε p58 subunit.

Authors:  Ritsuko Sahashi; Risa Matsuda; Osamu Suyari; Mieko Kawai; Hideki Yoshida; Sue Cotterill; Masamitsu Yamaguchi
Journal:  Am J Cancer Res       Date:  2013-11-01       Impact factor: 6.166

4.  The Drosophila melanogaster Genetic Reference Panel.

Authors:  Trudy F C Mackay; Stephen Richards; Eric A Stone; Antonio Barbadilla; Julien F Ayroles; Dianhui Zhu; Sònia Casillas; Yi Han; Michael M Magwire; Julie M Cridland; Mark F Richardson; Robert R H Anholt; Maite Barrón; Crystal Bess; Kerstin Petra Blankenburg; Mary Anna Carbone; David Castellano; Lesley Chaboub; Laura Duncan; Zeke Harris; Mehwish Javaid; Joy Christina Jayaseelan; Shalini N Jhangiani; Katherine W Jordan; Fremiet Lara; Faye Lawrence; Sandra L Lee; Pablo Librado; Raquel S Linheiro; Richard F Lyman; Aaron J Mackey; Mala Munidasa; Donna Marie Muzny; Lynne Nazareth; Irene Newsham; Lora Perales; Ling-Ling Pu; Carson Qu; Miquel Ràmia; Jeffrey G Reid; Stephanie M Rollmann; Julio Rozas; Nehad Saada; Lavanya Turlapati; Kim C Worley; Yuan-Qing Wu; Akihiko Yamamoto; Yiming Zhu; Casey M Bergman; Kevin R Thornton; David Mittelman; Richard A Gibbs
Journal:  Nature       Date:  2012-02-08       Impact factor: 49.962

5.  HTSeq--a Python framework to work with high-throughput sequencing data.

Authors:  Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal:  Bioinformatics       Date:  2014-09-25       Impact factor: 6.937

6.  Differential expression of midgut proteins in Trypanosoma brucei gambiense-stimulated vs. non-stimulated Glossina palpalis gambiensis flies.

Authors:  Anne Geiger; Illiassou Hamidou Soumana; Bernadette Tchicaya; Valérie Rofidal; Mathilde Decourcelle; Véronique Santoni; Sonia Hem
Journal:  Front Microbiol       Date:  2015-05-12       Impact factor: 5.640

7.  Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis.

Authors:  Andrew Hill; Po-Ru Loh; Ragu B Bharadwaj; Pascal Pons; Jingbo Shang; Eva Guinan; Karim Lakhani; Iain Kilty; Scott A Jelinsky
Journal:  Gigascience       Date:  2017-05-01       Impact factor: 6.524

8.  Transcriptional regulation of the Drosophila moira and osa genes by the DREF pathway.

Authors:  Kumi Nakamura; Hiroyuki Ida; Masamitsu Yamaguchi
Journal:  Nucleic Acids Res       Date:  2008-05-29       Impact factor: 16.971

9.  Genome scans for detecting footprints of local adaptation using a Bayesian factor model.

Authors:  Nicolas Duforet-Frebourg; Eric Bazin; Michael G B Blum
Journal:  Mol Biol Evol       Date:  2014-06-03       Impact factor: 16.240

10.  A population genetic signal of polygenic adaptation.

Authors:  Jeremy J Berg; Graham Coop
Journal:  PLoS Genet       Date:  2014-08-07       Impact factor: 5.917

View more
  2 in total

1.  The molecular and cellular basis of olfactory response to tsetse fly attractants.

Authors:  J Sebastian Chahda; Neeraj Soni; Jennifer S Sun; Shimaa A M Ebrahim; Brian L Weiss; John R Carlson
Journal:  PLoS Genet       Date:  2019-03-15       Impact factor: 5.917

2.  Comparative genomic analysis of six Glossina genomes, vectors of African trypanosomes.

Authors:  Geoffrey M Attardo; Adly M M Abd-Alla; Alvaro Acosta-Serrano; James E Allen; Rosemary Bateta; Joshua B Benoit; Kostas Bourtzis; Jelle Caers; Guy Caljon; Mikkel B Christensen; David W Farrow; Markus Friedrich; Aurélie Hua-Van; Emily C Jennings; Denis M Larkin; Daniel Lawson; Michael J Lehane; Vasileios P Lenis; Ernesto Lowy-Gallego; Rosaline W Macharia; Anna R Malacrida; Heather G Marco; Daniel Masiga; Gareth L Maslen; Irina Matetovici; Richard P Meisel; Irene Meki; Veronika Michalkova; Wolfgang J Miller; Patrick Minx; Paul O Mireji; Lino Ometto; Andrew G Parker; Rita Rio; Clair Rose; Andrew J Rosendale; Omar Rota-Stabelli; Grazia Savini; Liliane Schoofs; Francesca Scolari; Martin T Swain; Peter Takáč; Chad Tomlinson; George Tsiamis; Jan Van Den Abbeele; Aurelien Vigneron; Jingwen Wang; Wesley C Warren; Robert M Waterhouse; Matthew T Weirauch; Brian L Weiss; Richard K Wilson; Xin Zhao; Serap Aksoy
Journal:  Genome Biol       Date:  2019-09-02       Impact factor: 13.583

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.