| Literature DB >> 31148531 |
Joel L N Barratt1,2, Subin Park1,2, Fernanda S Nascimento1, Jessica Hofstetter1,2, Mateusz Plucinski3, Shannon Casillas1, Richard S Bradbury1, Michael J Arrowood4, Yvonne Qvarnstrom1, Eldin Talundzic3.
Abstract
Sexually reproducing pathogens such as Cyclospora cayetanensis often produce genetically heterogeneous infections where the number of unique sequence types detected at any given locus varies depending on which locus is sequenced. The genotypes assigned to these infections quickly become complex when additional loci are analysed. This genetic heterogeneity confounds the utility of traditional sequence-typing and phylogenetic approaches for aiding epidemiological trace-back, and requires new methods to address this complexity. Here, we describe an ensemble of two similarity-based classification algorithms, including a Bayesian and heuristic component that infer the relatedness of C. cayetanensis infections. The ensemble requires a set of haplotypes as input and assigns arbitrary distances to specimen pairs reflecting their most likely relationships. The approach was applied to data generated from a test cohort of 88 human fecal specimens containing C. cayetanensis, including 30 from patients whose infections were associated with epidemiologically defined outbreak clusters of cyclosporiasis. The ensemble assigned specimens to plausible clusters of genetically related infections despite their complex haplotype composition. These relationships were corroborated by a significant number of epidemiological linkages (P < 0.0001) suggesting the ensemble's utility for aiding epidemiological trace-back investigations of cyclosporiasis.Entities:
Keywords: Algorithm; Bayesian; Cyclospora cayetanensis; bioinformatics; ensemble; epidemiology; heuristic
Mesh:
Year: 2019 PMID: 31148531 PMCID: PMC6699905 DOI: 10.1017/S0031182019000581
Source DB: PubMed Journal: Parasitology ISSN: 0031-1820 Impact factor: 3.234
Primers designed for PCR enrichment of the selected typing markers
| Genome | Locus (Alias) | SNPs | Primer Name | Primer sequence (‘5–3’) | Amplicon size (sequence length | Amplification success (%) | |
|---|---|---|---|---|---|---|---|
| Mt | Mt rRNA (MSR) | 4 | 15F | GGACATGCAGTAACCTTTCCG | 55 °C | 686 (573) bp | 81/88(92.0) |
| 688R | AGGAAAGGTTAACCGCTGTCA | ||||||
| Nu | Nu, undefined (360i2) | 20 | HC360i2F | CCCATTACGCCGCATAGAGT | 67 °C | 650 (541) bp | 87/88(98.9) |
| HC360i2R | GCATTGCAAAGCCAGTCAGC | ||||||
| Nu | Nu, Sec14 family protein (378) | 15 | HC378F | CCCCTGCCTTGTTCTTGGTGAA | 71 °C | 469 (364) bp | 80/88(90.9) |
| HC378R | CCGGCGACACAGAGGTACC |
The number of variable sites present after trimming of sequences to equal lengths.
After trimming sequences to equal lengths.
Fig. 1.Workflow for selection of Cyclospora cayetanensis typing markers. Raw genome sequence data generated on the Illumina MiSeq platform were assessed for quality using FASTQC. AdaptorRemoval v2.1.7 (Schubert et al., 2016) was used to remove adaptor sequences from reads and to merge overlapping paired reads into consensus sequences. SPAades v3.9.0 (Bankevich et al., 2012) was used to de novo assemble the reads. During the assembly cleaning process, contigs derived from contaminating (Contam.) prokaryotic human gut flora were removed using BBMap (http://sourceforge.net/projects/bbmap/). The assemblies were assessed for quality using QUAST v4.3 (Gurevich et al., 2013) before and after the cleaning phase. Contigs with 60 times coverage, greater than or equal to 3000 base pairs (bp) long and with coding regions identified using GeneMark-ES v4.33 (Borodovsky and Lomsadze, 2011), were retained as part of the core genome. Single nucleotide polymorphisms (SNPs) were detected across the core genome assemblies using kSNP v3.021 (Gardner et al., 2015) and this information was used to identify high-entropy genomic loci. Genomic regions containing high confidence SNPs (i.e. those SNPs within genomic regions of the highest coverage) occurring within SNP-dense regions (i.e. where several informative SNPs exist within a genomic region of less than 1 kilobase pair in size), were identified as candidate typing markers for validation by PCR amplification and Sanger sequencing. The markers with the highest amplification and sequencing success rate were considered ideal candidates for C. cayetanensis typing, and were PCR amplified and sequenced from stool specimens provided by a diverse range of patients. The resulting sequences were then subjected to typing.
Fig. 2.Cluster dendrogram generated from the Ensemble Distance Matrix. Our ensemble of two similarity-based classification algorithms resolved the C. cayetanensis infections from 88 fecal specimens into sixteen clusters (different branch colours). Clusters were delineated by cutting the tree at the node indicating the separation of the Chinese sample (CHN_HEN01) from its nearest neighbour. The specimen names are shaded in colours according to their epidemiological linkage. Unshaded specimen names represent sporadic or unlinked cases of cyclosporiasis. Specimen identity codes begin with a two letter state abbreviation (except for Jakarta, Indonesia; JK), followed by two numbers indicating the year, and ending a unique identifier assigned to that specimen (2–3 digits). The specimen from China (CHN_HEN01) follows a different naming convention as sequence data from this specimen had been submitted to GenBank previously by different investigators (GenBank accession: NW_019211453).
Fig. 3.The haplotype composition of each specimen genotyped in this study represented as a barcode. The 88 specimens in the study cohort were assigned to 16 distinct clusters by the ensemble, with cluster assignments shown on the right hand side of each panel. These cluster assignments were made based on the haplotype composition of each sample, with the loci and their respective haplotype numbers shown along the two top rows. Boxes are shaded black if the corresponding haplotype was detected in a specimen. Specimen names are listed in the far left column of each panel. Rows are shaded grey if sequencing was unsuccessful for a given marker. This figure was generated to graphically represent the groupings assigned by the ensemble when presented with a set of complex genotyping data.