Literature DB >> 25968323

PhyTB: Phylogenetic tree visualisation and sample positioning for M. tuberculosis.

Ernest D Benavente^1,2, Francesc Coll³, Nick Furnham⁴, Ruth McNerney⁵, Judith R Glynn⁶, Susana Campino⁷, Arnab Pain⁸, Fady R Mohareb⁹, Taane G Clark^10,11.

Abstract

BACKGROUND: Phylogenetic-based classification of M. tuberculosis and other bacterial genomes is a core analysis for studying evolutionary hypotheses, disease outbreaks and transmission events. Whole genome sequencing is providing new insights into the genomic variation underlying intra- and inter-strain diversity, thereby assisting with the classification and molecular barcoding of the bacteria. One roadblock to strain investigation is the lack of user-interactive solutions to interrogate and visualise variation within a phylogenetic tree setting.
RESULTS: We have developed a web-based tool called PhyTB ( http://pathogenseq.lshtm.ac.uk/phytblive/index.php ) to assist phylogenetic tree visualisation and identification of M. tuberculosis clade-informative polymorphism. Variant Call Format files can be uploaded to determine a sample position within the tree. A map view summarises the geographical distribution of alleles and strain-types. The utility of the PhyTB is demonstrated on sequence data from 1,601 M. tuberculosis isolates.
CONCLUSION: PhyTB contextualises M. tuberculosis genomic variation within epidemiological, geographical and phylogenic settings. Further tool utility is possible by incorporating large variants and phenotypic data (e.g. drug-resistance profiles), and an assessment of genotype-phenotype associations. Source code is available to develop similar websites for other organisms ( http://sourceforge.net/projects/phylotrack ).

Entities: Chemical

Mesh：

Year: 2015 PMID： 25968323 PMCID： PMC4429496 DOI： 10.1186/s12859-015-0603-3

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Strain-specific genomic diversity in the Mycobacterium tuberculosis complex (MTBC) is an important factor in tuberculosis pathogenesis that may affect virulence, transmissibility, host response and emergence of drug resistance [1,2]. Some modern strains (e.g. Beijing, Euro-American, Haarlem) are believed to exhibit more virulent phenotypes compared to ancient ones (e.g. East African, Indian, M. africanum) [2]. M. tuberculosis is relatively clonal, with little recombination and a low mutation rate [3]. Like other bacterial genomic settings, the construction of phylogenetic trees using sequence data facilitates taxonomic localisation and the evolutionary analysis. The growing availability of M. tuberculosis whole genome sequences is leading to the full characterisation of single nucleotide polymorphisms (SNPs) and other nucleotide variation, such as insertions and deletions (indels). A SNP–based barcode has been developed to discriminate strain-types [2]. Trees constructed using genome-wide variation have greater discriminatory power than traditional genotyping approaches such as MIRU-VNTR and spoligotyping [4]. Clades reflecting strain type variations may be used to investigate disease outbreaks or transmission events, where samples are identified through apparent identical genomic signatures [5,6]. The tree also provides a structure to identify variants that can be used to investigate clinically important traits such as drug resistance [5]. The primary mechanism for acquiring resistance is the accumulation of point mutations in genes coding for drug-targets or -converting enzymes (e.g. katG, inhA, rpoB, pncA, embB, rrs, gyrA, gyrB genes) [7], and these mutations may exist in multiple lineages in the tree, reflecting homoplasy events. Some mutations thought to be related to drug resistance are actually not, but instead strain-informative [2]. With the increased application of sequencing technologies within clinical and microbiological research settings, it is important that informatic tools are available to identify informative strain-type and drug resistance related variants. Web-browsers for the visualisation of M. tuberculosis genomic variation exist [8-10], but there is limited connectivity with phylogenetic trees and downstream analysis, especially involving strain-types and drug resistance. In addition, there is little provision for uploading new data, such as standard variant call files (VCFs) (www.htslib.org). Here we present the PhyTB tool, which facilitates the phylogenetic exploration of M. tuberculosis isolates, including the display of clade-specific informative and drug resistance markers and their genomic annotation. Using the browser, it is possible to upload multiple standard genomic variant call files (VCF format) to identify the closest relative within the M. tuberculosis complex global phylogeny, thereby potentially assisting their interpretation in a clinical or epidemiological context. Source code is available to facilitate the development of sites for other organisms with genomes that can be represented in a phylogeny.

Implementation

PhyTB is a JavaScript–based web-browsing tool that uses the D3.js library for data visualization [11] and the JBrowse tool for genome browser representation [12]. The source code has been integrated and called PhyloTrack, enabling websites for other organisms to be developed (http://sourceforge.net/projects/phylotrack). The software requires a phylogenetic tree of the common Newick data format as input, and tab delimited meta data files for samples, clade-defining nodes and clade colour definitions. The phylogenetic tree was constructed using 91 k SNPs mapped against the H37Rv reference genome [Genbank:NC_000962.3]. These SNPs were identified using a combination of bwa-mem alignment software (bio-bwa.sourceforge.net) and the SAMtools/BCFtools suite (samtools.sourceforge.net) complemented by GATK (https://www.broadinstitute.org/gatk/). Variants at Q-score of 30 or more were then selected from the intersection dataset between those obtained from both SAMtools and GATK. SNPs in non-unique regions, including repeat regions in PE/PPE genes were removed (see [2] for details). The best-scoring maximum likelihood phylogenetic tree was computed using RAxML v7.4.2 (http://sco.h-its.org/exelixis/web/software/raxml/index.html) based on 91,648 sites spanning the whole genome. Given the considerable size of the dataset (1,601 samples, 91,648 SNP sites), the rapid bootstrapping algorithm (N = 100, ×= 12,345) combined with maximum likelihood search was chosen to construct the phylogenetic tree including only branches with bootstrap values greater than 95%. The resulting tree was rooted on M. canettii [Genbank: NC_019950.1] and nodes were annotated. Subsequently, the ancestral sequence at all internal nodes was computed using DnaPars from the Phylip package (http://evolution.genetics.washington.edu/phylip/). The main lineage- and sublineage-defining nodes were initially identified from the tree, based on the spoligotypes in each clade. Informative markers at each node in the phylogenetic tree are stored in VCF files and displayed, highlighting clade-defining polymorphism. This functionality has been implemented using the tabix tool [13] on the server side. The informative variants have been established by comparing allele frequencies between strain-types using ancestral node comparisons [2]. Perl scripts used to generate these data is included within the PhyloTrack package. These include scripts to convert a tree in JSON format for use by the D3.js library, produce metadata for each node, and process VCF files containing information for each node and SNP. VCF files containing clade informative and drug resistance markers [2] are compressed using bgzip and indexed using tabix to improve computational efficiency, as well as to act as a database. Variants in user uploaded VCF files are compared to those in the database to establish a sample’s position within the tree. Using node-specific SNPs, the possible paths inside the tree are reconstructed, and the one with the most SNP matches is reported. PhyTB’s map view shows allele and strain-type frequencies by geographical location, developed from PolyTB source code [9].

Results and discussion

PhyTB uses 1,601 global MTBC whole-genome sequences from 11 studies with representation across all 7 major lineages (lineage 1 - 7.6%, 2 - 24.3%, 3 - 11.8%, 4 - 53.5%, 5-7 2.8%). The phylogenetic tree constructed using the 91 k SNPs shows the expected clustering by lineage and strain-type (Figure 1). SNP information is displayed at internal nodes of the tree, therefore distinguishing between unique strain-defining mutations from those arising in multiple branches (homoplastic mutations). The homoplastic mutations arise due to recombination or convergent evolution, potentially related to drug resistance. Figure 1 shows a deep phylogenetic SNP (R463L) in the katG gene that is present across all lineages except lineage 4. This SNP has been historically and mistakenly thought to cause isoniazid resistance. PhyTB displays whether polymorphisms have been previously related to drug resistance [14] or are strain informative [2] in tracks, and meta data (e.g. codon, amino acid) is shown by selecting the polymorphism of interest. It is possible to move from the tree view to a geographical map showing allele frequencies. A map view, accessed through the genome browser located below the tree, shows a SNP at position 762,434 in rpoB, a gene associated with rifampicin resistance. The alternative allele leads to a synonymous mutation (G876G) that is fixed in CAS (lineage 3) strains in Malawi (Figure 2) and all other study sites. To demonstrate the VCF positioning functionality, we used 100 M. tuberculosis samples [ENA:ERP000192] of known strain-type [9], not included in the phylogeny. It was possible to unambiguously position all of them in the tree. Figure 3 shows the result of uploading the VCF file for a Russian sample [ENA:ERR019571], which has 5067 SNPs, allowing it to be positioned correctly in a Beijing clade.

Figure 1

Figure 2

PhyTB screenshot: A map view showing the frequency of the G876G SNP in the rpoB gene (searchable from (A)) and its association with lineage 3 strain-types in Malawi. Pie chart (B) shows the non-reference allele frequency (red segment, inner circle) is linked to the CAS spoligotype (blue segment, outer circle).

Figure 3

PhyTB screenshot: A Russian sample (ERR019571) is located in a Beijing clade in lineage 2.2.1 (C), established using variants uploaded in a Variant Call Format file.

PhyTB screenshot: A phylogenetic tree for the 1,601 M. tuberculosis isolates (A), with each lineage colour coded (B). A selected SNP R463L in the katG gene (associated with isoniazid resistance) (C), is located at position 2,152,224, (D) and present across all but one lineage (4) (E). PhyTB screenshot: A map view showing the frequency of the G876G SNP in the rpoB gene (searchable from (A)) and its association with lineage 3 strain-types in Malawi. Pie chart (B) shows the non-reference allele frequency (red segment, inner circle) is linked to the CAS spoligotype (blue segment, outer circle). PhyTB screenshot: A Russian sample (ERR019571) is located in a Beijing clade in lineage 2.2.1 (C), established using variants uploaded in a Variant Call Format file.

Conclusion

The PhyTB web-browser attempts to contextualise TB genomic variation within epidemiological, geographical and phylogenic settings. To assist with integrating such data for other organisms, we provide the source code, which has been packaged in the PhyloTrack library. In pathogenic bacteria like M. tuberculosis, data integration is crucial to distinguish drug-resistance mutations from phylogenetic markers, to study the transmission of outbreak strains, to detect the source of an infection, inform patient management and design appropriate infection control measures (e.g. rapid tests). Further tool utility is possible by extending it to incorporate large variants and phenotypic data (e.g. drug-resistance profiles).

Availability and requirements

Project name:PhyTBProject home page:http://pathogenseq.lshtm.ac.uk/phytblive/index.phpSource code:PhyloTrack - http://sourceforge.net/projects/phylotrackOperating system(s): Platform independent Programming language: JavaScript and Perl Other requirements: None License: None Any restrictions to use by non-academics: None

13 in total

1. Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-01-05 Impact factor: 6.937

2. JBrowse: a next-generation genome browser.

Authors: Mitchell E Skinner; Andrew V Uzilov; Lincoln D Stein; Christopher J Mungall; Ian H Holmes
Journal: Genome Res Date: 2009-07-01 Impact factor: 9.043

3. Clade-specific virulence patterns of Mycobacterium tuberculosis complex strains in human primary macrophages and aerogenically infected mice.

Authors: Norbert Reiling; Susanne Homolka; Kerstin Walter; Julius Brandenburg; Lisa Niwinski; Martin Ernst; Christian Herzmann; Christoph Lange; Roland Diel; Stefan Ehlers; Stefan Niemann
Journal: MBio Date: 2013-07-30 Impact factor: 7.867

4. Recurrence due to relapse or reinfection with Mycobacterium tuberculosis: a whole-genome sequencing approach in a large, population-based cohort with a high HIV infection prevalence and active follow-up.

Authors: José Afonso Guerra-Assunção; Rein M G J Houben; Amelia C Crampin; Themba Mzembe; Kim Mallard; Francesc Coll; Palwasha Khan; Louis Banda; Arthur Chiwaya; Rui P A Pereira; Ruth McNerney; David Harris; Julian Parkhill; Taane G Clark; Judith R Glynn
Journal: J Infect Dis Date: 2014-10-21 Impact factor: 5.226

5. Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences.

Authors: Francesc Coll; Ruth McNerney; Mark D Preston; José Afonso Guerra-Assunção; Andrew Warry; Grant Hill-Cawthorne; Kim Mallard; Mridul Nair; Anabela Miranda; Adriana Alves; João Perdigão; Miguel Viveiros; Isabel Portugal; Zahra Hasan; Rumina Hasan; Judith R Glynn; Nigel Martin; Arnab Pain; Taane G Clark
Journal: Genome Med Date: 2015-05-27 Impact factor: 11.117

6. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis.

Authors: Christopher B Ford; Rupal R Shah; Midori Kato Maeda; Sebastien Gagneux; Megan B Murray; Ted Cohen; James C Johnston; Jennifer Gardy; Marc Lipsitch; Sarah M Fortune
Journal: Nat Genet Date: 2013-06-09 Impact factor: 38.330

7. PolyTB: a genomic variation map for Mycobacterium tuberculosis.

Authors: Francesc Coll; Mark Preston; José Afonso Guerra-Assunção; Grant Hill-Cawthorn; David Harris; João Perdigão; Miguel Viveiros; Isabel Portugal; Francis Drobniewski; Sebastien Gagneux; Judith R Glynn; Arnab Pain; Julian Parkhill; Ruth McNerney; Nigel Martin; Taane G Clark
Journal: Tuberculosis (Edinb) Date: 2014-02-15 Impact factor: 3.131

8. Genome-wide Mycobacterium tuberculosis variation (GMTV) database: a new tool for integrating sequence variations and epidemiology.

Authors: Ekaterina N Chernyaeva; Marina V Shulgina; Mikhail S Rotkevich; Pavel V Dobrynin; Serguei A Simonov; Egor A Shitikov; Dmitry S Ischenko; Irina Y Karpova; Elena S Kostryukova; Elena N Ilina; Vadim M Govorun; Vyacheslav Y Zhuravlev; Olga A Manicheva; Peter K Yablonsky; Yulia D Isaeva; Elena Y Nosova; Igor V Mokrousov; Anna A Vyazovaya; Olga V Narvskaya; Alla L Lapidus; Stephen J O'Brien
Journal: BMC Genomics Date: 2014-04-25 Impact factor: 3.969

9. PATRIC, the bacterial bioinformatics database and analysis resource.

Authors: Alice R Wattam; David Abraham; Oral Dalay; Terry L Disz; Timothy Driscoll; Joseph L Gabbard; Joseph J Gillespie; Roger Gough; Deborah Hix; Ronald Kenyon; Dustin Machi; Chunhong Mao; Eric K Nordberg; Robert Olson; Ross Overbeek; Gordon D Pusch; Maulik Shukla; Julie Schulman; Rick L Stevens; Daniel E Sullivan; Veronika Vonstein; Andrew Warren; Rebecca Will; Meredith J C Wilson; Hyun Seung Yoo; Chengdong Zhang; Yan Zhang; Bruno W Sobral
Journal: Nucleic Acids Res Date: 2013-11-12 Impact factor: 16.971

10. Elucidating emergence and transmission of multidrug-resistant tuberculosis in treatment experienced patients by whole genome sequencing.

Authors: Taane G Clark; Kim Mallard; Francesc Coll; Mark Preston; Samuel Assefa; David Harris; Sam Ogwang; Francis Mumbowa; Bruce Kirenga; Denise M O'Sullivan; Alphonse Okwera; Kathleen D Eisenach; Moses Joloba; Stephen D Bentley; Jerrold J Ellner; Julian Parkhill; Edward C Jones-López; Ruth McNerney
Journal: PLoS One Date: 2013-12-11 Impact factor: 3.240

14 in total

1. Genomic epidemiology of Lineage 4 Mycobacterium tuberculosis subpopulations in New York City and New Jersey, 1999-2009.

Authors: Tyler S Brown; Apurva Narechania; John R Walker; Paul J Planet; Pablo J Bifani; Sergios-Orestis Kolokotronis; Barry N Kreiswirth; Barun Mathema
Journal: BMC Genomics Date: 2016-11-21 Impact factor: 3.969

2. The variability and reproducibility of whole genome sequencing technology for detecting resistance to anti-tuberculous drugs.

Authors: Jody Phelan; Denise M O'Sullivan; Diana Machado; Jorge Ramos; Alexandra S Whale; Justin O'Grady; Keertan Dheda; Susana Campino; Ruth McNerney; Miguel Viveiros; Jim F Huggett; Taane G Clark
Journal: Genome Med Date: 2016-12-22 Impact factor: 11.117

3. Whole-Genome Analysis of Mycobacterium tuberculosis from Patients with Tuberculous Spondylitis, Russia.

Authors: Ekaterina Chernyaeva; Mikhail Rotkevich; Ksenia Krasheninnikova; Andrey Yurchenko; Anna Vyazovaya; Igor Mokrousov; Natalia Solovieva; Viacheslav Zhuravlev; Piotr Yablonsky; Stephen J O'Brien
Journal: Emerg Infect Dis Date: 2018-03 Impact factor: 6.883

4. An integrated whole genome analysis of Mycobacterium tuberculosis reveals insights into relationship between its genome, transcriptome and methylome.

Authors: Paula J Gomez-Gonzalez; Nuria Andreu; Jody E Phelan; Paola Florez de Sessions; Judith R Glynn; Amelia C Crampin; Susana Campino; Philip D Butcher; Martin L Hibberd; Taane G Clark
Journal: Sci Rep Date: 2019-03-26 Impact factor: 4.379

5. Draft Genome Sequence of Amikacin- and Kanamycin-Resistant Mycobacterium tuberculosis MT433 without rrs and eis Mutations.

Authors: Angkanang Sowajassatakul; Olabisi O Coker; Therdsak Prammananan; Angkana Chaiprasert; Saranya Phunpruch
Journal: Genome Announc Date: 2015-11-19

6. Whole-Genome Sequencing Analysis of Serially Isolated Multi-Drug and Extensively Drug Resistant Mycobacterium tuberculosis from Thai Patients.

Authors: Kiatichai Faksri; Jun Hao Tan; Areeya Disratthakit; Eryu Xia; Therdsak Prammananan; Prapat Suriyaphol; Chiea Chuen Khor; Yik-Ying Teo; Rick Twee-Hee Ong; Angkana Chaiprasert
Journal: PLoS One Date: 2016-08-12 Impact factor: 3.240

7. TGS-TB: Total Genotyping Solution for Mycobacterium tuberculosis Using Short-Read Whole-Genome Sequencing.

Authors: Tsuyoshi Sekizuka; Akifumi Yamashita; Yoshiro Murase; Tomotada Iwamoto; Satoshi Mitarai; Seiya Kato; Makoto Kuroda
Journal: PLoS One Date: 2015-11-13 Impact factor: 3.240

8. Genetic signatures of Mycobacterium tuberculosis Nonthaburi genotype revealed by whole genome analysis of isolates from tuberculous meningitis patients in Thailand.

Authors: Olabisi Oluwabukola Coker; Angkana Chaiprasert; Chumpol Ngamphiw; Sissades Tongsima; Sanjib Mani Regmi; Taane G Clark; Rick Twee Hee Ong; Yik-Ying Teo; Therdsak Prammananan; Prasit Palittapongarnpim
Journal: PeerJ Date: 2016-04-12 Impact factor: 2.984

9. Whole genome sequencing of drug resistant Mycobacterium tuberculosis isolates from a high burden tuberculosis region of North West Pakistan.

Authors: Abdul Jabbar; Jody E Phelan; Paola Florez de Sessions; Taj Ali Khan; Hazir Rahman; Sadiq Noor Khan; Daire M Cantillon; Leticia Muraro Wildner; Sajid Ali; Susana Campino; Simon J Waddell; Taane G Clark
Journal: Sci Rep Date: 2019-10-18 Impact factor: 4.379

10. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data.

Authors: Yang Yang; Katherine E Niehaus; Timothy M Walker; Zamin Iqbal; A Sarah Walker; Daniel J Wilson; Tim E A Peto; Derrick W Crook; E Grace Smith; Tingting Zhu; David A Clifton
Journal: Bioinformatics Date: 2018-05-15 Impact factor: 6.937