Literature DB >> 24700318

SVAMP: sequence variation analysis, maps and phylogeny.

Raeece Naeem1, Lailatul Hidayah1, Mark D Preston1, Taane G Clark1, Arnab Pain1.   

Abstract

SUMMARY: SVAMP is a stand-alone desktop application to visualize genomic variants (in variant call format) in the context of geographical metadata. Users of SVAMP are able to generate phylogenetic trees and perform principal coordinate analysis in real time from variant call format (VCF) and associated metadata files. Allele frequency map, geographical map of isolates, Tajima's D metric, single nucleotide polymorphism density, GC and variation density are also available for visualization in real time. We demonstrate the utility of SVAMP in tracking a methicillin-resistant Staphylococcus aureus outbreak from published next-generation sequencing data across 15 countries. We also demonstrate the scalability and accuracy of our software on 245 Plasmodium falciparum malaria isolates from three continents.
AVAILABILITY AND IMPLEMENTATION: The Qt/C++ software code, binaries, user manual and example datasets are available at http://cbrc.kaust.edu.sa/svamp CONTACT: arnab.pain@kaust.edu.sa or arnab.pain@cantab.net SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2014. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2014        PMID: 24700318      PMCID: PMC4103593          DOI: 10.1093/bioinformatics/btu176

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Associating sequence variants [single nucleotide polymorphisms (SNPs) and indels] with sample metadata such as geographical location and drug susceptibility have played a key role in studying the population structure (Manske ), identifying mechanisms of drug resistance (Downing ) and tracking the transmission of an infectious disease (Harris ). With the increasing application of deep sequencing as an approach, the number and volume of population studies with geo-biological information and associated genomic data will continue to grow. This increases the demand for tools to integrate, visualize and analyse complex genomic epidemiological data in real time, including browsing genome variation patterns and assessing population structure or geo-phylogeny. Although software such as Polylens (Berry ) and GenGIS (Parks ) can integrate geographical and genetic sequence data, there is a need to scale up to whole genome variation in the standardized VCF format (Danecek ) with informative population genetic analysis. This motivated us to develop SVAMP, a stand-alone Qt/C++ application capable of analysing variants in the context of geography and aiding in making inferences on the population structure. SVAMP is built on the open-source software VarB (Preston ).

2 METHODS

Input to SVAMP software is a bundle of multisample VCF file, reference FASTA, annotation general feature format (GFF) and a precalculated SQLite database file. The bundle preparation script included as a part of SVAMP software captures the geographical coordinates, date of isolation and the genome coverage of samples. The files when loaded into SVAMP will aid the user in performing key population genomics analysis in real time and visualize the results. Two popular methods of analysing sample relatedness, principal coordinate analysis [PCoA; Torgerson–Gower scaling (Gower, 1966)] and geo-phylogenetic tree, are integrated into SVAMP. The pairwise dissimilarity matrix D is first computed based on the Hamming distance (Hamming, 1950) (d) between pairs of samples (i, j) using equation where k is the index of the genomic position out of L considered positions. Si,k is the genotype called by sample i at position k in the genome. Positions that have missing genotype information are ignored in the computation; therefore, the multisample VCF file should ideally consist of samples and variants with reasonably complete data. The matrix D forms the basis for subsequent PCoA and phylogenetic tree reconstruction and consists of N (number of samples) rows and K (number of variant positions) columns. Screenshot from SVAMP software shows (A) variation across 63 MRSA isolates from 15 countries, (B) allele frequency map of a variation site in the genome, (C) PCoA plot, (D) phylogenetic tree of all the isolates PCoA, equivalently multidimensional scaling, is computed as per the R function cmdscale, and the phylogenetic tree is constructed using Fitch–Margoliash algorithm (Fitch and Margolia, 1967). The user is provided with an option to group colours based on a known phenotype (e.g. drug susceptibility) or a custom classification. The ability to perform tree computation using external phylogeny package is also supported by saving alignments in a compatible format and visualizing the tree in SVAMP. The PCoA, phylogenetic tree and exporting alignments can be performed on multiple regions of interest within a subset of samples. Integrating popular bam viewers such as LookSeq (Manske and Kwiatkowski, 2009) to view read alignment evidence for variants is an added feature of SVAMP.

3 RESULTS

We have evaluated the application and scalability of SVAMP using two published datasets: (i) a bacterial population study (Harris ) on methicillin-resistant Staphylococcus aureus (commonly known as MRSA) and (ii) a worldwide population structure study (Manske ) on Plasmodium falciparum malaria parasite. Both these example datasets are available for download at http://cbrc.kaust.edu.sa/svamp as a packaged SVAMP bundle.

3.1 MRSA outbreak analysis using SVAMP

The MRSA dataset visualised in SVAMP as shown in Figure 1 contains 4310 SNP sites determined from 63 isolates obtained from various hospitals across 15 countries, spanning a period of >25 years. The linear phylogenetic tree constructed using SVAMP is shown in Supplementary Figure S1, and the circular tree in Supplementary Figure S2 is consistent with that described in the paper by Harris . Supplementary Figure S3 shows the Portuguese samples on the tree overlaid on the geographical map displaying the year of isolation and location. Supplementary Figure S4 shows the two European isolates DEN907 and TW20 clearly joining the Asian clade. From Supplementary Figure S1, it can also be observed that five isolates from Thailand S21, S24, S39, S42 and S81 obtained from the same hospital cluster together to form a single subclade. Colour coding the isolates based on the country of origin allows the visualization of the geographical map and the tree simultaneously, assisting with making genomic epidemiological inference.
Fig. 1.

Screenshot from SVAMP software shows (A) variation across 63 MRSA isolates from 15 countries, (B) allele frequency map of a variation site in the genome, (C) PCoA plot, (D) phylogenetic tree of all the isolates

3.2 Exploring the population structure of Malaria isolates using SVAMP

The raw sequencing data obtained from P. falciparum diversity study (Manske ) were mapped using smalt, and SNPs were called using samtools. Resulting variants were merged using vcftools. Only coding region variants that do not fall in var, rifin and stevor gene (the hypervariable gene families in malaria) sites were included. After filtering for quality and missing data, 26 918 SNPs were retained. This dataset consists of 245 samples from six countries: three from Africa (AFR), two from Southeast Asia (SEA) and Papua New Guinea (PNG). The PCoA analysis using SVAMP in Supplementary Figure S5 clearly shows three different clusters as three different groups AFR, SEA and PNG, as seen in the paper by Manske . As expected, individual continental PCoA analyses demonstrate separation between East and West African samples (Supplementary Fig. S6) and between Thailand and Cambodia samples. The commands and parameters used to obtain the final dataset used in SVAMP are explained in the Supplementary Materials.

3.3 Memory and computational speed of SVAMP on MRSA and malaria datasets

Memory usage and computational speed of SVAMP was evaluated on a laptop computer with 2 cores (4 GB RAM) and on a workstation with 12 CPU cores (96 GB RAM). The results were averaged for both MRSA and malaria datasets and are shown in Table 1.
Table 1.

Memory and speed of SVAMP on malaria and MRSA datasets

Dataset (N, K)Size on disk (MB)Average RAM usage (GB)Time to load data (s)Time to compute PCoA (s)Time to construct tree
MRSA (63, 4310)21.30.234120 s
Malaria (245, 26 918)6371.2350604.7 h

Note: N: number of samples; K: number of variants.

Memory and speed of SVAMP on malaria and MRSA datasets Note: N: number of samples; K: number of variants.

CONCLUSIONS

By using the sequence variant and associated geographical information, we believe the software SVAMP will aid greatly in analysing isolates from an outbreak, as well as predicting the population structure in epidemiological studies. Funding: KAUST faculty baseline funding to A.P.; Medical Research Council (UK) grant MR/J005398/1 to T.G.C. and M.D.P. Conflict of Interest: none declared.
  9 in total

1.  GenGIS: A geospatial information system for genomic data.

Authors:  Donovan H Parks; Michael Porter; Sylvia Churcher; Suwen Wang; Christian Blouin; Jacqueline Whalley; Stephen Brooks; Robert G Beiko
Journal:  Genome Res       Date:  2009-07-27       Impact factor: 9.043

2.  LookSeq: a browser-based viewer for deep sequencing data.

Authors:  Heinrich Magnus Manske; Dominic P Kwiatkowski
Journal:  Genome Res       Date:  2009-08-13       Impact factor: 9.043

3.  PolyLens: software for map-based visualisation and analysis of genome-scale polymorphism data.

Authors:  Michael W Berry; Tiantian Gao; Ryhan Pathan; Gary W Stuart
Journal:  Int J Comput Biol Drug Des       Date:  2013-02-21

Review 4.  Construction of phylogenetic trees.

Authors:  W M Fitch; E Margoliash
Journal:  Science       Date:  1967-01-20       Impact factor: 47.728

5.  Whole genome sequencing of multiple Leishmania donovani clinical isolates provides insights into population structure and mechanisms of drug resistance.

Authors:  Tim Downing; Hideo Imamura; Saskia Decuypere; Taane G Clark; Graham H Coombs; James A Cotton; James D Hilley; Simonne de Doncker; Ilse Maes; Jeremy C Mottram; Mike A Quail; Suman Rijal; Mandy Sanders; Gabriele Schönian; Olivia Stark; Shyam Sundar; Manu Vanaerschot; Christiane Hertz-Fowler; Jean-Claude Dujardin; Matthew Berriman
Journal:  Genome Res       Date:  2011-10-28       Impact factor: 9.043

6.  Evolution of MRSA during hospital transmission and intercontinental spread.

Authors:  Simon R Harris; Edward J Feil; Matthew T G Holden; Michael A Quail; Emma K Nickerson; Narisara Chantratita; Susana Gardete; Ana Tavares; Nick Day; Jodi A Lindsay; Jonathan D Edgeworth; Hermínia de Lencastre; Julian Parkhill; Sharon J Peacock; Stephen D Bentley
Journal:  Science       Date:  2010-01-22       Impact factor: 47.728

7.  VarB: a variation browsing and analysis tool for variants derived from next-generation sequencing data.

Authors:  Mark D Preston; Magnus Manske; Neil Horner; Samuel Assefa; Susana Campino; Sarah Auburn; Issaka Zongo; Jean-Bosco Ouedraogo; Francois Nosten; Tim Anderson; Taane G Clark
Journal:  Bioinformatics       Date:  2012-09-13       Impact factor: 6.937

8.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

9.  Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing.

Authors:  Magnus Manske; Olivo Miotto; Susana Campino; Sarah Auburn; Jacob Almagro-Garcia; Gareth Maslen; Jack O'Brien; Abdoulaye Djimde; Ogobara Doumbo; Issaka Zongo; Jean-Bosco Ouedraogo; Pascal Michon; Ivo Mueller; Peter Siba; Alexis Nzila; Steffen Borrmann; Steven M Kiara; Kevin Marsh; Hongying Jiang; Xin-Zhuan Su; Chanaki Amaratunga; Rick Fairhurst; Duong Socheat; Francois Nosten; Mallika Imwong; Nicholas J White; Mandy Sanders; Elisa Anastasi; Dan Alcock; Eleanor Drury; Samuel Oyola; Michael A Quail; Daniel J Turner; Valentin Ruano-Rubio; Dushyanth Jyothi; Lucas Amenga-Etego; Christina Hubbart; Anna Jeffreys; Kate Rowlands; Colin Sutherland; Cally Roper; Valentina Mangano; David Modiano; John C Tan; Michael T Ferdig; Alfred Amambua-Ngwa; David J Conway; Shannon Takala-Harrison; Christopher V Plowe; Julian C Rayner; Kirk A Rockett; Taane G Clark; Chris I Newbold; Matthew Berriman; Bronwyn MacInnis; Dominic P Kwiatkowski
Journal:  Nature       Date:  2012-07-19       Impact factor: 49.962

  9 in total
  1 in total

1.  Interspecific and intraspecific gene variability in a 1-Mb region containing the highest density of NBS-LRR genes found in the melon genome.

Authors:  Víctor M González; Núria Aventín; Emilio Centeno; Pere Puigdomènech
Journal:  BMC Genomics       Date:  2014-12-17       Impact factor: 3.969

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.