Literature DB >> 28348851

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments.

Andrew J Page¹, Ben Taylor¹, Aidan J Delaney², Jorge Soares¹, Torsten Seemann³, Jacqueline A Keane¹, Simon R Harris¹.

Abstract

Rapidly decreasing genome sequencing costs have led to a proportionate increase in the number of samples used in prokaryotic population studies. Extracting single nucleotide polymorphisms (SNPs) from a large whole genome alignment is now a routine task, but existing tools have failed to scale efficiently with the increased size of studies. These tools are slow, memory inefficient and are installed through non-standard procedures. We present SNP-sites which can rapidly extract SNPs from a multi-FASTA alignment using modest resources and can output results in multiple formats for downstream analysis. SNPs can be extracted from a 8.3 GB alignment file (1842 taxa, 22 618 sites) in 267 seconds using 59 MB of RAM and 1 CPU core, making it feasible to run on modest computers. It is easy to install through the Debian and Homebrew package managers, and has been successfully tested on more than 20 operating systems. SNP-sites is implemented in C and is available under the open source license GNU GPL version 3.

Entities: Disease Species

Keywords: SNP calling; high throughput; software

Mesh：

Year: 2016 PMID： 28348851 PMCID： PMC5320690 DOI： 10.1099/mgen.0.000056

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

1. The source code for SNP-sites is available from GitHub under GNU GPL v3; (URL – https://github.com/sanger-pathogens/snp-sites) 2. The software is available from Homebrew using the recipe “brew install snp-sites” and from Debian using "apt-get install snp-sites”. 3. Salmonella Typhi multi-FASTA alignment data has been deposited in Figshare: https://dx.doi.org/10.6084/m9.figshare.2067249.v1

Impact Statement

Rapidly extracting SNPs from increasingly large alignments, both in number of sites and number of taxa, is a problem that current tools struggle to deal with efficiently. SNP-sites was created with these challenges in mind and this paper demonstrates that it scales well, using modest desktop computers, to sample sizes far in excess of what is currently analysed in single population studies. The software has also been packaged to allow it to be easily installed on a wide variety of operating systems and hardware, something often neglected in bioinformatics.

Introduction

As the cost of sequencing has rapidly decreased, the number of samples sequenced within a study has proportionately increased and now stands in the thousands (Chewapreecha ; Nasser ; Wong ). A common task in prokaryotic bioinformatics analysis is the extraction of all single nucleotide polymorphisms (SNPs) from a multiple FASTA alignment. Whilst it is a simple problem to describe, current tools cannot rapidly or efficiently extract SNPs in the increasingly large datasets found in prokaryotic population studies. These inefficiencies, such as loading all the data into memory (Lindenbaum, 2015), or slow speed due to algorithm design (Capella-Gutiérrez ), make it infeasible to analyse these sample sets on modest computers. Furthermore, existing tools employ challenging, non-standard installation procedures. A number of applications exist which can extract SNPs from a multi-FASTA alignment, such as JVarKit (Lindenbaum 2015), TrimAl (Capella-Gutiérrez ), PGDSpider (Lischer & Excoffier, 2012) and PAUP* (Swofford, 2002). JVarKit is a Java toolkit which can output SNP positions in variant call format (VCF) (Danecek ). The standardised VCF allows for post-processing with BCFtools (Danecek ), which is used to analyse variation in very large datasets such as the Human 1000 Genomes project (Sudmant ). It is reasonably fast, however it uses nearly 8 bytes of RAM per base of sequencing, which results in substantial memory usage for even small datasets. For example, a 1 GB alignment (200 taxa, 50 000 sites, 5 Mbp genomes) required 7.2 GB of RAM. TrimAl (version 1.4) is a C++ tool which outputs variation, given a multiple FASTA alignment, however it does not support VCF, only outputting the positions of SNPs in a bespoke format. It is very slow for small sample sets, however it uses less memory than JVarKit. PGDSpider is a Java based application which can output a VCF file, however the authors warn it is not suitable for large files, so it has been excluded from this analysis. PAUP* is a popular commercial application but as it is no longer distributed it was not available for comparison. None of these applications are easily installable on a wide variety of operating systems and environments. TrimAl is the only application available in Homebrew and none are available through the Debian package management system. Here we present SNP-sites which overcomes these limitations by managing disk I/O and memory carefully, and optimizing the implementation using C (ISO C99 compliant). Standard installation methods are used, with the software prepackaged and available through the Debian and Homebrew package managers. The software has been successfully run on more than 20 architectures using Debian Linux, Redhat Enterprise Linux and on multiple versions of OS X. A Cython version of the SNP-sites algorithm called PySnpSites (https://github.com/bewt85/PySnpSites) is also presented for comparison purposes.

Theory and Implementation

The input to the software is a single multiple FASTA alignment of nucleotides, where all sequences are the same length and have already been aligned. The file can optionally be gzipped. This alignment may have been generated by overlaying SNPs on a consensus reference genome, or using a multiple alignment tool, such as MUSCLE (Edgar, 2004), PRANK (Löytynoja, 2014), MAFFT (Katoh & Standley, 2013), or ClustalW (Thompson ). By default the output format is a multiple FASTA alignment. The output format can optionally be changed to PHYLIP format (Felsenstein, 1989) or VCF (version 4.1) (Danecek ). When used as a preprocessing step for FastTree (Price ), this substantially decreases the memory usage of FastTree during phylogenetic tree reconstruction. The PHYLIP format can be used as input to RAxML (Stamatakis, 2014) for creating phylogenetic trees. For phylogenetic reconstructions removing monomorphic sites from an alignment may require a different model to avoid parameters being incorrectly estimated. The VCF output retains the position of the SNPs in each sample and can be parsed using standard tools such as BCFtools (Danecek ) or for GWAS analysis using PLINK (Chang ). Each sequence is read in sequentially. A consensus sequence is generated in the first pass and is iteratively compared to each sequence. The position of any difference is noted. A second pass of the input file extracts the bases at each SNP site and outputs them in the chosen format. Where a base is unknown or is a gap (n/N/?/-), the base is regarded as a non-variant. For example, given the input alignment: the output is: The first site (column) of the input contains the base A in all samples. As there is no variation this site is excluded from the output. The second site in the input contains bases A and G, and since there is variation at this site, it is outputted. The third site in the input contains a gap (-) and the base A. Since gaps are regarded as non-variants, this site is not outputted. The maximum resource requirements of the algorithm are known. Given the number of SNP sites is p, the number of samples is s and the number of bases in a single alignment is g, the maximum memory usage can be defined as: max(p×s, g×2) Given that f is the size of the input file and o is the size of the output file, the file I/O is defined as: 2×f<= I/O<= 2×f+o The computational complexity is O(n). These properties make the algorithm theoretically scalable and feasible on large datasets far beyond what is currently analysed within a single study. All changes to SNP-sites are validated automatically against a hand-generated set of example cases incorporated into unit tests. A continuous integration system (https://travis-ci.org/sanger-pathogens/snp-sites) ensures that modifications which change the output erroneously are publicly flagged. To test the performance of SNP-sites, we have compared it with JVarKit, TrimAl and PySnpSites (https://github.com/bewt85/PySnpSites). PySnpSites is a Cython-based partial reimplementation of the SNP-sites algorithm. A number of simulated datasets were generated to exercise the different parameters and to see their effect on memory usage and running time. All of the software to generate these datasets is contained within the SNP-sites source code repository. All experiments were performed using a single processor (2.1 Ghz AMD Opteron 6272) with a maximum of 16 GB of RAM available. The maximum run-time of an application was set as 12 h, after which time the experiment was halted. Alignments were generated with varying numbers of SNPs to show the effect of SNP density on the performance of each application. Each alignment had 1000 samples and a genome alignment length of 5 Mbp, with a total file size of 4.8 GB. This is a scale encountered in recent studies (Wong ). As the SNP density increases, so does the running time and memory usage as seen in Fig. 1. The running time of both SNP-sites and PySnpSites is reasonable, however the memory usage of PySnpSites rapidly exceeds the maximum allowed memory (16 GB). Where 20 % of bases in the input alignment are SNPs, SNP-sites uses only 1 GB of RAM, or approximately 20 % of the file input size, scaling with the volume of variation rather than the size of the input file. In all experiments JVarKit exceeded the maximum allowed memory and was halted. All experiments using TrimAl exceeded the maximum running time of 12 h. As both of these applications did not successfully complete they are not present in the results.

Fig. 1.

Effect of the number of SNPs on wall time in seconds (a) and memory in MB (b). All JVarKit experiments exceeded the maximum memory and all TrimAl experiments exceeded the maximum run-time, so data are not shown. The number of samples analysed within a single study now stands in the thousands (Chewapreecha ). To cope with this scale and to demonstrate how applications will perform in the future, we generated alignments with 100 to 100 000 samples. Each genome contained 1 Mbp, and 1000 SNP sites. The total file sizes ranged from 0.1 GB to 86 GB. As the number of taxa increase, the running time of PySnpSites and SNP-sites increases linearly, with SNP-sites taking 32 min to analyse an 86 GB alignment with 100 000 taxa as can be seen in Fig. 2(a). The running time of JVarKit is ten times greater than that of PySnpSites and SNP-sites as shown in Fig. 2(b), however it exceeds the 16 GB maximum memory limit beyond 1000 taxa. The running time of TrimAl is another order of magnitude greater, making it rapidly infeasible to run. The memory usage of SNP-sites is substantially less than all other applications, with the closest, PySnpSites using 9.2 GB of RAM compared to 0.274 GB of RAM for SNP-sites.

Fig. 2.

Effect of number of samples on wall time in seconds (a) and memory in MB (b).

Effect of number of samples on wall time in seconds (a) and memory in MB (b). Finally the length of each genome in the alignment was varied from 100 000 to 100 Mbp with 1000 taxa and 1000 SNP sites in each alignment. PySnpSites and SNP-sites performed consistently well as shown in Fig. 3(a), with both taking ≈40 min to process the largest 95 GB alignment file. SNP-sites uses just 203 MB of RAM compared to 691 MB by PySnpSites as shown in Fig. 3(b). The other two applications exceed the maximum running times and/or the maximum memory whilst trying to analyse 5 Mbp genomes, which is the size of a typical Gram-negative bacterial genome.

Fig. 3.

Effect of genome length on wall time in seconds (a) and memory in MB (b).

Effect of genome length on wall time in seconds (a) and memory in MB (b). The performance of SNP-sites was evaluated on a real data set of Salmonella Typhi from (Wong ). A total of 1842 taxa were aligned to the 4.8 Mbp chromosome (GenBank accession number AL513382) of S. Typhi CT18. This gave a total alignment file size of 8.3 GB and incorporated SNPs at 22 618 sites. SNP-sites used 59 MB of RAM and took 267 seconds.

Conclusion

Extracting variation from a multiple FASTA alignment is a common task, and whilst it is simple to define, existing tools fail to perform well. We showed that SNP-sites performed consistently under a variety of conditions, using low amounts of RAM and had a low running time for even for the largest datasets we simulated to represent the scale of studies expected in the near future. This makes it feasible to run on standard desktop machines. SNP-sites uses standard installation methods with the software prepackaged and available through the Debian and Homebrew package managers. The software has been successfully tested and run on more than 20 architectures using Debian Linux and on multiple versions of OS X.

14 in total

1. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

2. PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs.

Authors: H E L Lischer; L Excoffier
Journal: Bioinformatics Date: 2011-11-21 Impact factor: 6.937

3. Multiple sequence alignment using ClustalW and ClustalX.

Authors: Julie D Thompson; Toby J Gibson; Des G Higgins
Journal: Curr Protoc Bioinformatics Date: 2002-08

4. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

5. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

6. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

7. Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524

8. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

9. Dense genomic sampling identifies highways of pneumococcal recombination.

Authors: Paul Turner; Stephen D Bentley; Claire Chewapreecha; Simon R Harris; Nicholas J Croucher; Claudia Turner; Pekka Marttinen; Lu Cheng; Alberto Pessia; David M Aanensen; Alison E Mather; Andrew J Page; Susannah J Salter; David Harris; Francois Nosten; David Goldblatt; Jukka Corander; Julian Parkhill
Journal: Nat Genet Date: 2014-02-09 Impact factor: 38.330

10. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors: Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

313 in total

1. Gut Colonization Preceding Mucosal Barrier Injury Bloodstream Infection in Pediatric Hematopoietic Stem Cell Transplantation Recipients.

Authors: Matthew S Kelly; Doyle V Ward; Christopher J Severyn; Mehreen Arshad; Sarah M Heston; Kirsten Jenkins; Paul L Martin; Lauren McGill; Andre Stokhuyzen; Shakti K Bhattarai; Vanni Bucci; Patrick C Seed
Journal: Biol Blood Marrow Transplant Date: 2019-07-18 Impact factor: 5.742

2. Extended-Spectrum-β-Lactamase- and Plasmid AmpC-Producing Escherichia coli Causing Community-Onset Bloodstream Infection: Association of Bacterial Clones and Virulence Genes with Septic Shock, Source of Infection, and Recurrence.

Authors: Inga Fröding; Badrul Hasan; Isak Sylvin; Maarten Coorens; Pontus Nauclér; Christian G Giske
Journal: Antimicrob Agents Chemother Date: 2020-07-22 Impact factor: 5.191

3. A large-scale whole-genome comparison shows that experimental evolution in response to antibiotics predicts changes in naturally evolved clinical Pseudomonas aeruginosa.

Authors: Samuel J T Wardell; Attika Rehman; Lois W Martin; Craig Winstanley; Wayne M Patrick; Iain L Lamont
Journal: Antimicrob Agents Chemother Date: 2019-09-30 Impact factor: 5.191

4. Epstein-Barr Virus Genomes Reveal Population Structure and Type 1 Association with Endemic Burkitt Lymphoma.

Authors: Jeffrey A Bailey; Ann M Moormann; Yasin Kaymaz; Cliff I Oduor; Ozkan Aydemir; Micah A Luftig; Juliana A Otieno; John Michael Ong'echa
Journal: J Virol Date: 2020-08-17 Impact factor: 5.103

5. Global and regional dissemination and evolution of Burkholderia pseudomallei.

Authors: Claire Chewapreecha; Matthew T G Holden; Minna Vehkala; Niko Välimäki; Zhirong Yang; Simon R Harris; Alison E Mather; Apichai Tuanyok; Birgit De Smet; Simon Le Hello; Chantal Bizet; Mark Mayo; Vanaporn Wuthiekanun; Direk Limmathurotsakul; Rattanaphone Phetsouvanh; Brian G Spratt; Jukka Corander; Paul Keim; Gordon Dougan; David A B Dance; Bart J Currie; Julian Parkhill; Sharon J Peacock
Journal: Nat Microbiol Date: 2017-01-23 Impact factor: 17.745

6. The evolution of ecological facilitation within mixed-species biofilms in the mouse gastrointestinal tract.

Authors: Xiaoxi B Lin; Tuo Wang; Paul Stothard; Jukka Corander; Jun Wang; John F Baines; Sarah C L Knowles; Laima Baltrūnaitė; Guergana Tasseva; Robert Schmaltz; Stephanie Tollenaar; Liz A Cody; Theodore Grenier; Wei Wu; Amanda E Ramer-Tait; Jens Walter
Journal: ISME J Date: 2018-07-16 Impact factor: 10.302

7. Genomic architecture and introgression shape a butterfly radiation.

Authors: Nathaniel B Edelman; Paul B Frandsen; Michael Miyagi; Bernardo Clavijo; John Davey; Rebecca B Dikow; Gonzalo García-Accinelli; Steven M Van Belleghem; Nick Patterson; Daniel E Neafsey; Richard Challis; Sujai Kumar; Gilson R P Moreira; Camilo Salazar; Mathieu Chouteau; Brian A Counterman; Riccardo Papa; Mark Blaxter; Robert D Reed; Kanchon K Dasmahapatra; Marcus Kronforst; Mathieu Joron; Chris D Jiggins; W Owen McMillan; Federica Di Palma; Andrew J Blumberg; John Wakeley; David Jaffe; James Mallet
Journal: Science Date: 2019-11-01 Impact factor: 47.728

8. Genome-wide epistasis and co-selection study using mutual information.

Authors: Johan Pensar; Santeri Puranen; Brian Arnold; Neil MacAlasdair; Juri Kuronen; Gerry Tonkin-Hill; Maiju Pesonen; Yingying Xu; Aleksi Sipola; Leonor Sánchez-Busó; John A Lees; Claire Chewapreecha; Stephen D Bentley; Simon R Harris; Julian Parkhill; Nicholas J Croucher; Jukka Corander
Journal: Nucleic Acids Res Date: 2019-10-10 Impact factor: 16.971

9. Geography is more important than life history in the recent diversification of the tiger salamander complex.

Authors: Kathryn M Everson; Levi N Gray; Angela G Jones; Nicolette M Lawrence; Mary E Foley; Kelly L Sovacool; Justin D Kratovil; Scott Hotaling; Paul M Hime; Andrew Storfer; Gabriela Parra-Olea; Ruth Percino-Daniel; X Aguilar-Miguel; Eric M O'Neill; Luis Zambrano; H Bradley Shaffer; David W Weisrock
Journal: Proc Natl Acad Sci U S A Date: 2021-04-27 Impact factor: 11.205

10. Genomic epidemiology of COVID-19 in care homes in the east of England.

Authors: William L Hamilton; Gerry Tonkin-Hill; Emily R Smith; Dinesh Aggarwal; Charlotte J Houldcroft; Ben Warne; Luke W Meredith; Myra Hosmillo; Aminu S Jahun; Martin D Curran; Surendra Parmar; Laura G Caller; Sarah L Caddy; Fahad A Khokhar; Anna Yakovleva; Grant Hall; Theresa Feltwell; Malte L Pinckert; Iliana Georgana; Yasmin Chaudhry; Colin S Brown; Sonia Gonçalves; Roberto Amato; Ewan M Harrison; Nicholas M Brown; Mathew A Beale; Michael Spencer Chapman; David K Jackson; Ian Johnston; Alex Alderton; John Sillitoe; Cordelia Langford; Gordon Dougan; Sharon J Peacock; Dominic P Kwiatowski; Ian G Goodfellow; M Estee Torok
Journal: Elife Date: 2021-03-02 Impact factor: 8.140