Literature DB >> 30345380

RhierBAPS: An R implementation of the population clustering algorithm hierBAPS.

Gerry Tonkin-Hill¹, John A Lees², Stephen D Bentley¹, Simon D W Frost^3,4, Jukka Corander^1,5,6.

Abstract

Identifying structure in collections of sequence data sets remains a common problem in genomics. hierBAPS, a popular algorithm for identifying population structure in haploid genomes, has previously only been available as a MATLAB binary. We provide an R implementation which is both easier to install and use, automating the entire pipeline. Additionally, we allow for the use of multiple processors, improve on the default settings of the algorithm, and provide an interface with the ggtree library to enable informative illustration of the clustering results. Our aim is that this package aids in the understanding and dissemination of the method, as well as enhancing the reproducibility of population structure analyses.

Entities: Chemical Disease Species

Keywords: R; clustering; population structure

Year: 2018 PMID： 30345380 PMCID： PMC6178908 DOI： 10.12688/wellcomeopenres.14694.1

Source DB: PubMed Journal: Wellcome Open Res ISSN： 2398-502X

Introduction

Identifying sub-populations in collections of genetic sequences is a common problem in population genetics, molecular ecology, epidemiology and microbiology. In general, the aim of genetic clustering algorithms is to identify separate panmictic clusters within a broader, more heterogeneous population. In large sequence data sets, it is helpful to identify smaller subpopulations which can be further analysed for associations with particular phenotypes as well as recombination [1, 2], as long as potential biases introduced through taking clusters from a larger population are taken into account [3]. A frequently used model assumes that each individual sequence is drawn from one of K distinct subpopulations with each cluster having its own set of allele frequencies. The aim is then to identify which cluster each sequence originates from and the corresponding allele frequencies within that cluster. There are a number of methods for solving this problem including STRUCTURE [4, 5], snapclust [6] and BAPS (Bayesian Analysis of Population Structure [7– 10]). The BAPS algorithm [9, 11] is distinct in that it attempts to estimate the partition of individual sequences into clusters directly by analytically integrating over the allele frequencies parameters for each subpopulation. This allows for the latent number of underlying subpopulations, K, to be estimated as part of the model fitting procedure. The hierBAPS algorithm extends this approach by enabling the investigation of a population at multiple resolutions. This is achieved by initially clustering the entire dataset using the BAPS algorithm before iteratively applying the algorithm to each of the resulting clusters. Similar to other approaches [4], BAPS assumes that alleles are drawn independently from a multinomial distribution with a Dirichlet prior. However, unlike STRUCTURE, which uses Gibbs sampling to estimate the posterior distribution, BAPS attempts to find the partition of the data S that maximises the posterior probability of an allocation over all other possible allocations. A partition S is defined as the allocation of each sequence to one of K possible clusters. The maximum possible value of K is given in the hierBAPS algorithm. Here indicates the set of all possible partitions with up to K max clusters. The hierBAPS algorithm attempts to choose S to maximise where P(data |S) is the marginal likelihood of having the allele frequency parameters analytically integrated out leading to where n is the count for allele l at locus j in cluster i and α is the corresponding hyper-parameter for the Dirichlet prior. The BAPS algorithm attempts to find the partition S that maximizes the posterior probability using a greedy stochastic search approach. A discretised uniform distribution of the cluster size K ( K = 1,…, K max) is used in hierBAPS to provide the prior probability of each partition P( S). The Dirichlet hyperparameters are set at where N A( is the number of of distinct alleles at locus j. Currently, hierBAPS is only available as a MATLAB binary, which can be both difficult to install and use as separate runtime libraries are generally needed for different OS versions for MacOS X, Windows and Linux systems. The documentation is also lacking, making it difficult for less computationally experienced researchers to use. There is currently no clear guide on how to use the output of the MATLAB binary to produce informative plots for interpretation. Whilst there are other algorithms available to cluster genetic data in R, such as snapclust [6] and DAPC [12], neither make use of the partition approach used in BAPS. By providing an R implementation of hierBAPS, we aim to increase its usability and the reproducibility of analyses using the software.

Methods

Implementation

RhierBAPS is implemented in the R language [13]. The core program relies upon the R packages ape [14], dplyr, gmp, purrr and ggplot2. Additional plotting functionality makes use of ggtree [15] and phytools [16]. The structure of the code is very similar to the original MATLAB code and has similar runtimes. The development version of the package can be installed using devtools. Unlike the MATLAB version, rhierBAPS by default only considers SNP loci that have a minor allele in at least two sequences. This has been found to improve the results of the analysis as although singleton SNPs are important when constructing phylogenies they introduce noise into the model used in hierBAPS leading to poorer quality clusterings. It is currently recommended that singletons SNPs are removed before running the MATLAB version of the software.

Operation

RhierBAPS can be installed on any computer where R versions 3.5 and above can be installed. The package can be run using just a few lines of R code where the variable "fasta.file.name" should be replaced with the location of the FASTA formated multiple sequence alignment of the sequences of interest.

Use cases

RhierBAPS requires a multiple sequence alignment in FASTA format. In all examples we make use of sequences from the Bacillus cereus Multi Locus Sequence Typing website ( https://pubmlst.org/bcereus/) [17]. The sequences used are included as part of the R package. The algorithm requires an initial number of clusters to be set which should be higher than the maximum number of expected clusters in the dataset. If a dataset is likely to contain many distinct lineages, for example, if there are many samples from many locations, then a higher initial number of clusters should be set. Conversely, if the samples are from only a small number of sites and little variation is expected then a smaller initial cluster size can be set. To get an idea of a good initial cluster size, agglomerative clustering with complete linkage using pairwise SNP distances can be performed initially. The number of levels over which clustering should be performed is also required as input to the algorithm. In the preceding example, we ran rhierBAPS with 20 initial clusters at two clustering levels. Additional parameters that can be set include the number of cores to use and whether the program should generate progress information. The hierBAPS function generates a data frame indicating the assignment of sequences to clusters at each level. This, along with the marginal log likelihoods can be saved to file. Finally, as the program is written in R we are able to take advantage of the excellent plotting capabilities available. Given a phylogenetic tree generated using IQTREE [18] with model selection [19] using the command iqtree −s, we can annotate it with the BAPS clusters using ggtree [15] ( Figure 1).

Figure 1.

Phylogenetic tree built using Iqtree and annotated with the top level clusters identified using rhierBAPS.

Additionally, the plot_sub_cluster function allows for the user to focus on one higher level cluster and investigate the sub cluster present within it. Here we investigate cluster 9 (highlighted in red), at the top most level ( Figure 2).

Figure 2.

Phylogenetic tree focusing on the 9th cluster at the top level identified using rhierBAPS and plotted using the plot_sub_cluster function. The subsequent clustering at the 2nd level is indicated in the sub-tree to the right.

Summary

Clustering is an essential component of many genetic analysis pipelines. We have presented rhierBAPS, an R package that implements the hierBAPS algorithm for clustering genetic sequence data. It is both easy to install and use, whilst providing additional plotting capabilities and the ability to run using multiple cores. We believe it will aid in the reproducibility of population structure analysis.

Software availability

The package is available on CRAN: https://cran.r-project.org/web/packages/rhierbaps/index.html Source code available from: https://github.com/gtonkinhill/rhierbaps Archived source code as at time of publication: http://doi.org/10.5281/zenodo.1318958 [20] License: MIT The authors report an implementation of hierBAPS for R, RhierBAPS, for determining the optimal number of clusters in population DNA sequence data. The program does not extend the methods in hierBAPS, but I appreciate that R is the default language in many bioinformatics pipelines. I have a few suggestions: - The program is easy to use, but I think that it is worth including a few data sets that can are readily available when the package is loaded into R. This would make the examples easier to run and understand. - I think that it would be valuable to see some results of choosing different minimum values of K to illustrate the sensitivity of the clustering to this parameter. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The identification of groups (or clusters) with genetic or genomic data is one the basic questions asked by population geneticists. Several methods exist with various software implementations. The software described in this note is a valuable addition to the set of R packages for population genetics and related fields. I greatly appreciate to have this method implemented in R. The analytical integration is a really great feature and that makes this method attractive. I tried the package and the examples run very smoothly as expected. I also tried the main function with a small DNA alignment (the 'woodmouse' data in ape) and the results made sense. The graphical tools provided with the package are helpful, and the results output as a list in R make easy to use a custom graphical display. For instance, I was able to make my own display with functions in ape in just four lines of code. I have a few comments or suggestions that the Authors may find useful for future versions of their article and/or package. At present, it seems that the package handles SNP data. Does this mean that only biallelic DNA loci can be analysed? Can other types of biallelic genetic data be handled? What if more than two alleles are observed at a DNA site? One suggestion for future developments would be to better integrate with other data classes, particularly from ape and adegenet which are more and more widely used. Also, ape has now efficient links with the data classes used in BioConductor, which makes possible to integrate a wide range of approaches (bioinformatics, genomics, phylogenetics, population genetics) within R. It seems that the present package does not calculate the individual relative probabilities of assignment to the different clusters as done in Corander et al. [1]. This might be a valuable addition for future versions, and it would help to compare the results from different methods, for instance using the nice compoplot function in adegenet. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

16 in total

1. Inference of population structure using multilocus genotype data.

Authors: J K Pritchard; M Stephens; P Donnelly
Journal: Genetics Date: 2000-06 Impact factor: 4.562

2. A model-based method for identifying species hybrids using multilocus genetic data.

Authors: E C Anderson; E A Thompson
Journal: Genetics Date: 2002-03 Impact factor: 4.562

3. Bayesian analysis of genetic differentiation between populations.

Authors: Jukka Corander; Patrik Waldmann; Mikko J Sillanpää
Journal: Genetics Date: 2003-01 Impact factor: 4.562

4. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations.

Authors: Thibaut Jombart; Sébastien Devillard; François Balloux
Journal: BMC Genet Date: 2010-10-15 Impact factor: 2.797

5. BIGSdb: Scalable analysis of bacterial genome variation at the population level.

Authors: Keith A Jolley; Martin C J Maiden
Journal: BMC Bioinformatics Date: 2010-12-10 Impact factor: 3.169

6. Detection of recombination events in bacterial genomes from large population samples.

Authors: Pekka Marttinen; William P Hanage; Nicholas J Croucher; Thomas R Connor; Simon R Harris; Stephen D Bentley; Jukka Corander
Journal: Nucleic Acids Res Date: 2011-11-07 Impact factor: 16.971

7. ModelFinder: fast model selection for accurate phylogenetic estimates.

Authors: Subha Kalyaanamoorthy; Bui Quang Minh; Thomas K F Wong; Arndt von Haeseler; Lars S Jermiin
Journal: Nat Methods Date: 2017-05-08 Impact factor: 28.547

8. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations.

Authors: Jukka Corander; Pekka Marttinen; Jukka Sirén; Jing Tang
Journal: BMC Bioinformatics Date: 2008-12-16 Impact factor: 3.169

9. Biased phylodynamic inferences from analysing clusters of viral sequences.

Authors: Bethany L Dearlove; Fei Xiang; Simon D W Frost
Journal: Virus Evol Date: 2017-08-03

10. A fast likelihood solution to the genetic clustering problem.

Authors: Marie-Pauline Beugin; Thibault Gayet; Dominique Pontier; Sébastien Devillard; Thibaut Jombart
Journal: Methods Ecol Evol Date: 2018-01-30 Impact factor: 7.781

57 in total

1. Genome-wide discovery of epistatic loci affecting antibiotic resistance in Neisseria gonorrhoeae using evolutionary couplings.

Authors: Benjamin Schubert; Rohan Maddamsetti; Jackson Nyman; Maha R Farhat; Debora S Marks
Journal: Nat Microbiol Date: 2018-12-03 Impact factor: 17.745

2. Population Genomics Reveals Distinct Temporal Association with the Emergence of ST1 Serotype V Group B Streptococcus and Macrolide Resistance in North America.

Authors: M Belén Cubria; Luis Alberto Vega; William C Shropshire; Misu A Sanson; Brittany J Shah; Shrijana Regmi; Marcia Rench; Carol J Baker; Anthony R Flores
Journal: Antimicrob Agents Chemother Date: 2021-10-11 Impact factor: 5.938

3. Fast hierarchical Bayesian analysis of population structure.

Authors: Gerry Tonkin-Hill; John A Lees; Stephen D Bentley; Simon D W Frost; Jukka Corander
Journal: Nucleic Acids Res Date: 2019-06-20 Impact factor: 16.971

4. Genome-wide insights into population structure and host specificity of Campylobacter jejuni.

Authors: Lennard Epping; Birgit Walther; Rosario M Piro; Marie-Theres Knüver; Charlotte Huber; Andrea Thürmer; Antje Flieger; Angelika Fruth; Nicol Janecko; Lothar H Wieler; Kerstin Stingl; Torsten Semmler
Journal: Sci Rep Date: 2021-05-14 Impact factor: 4.379

5. Emergence of a Neisseria gonorrhoeae clone with reduced cephalosporin susceptibility between 2014 and 2019 in Amsterdam, The Netherlands, revealed by genomic population analysis.

Authors: Jolinda de Korne-Elenbaas; Sylvia M Bruisten; Henry J C de Vries; Alje P Van Dam
Journal: J Antimicrob Chemother Date: 2021-06-18 Impact factor: 5.790

6. Whole-genome analysis uncovers loss of blaZ associated with carriage isolates belonging to methicillin-resistant Staphylococcus aureus (MRSA) clone ST5-VI in Cape Verde.

Authors: Magdalena Wysocka; Tamar Monteiro; Carine de Pina; Deisy Gonçalves; Sandrine de Pina; Antonio Ludgero-Correia; Joao Moreno; Roxana Zamudio; Nada Almebairik; Laura J Gray; Manish Pareek; David R Jenkins; Marta Aires-de-Sousa; Herminia De Lencastre; Sandra Beleza; Isabel I Araujo; Teresa Conceição; Marco R Oggioni
Journal: J Glob Antimicrob Resist Date: 2021-05-27 Impact factor: 4.035

7. Genomic characterization of the non-O1/non-O139 Vibrio cholerae strain that caused a gastroenteritis outbreak in Santiago, Chile, 2018.

Authors: Mónica Arteaga; Juliana Velasco; Shelly Rodriguez; Maricel Vidal; Carolina Arellano; Francisco Silva; Leandro J Carreño; Roberto Vidal; David A Montero
Journal: Microb Genom Date: 2020-03

8. Long-Term Intrahost Evolution of Methicillin Resistant Staphylococcus aureus Among Cystic Fibrosis Patients With Respiratory Carriage.

Authors: Taj Azarian; Jessica P Ridgway; Zachary Yin; Michael Z David
Journal: Front Genet Date: 2019-06-12 Impact factor: 4.772

9. The Emergence and Molecular Characteristics of New Delhi Metallo β-Lactamase-Producing Escherichia coli From Ducks in Guangdong, China.

Authors: Min-Ge Wang; Yang Yu; Dong Wang; Run-Shi Yang; Ling Jia; Da-Tong Cai; Si-Lin Zheng; Liang-Xing Fang; Jian Sun; Ya-Hong Liu; Xiao-Ping Liao
Journal: Front Microbiol Date: 2021-07-05 Impact factor: 5.640

10. Genomic epidemiology of methicillin-resistant and -susceptible Staphylococcus aureus from bloodstream infections.

Authors: Joshua T Smith; Elissa M Eckhardt; Nicole B Hansel; Tahmineh Rahmani Eliato; Isabella W Martin; Cheryl P Andam
Journal: BMC Infect Dis Date: 2021-06-21 Impact factor: 3.090