Literature DB >> 24571581

SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data.

Tae-Ho Lee, Hui Guo, Xiyin Wang, Changsoo Kim, Andrew H Paterson¹.

Abstract

BACKGROUND: Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data.
RESULTS: We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline.
CONCLUSIONS: Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.

Entities: Disease Gene Species

Mesh：

Year: 2014 PMID： 24571581 PMCID： PMC3945939 DOI： 10.1186/1471-2164-15-162

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

Since the Arabidopsis genome was completed [1], advanced sequencing technology has facilitated the whole genome sequencing of many plants of commercial or experimental importance [2-4]. Reference genome sequences and high-throughput data analysis also provide the basis for resequencing whole genomes or transcriptomes to answer questions about variations between cultivars, populations, and taxa. In a variation study, the distribution of single nucleotide polymorphisms (SNPs) and/or short insertions and deletions (indels) is the prime concern. A variety of studies have begun to utilize and illustrate how to deal with extensive SNP data [5-10]. Particularly, phylogenetic trees have been used in many evolutionary studies to depict evidence about evolutionary relationships between or within organisms, and to study the evolution and functional innovation of genes [6,7]. However, there has been no easy-to-use pipeline to determine phylogenetic trees with the huge number of variants obtained from sequencing projects. One typical method to determine trees has been: 1) calculating p-distance from all SNP data between two samples, 2) making the p-distance matrix for all samples, 3) constructing a neighbor-joining tree with the matrix by a program such as ‘neighbor’ in the PHYLIP package [11] and 4) drawing the phylogenetic tree image by a program such as MEGA4 [12]. However, there are at least three points to be methodologically improved: 1) there is no consideration of LD (Linkage Disequilibrium) blocks which can cause bias of variants, 2) statistical tests need be improved to evaluate the level of confidence, and 3) users are required to manipulate large data sets step-by-step to obtain a phylogenetic tree. The snpTree server [13] provided solutions for the second and third points. However, the target of this web server was bacterial genomes which are much smaller than eukaryotic genomes and seldom if ever have LD blocks. We developed a pipeline, SNPhylo (Additional file 1), permitting users to construct a phylogenetic tree from a file containing SNP data in VCF (Variant Call Format), HapMap format or GDS (Genomic Data Structure) format [14]. Here we introduce the pipeline with three examples that show the applicability of the pipeline.

Implementation

Procedures to determine a phylogenetic tree in the pipeline are 1) testing each SNP position and removing those positions which do not have sufficient numbers of qualified SNPs for all samples, 2) generating new GDS format files from the tested SNP data files, 3) reading the GDS file and extracting SNP data which meet criteria of ≥ MAF (Minor Allele Frequency) and ≤ missing rate threshold, and are in approximate linkage equilibrium with each other as determined by SNPRelate package [14], 4) Concatenating the extracted SNPs for each sample and generating a sequence file containing the sequences, 5) Performing multiple alignment of the sequences by MUSCLE alignment program [15], and 6) Determining a phylogenetic tree by the maximum likelihood method by running DNAML programs in the PHYLIP package [11]. In addition, bootstrapping analysis for the tree is fulfilled by ‘phangorn’ package [16] (Figure 1). Using a GDS file as the SNP data file avoids the first and second steps.

Figure 1

Flowchart of SNPhylo pipeline. The blue boxes represent processes in SNPhylo while green boxes indicate files as input and output. The orange arrows show the flow of data. By the SNPhylo pipeline, users can get a PNG format tree image file as well as a Newick format tree file determined from 4 different SNP data format file, VCF, HapMap, Simple and GDS. All the steps are automated by one Bash shell script, snphylo.sh, though the pipeline includes additional components implemented in Python and R. Thus, by the script, users can obtain from a SNP data file a phylogenetic tree file and other informative files such as multiple alignment results file in PHYLIP format, which can be used for additional analysis such as a parallel bootstrap analysis by PhyML [17]. The pipeline also generates a tree image in PNG format with R packages [16,18] so the user easily interprets the results of analysis. In addition, the tree file in Newick format is provided as well so users can make more informative tree image by other programs such as MEGA4 [12] and Newick utility [19] depending on the demands of users.

Results & discussion

Phylogenetic tree with soybean SNP data

As a demonstration of the use of SNPhylo, we determined a tree (Figure 2A) with published SNP data that includes 6,289,747 SNP loci determined by resequencing of 31 soybean wild types and cultivars [7]. The tree was determined with default options within 4 minutes using a GDS format file on a current Linux desktop computer which had 4GB memory and 2.66GHz Dual-Core CPU. In comparison, determination of the tree took about 50 minutes with a ~880 MB HapMap format file because of the need to perform additional steps that involve testing each SNP position and removing those positions which do not have sufficient numbers of qualified SNPs for all 31 samples, described in the procedures above.

Figure 2

Phylogenetic trees and Bayesian clustering result constructed with soybean SNP data. (A) The tree constructed by SNPhylo pipeline with soybean SNP data from 31 soybean wild types and cultivars. The cluster which is more consistent with the Bayesian clustering result of the original report is circled in red. The ‘W’ and ‘C’ prefix in ID numbers represent wild type and cultivars, respectively. The bootstrap values determined with 1,000 samples are represented in red. (B) The part of tree of the original soybean SNP analysis report [7]. The IDs which are not consistent with the Bayesian clustering result are circled in red. (C) The Bayesian clustering result of the original paper [7]. Most branches in the tree correspond to those inferred in the original report [7] though our tree was easily determined by our pipeline in a relatively short time. Interestingly, in one case, our tree was more consistent with the Bayesian clustering result of the original report (Figure 2B) rather than the tree of the original report. Specifically, in the original report, the three wild soybeans (W03, W13, and W14) were clustered together in Bayesian clustering (red box in Figure 2C), while phylogenetic analysis separated W03 from the others (two red ellipses in Figure 2B). The tree determined by SNPhylo shows the three wild soybeans included in same cluster (red ellipse in Figure 2A), consistent with the Bayesian clustering result (red box in Figure 2C). In addition, we constructed a phylogenetic tree by the neighbor-joining method used in the original report using only the SNP data filtered by LD information, and obtained the same tree constructed by SNPhylo for the three wild soybeans (data not shown). Thus, the consistency with the Bayesian clustering result of both our tree and a phylogenetic tree based on LD-filtered data may indicate that using LD information improves interpretation of phylogenetic relationships from genomic data.

Rapid construction of a tree with rice SNP data

As another case study, we constructed a phylogenetic tree with rice SNP data that has 162,479 SNP loci determined by resequencing microarrays with 20 samples [10] (Figure 3A). Because of relatively low quality and small number of SNP data, the tree was constructed with loose parameters (−p 25) such that SNP loci were allowed to remain in the analysis even if as many as 25% of samples lacked data, versus the default of 5%. With the Linux system used to construct the soybean tree, the construction of the rice tree took less than 1 minute.

Figure 3

Rice phylogenetic trees showing three rice groups. (A) A rice SNP tree constructed by the SNPhylo. The three clusters in the tree reflect the three rice groups. The ellipses in red, blue and green represent japonica, indica and aus group, respectively. The bootstrap values determined with 1,000 samples are represented in red. (B) The tree in the original report for the rice SNPs [10]. The clustered in red, blue and green represents japonica, indica and aus group, respectively as well. The tree constructed by SNPhylo had three evident clusters representing the three rice groups, japonica, indica and aus, and the results was consistent with the previous tree of the original report [10]. Interestingly, the previous tree (Figure 3B) and the SNPhylo tree showed different branch lengths between the three rice group clusters. Specifically, the branch between japonica and the other two clusters was much longer in the previous tree, with the branch lengths being more similar to one another in the SNPhylo tree. The relatively long edge in the previous tree may be caused by the higher LD level of japonica groups than other rice groups [10]. SNP bias due to high levels of LD in japonica might lead to overestimation of distances between clusters. The inclusion of a step to decrease this bias may permit SNPhylo to construct a more accurate tree.

Construction of a phylogenetic tree with Arabidopsis SNP data

Arabidopsis has been used as a model plant since its whole genome was sequenced [1] because of its small genome size, small physical size amenable to laboratory experiments, and short life-cycle. Since the first genome sequence was released, much Arabidopsis genome data has been released by various re-sequencing projects. Thus, as an additional case study, we constructed a phylogenetic tree with SNP data (Figure 4) determined by Arabidopsis genome project (http://mus.well.ox.ac.uk/19genomes/). Because of the relatively high LD level [20], the phylogenetic tree was constructed with relatively higher LD threshold (−l 0.4) than the default value.

Figure 4

Phylogenetic tree of accessions and the geographical relations of the accessions. (A) The Arabidopsis phylogenetic tree constructed with SNP data from 19 genomes by SNPhylo. Three major clusters in the tree are circled in three colors, red, blue, and orange. The bootstrap values determined with 1,000 samples are represented in red. (B) Europe continent map showing geographical relations of the accessions in each cluster in the Arabidopsis tree. The origins of accessions circled in red, blue, and orange are located in the geographical west, middle, and east of the continent, respectively. There are three major clusters in the phylogenetic tree (Figure 4A). The accessions in each cluster show high consistency regarding geographic origins (Figure 4B). The origins of accessions circled in red, blue, and orange are located in the geographical west, middle, and east of Europe, respectively. For example, the origins of Edi-0 and Bur-0 in the same cluster are Scotland and Ireland, respectively. In addition, the relationship between geographical location and the cluster in the phylogenetic tree are consistent with the East–West gradient in clustering results of 96 Arabidopsis genotypes which is likely caused by post-glaciation colonization routes [21].

Dependence of SNPhylo run time on amount of SNP data

The run time of the pipeline to generate a tree with the Arabidopsis SNP data for 2,595,179 SNP loci of 20 samples was 1,850 seconds. The result means that the pipeline can process about 1,402 SNP loci per second. However, it is not clear whether the number of SNP genotypes or the number of organism samples primarily determine the duration of the run. In order to address the question, we determined run times of the pipeline with various data sets generated from the Arabidopsis SNP data set in HapMap format (Figure 5). In the figure, each line shows the linear change of run time depending on the different number of SNP genotypes for a specific sample number. For example, the red line represents the nearly linear change of run time of SNP data sets of 5 samples by the number of SNP loci. The averages of run time for data sets having different SNP loci numbers for 5, 10, 15 and 20 samples are 857.4, 885.4, 943.1 and 1006.1 seconds, respectively. On the other hand, the averages of run times for data sets for 50,000, 100,000, 150,000, 200,000 and 250,000 SNP loci are 288.0, 604.7, 913.3, 1236.6 and 1571.8 seconds, respectively. The trends of the time changes in GDS format data (data not shown) were similar with the HapMap format data although the times were smaller than in the HapMap format. Therefore, the result shows that the run time of the pipeline is mostly affected by the SNP genotype number, rather than organism sample number.

Figure 5

Linear change of run time of SNPhylo depending the number of SNP loci. The red, blue, green and orange lines represent changes of analysis time of HapMap files for 5, 10, 15, 19 Arabidopsis samples, respectively, depending on the changes of SNP loci number. Seeing the figure, the analysis time of SNP data is mostly affected by the SNP loci number rather than sample number.

Conclusions

Using SNPhylo, users can easily produce a phylogenetic tree from large SNP data derived from various detection technologies such as genome wide resequencing [7] and resequencing microarrays [10]. Consequently, this pipeline can help a researcher focus more on interpretation of a reliable tree generated by maximum likelihood analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.

Availability and requirements

Project name: SNPhylo Project home page: http://chibba.pgml.uga.edu/snphylo/ Operating system(s): Linux, UNIX and OS X Programming language: Python, R and BASH Other requirements: MUSCLE [15] and DNAML [11] License: GNU GPLv2 Any restrictions to use by non-academics: None

Abbreviations

LD: Linkage disequilibrium; VCF: Variant call format; GDS: Genomic data structure; MAF: Minor allele frequency.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

THL developed the pipeline and wrote the manuscript. HG, XW and CK provided advice and revised the manuscript. AHP provided substantial advice and guidance during all phases of the project. All authors read and approved the final manuscript.

Additional file 1

SNPhylo version 20140116. Description: This compressed file contains SNPhylo source codes and additional files such as setup script and instruction for installation. The latest version is available at SNPhylo homepage (http://chibba.pgml.uga.edu/snphylo/). Click here for file

20 in total

1. Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection.

Authors: Hon-Ming Lam; Xun Xu; Xin Liu; Wenbin Chen; Guohua Yang; Fuk-Ling Wong; Man-Wah Li; Weiming He; Nan Qin; Bo Wang; Jun Li; Min Jian; Jian Wang; Guihua Shao; Jun Wang; Samuel Sai-Ming Sun; Gengyun Zhang
Journal: Nat Genet Date: 2010-11-14 Impact factor: 38.330

2. A high-performance computing toolset for relatedness and principal component analysis of SNP data.

Authors: Xiuwen Zheng; David Levine; Jess Shen; Stephanie M Gogarten; Cathy Laurie; Bruce S Weir
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

3. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes.

Authors: Xun Xu; Xin Liu; Song Ge; Jeffrey D Jensen; Fengyi Hu; Xin Li; Yang Dong; Ryan N Gutenkunst; Lin Fang; Lei Huang; Jingxiang Li; Weiming He; Guojie Zhang; Xiaoming Zheng; Fumin Zhang; Yingrui Li; Chang Yu; Karsten Kristiansen; Xiuqing Zhang; Jian Wang; Mark Wright; Susan McCouch; Rasmus Nielsen; Jun Wang; Wen Wang
Journal: Nat Biotechnol Date: 2011-12-11 Impact factor: 54.908

4. Whole-genome sequencing of multiple Arabidopsis thaliana populations.

Authors: Jun Cao; Korbinian Schneeberger; Stephan Ossowski; Torsten Günther; Sebastian Bender; Joffrey Fitz; Daniel Koenig; Christa Lanz; Oliver Stegle; Christoph Lippert; Xi Wang; Felix Ott; Jonas Müller; Carlos Alonso-Blanco; Karsten Borgwardt; Karl J Schmid; Detlef Weigel
Journal: Nat Genet Date: 2011-08-28 Impact factor: 38.330

5. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.

Authors: Andrew H Paterson; Jonathan F Wendel; Heidrun Gundlach; Hui Guo; Jerry Jenkins; Dianchuan Jin; Danny Llewellyn; Kurtis C Showmaker; Shengqiang Shu; Joshua Udall; Mi-jeong Yoo; Robert Byers; Wei Chen; Adi Doron-Faigenboim; Mary V Duke; Lei Gong; Jane Grimwood; Corrinne Grover; Kara Grupp; Guanjing Hu; Tae-ho Lee; Jingping Li; Lifeng Lin; Tao Liu; Barry S Marler; Justin T Page; Alison W Roberts; Elisson Romanel; William S Sanders; Emmanuel Szadkowski; Xu Tan; Haibao Tang; Chunming Xu; Jinpeng Wang; Zining Wang; Dong Zhang; Lan Zhang; Hamid Ashrafi; Frank Bedon; John E Bowers; Curt L Brubaker; Peng W Chee; Sayan Das; Alan R Gingle; Candace H Haigler; David Harker; Lucia V Hoffmann; Ran Hovav; Donald C Jones; Cornelia Lemke; Shahid Mansoor; Mehboob ur Rahman; Lisa N Rainville; Aditi Rambani; Umesh K Reddy; Jun-kang Rong; Yehoshua Saranga; Brian E Scheffler; Jodi A Scheffler; David M Stelly; Barbara A Triplett; Allen Van Deynze; Maite F S Vaslin; Vijay N Waghmare; Sally A Walford; Robert J Wright; Essam A Zaki; Tianzhen Zhang; Elizabeth S Dennis; Klaus F X Mayer; Daniel G Peterson; Daniel S Rokhsar; Xiyin Wang; Jeremy Schmutz
Journal: Nature Date: 2012-12-20 Impact factor: 49.962

6. The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell.

Authors: Thomas Junier; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2010-05-13 Impact factor: 6.937

7. The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions.

Authors: Shaogui Guo; Jianguo Zhang; Honghe Sun; Jerome Salse; William J Lucas; Haiying Zhang; Yi Zheng; Linyong Mao; Yi Ren; Zhiwen Wang; Jiumeng Min; Xiaosen Guo; Florent Murat; Byung-Kook Ham; Zhaoliang Zhang; Shan Gao; Mingyun Huang; Yimin Xu; Silin Zhong; Aureliano Bombarely; Lukas A Mueller; Hong Zhao; Hongju He; Yan Zhang; Zhonghua Zhang; Sanwen Huang; Tao Tan; Erli Pang; Kui Lin; Qun Hu; Hanhui Kuang; Peixiang Ni; Bo Wang; Jingan Liu; Qinghe Kou; Wenju Hou; Xiaohua Zou; Jiao Jiang; Guoyi Gong; Kathrin Klee; Heiko Schoof; Ying Huang; Xuesong Hu; Shanshan Dong; Dequan Liang; Juan Wang; Kui Wu; Yang Xia; Xiang Zhao; Zequn Zheng; Miao Xing; Xinming Liang; Bangqing Huang; Tian Lv; Junyi Wang; Ye Yin; Hongping Yi; Ruiqiang Li; Mingzhu Wu; Amnon Levi; Xingping Zhang; James J Giovannoni; Jun Wang; Yunfu Li; Zhangjun Fei; Yong Xu
Journal: Nat Genet Date: 2012-11-25 Impact factor: 38.330

8. Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

Authors: Xiangchao Gan; Oliver Stegle; Jonas Behr; Joshua G Steffen; Philipp Drewe; Katie L Hildebrand; Rune Lyngsoe; Sebastian J Schultheiss; Edward J Osborne; Vipin T Sreedharan; André Kahles; Regina Bohnert; Géraldine Jean; Paul Derwent; Paul Kersey; Eric J Belfield; Nicholas P Harberd; Eric Kemen; Christopher Toomajian; Paula X Kover; Richard M Clark; Gunnar Rätsch; Richard Mott
Journal: Nature Date: 2011-08-28 Impact factor: 49.962

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. snpTree--a web-server to identify and construct SNP trees from whole genome sequence data.

Authors: Pimlapas Leekitcharoenphon; Rolf S Kaas; Martin Christen Frølund Thomsen; Carsten Friis; Simon Rasmussen; Frank M Aarestrup
Journal: BMC Genomics Date: 2012-12-13 Impact factor: 3.969

170 in total

1. High-Throughput Genotyping Technologies in Plant Taxonomy.

Authors: Monica F Danilevicz; Cassandria G Tay Fernandez; Jacob I Marsh; Philipp E Bayer; David Edwards
Journal: Methods Mol Biol Date: 2021

2. Deciphering signature of selection affecting beef quality traits in Angus cattle.

Authors: Mengistie Taye; Joon Yoon; Tadelle Dessie; Seoae Cho; Sung Jong Oh; Hak-Kyo Lee; Heebal Kim
Journal: Genes Genomics Date: 2017-09-30 Impact factor: 1.839

3. Purging of Strongly Deleterious Mutations Explains Long-Term Persistence and Absence of Inbreeding Depression in Island Foxes.

Authors: Jacqueline A Robinson; Caitlin Brown; Bernard Y Kim; Kirk E Lohmueller; Robert K Wayne
Journal: Curr Biol Date: 2018-10-25 Impact factor: 10.834

4. Local ancestry analysis reveals genomic convergence in extremophile fishes.

Authors: Anthony P Brown; Kerry L McGowan; Enrique J Schwarzkopf; Ryan Greenway; Lenin Arias Rodriguez; Michael Tobler; Joanna L Kelley
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-06-03 Impact factor: 6.237

5. A comprehensive biomedical variant catalogue based on whole genome sequences of 582 dogs and eight wolves.

Authors: V Jagannathan; C Drögemüller; T Leeb
Journal: Anim Genet Date: 2019-09-05 Impact factor: 3.169

6. Quartet inference from SNP data under the coalescent model.

Authors: Julia Chifman; Laura Kubatko
Journal: Bioinformatics Date: 2014-08-07 Impact factor: 6.937

7. Ws-2 Introgression in a Proportion of Arabidopsis thaliana Col-0 Stock Seed Produces Specific Phenotypes and Highlights the Importance of Routine Genetic Verification.

Authors: Mon-Ray Shao; Vikas Shedge; Hardik Kundariya; Fredric R Lehle; Sally A Mackenzie
Journal: Plant Cell Date: 2016-03-15 Impact factor: 11.277

8. Genome-wide analysis highlights genetic dilution in Algerian sheep.

Authors: S B S Gaouar; M Lafri; A Djaout; R El-Bouyahiaoui; A Bouri; A Bouchatal; A Maftah; E Ciani; A B Da Silva
Journal: Heredity (Edinb) Date: 2016-09-14 Impact factor: 3.821

9. Investigation of the genetic diversity of a core collection of japanese rice landraces (JRC) using whole-genome sequencing.

Authors: N Tanaka; M Shenton; Y Kawahara; M Kumagai; H Sakai; H Kanamori; J Yonemaru; S Fukuoka; K Sugimoto; M Ishimoto; J Wu; K Ebana
Journal: Plant Cell Physiol Date: 2020-10-12 Impact factor: 4.927

10. Whole-Genome Sequencing to Evaluate the Resistance Landscape Following Antimalarial Treatment Failure With Fosmidomycin-Clindamycin.

Authors: Ann M Guggisberg; Sesh A Sundararaman; Miguel Lanaspa; Cinta Moraleda; Raquel González; Alfredo Mayor; Pau Cisteró; David Hutchinson; Peter G Kremsner; Beatrice H Hahn; Quique Bassat; Audrey R Odom
Journal: J Infect Dis Date: 2016-07-20 Impact factor: 5.226