Literature DB >> 27792763

PopSc: Computing Toolkit for Basic Statistics of Molecular Population Genetics Simultaneously Implemented in Web-Based Calculator, Python and R.

Shi-Yi Chen1, Feilong Deng1, Ying Huang2, Cao Li1, Linhai Liu1, Xianbo Jia1, Song-Jia Lai1.   

Abstract

Although various computer tools have been elaborately developed to calculate a series of statistics in molecular population genetics for both small- and large-scale DNA data, there is no efficient and easy-to-use toolkit available yet for exclusively focusing on the steps of mathematical calculation. Here, we present PopSc, a bioinformatic toolkit for calculating 45 basic statistics in molecular population genetics, which could be categorized into three classes, including (i) genetic diversity of DNA sequences, (ii) statistical tests for neutral evolution, and (iii) measures of genetic differentiation among populations. In contrast to the existing computer tools, PopSc was designed to directly accept the intermediate metadata, such as allele frequencies, rather than the raw DNA sequences or genotyping results. PopSc is first implemented as the web-based calculator with user-friendly interface, which greatly facilitates the teaching of population genetics in class and also promotes the convenient and straightforward calculation of statistics in research. Additionally, we also provide the Python library and R package of PopSc, which can be flexibly integrated into other advanced bioinformatic packages of population genetics analysis.

Entities:  

Mesh:

Year:  2016        PMID: 27792763      PMCID: PMC5085088          DOI: 10.1371/journal.pone.0165434

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Theoretical framework of population genetics had been significantly promoted due to the accumulated evidence in molecular biology, which resulted in an interdisciplinary Molecular Population Genetics more than two decades ago [1]. A primary task of molecular population genetics is to investigate the distribution and dynamic change of allele frequencies at population level in relation to a series of evolutionary processes and demographic events. Accordingly, both various mathematical models and a large number of statistical tests have been developed to comprehensively address these issues. A widely known example is that Japanese scientist Fumio Tajima proposed a simple and sophisticated statistic of Tajima D value for testing hypothesis of neutral evolution on basis of the observed DNA polymorphisms in 1989 [2], which has still been actively cited in enormous publications [3]. Prior to the advent of high-throughput sequencing techniques, the common molecular data available for studies of population genetics always only involve a short DNA fragment directly sequenced with tens of thousands of base pairs in length or several hundreds of polymorphism loci individually genotyped. By aiming at these small-scale and standard molecular data, kinds of computer programs, such as the actively cited DnaSP [4] and Arlequin [5], had been elaborately explored for calculating a number of statistics in population genetics with different strengths and weaknesses [6]. Recently, to better address the large-scale molecular data, such as genome-wide polymorphisms by massive resequencing, many high-throughput bioinformatic packages have been proposed for much more efficiently calculating the basic statistics, especially written in R programming language [7-9]. Despite the fact that there are various tools available for population genetics analysis on both small- and large-scale molecular data, we also found two limitations in this field deserving to be addressed. First, it is very inconvenient for using these existing tools, because all of them require strict format of input data, when you intend to compute certain statistics directly starting with the ready-made metadata of variants (such as allele frequencies) instead of raw DNA sequences. So, we believe that the easy-to-use online calculator would be very helpful for conducting some trivial tasks in teaching as well as research of molecular population genetics. Second, there is no toolkit available yet for exclusively performing the calculation step of statistics, which, in expectation, should be able to be directly and flexibly employed and incorporated into an advanced bioinformatic package. Such toolkit, therefore, could facilitate the development of analysis packages for bioinformatists without good background in population genetics, and also for biologists only holding the elementary skills in bioinformatics. In the present study, we specially addressed the two issues.

Design of PopSc and Statistics

To address the issues as mentioned above, PopSc was designed to exclusively perform the calculation step of statistics of interest in molecular population genetics. Therefore, the initial input data into PopSc are these descriptive metadata, such as the allele or haplotype frequencies, number of segregating sites, counts of different mutation types, and mismatch distribution, etc., which must be generated from upstream raw DNA sequences or genotyping data in advance. After completing the computational duties, PopSc will directly pass the numerical output of statistic on to users. Such philosophy of design will promote the convenient calculation without requiring tedious input of raw DNA sequences and also guarantee the possibility of direct integration of PopSc toolkit into other advanced packages of bioinformatic analysis. Generally guided by the popularity in scientific literatures, we comprehensively collected a total of 45 basic statistics in molecular population genetics into PopSc (Table 1). All of them could be roughly categorized into three classes, including (i) genetic diversity of DNA sequences, (ii) statistical tests for neutral evolution, and (iii) measures of genetic differentiation among populations. The required input data of PopSc for computing these statistics would be slightly different dependent on which one is selected, and each of them was clearly defined in the reference manuals. To facilitate practical uses, the calculation of each statistic is performed by one independent function. The mathematical formulas of these statistics were employed in intact from initial publications, meanwhile, all calculated results of PopSc were also carefully checked by either referring to values as being outputted by prevalent tools, such as DnaSP [4], or according to artificial verification step by step.
Table 1

Summaries of the included statistics into PopSc.

ClassesStatistics
Genetic diversity: the basic statistics about genetic diversity among a set of nucleotide sequences.Heterozygosity H, Haplotype diversity Hd [10]; Nucleotide diversity π [11]; Average nucleotide differences k [12]; Polymorphism information content, PIC [13]
Neutrality test: the classical statistical tests for DNA neutral evolution based on the mutation frequencies, haplotype distribution, and mismatch distribution, respectively.Tajima’s D [2]; Fu and Li’s D, F, D*, F* [14]; Strobeck’s S, Fu's W, Fs, Watterson’s W [15, 16]; Fay and Wu's H, Hn, Zeng's E [17, 18]; Ramos-Onsins’s R2, R3, R4, R2E, R3E, R4E, Ch, Che, ku [19]; Raggedness index rg [20]; Kelly’s Zns, ZA [21, 22]
Population structure: the statistics for measuring genetic differentiation among populations.Wright's FST, F¯ST, FIS [23]; Nei's GST, DST, JT, JS, RST [24]; Hedrick GST [25]; Jost D [26]; Weir and Cockerham’s θU, θRH, θw, fU, fRH, fw [27]

Implementation and Availability

Because the web browser is independent of operating system and requires no installation, PopSc was first implemented as online calculator (Fig 1). The web-based calculator of PopSc provides the convenient and straightforward calculation of statistics without requiring ab initio input of DNA sequences or other types of raw molecular data. Mathematical formulas and the related documentation for each statistic are also clearly shown on the respective web pages, which would be very helpful for teaching the molecular population genetics in class; and such teaching purpose had also been successfully addressed by the very popular tool of GENALEX, which is an add-in program for Microsoft Excel [28].
Fig 1

Screenshot of PopSc online calculator.

During the past years, programming languages of Python and R are becoming more and more popular in bioinformatic analyses. Therefore, PopSc was further independently implemented and provided as the Python library and R package, which can be easily integrated into other advanced bioinformatic packages in molecular population genetics. The Python library and R package of PopSc are deposited in official repositories (http://pypi.python.org and http://cran.r-project.org, respectively) and could be installed by standard commands. The web-based calculator, source codes and reference manuals of PopSc could also be freely available at: http://chenshiyi.com/popsc.html. Of course, PopSc is not an upgrade or even a replacer to the exiting computer tools because it just performs the step of mathematical calculation for each statistic. Therefore, PopSc only accepts the descriptive metadata, which must be independently prepared from raw DNA sequences or genotyping data with the aid of the custom scripts or other computer tools, such as MEGA [29]. PopSc certainly remains open for the additional implementation of other statistics in molecular population genetics.
  25 in total

1.  Hitchhiking under positive Darwinian selection.

Authors:  J C Fay; C I Wu
Journal:  Genetics       Date:  2000-07       Impact factor: 4.562

2.  DNA variation at the rp49 gene region of Drosophila simulans: evolutionary inferences from an unusual haplotype structure.

Authors:  J Rozas; M Gullaud; G Blandin; M Aguadé
Journal:  Genetics       Date:  2001-07       Impact factor: 4.562

3.  Statistical properties of new neutrality tests against population growth.

Authors:  Sebastian E Ramos-Onsins; Julio Rozas
Journal:  Mol Biol Evol       Date:  2002-12       Impact factor: 16.240

4.  A standardized genetic differentiation measure.

Authors:  Philip W Hedrick
Journal:  Evolution       Date:  2005-08       Impact factor: 3.694

5.  Estimation of average heterozygosity and genetic distance from a small number of individuals.

Authors:  M Nei
Journal:  Genetics       Date:  1978-07       Impact factor: 4.562

6.  Statistical tests for detecting positive selection by utilizing high-frequency variants.

Authors:  Kai Zeng; Yun-Xin Fu; Suhua Shi; Chung-I Wu
Journal:  Genetics       Date:  2006-09-01       Impact factor: 4.562

Review 7.  Computer programs for population genetics data analysis: a survival guide.

Authors:  Laurent Excoffier; Gerald Heckel
Journal:  Nat Rev Genet       Date:  2006-08-22       Impact factor: 53.242

8.  G(ST) and its relatives do not measure differentiation.

Authors:  Lou Jost
Journal:  Mol Ecol       Date:  2008-09       Impact factor: 6.185

9.  DnaSP v5: a software for comprehensive analysis of DNA polymorphism data.

Authors:  P Librado; J Rozas
Journal:  Bioinformatics       Date:  2009-04-03       Impact factor: 6.937

10.  Arlequin (version 3.0): an integrated software package for population genetics data analysis.

Authors:  Laurent Excoffier; Guillaume Laval; Stefan Schneider
Journal:  Evol Bioinform Online       Date:  2007-02-23       Impact factor: 1.625

View more
  3 in total

1.  Viral transmission and evolution dynamics of SARS-CoV-2 in shipboard quarantine.

Authors:  Ting-Yu Yeh; Gregory P Contreras
Journal:  Bull World Health Organ       Date:  2021-04-30       Impact factor: 9.408

2.  Genome-wide SNP discovery and evaluation of genetic diversity among six Chinese indigenous cattle breeds in Sichuan.

Authors:  Wei Wang; Jia Gan; Donghui Fang; Hui Tang; Huai Wang; Jun Yi; Maozhong Fu
Journal:  PLoS One       Date:  2018-08-08       Impact factor: 3.240

3.  Genetic diversity and population structure of four Chinese rabbit breeds.

Authors:  Anyong Ren; Kun Du; Xianbo Jia; Rui Yang; Jie Wang; Shi-Yi Chen; Song-Jia Lai
Journal:  PLoS One       Date:  2019-09-16       Impact factor: 3.240

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.