Literature DB >> 19797407

Processing and population genetic analysis of multigenic datasets with ProSeq3 software.

Abstract

MOTIVATION: The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. AVAILABILITY: The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.

Entities: Disease Species

Mesh：

Year: 2009 PMID： 19797407 PMCID： PMC2778335 DOI： 10.1093/bioinformatics/btp572

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

With ever decreasing costs of DNA sequencing and increasingly sophisticated analyses, the number of loci used in population genetic, phylogeographic and phylogenetic studies increases steadily. Only a few years ago it was normal to base the conclusions of experimental population genetic studies on the analysis of a single gene (Filatov and Charlesworth, 1999), while these days it is not uncommon to use hundreds of loci (or more) in a single study (Begun et al., 2007; Foxe et al., 2008). With the advent of high throughput sequencing the use of hundreds of loci will become the norm even for non-model organisms within a few years. Many population genetic programs, such as IMa (Hey and Nielsen, 2007), Structure (Pritchard et al., 2000) or Compute (Thornton, 2003) use multiple genes for analysis, however, preparation of such datasets, even with sequences in hand, is far from straightforward. Although there are ways to manipulate multigenic datasets using scripts, this requires programming skills, and in practice experimental population geneticists often do that manually. Here I report a program, ProSeq3, with a convenient graphic user interface that simplifies the preparation and basic population genetic analysis of multigenic datasets. It has been tested and fine-tuned for several years in our laboratory and its use leads to significant time savings at the dataset preparation and analysis stages.

2 FEATURES

ProSeq was originally developed as a Windows-based sequence editor with some DNA polymorphism analysis capability for single gene datasets (Filatov, 2002). The new version is now available for both Windows and Linux and can handle large datasets with thousands of genes. The size of the datasets is limited by memory and by the maximal value of 32-bit signed long integers (2 147 483 647) used for internal indexing. The program can be used for sequence editing, annotation of sequence features, handling of output from high throughput sequencers, or from BLAST searches, as well as for various population genetic analyses. ProSeq3 supports and facilitates all steps of DNA sequencing workflow from sequence chromatogram editing to DNA polymorphism analysis of multigenic data.

2.1 DNA sequence editing, alignment and annotation

To help with the processing of raw sequence data ProSeq3 allows users to open and visualize sequence chromatograms, edit the sequence and assemble sequence contigs. Integration with popular phred and phrap programs (de la Bastide and McCombie, 2007; Ewing and Green, 1998) makes it possible to automatically assess chromatogram quality and assemble contigs. Raw sequences with or without associated chromatogram and base quality information can be further edited and annotated in ProSeq3 to obtain finished sequences. ProSeq3 supports and facilitates the functional annotation of individual sequences in the dataset with several handy functions, such as selection and assignment of a functional (e.g. coding) region in the editor window, and the ability to copy assigned regions from another sequence in the dataset. All annotations are preserved if the dataset is saved in the data file (*.df) ‘native’ for ProSeq3. Multiple sequence alignment can be done within ProSeq3, which includes Clustal (Higgins et al., 1996). Alternatively alignment can be done manually using the ProSeq3 editor or an external program. In the latter case alignment information (position and length of gaps) can be imported back into the annotated dataset in ProSeq3. Following automated alignment, it is usually necessary to check, correct and trim the alignment manually, and check sequence differences between individual sequences, which is easily done in the sequence editor included in ProSeq3. The editor is fairly flexible and includes three viewing/editing modes, allowing the user to see/edit the sequence, polymorphisms in the alignment and the functional regions assigned to the sequence. Using these modes the user can scroll along the sequence, zoom in to see a region of the sequence or zoom out to visualize the entire sequence with annotation shown in a graphical form.

2.2 Handling data with a relational database

Tracking what sequence in a dataset comes from which individual becomes problematic when the number of sequenced genes is large. ProSeq3 resolves this problem by storing all the data in an internal relational database where the sequences are linked to individuals and individuals can be combined into groups (populations). This data structure makes it trivial to manipulate multiple datasets in the project; e.g. exclusion of one individual from analysis can be done with a couple of mouse clicks, which results in automatic exclusion of all sequences linked to that individual. Similarly, individual sequences or parts of sequences can be excluded from the analysis. Grouping sequences into populations is also done at the level of individuals: if an individual is assigned to the particular population, all the sequences across multiple datasets in the project that are linked to that individual are automatically assigned to that population. The assignment of sequences to individuals and individuals to groups can be done by a simple drag and drop approach. Relational information of the database is preserved if the project is saved in the native (*.df) ProSeq3 file format.

2.3 DNA polymorphism analysis

Once the alignments for several genes are complete and ready for analysis, they are usually analysed one by one using such programs as MEGA (Tamura et al., 2007) or DnaSP (Librado and Rozas, 2009). This process is relatively quick when there are only a few genes, but it becomes prohibitively time-consuming with larger numbers of genes. ProSeq3 solves this problem by allowing the user to run all the datasets in the project through the particular analysis in one go. Several most commonly used population genetic analyses are implemented in ProSeq3: visualisation and analysis of single nucleotide polymorphisms, common statistics for DNA polymorphism (π, θ; Nei and Kumar, 2000), various neutrality tests such as Tajima's D (Tajima, 1989), and analysis of population subdivision/divergence. The distribution of DNA polymorphism or neutrality statistics along the length of a gene can be visualised with a sliding window option. Although ProSeq3 was developed for population genetic analyses it also includes a tool for basic phylogenetic analysis that can construct and visualise neighbor-joining trees (Nei and Kumar, 2000). A combination of a sequence editor and tree visualisation tool in one program is particularly handy at the stage of preliminary evaluation and checking of the datasets, as oddities in the data, such as misalignment or sequencing errors make a sequence appear more diverged, which is easily identifiable from the inspection of a gene tree and can be quickly fixed within ProSeq3. Other analysis options include the tool for creating bootstrap replicates of a dataset, and a tool for coalescent simulations (Hein et al., 2005) with or without recombination in panmictic or subdivided populations.

2.4 Input/output options

ProSeq3 supports 25 different file formats. It can create input files for such popular programs as DnaSP (Librado and Rozas, 2009), MEGA (Tamura et al., 2007), PAML (Yang, 2007), Arlequin (Excoffier et al., 2005), Structure (Pritchard et al., 2000) and IMa (Hey and Nielsen, 2007). The multitude of supported file formats and flexible data structure of ProSeq3 make it a convenient hub for sequence data processing and analysis.

3 IMPLEMENTATION

ProSeq3 has been developed in Delphi7 with the CLX library and it can be compiled for Windows and Linux operation systems.

14 in total

1. Inference of population structure using multilocus genotype data.

Authors: J K Pritchard; M Stephens; P Donnelly
Journal: Genetics Date: 2000-06 Impact factor: 4.562

2. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics.

Authors: Jody Hey; Rasmus Nielsen
Journal: Proc Natl Acad Sci U S A Date: 2007-02-14 Impact factor: 11.205

3. PAML 4: phylogenetic analysis by maximum likelihood.

Authors: Ziheng Yang
Journal: Mol Biol Evol Date: 2007-05-04 Impact factor: 16.240

4. Assembling genomic DNA sequences with PHRAP.

Authors: Melissa de la Bastide; W Richard McCombie
Journal: Curr Protoc Bioinformatics Date: 2007-03

5. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

6. Using CLUSTAL for multiple sequence alignments.

Authors: D G Higgins; J D Thompson; T J Gibson
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

7. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

8. DNA polymorphism, haplotype structure and balancing selection in the Leavenworthia PgiC locus.

Authors: D A Filatov; D Charlesworth
Journal: Genetics Date: 1999-11 Impact factor: 4.562

9. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans.

Authors: David J Begun; Alisha K Holloway; Kristian Stevens; Ladeana W Hillier; Yu-Ping Poh; Matthew W Hahn; Phillip M Nista; Corbin D Jones; Andrew D Kern; Colin N Dewey; Lior Pachter; Eugene Myers; Charles H Langley
Journal: PLoS Biol Date: 2007-11-06 Impact factor: 8.029

10. Selection on amino acid substitutions in Arabidopsis.

Authors: John Paul Foxe; Vaqaar-un-Nisa Dar; Honggang Zheng; Magnus Nordborg; Brandon S Gaut; Stephen I Wright
Journal: Mol Biol Evol Date: 2008-04-04 Impact factor: 16.240

26 in total

1. Engineering 6-phosphogluconate dehydrogenase improves grain yield in heat-stressed maize.

Authors: Camila Ribeiro; Tracie A Hennen-Bierwagen; Alan M Myers; Kenneth Cline; A Mark Settles
Journal: Proc Natl Acad Sci U S A Date: 2020-12-15 Impact factor: 11.205

2. Ecological genomics of mutualism decline in nitrogen-fixing bacteria.

Authors: Christie R Klinger; Jennifer A Lau; Katy D Heath
Journal: Proc Biol Sci Date: 2016-03-16 Impact factor: 5.349

3. Adaptive signals in algal Rubisco reveal a history of ancient atmospheric carbon dioxide.

Authors: J N Young; R E M Rickaby; M V Kapralov; D A Filatov
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2012-02-19 Impact factor: 6.237

4. DNA polymorphism in recombining and non-recombining mating-type-specific loci of the smut fungus Microbotryum.

Authors: A A Votintseva; D A Filatov
Journal: Heredity (Edinb) Date: 2010-11-17 Impact factor: 3.821

5. High genetic diversity and lack of pronounced population structure in five species of sympatric Pacific eels.

Authors: Chrysoula Gubili; Robert Schabetsberger; Christine Poellabauer; Becky Bates; Rosa M Wagstaff; Lewis M Woodward; Ursula Sichrowsky; Alexander Scheck; David T Boseto; Erik Feunteun; Antony Acou; Robert Jehle
Journal: Fish Manag Ecol Date: 2019-02-01 Impact factor: 1.894

6. Recent and massive expansion of the mating-type-specific region in the smut fungus Microbotryum.

Authors: Carrie A Whittle; Antonina Votintseva; Kate Ridout; Dmitry A Filatov
Journal: Genetics Date: 2015-01-07 Impact factor: 4.402

7. Phylogeography in Galaxias maculatus (Jenyns, 1848) along Two Biogeographical Provinces in the Chilean Coast.

Authors: Claudio A González-Wevar; Pilar Salinas; Mathias Hüne; Nicolás I Segovia; Luis Vargas-Chacoff; Marcela Astorga; Juan I Cañete; Elie Poulin
Journal: PLoS One Date: 2015-07-10 Impact factor: 3.240

8. Phylogeography, Interaction Patterns and the Evolution of Host Choice in Drosophila-Parasitoid Systems in Ryukyu Archipelago and Taiwan.

Authors: Biljana Novković; Masahito T Kimura
Journal: PLoS One Date: 2015-06-12 Impact factor: 3.240

9. Migration-tracking integrated phylogeography supports long-distance dispersal-driven divergence for a migratory bird species in the Japanese archipelago.

Authors: Daisuke Aoki; Haruna Sakamoto; Munehiro Kitazawa; Alexey P Kryukov; Masaoki Takagi
Journal: Ecol Evol Date: 2021-05-02 Impact factor: 2.912

10. Molecular adaptation during a rapid adaptive radiation.

Authors: Maxim V Kapralov; Antonina A Votintseva; Dmitry A Filatov
Journal: Mol Biol Evol Date: 2013-01-25 Impact factor: 16.240