| Literature DB >> 19812722 |
Guy Leonard1, Jamie R Stevens, Thomas A Richards.
Abstract
The phylogenetic analysis of nucleotide sequences and increasingly that of amino acid sequences is used to address a number of biological questions. Access to extensive datasets, including numerous genome projects, means that standard phylogenetic analyses can include many hundreds of sequences. Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases. Managing large numbers of sequences and standardizing sequence labels for use in phylogenetic analysis programs can be a time consuming and laborious task. Here we report the availability of an online resource for the management of gene sequences recovered from public access genome databases such as GenBank. These web utilities include the facility for renaming every sequence in a FASTA alignment file, with each sequence label derived from a user-defined combination of the species name and/or database accession number. This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations. Post phylogenetic analysis, these webpages can then be used to rename every label in the subsequent tree files (with a user-defined combination of species name and/or database accession number). Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures. Additional features of our platform include the automatic removal of identical accession numbers (recorded in the report file) and generation of species and accession number lists for use in supplementary materials or figure legends.Entities:
Keywords: branch labels; phylogeny; sequence alignment; text management
Year: 2009 PMID: 19812722 PMCID: PMC2747128 DOI: 10.4137/ebo.s2331
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1REFGEN conversion of FASTA files for use in phylogenetic programs. A) Snapshot of CPN60 alignment. Sequences are derived from GenBank and the DOE JGI Phytophthora ramorum databases (please note although the DOE JGI sequence does not confirm to the long identification line format it is accommodated by REFGEN). All CPN60 sequences are curtailed after the first 70 amino acid positions for the purpose of this figure. Note the long database identifier lines given to each sequence. B) Screenshot of REFGEN formatting options. C) Output from REFGEN, with sequence labels now compatible with all phylogenetic programs and ready for analysis.
Figure 2TREENAMER conversion of phylogenetic analysis with REFGEN IDs. A) Screenshot of TREENAMER tree formatting options. B) Example of tree output, the leftmost tree results from phylogenetic analysis. The rightmost tree is the same tree after editing with TREENAMER. Please note although the DOE JGI sequence does not conform to the long identification line format it is accommodated by TREENAMER, such sequences will require manual alteration in the final figure but can be easily traced using the REFGEN output files.