| Literature DB >> 21851592 |
Ralph S Peters1, Benjamin Meyer, Lars Krogmann, Janus Borner, Karen Meusemann, Kai Schütte, Oliver Niehuis, Bernhard Misof.
Abstract
BACKGROUND: Enormous molecular sequence data have been accumulated over the past several years and are still exponentially growing with the use of faster and cheaper sequencing techniques. There is high and widespread interest in using these data for phylogenetic analyses. However, the amount of data that one can retrieve from public sequence repositories is virtually impossible to tame without dedicated software that automates processes. Here we present a novel bioinformatics pipeline for downloading, formatting, filtering and analyzing public sequence data deposited in GenBank. It combines some well-established programs with numerous newly developed software tools (available at http://software.zfmk.de/).Entities:
Mesh:
Year: 2011 PMID: 21851592 PMCID: PMC3173391 DOI: 10.1186/1741-7007-9-55
Source DB: PubMed Journal: BMC Biol ISSN: 1741-7007 Impact factor: 7.431
Figure 1Outline of our pipeline that processes GenBank sequence data for phylogenetic analysis. Steps that are executed by newly developed scripts are highlighted in blue, and external programs are written in parentheses after step description. Steps that directly refer to the phylogenetic analysis are highlighted in red. The additional procedure to infer subset 2 is shaded in gray.
New scripts used in our pipelinea
| Step | Number | Script |
|---|---|---|
| Download from GenBank | [I] | |
| Standardize headers | [a.I], [b.I] | |
| Split sequences to single genes | [b.II] | |
| Check strand polarity and sequence similarity | [b.III] | |
| Choose longest sequence per species and gene | [a.IV], [b.IV] | |
| Translate coding mitochondrial sequences from nucleotides to amino acids | [b.V] | |
| Delete groups of orthologs with three or fewer species | [II], [III], [XIII] | |
| Delete species with only one sequence | [III], [XIII] | |
| Backtranslate coding mitochondrial sequences from amino acids to nucleotides | [VI] | |
| Mask gappy regions in alignment | [VII] | |
| Select maximum clique of overlapping sequences | [IX], [X] | |
| Ban compositional heterogeneity | [XI], [XII] | |
| Prune genera to best represented species | [XIV] | |
| Select largest group of species that overlap in at least one group of orthologs | [XV] | |
| Concatenate alignments | [XVI] |
aAvailable at http://software.zfmk.de/ and in Additional file 1. All scripts were written in Ruby, except for checking_seq, which was written in Perl. Numerals (column "Number") correspond to those in Figure 1.
Figure 2Simplified phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 1 obtained from subset 1). The tree includes 1,142 species. The applied color code indicates major lineages.
Figure 3Phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 1), reduced to family level. Numbers that follow the family names indicate the number of analyzed species. Numbers above branches indicate bootstrap support values (%). Values < 50% are omitted. The applied color code corresponds to that of Figure 2. Single species whose position in the inferred phylogenetic tree we consider erroneous are shown in gray.
Figure 4Phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 2 obtained from subset 2), reduced to family level. In this tree, species that were excluded by our pipeline in the course of generating subset 1 are reincluded. These taxa are marked with asterisks. The meaning of numbers and the applied color code correspond to those in Figure 3.