| Literature DB >> 17708774 |
Rob Knight1, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett C Easton, Michael Eaton, Micah Hamady, Helen Lindsay, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, Michael Robeson, Raymond Sammut, Sandra Smit, Matthew J Wakefield, Jeremy Widmann, Shandy Wikman, Stephanie Wilson, Hua Ying, Gavin A Huttley.
Abstract
We have implemented in Python the COmparative GENomic Toolkit, a fully integrated and thoroughly tested framework for novel probabilistic analyses of biological sequences, devising workflows, and generating publication quality graphics. PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for third-party applications. The toolkit takes advantage of parallel architectures and runs on a range of hardware and operating systems, and is available under the general public license from http://sourceforge.net/projects/pycogent.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17708774 PMCID: PMC2375001 DOI: 10.1186/gb-2007-8-8-r171
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Summary of features of selected comparative genomics tools
| Feature | PyCogent | HyPHY | P4 | BioPython | Mesquite | ARB | CIPRESa |
| Query remote database | Yes | No | No | Yes | Yes | Yes | No |
| Control external | Yes | No | No | Yes | Yes | Yes | Yes |
| Create novel substitution | Yes | Yes | Yes | No | Yes | No | No |
| Novel sequence alignment | Yes | No | No | Yes | No | No | No |
| Partition models | Yes | Yes | No | No | No | No | No |
| Slice sequences | Yes | No | No | Yes | No | No | No |
| Draw alignments | Yes | No | No | No | No | No | No |
| Build phylogenetic trees | Yes | Yes | Yes | No | Yes | Yes | Yes |
| Draw phylogenetic trees | Yes | Yes | Yes | No | Yes | Yes | Yes |
| Visualization of model | Yes | Yes | No | No | Yes | No | No |
| Parallel computation | Yes | Yes | No | Yes | No | No | Yes |
| Customize parallelization | Yes | Yes | No | Yes | No | No | Yes |
| Reconstruct ancestral | Yes | Yes | No | No | Yes | Yes | Yes |
| Simulate sequences | Yes | Yes | Yes | No | Yes | No | Yes |
| Graphical user interface | No | Yes | No | No | Yes | Yes | Yes |
| Script based control | Yes | Yes | Yes | Yes | Yes | No | Yes |
| Handle 3D structures | Yes | No | No | Yes | Yes | Yes | No |
| Handling RNA secondary | Yes | No | No | No | No | Yes | No |
PyCogent provides a unique combination of evolutionary modeling capabilities, visualization, and workflow control. Although many packages, including but not limited to those shown, provide some overlap in capabilities, PyCogent provides a combination of features that is uniquely suited to genome-scale analyses. aCIPRES requires purchase of commercially licensed software for core functionality. 3D, three-dimensional.
Figure 1Interactive Python session showing a codon analysis of mammal nucleotide BRCA1 sequences. Line numbers are shown at the beginnings of input (but not output) lines and are referenced in the text. The terms '>>>' and '...' represent primary input and continuation prompts, respectively, from a Python interactive session. For noninteractive use, these characters and the following space are removed. The trailing '...' indicates additional output has been truncated.
Figure 2Estimating pair-wise distances. We use a general time reversible nucleotide substitution model (line 1). The pair-wise distances (line 4) are passed to the neighbor joining (nj) function (line 5), which returns a tree that is then written to file (line 6).
Figure 3Radial dendrogram displaying Proteobacteria rRNA G+C% on a phylogenetic tree. Low to high G+C% is displayed on a spectrum from yellow to blue. Included are 30 randomly sampled species from each of the five Proteobacteria divisions (α to γ).
Figure 4Specifying the phylo-HMM for analysis of VWF. The meaning of the substitution model arguments (lines 1 to 3) are as follows: ordered_param, rate will be split and ordered from small to large across bins; distribution, the statistical distribution by which parameter values are determined; and recode_gaps, whether gap characters are set to 'N'. The substitution model is then turned into a likelihood function (line 5) by providing a phylogenetic tree, specifying that the Γ distribution is split into two bins and the autocorrelated occurrence of rate class members is indicated by the sites_independent argument. We finish the definition of the Γ rate heterogeneity distribution by setting the bin probabilities (bprobs) to be fixed at the default value (line 6), which is equal. The remaining statements provide the alignment data to the likelihood function, optimize it, and extract the posterior probabilities for each site belonging to each rate class (lines 7 to 9). The slow rate class is automatically assigned the name bin0 and those probabilities are extracted by slicing the array (line 10). HMM, hidden Markov model; VWF, von Willebrand Factor.
Figure 5Posterior probabilities of aligned positions being classified as slowly evolving for VWF. Horizontal lines next to each name represent the aligned sequence, with gaps indicated by disruptions to the line (indels disrupt the von Willebrand Factor [VWF] A3 domain). Annotations for a sequence are displayed above its line. Red diamonds are single nucleotide polymorphisms (SNPs) annotated as being associated with von Willebrand disease, blue diamonds are the remaining SNPs. The blue line is the posterior probability a site belongs to the slow (bin0) bin.
Figure 6Rates of evolution on the the VWF A1 domain residues. Posterior probabilities of being slowly evolving are shown on a spectrum from red to blue corresponding to low/high probabilities. Residues with a disease causing single nucleotide polymorphism are colored yellow. A movie showing rotation of the structure is provided in Additional data file 3. VWF, von Willebrand Factor.