Literature DB >> 19304878

Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Peter J A Cock¹, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, Michiel J L de Hoon.

Abstract

SUMMARY: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. AVAILABILITY: Biopython is freely available, with documentation and source code at (www.biopython.org) under the Biopython license.

Entities: Species

Mesh：

Year: 2009 PMID： 19304878 PMCID： PMC2682512 DOI： 10.1093/bioinformatics/btp163

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Python (www.python.org) and Biopython are freely available open source tools, available for all the major operating systems. Python is a very high-level programming language, in widespread commercial and academic use. It features an easy to learn syntax, object-oriented programming capabilities and a wide array of libraries. Python can interface to optimized code written in C, C++or even FORTRAN, and together with the Numerical Python project numpy (Oliphant, 2006), makes a good choice for scientific programming (Oliphant, 2007). Python has even been used in the numerically demanding field of molecular dynamics (Hinsen, 2000). There are also high-quality plotting libraries such as matplotlib (matplotlib.sourceforge.net) available. Since its founding in 1999 (Chapman and Chang, 2000), Biopython has grown into a large collection of modules, described briefly below, intended for computational biology or bioinformatics programmers to use in scripts or incorporate into their own software. Our web site lists over 100 publications using or citing Biopython. The Open Bioinformatics Foundation (OBF, www.open-bio.org) hosts our web site, source code repository, bug tracking database and email mailing lists, and also supports the related BioPerl (Stajich et al., 2002), BioJava (Holland et al., 2008), BioRuby (www.bioruby.org) and BioSQL (www.biosql.org) projects.

2 BIOPYTHON FEATURES

The Seq object is Biopython's core sequence representation. It behaves very much like a Python string but with the addition of an alphabet (allowing explicit declaration of a protein sequence for example) and some key biologically relevant methods. For example, Sequence annotation is represented using SeqRecord objects which augment a Seq object with properties such as the record name, identifier and description and space for additional key/value terms. The SeqRecord can also hold a list of SeqFeature objects which describe sub-features of the sequence with their location and their own annotation. The Bio.SeqIO module provides a simple interface for reading and writing biological sequence files in various formats (Table 1), where regardless of the file format, the information is held as SeqRecord objects. Bio.SeqIO interprets multiple sequence alignment file formats as collections of equal length (gapped) sequences. Alternatively, Bio.AlignIO works directly with alignments, including files holding more than one alignment (e.g. re-sampled alignments for bootstrapping, or multiple pairwise alignments). Related module Bio.Nexus, developed for Kauff et al. (2007), supports phylogenetic tools using the NEXUS interface (Maddison et al., 1997) or the Newick standard tree format.

Table 1.

Selected Bio.SeqIO or Bio.AlignIO file formats

Format	R/W	Name and reference
fasta	R+W	FASTA (Pearson and Lipman, 1988)
genbank	R+W	GenBank (Benson et al., 2007)
embl	R	EMBL (Kulikova et al., 2006)
swiss	R	Swiss-Prot/TrEMBL or UniProtKB
		(The UniProt Consortium, 2007)
clustal	R+W	Clustal W (Thompson et al., 1994)
phylip	R+W	PHYLIP (Felsenstein, 1989)
stockholm	R+W	Stockholm or Pfam (Bateman et al., 2004)
nexus	R+W	NEXUS (Maddison et al., 1997)

Where possible, our format names (column ‘Format’) match BioPerl and EMBOSS (Rice et al., 2000). Column ‘R/W’ denotes support for reading (R) and writing (W).

Selected Bio.SeqIO or Bio.AlignIO file formats Where possible, our format names (column ‘Format’) match BioPerl and EMBOSS (Rice et al., 2000). Column ‘R/W’ denotes support for reading (R) and writing (W). Modules for a number of online databases are included, such as the NCBI Entrez Utilities, ExPASy, InterPro, KEGG and SCOP. Bio.Blast can call the NCBI's online Blast server or a local standalone installation, and includes a parser for their XML output. Biopython has wrapper code for other command line tools too, such as ClustalW and EMBOSS. Bio.PDB module provides a PDB file parser, and functionality related to macromolecular structure (Hamelryck and Manderick, 2003). Module Bio.Motif provides support for sequence motif analysis (searching, comparing and de novo learning). Biopython's graphical output capabilities were recently significantly extended by the inclusion of GenomeDiagram (Pritchard et al., 2006). Biopython contains modules for supervised statistical learning, such as Bayesian methods and Markov models, as well as unsu pervised learning, such as clustering (De Hoon et al., 2004). The population genetics module provides wrappers for GENEPOP (Rousset, 2007), coalescent simulation via SIMCOAL2 (Laval and Excoffier, 2004) and selection detection based on a well-evaluated Fst-outlier detection method (Beaumont and Nichols, 1996). BioSQL (www.biosql.org) is another OBF supported initiative, a joint collaboration between BioPerl, Biopython, BioJava and BioRuby to support loading and retrieving annotated sequences to and from an SQL database using a standard schema. Each project provides an object-relational mapping (ORM) between the shared schema and its own object model (a SeqRecord in Biopython). As an example, xBASE (Chaudhuri and Pallen, 2006) uses BioSQL with both BioPerl and Biopython.

3 CONCLUSIONS

Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists. The features described herein are only a subset; potential users should refer to the tutorial and API documentation for further information.

17 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. PDB file parser and structure class implemented in Python.

Authors: Thomas Hamelryck; Bernard Manderick
Journal: Bioinformatics Date: 2003-11-22 Impact factor: 6.937

3. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history.

Authors: Guillaume Laval; Laurent Excoffier
Journal: Bioinformatics Date: 2004-04-29 Impact factor: 6.937

4. Open source clustering software.

Authors: M J L de Hoon; S Imoto; J Nolan; S Miyano
Journal: Bioinformatics Date: 2004-02-10 Impact factor: 6.937

5. GenomeDiagram: a python package for the visualization of large-scale genomic data.

Authors: Leighton Pritchard; Jennifer A White; Paul R J Birch; Ian K Toth
Journal: Bioinformatics Date: 2005-12-23 Impact factor: 6.937

6. genepop'007: a complete re-implementation of the genepop software for Windows and Linux.

Authors: François Rousset
Journal: Mol Ecol Resour Date: 2008-01 Impact factor: 7.090

7. Improved tools for biological sequence comparison.

Authors: W R Pearson; D J Lipman
Journal: Proc Natl Acad Sci U S A Date: 1988-04 Impact factor: 11.205

8. The Universal Protein Resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

9. EMBL Nucleotide Sequence Database in 2006.

Authors: Tamara Kulikova; Ruth Akhtar; Philippe Aldebert; Nicola Althorpe; Mikael Andersson; Alastair Baldwin; Kirsty Bates; Sumit Bhattacharyya; Lawrence Bower; Paul Browne; Matias Castro; Guy Cochrane; Karyn Duggan; Ruth Eberhardt; Nadeem Faruque; Gemma Hoad; Carola Kanz; Charles Lee; Rasko Leinonen; Quan Lin; Vincent Lombard; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Francesco Nardone; Maria Pilar Garcia Pastor; Sheila Plaister; Siamak Sobhany; Peter Stoehr; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2006-12-05 Impact factor: 16.971

10. BioJava: an open-source framework for bioinformatics.

Authors: R C G Holland; T A Down; M Pocock; A Prlić; D Huen; K James; S Foisy; A Dräger; A Yates; M Heuer; M J Schreiber
Journal: Bioinformatics Date: 2008-08-08 Impact factor: 6.937

1396 in total

1. Prediction of resistance development against drug combinations by collateral responses to component drugs.

Authors: Christian Munck; Heidi K Gumpert; Annika I Nilsson Wallin; Harris H Wang; Morten O A Sommer
Journal: Sci Transl Med Date: 2014-11-12 Impact factor: 17.956

2. Modeling large regions in proteins: applications to loops, termini, and folding.

Authors: Aashish N Adhikari; Jian Peng; Michael Wilde; Jinbo Xu; Karl F Freed; Tobin R Sosnick
Journal: Protein Sci Date: 2011-12-05 Impact factor: 6.725

3. Scientific workflow management in proteomics.

Authors: Jeroen S de Bruin; André M Deelder; Magnus Palmblad
Journal: Mol Cell Proteomics Date: 2012-03-12 Impact factor: 5.911

4. Reducing the dimensionality of the protein-folding search problem.

Authors: George D Chellapa; George D Rose
Journal: Protein Sci Date: 2012-07-06 Impact factor: 6.725

5. Developing information technology at the Medical Research Unit of the Albert Schweitzer Hospital in Lambaréné, Gabon.

Authors: Paterne Lessihuin Dibacka; Yann Bounda; Davy Ondo Nguema; Bertrand Lell
Journal: Wien Klin Wochenschr Date: 2010-03 Impact factor: 1.704

6. InteractiveROSETTA: a graphical user interface for the PyRosetta protein modeling suite.

Authors: Christian D Schenkelberg; Christopher Bystroff
Journal: Bioinformatics Date: 2015-08-26 Impact factor: 6.937

7. The Recognition of Identical Ligands by Unrelated Proteins.

Authors: Sarah Barelier; Teague Sterling; Matthew J O'Meara; Brian K Shoichet
Journal: ACS Chem Biol Date: 2015-10-12 Impact factor: 5.100

8. An Activity-Guided Map of Electrophile-Cysteine Interactions in Primary Human T Cells.

Authors: Ekaterina V Vinogradova; Xiaoyu Zhang; David Remillard; Daniel C Lazar; Radu M Suciu; Yujia Wang; Giulia Bianco; Yu Yamashita; Vincent M Crowley; Michael A Schafroth; Minoru Yokoyama; David B Konrad; Kenneth M Lum; Gabriel M Simon; Esther K Kemper; Michael R Lazear; Sifei Yin; Megan M Blewett; Melissa M Dix; Nhan Nguyen; Maxim N Shokhirev; Emily N Chin; Luke L Lairson; Bruno Melillo; Stuart L Schreiber; Stefano Forli; John R Teijaro; Benjamin F Cravatt
Journal: Cell Date: 2020-07-29 Impact factor: 41.582

9. Structure of the DASH/Dam1 complex shows its role at the yeast kinetochore-microtubule interface.

Authors: Simon Jenni; Stephen C Harrison
Journal: Science Date: 2018-05-04 Impact factor: 47.728

10. Reversible Disruption of Specific Transcription Factor-DNA Interactions Using CRISPR/Cas9.

Authors: S Ali Shariati; Antonia Dominguez; Shicong Xie; Marius Wernig; Lei S Qi; Jan M Skotheim
Journal: Mol Cell Date: 2019-05-02 Impact factor: 17.970