Literature DB >> 25071829

On best practices in the development of bioinformatics software.

Felipe da Veiga Leprevost¹, Valmir C Barbosa², Eduardo L Francisco³, Yasset Perez-Riverol⁴, Paulo C Carvalho⁵.

Abstract

Entities: Chemical Gene Species

Keywords: best practices; bioinformatics; repository; source control; test

Year: 2014 PMID： 25071829 PMCID： PMC4078907 DOI： 10.3389/fgene.2014.00199

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

× No keyword cloud information.

1. Introduction

Bioinformatics is one of the major areas of study in modern biology. Medium- and large-scale quantitative biology studies have created a demand for professionals with proficiency in multiple disciplines, including computer science and statistical inference besides biology. Bioinformatics has now become a cornerstone in biology, and yet the formal training of new professionals (Perez-Riverol et al., 2013; Via et al., 2013), the availability of good services for data deposition, and the development of new standards and software coding rules (Sandve et al., 2013; Seemann, 2013) are still major concerns. Good programming practices range from documentation and code readability through design patterns and testing (Via et al., 2013; Wilson et al., 2014). Here, we highlight some points for best practices and raise important issues to be discussed by the community.

2. Source-code availability to reviewers

It is debated among researchers whether source codes should be made available to reviewers, as doing so could allow for a more complete review and evaluation of the manuscript’s results. It could also ultimately enable reviewers to demand quality and clarity in the same way as from manuscripts originating from laboratory experiments, in which a bad PCR or a Western-Blot without controls may lead to wrong interpretations of the results (Ince et al., 2012). In the case of software, a clear indication that best practices were not followed can bespeak carelessness and therefore indirectly signal that something may be wrong. It is our opinion that reviewing the source code from submitted papers should be possible if desired, though publishers would obviously have to search for even more specialized reviewers for the task. The review process does not necessarily need to be done at the code level but can be accomplished by evaluating the structure of the project, availability of test units, and functional tests. By organizing and providing tests with different case scenarios the authors can easily demonstrate how the software works and how it behaves in different occasions. The possibility of executing the code (without having to go deeply into it) and of looking into how particular issues are handled in the code is important at all stages of the work (both pre- and post-publication). Further inspection by the scientific community will eventually lead to the same advantages we see in open-source projects like the Linux kernel (Torvalds, 2014b) or the protocols used in the Internet. Bugs can be spotted and improvements suggested by the community. This is especially important because, as science is an ever changing enterprise, always adapting and growing, the opportunity is given for the software to evolve along with the field.

3. Software indexing and availability

A topic that we should address as a community is the possibility of indexing software with a solution like the well-known DOI system. An example of such an initiative is the combined work of the Mozilla Science Lab (Mozilla Foundation, 2014), GitHub (GitHub, 2014), and Figshare (Figshare, 2014). This would enable researchers and practitioners to easily keep track of different software versions, thereby facilitating access and deployment (Summers, 2014). Currently, it is common for bioinformatics software to be hosted by university or even personal or laboratory websites. Although they are convenient and provide users with quick access to the material in question, such solutions are also the source of a major problem in bioinformatics, namely the discontinuation of software availability. An ideal solution to this problem would be a central hosting repository where each version could be archived and made available. This would also help when old versions became necessary for old, third-party workflows. Another important aspect is the ability to prevent the deletion of previous versions of a project, which would also help prevent other projects from ceasing to exist after a certain time or being abandoned.

4. Documenting the source code

Software documentation can be categorized into two groups, one targeted at software developers, the other at the end users. The former is usually found in the source code, or is linked to it, and is used to explain the particularities of the code itself, which is important especially for software updating and customization. The latter typically uses nontechnical language and is aimed at aiding the user in the process of software installation and execution. Without proper code documentation the process of resolving a bug or including new developers in the team becomes a very complicated task. Users likewise need to have access to the documentation explaining its usage, which must include all directives for installation under different operating systems (when such is the case) and for the handling of parameters and input data prior to a run. It is also important to note that we need proper documentation for biologists, as they will be the ones installing and using the programs. With easy-to-follow guidelines and instructions for non-programmers, it is possible to improve software usability.

5. Source-code management

During a software’s life cycle, a varying number of developers can be involved with its production and different versions of it can be created. One of the main goals of having source-code management is to have all these aspects automatically taken care of through the building of a historical registry of development. Solutions such as Git allow the simultaneous collaboration with several projects while greatly simplifying each maintainer’s tasks of tracking and resolving bugs, handling feature requests, and launching upgrades (Torvalds, 2014a). This also helps to promote the collaborative aspect of software development since anyone can join an ongoing project and provide patches.

6. Test libraries, sample data, and dataset repositories

A test library is a series of scripts designed to test a given piece of software. It is meant to aid in quickly determining whether the software’s main modules are working as expected. Ideally, all functions of the code should be thus tested, but sometimes this is not possible because of the size or complexity of the project. What is fundamental to test, though, is whether the main logic and operations are working correctly whatever the running environment happens to be. Normally a test library is shipped together with the software and the tests are executed before installation to certify that the main features are working on the machine at hand. Another important aspect of any scientific software is that sample data be provided along with it, in a manner similar to that in which supplementary files are provided together with a manuscript. Through “real-world” examples, users can verify what to expect of the various analyses. Such examples also allow for comparisons with other datasets (Perez-Riverol et al., 2014).

7. The advantages of the open-source development

There are several advantages to making a software project open source (Perez-Riverol et al., 2014). In computer science, projects are usually classified into two major categories: open source and proprietary. Being open source means making the code freely available, a simple gesture that can have powerful implications for user projects, especially those that are science-related. One of the greatest advantages of an open-source program is that it is possible to see and understand all functionalities and every calculation it does, thus ensuring full transparency. The same cannot be said of proprietary software, in which case users are required, essentially, to have faith in the product’s developer/seller and become unable to criticize or properly know how results are obtained. In general, open source means a greater tendency toward reliability, as anyone can peruse the source code and eventually spot some bug. As such, an open-source project is continually reviewed by the community. When someone spots an error and then corrects it, a patch can be generated and sent to the code maintainer. One of the key aspects of having an open-source project is to provide clarity about how results are generated and can be reproduced (Prli and Procter, 2012).

8. Final considerations

During the development phase of a software project, adopting best practices in programming involves investing time and effort to better structure ideas as both the code and the documentation are written. Although such investment may at times seem cumbersome, in the long run it benefits both developers and users, and is therefore valuable. In a related vein, another crucial issue is trustworthiness: from the perspective of the scientists using it, a software tool abiding by good practices can provide more confidence as their own projects are developed, which in turn is a key aspect of any work based on data analysis. All of this point in the direction of the software having more quality, since ultimately, quality depends on programming practices. The more quality a software has, the longer it will live and the more people will use it (Altschul et al., 2013). In this regard, a noteworthy initiative is the GMOD Galaxy, an open and integrated workflow system which allows the sharing of customized analyses (Giardine, 2005). Other examples of softwares following the best practices listed above are Tophat (Trapnell et al., 2009), Bowtie (Langmead et al., 2009), and the BioPerl project (Stajich, 2002).

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

13 in total

1. The case for open computer programs.

Authors: Darrel C Ince; Leslie Hatton; John Graham-Cumming
Journal: Nature Date: 2012-02-22 Impact factor: 49.962

2. Computational proteomics pitfalls and challenges: HavanaBioinfo 2012 workshop report.

Authors: Yasset Perez-Riverol; Henning Hermjakob; Oliver Kohlbacher; Lennart Martens; David Creasy; Jürgen Cox; Felipe Leprevost; Baozhen Paul Shan; Violeta I Pérez-Nueno; Michal Blazejczyk; Marco Punta; Klemens Vierlinger; Pedro A Valiente; Kalet Leon; Glay Chinea; Osmany Guirola; Ricardo Bringas; Gleysin Cabrera; Gerardo Guillen; Gabriel Padron; Luis Javier Gonzalez; Vladimir Besada
Journal: J Proteomics Date: 2013-01-29 Impact factor: 4.044

3. The anatomy of successful computational biology software.

Authors: Stephen Altschul; Barry Demchak; Richard Durbin; Robert Gentleman; Martin Krzywinski; Heng Li; Anton Nekrutenko; James Robinson; Wayne Rasband; James Taylor; Cole Trapnell
Journal: Nat Biotechnol Date: 2013-10 Impact factor: 54.908

4. Ten simple rules for the open development of scientific software.

Authors: Andreas Prlić; James B Procter
Journal: PLoS Comput Biol Date: 2012-12-06 Impact factor: 4.475

5. Best practices in bioinformatics training for life scientists.

Authors: Allegra Via; Thomas Blicher; Erik Bongcam-Rudloff; Michelle D Brazas; Cath Brooksbank; Aidan Budd; Javier De Las Rivas; Jacqueline Dreyer; Pedro L Fernandes; Celia van Gelder; Joachim Jacob; Rafael C Jimenez; Jane Loveland; Federico Moran; Nicola Mulder; Tommi Nyrönen; Kristian Rother; Maria Victoria Schneider; Teresa K Attwood
Journal: Brief Bioinform Date: 2013-06-25 Impact factor: 11.622

6. Ten simple rules for reproducible computational research.

Authors: Geir Kjetil Sandve; Anton Nekrutenko; James Taylor; Eivind Hovig
Journal: PLoS Comput Biol Date: 2013-10-24 Impact factor: 4.475

7. Ten recommendations for creating usable bioinformatics command line software.

Authors: Torsten Seemann
Journal: Gigascience Date: 2013-11-13 Impact factor: 6.524

8. Best practices for scientific computing.

Authors: Greg Wilson; D A Aruliah; C Titus Brown; Neil P Chue Hong; Matt Davis; Richard T Guy; Steven H D Haddock; Kathryn D Huff; Ian M Mitchell; Mark D Plumbley; Ben Waugh; Ethan P White; Paul Wilson
Journal: PLoS Biol Date: 2014-01-07 Impact factor: 8.029

9. TopHat: discovering splice junctions with RNA-Seq.

Authors: Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal: Bioinformatics Date: 2009-03-16 Impact factor: 6.937

Review 10. Open source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective.

Authors: Yasset Perez-Riverol; Rui Wang; Henning Hermjakob; Markus Müller; Vladimir Vesada; Juan Antonio Vizcaíno
Journal: Biochim Biophys Acta Date: 2013-03-01

22 in total

1. BuddySuite: Command-Line Toolkits for Manipulating Sequences, Alignments, and Phylogenetic Trees.

Authors: Stephen R Bond; Karl E Keat; Sofia N Barreira; Andreas D Baxevanis
Journal: Mol Biol Evol Date: 2017-06-01 Impact factor: 16.240

Review 2. Simple, efficient and thorough shotgun proteomic analysis with PatternLab V.

Authors: Marlon D M Santos; Diogo B Lima; Juliana S G Fischer; Milan A Clasen; Louise U Kurt; Amanda Caroline Camillo-Andrade; Leandro C Monteiro; Priscila F de Aquino; Ana G C Neves-Ferreira; Richard H Valente; Monique R O Trugilho; Giselle V F Brunoro; Tatiana A C B Souza; Renata M Santos; Michel Batista; Fabio C Gozzo; Rosario Durán; John R Yates; Valmir C Barbosa; Paulo C Carvalho
Journal: Nat Protoc Date: 2022-04-11 Impact factor: 17.021

3. Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software.

Authors: Peter Georgeson; Anna Syme; Clare Sloggett; Jessica Chung; Harriet Dashnow; Michael Milton; Andrew Lonsdale; David Powell; Torsten Seemann; Bernard Pope
Journal: Gigascience Date: 2019-09-01 Impact factor: 6.524

4. Image-based phenotyping of plant disease symptoms.

Authors: Andrew M Mutka; Rebecca S Bart
Journal: Front Plant Sci Date: 2015-01-05 Impact factor: 5.753

5. Reproducible Analysis of Post-Translational Modifications in Proteomes--Application to Human Mutations.

Authors: Alex S Holehouse; Kristen M Naegle
Journal: PLoS One Date: 2015-12-14 Impact factor: 3.240

Review 6. Open source libraries and frameworks for biological data visualisation: a guide for developers.

Authors: Rui Wang; Yasset Perez-Riverol; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Proteomics Date: 2015-02-05 Impact factor: 3.984

7. A multicenter study benchmarks software tools for label-free proteome quantification.

Authors: Pedro Navarro; Jörg Kuharev; Ludovic C Gillet; Oliver M Bernhardt; Brendan MacLean; Hannes L Röst; Stephen A Tate; Chih-Chiang Tsou; Lukas Reiter; Ute Distler; George Rosenberger; Yasset Perez-Riverol; Alexey I Nesvizhskii; Ruedi Aebersold; Stefan Tenzer
Journal: Nat Biotechnol Date: 2016-10-03 Impact factor: 54.908

8. FASTAptamer: A Bioinformatic Toolkit for High-throughput Sequence Analysis of Combinatorial Selections.

Authors: Khalid K Alam; Jonathan L Chang; Donald H Burke
Journal: Mol Ther Nucleic Acids Date: 2015-03-03 Impact factor: 10.183

9. WARACS: Wrappers to Automate the Reconstruction of Ancestral Character States.

Authors: Michael Gruenstaeudl
Journal: Appl Plant Sci Date: 2016-02-12 Impact factor: 1.936

10. PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets.

Authors: Yasset Perez-Riverol; Qing-Wei Xu; Rui Wang; Julian Uszkoreit; Johannes Griss; Aniel Sanchez; Florian Reisinger; Attila Csordas; Tobias Ternent; Noemi Del-Toro; Jose A Dianes; Martin Eisenacher; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Mol Cell Proteomics Date: 2015-11-06 Impact factor: 5.911