| Literature DB >> 29385525 |
Diego Darriba1, Tomáš Flouri1, Alexandros Stamatakis1,2.
Abstract
With Next Generation Sequencing data being routinely used, evolutionary biology is transforming into a computational science. Thus, researchers have to rely on a growing number of increasingly complex software. All widely used core tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently, also with respect to software complexity. A topic that has received little attention is the software engineering quality of widely used core analysis tools. Software developers appear to rarely assess the quality of their code, and this can have potential negative consequences for end-users. To this end, we assessed the code quality of 16 highly cited and compute-intensive tools mainly written in C/C++ (e.g., MrBayes, MAFFT, SweepFinder, etc.) and JAVA (BEAST) from the broader area of evolutionary biology that are being routinely used in current data analysis pipelines. Because, the software engineering quality of the tools we analyzed is rather unsatisfying, we provide a list of best practices for improving the quality of existing tools and list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal as well as science policy and, more importantly, funding issues that need to be addressed for improving software engineering quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software engineering quality issues and to emphasize the substantial lack of funding for scientific software development.Entities:
Mesh:
Year: 2018 PMID: 29385525 PMCID: PMC5913673 DOI: 10.1093/molbev/msy014
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Evaluated Software Packages per Application Domain Including Version Numbers.
| Domain | Software | Version Number |
|---|---|---|
| Phylogenetics | PAML ( | 4.8 |
| PHYML ( | 20141009 | |
| MrBayes ( | 3.2.4-svn(r926) | |
| RAxML ( | 8.2.11 | |
| Population genetics | MS ( | Sep 8, 2014 |
| SweepFinder ( | Feb 23, 2015 | |
| Seq. alignment | MAFFT ( | 7.205 |
| T-Coffee ( | 20141026_23: 18 | |
| Prank ( | 140603 | |
| Div. times | Beast ( | 1.8.0 |
| FDPPDIV ( | 1.3 | |
| Multi.-Sp. coalescence | BP&P ( | 3.0 |
| Seq. simulation | Seq-Gen ( | 1.3.3 |
| INDELible ( | 1.0.3 | |
| De novo assembly | SOAP ( | r240 |
| Abyss ( | 1.5.2 | |
| Astrophysics | Gadget-2 ( | 2.0.7 |
Note.—As MS and SweepFinder do not have version numbers we show the download dates.
PAML Components.
| PAML component | LoC (own) | LoC (total) | Major W. | Minor W. | Clang W. | Malloc | Valgrind | Assert |
|---|---|---|---|---|---|---|---|---|
| baseml | 1,304 | 14,212 | 0.0 | 4.6 | 623.0 | NoCast | Clean | 0.0 |
| basemlg | 685 | 13,593 | 7.2 | 4.37 | 452.7 | NoCast | Leaks | 0.0 |
| chi2 | 185 | 185 | 0.0 | 27.0 | 37.85 | NoCast | Clean | 0.0 |
| codeml | 5,309 | 18,217 | 4.7 | 8.5 | 229.62 | NoCast | Clean | 0.0 |
| evolver | 1,123 | 14,031 | 4.5 | 58.8 | 297.6 | NoCast | Leaks | 0.0 |
| mcmctree | 2,970 | 8,079 | 2.4 | 11.1 | 184.1 | NoCast | Clean | 0.0 |
| pamp | 514 | 13,422 | 1.9 | 9.7 | 485.4 | NoCast | Leaks | 0.0 |
| yn00 | 712 | 927 | 4.2 | 50.8 | 315.5 | NoCast | Leaks | 0.0 |
Note.—LoC(own) is the number of effective lines of code that belong only to the component. LoC(total) is the total number of effective lines of code for each component, including code shared with other components. Columns “Major W.” and “Minor W.” give the major and minor GNU compiler warnings and “Clang W.” reports the number of clang warnings, all normalized to 1,000 lines of own code. Column “Malloc” provides the malloc() casting error, “Valgrind” the memory behavior and “assert” the number of assertions per 1,000 lines of code.
PAML Values Refer to Parts of the Source Code that is Shared Among All Individual Components that are Listed in table 2.
| Code | Language | LoC | Major W. | Minor W. | Clang W. | Malloc | Valgrind | Assert |
|---|---|---|---|---|---|---|---|---|
| PAML | C | 12,908 | 0.9 | 9.4 | 18.8 | NoCast | clean | 0.0 |
| PHYML | C | 56,456 | 0.0 | 0.0 | 56.5 | NoCast | clean | 0.16 |
| MrBayes | C | 94,432 | 0.02 | 0.0 | 9.6 | MisCast | Invalid/leaks | 2.37 |
| RAxML | C | 57,233 | 0.0 | 0.0 | 16.8 | No-Error | Leaks | 17.5 |
| SOAP | C/C++ | 37,020 | 3.9 | 17.0 | 155.5 | NoCast | Leaks | 0.0 |
| Abyss | C | 43,189 | 0.0 | 0.0 | 134.8 | No-Error | Clean | 23.11 |
| MS | C | 2,063 | 4.8 | 10.7 | 62.3 | WrongCast | Leaks | 0.0 |
| SweepFinder | C | 4,465 | 0.0 | 32.3 | 52.4 | NoCast | Clean | 1.56 |
| MAFFT | C | 57,688 | 1.1 | 1.3 | 27.3 | NoCast | Invalid/leaks | 0.0 |
| T-Coffee | C | 160,223 | 2.2 | 3.9 | 34.2 | NoCast | Leaks | 0.44 |
| Prank | C++ | 23,947 | 6.8 | 0.3 | 121.4 | NoCast | Invalid | 9.19 |
| BEAST | JAVA | 302,611 | 0.07 | 12.5 | N/A | No-Error | N/A | 0.0 |
| FDPPDIV | C++ | 11,474 | 3.0 | 3.5 | 61.7 | No-Error | Leaks | 0.26 |
| BP&P | C | 16,593 | 3.0 | 5.8 | 49.0 | NoCast | Leaks | 0.0 |
| Seq-Gen | C | 3,977 | 0.0 | 1.0 | 51.3 | No-Error | Leaks | 0.0 |
| INDELible | C++ | 11,402 | 0.0 | 22.8 | 182.5 | No-Error | Clean | 0.0 |
| Gadget-2 | C | 12,509 | 0.0 | 2.9 | 48.8 | NoCast | Probably clean | 0.0 |
Note.—Column “Language” denotes the programming language and column “LoC” is the total number of effective lines of code. Columns “Major W.” and “Minor W.” give the major and minor GNU compiler warnings and “Clang W.” reports the number of clang warnings, all normalized to 1,000 lines of code. Column “Malloc” provides the malloc() casting error, “Valgrind” the memory behavior. We denote the Gadget-2 code as “probably clean” since we interrupted the valgrind analysis that did not report any errors after 30 min of run-time. Finally, column “assert” represents the number of assertions per 1,000 lines of code.
Results of a Code Duplication Analysis Using the Simian Tool.
| Code | Lines Checked | Files Checked | Duplicate Lines | Duplication % | Blocks | Files |
|---|---|---|---|---|---|---|
| PAML | 22,200 | 17 | 1,210 | 5.5% | 120 | 11 |
| PHYML | 42,786 | 73 | 5,878 | 13.7% | 549 | 32 |
| MrBayes | 70,680 | 19 | 21,862 | 30.9% | 1,680 | 10 |
| RAxML | 55,873 | 25 | 17,137 | 30.7% | 1,304 | 22 |
| SOAP | 27,514 | 116 | 10,107 | 36.7% | 527 | 72 |
| Abyss | 37,038 | 212 | 4,245 | 11.5% | 441 | 71 |
| MS | 1,718 | 24 | 186 | 10.8% | 21 | 9 |
| SweepFinder | 3,777 | 12 | 293 | 7.8% | 28 | 3 |
| MAFFT | 45,045 | 72 | 28,630 | 63.6% | 1,647 | 59 |
| T-Coffee | 82,758 | 196 | 19,345 | 23.4% | 1,325 | 58 |
| Prank | 16,124 | 67 | 5,318 | 33.0% | 462 | 43 |
| BEAST | 228,316 | 2,336 | 64,024 | 28.0% | 4,786 | 1,151 |
| BP&P | 14,332 | 5 | 502 | 3.5% | 56 | 3 |
| Seq-Gen | 3,244 | 44 | 206 | 6.4% | 25 | 6 |
| INDELible | 9,840 | 7 | 1,954 | 19.9% | 106 | 5 |
| Gadget-2 | 9,770 | 31 | 3,314 | 33.9% | 180 | 31 |
Note.—The column “Lines checked” refers to the total number of source lines and “Files checked” to the total number of source files analyzed with Simian. Note that, the “Lines checked” number is not identical to the LoC numbers reported in tables 2 and 3, since the Simian tool does not take header files into account. Column “Duplicate lines” provides the number of duplicate lines detected, “duplication %” the relative amount of code duplication, and “Blocks” provides the total number of contiguous duplicated blocks of code. Finally, column “Files” gives the number of files in which duplicated code was detected.