Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Improving the usability and archival stability of bioinformatics software.

Literature DB >> 30813962

Improving the usability and archival stability of bioinformatics software.

Serghei Mangul^1,2, Lana S Martin³, Eleazar Eskin^4,5, Ran Blekhman^6,7.

Abstract

Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication.

Entities: Species

Mesh：

Year: 2019 PMID： 30813962 PMCID： PMC6391762 DOI： 10.1186/s13059-019-1649-8

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

The rapid advancement of genomics and sequencing technologies has led to an overwhelming amount and diversity of new analytical algorithms packaged as software tools [1]. Such computational tools have helped life science and medical researchers analyze increasingly complex data, solve difficult biological problems, and lay groundwork essential for novel clinical translation. Indeed, all phases of sequencing data analysis rely on bioinformatics tools [2], from the initial sequencing of the human genome to modern analyses of high-throughput sequencing data. With the increasing importance and popularity of computational and data-enabled approaches among biomedical researchers, it becomes ever more critical to ensure that the developed software is usable [3] and the uniform resource locator (URL), through which the software tool is accessible, is archivally stable. Consistently usable and accessible software provides a foundation for the reproducibility of published biomedical research, defined as the ability to replicate published findings by running the same computational tool on the data generated by the study [4, 5]. In addition, the usability—or “user-friendliness”—of software tools is important, and software tools developed in academia are often less user-friendly compared with tools developed in industry settings. Given the abundance of tools released each year, these issues can limit a software tool’s scientific utility. While the scale of this issue in computational biology has yet to be estimated, the bioinformatics community warns that poorly maintained or implemented tools will ultimately hinder progress in “big data”-driven fields [6]. Successfully implementing and widely distributing scientific software involves numerous unique challenges. First, in academia, software tools are developed by small groups comprised of graduate or postdoctoral scholars [7]. These groups are less comprehensively trained and collaborate for shorter durations when compared with software development groups in the industry, where holistic teams of specialists support the long-term maintenance of projects. In addition, even a stable online presence provides no guarantee of the software tool’s usability, defined as the ability of the user to install and run the tools. Limited software usability and archival stability of computational tools can limit the applicability of the developed software tool and ultimately impair our ability to reproduce the results obtained using the software tools in the original publication. In academia, developers often lack formal training in software engineering, particularly in specialized user experience and cross-platform design. Many computational biology software developers lack the resources to provide a user-friendly interface for installing and running the tools. Developing an easy-to-use tool is further complicated by many tools’ reliance on third-party software that needs to be installed in advance, called “dependencies.” The computational biology field currently lacks a standardized approach for allowing the end user to easily install tools. Lack of institutional support for software development exacerbates these challenges. Although agencies are interested in novel computational methods, there is inadequate funding for the continual development and maintenance of existing tools [8]. Even widely used tools can suddenly lose the funding that is necessary for long-term maintenance—halting, and even discontinuing, development and availability of the tool. In addition, software development is not incentivized in academic hiring and promotion, processes that focus primarily on publication and funding rather than the development of software tools and resources. The increasing synergy between computational and wet lab researchers motivated us to perform a survey of the usability and archival stability of omics computational tools and resources [9]. Our empirical analysis of 24,490 omics software resources published from 2000 to 2017 shows that 26% of all omics software resources are not currently accessible through URLs published in the original paper. Despite a strong decline in the percentage of archivally unstable resources from 2000 to 2017, there remain 200 archivally unstable resources published each year during the study period. Improving the usability and archival stability of computational tools requires a concerted effort among software developers in academia, and our study points at several promising initiatives. We recommend hosting software tools on websites, such as GitHub and SourceForge, designed to host source code. In our study, software tool URLs directing the reader to such online repositories have a high rate of accessibility; 99% of the links to GitHub and 96% of the links to SourceForge are accessible, while only 72% of links hosted elsewhere are accessible. The bioinformatics community has used these web-based services since 2001, and the proportion of software tools hosted on these repositories has grown substantially—among papers included in our study, from 5% in 2012 to 20% in 2017 [9]. Further, our systematic assessment of the installability of published omics software tools suggests several areas of improvement. Among the 99 randomly selected tools included in our analysis, 49% were deemed “difficult to install,” meaning installation required more than 15 min, and 28% of the tools failed to be installed within the 2-h limit, usually due to implementation problems. Moreover, we found that installability impacts the popularity of software tools; successfully installed tools had significantly more citations compared with tools that we were not able to install within 2 h. On average, eight commands were required to install the surveyed tools, while user manuals provide an average of only 3.9 commands. Several surveyed software tools were available via a package manager, which allows the user to consistently automate the installation, upgrade, and configuration. We found that tools available through well-maintained package managers (e.g., Bioconda [10]) were always installable, while tools shipped without package managers were prone to problems in 32% of the studied cases. As bioinformatics researchers address increasingly complex datasets and problems, our community needs to adopt rigorous and standardized approaches to developing, peer reviewing, and published software packages. Many solutions to archival instability are already available, pragmatic, and analogous to existing practices in digital data archiving. For example, hosting bioinformatics software packages on archivally stable services, such as GitHub or SourceForge, greatly improves the long-term accessibility of omics tools. Solutions to unusable packages are more varied. Developers could create and provide an easy-to-use installation interface capable of downloading and installing any required third-party software packages, known as dependencies. Alternatively, developers can wrap the tools in package managers such as Bioconda [10]. Similar to the unit and integration testing practices in software engineering, an example dataset with a description of the expected results allows the user to verify that the tool was successfully installed and works properly before running the tool on experimental data. Journals may need to encourage a rigorous, standardized approach to software usability and accessibility by formally taking the issues into account during the peer review process. Reviewers may require that papers describing software tools include relevant items such as installation scripts, test data, and functions that allow automatic checks for the plausibility of installing and running the tool. Journals can provide a complementary top-down strategy to address another growing problem in our field: version control, or maintaining a stable software system that consists of many versions and configurations. For example, forking is a procedure designed to ensure the version of cited code within an article may persist beyond initial publication. Recently, the journal eLife took a major step toward improving archival stability by permanently forking published software to GitHub. We are witnessing an exciting time in bioinformatics. Each year, rapid advances in omics technologies prelude an astonishing number of software tools designed to accommodate increasingly bigger, more complex, and more specialized datasets. Dramatic changes in bioinformatics research and high-throughput computing capabilities can render some tools irrelevant—yet provide context for the development of new tools with superior accuracy. Our study highlights the challenges of producing usable and archivally stable bioinformatics software. The current model of computational biology software development encourages researchers to develop and publish novel tools but does not incentivize the maintenance of existing tools. Moreover, in academia, there is little motivation to develop computational tools that are easy to install and use. Nevertheless, results from our study broadly capture the importance of software stability and usability to the growth of computational biology. Our results provide compelling evidence for the adoption of a concerted effort toward a standardized approach for verifying and archiving software and highlight the need for funding and resources dedicated for the development and maintenance of software tools in the biomedical research community.

21 in total

Review 1. Design and analysis of CRISPR-Cas experiments.

Authors: Ruth E Hanna; John G Doench
Journal: Nat Biotechnol Date: 2020-04-13 Impact factor: 54.908

Review 2. Software tools, databases and resources in metabolomics: updates from 2018 to 2019.

Authors: Keiron O'Shea; Biswapriya B Misra
Journal: Metabolomics Date: 2020-03-07 Impact factor: 4.290

3. A comprehensive benchmarking of WGS-based deletion structural variant callers.

Authors: Varuni Sarwal; Sebastian Niehus; Ram Ayyala; Minyoung Kim; Aditya Sarkar; Sei Chang; Angela Lu; Neha Rajkumar; Nicholas Darfci-Maher; Russell Littman; Karishma Chhugani; Arda Soylev; Zoia Comarova; Emily Wesel; Jacqueline Castellanos; Rahul Chikka; Margaret G Distler; Eleazar Eskin; Jonathan Flint; Serghei Mangul
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

4. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data.

Authors: Jacob L Steenwyk; Thomas J Buida; Carla Gonçalves; Dayna C Goltz; Grace Morales; Matthew E Mead; Abigail L LaBella; Christina M Chavez; Jonathan E Schmitz; Maria Hadjifrangiskou; Yuanning Li; Antonis Rokas
Journal: Genetics Date: 2022-07-04 Impact factor: 4.402

5. xGAP: A python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery.

Authors: Aditya Gorla; Brandon Jew; Luke Zhang; Jae Hoon Sul
Journal: Bioinformatics Date: 2021-01-08 Impact factor: 6.937

Improving the usability and archival stability of bioinformatics software.

Review 1. Design and analysis of CRISPR-Cas experiments.

Review 2. Software tools, databases and resources in metabolomics: updates from 2018 to 2019.

3. A comprehensive benchmarking of WGS-based deletion structural variant callers.

4. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data.

5. xGAP: A python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery.

6. Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data.

7. NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization.

Review 8. Essential guidelines for computational method benchmarking.

Review 9. Fostering reproducibility, reusability, and technology transfer in health informatics.

Review 10. The promise and challenge of cancer microbiome research.