Next-generation genome sequencing (NGS) technology is currently at a point where we can sequence genomes with unprecedented scale and accuracy, and it will be only a matter of time before we also have fully integrated, low-cost sequencing sample preparation, essentially being able to load DNA or RNA samples on the sequencer and generate data with the push of a button. Furthermore, with the sequencing cost significantly reduced in recent years, NGS technology has become a commodity technology in biomedical laboratories, and in the very near future it will be fully integrated into every aspect of medical and clinical practice. However, a significant bottleneck still exists, as the adoption of NGS technology for genomic research requires employing bioinformatics experts for deploying complex software and implementing computational infrastructure for bioinformatics data analysis. This is a key barrier to progress toward the democratization and broad adoption of the technology, especially by researchers in underrepresented institutions, lacking access to bioinformatics core facilities or the financial resources to hire experts for data analysis.The development of standardized, scalable data analysis software by the bioinformatics community that is easily accessible by non-experts has alleviated some of this bottleneck. These computing platforms have accelerated biomedical research based on NGS data, by providing easy access to state-of-the-art data analysis algorithms applicable to a range of genomic applications. From the perspective of clinical practitioners and biomedical researchers without access to core facilities or software experts, these platforms require minimal effort during installation; provide an intuitive user interface for running bioinformatics algorithms and mining NGS data for answering basic research hypotheses or discovering clinical targets for genomic medicine; include a simple yet powerful software mechanism for algorithm version tracking, update and configuration; enable easy management of genomic data collections and supporting datasets for bioinformatics analysis, including genome indexes, assemblies and gene variants, among others; and provide a complete analysis and training infrastructure that will be accessible and portable across computing platforms ranging from institutional computer clusters to desktop computers in biomedical research laboratories, by leveraging cloud-computing and software virtualization technology.Published platforms in the field that have these characteristics and leverage cloud computing for bioinformatics include Dugong [1], an implementation of Galaxy in Docker [2], BioPortainers [3], AlgoRun [4], GUIdock [5], DockerBIO [6] and Docker containers released through the Galaxy project [7]. However, while some of the aspects of a platform to help non-experts easily perform NGS data analysis can be found dispersed across those systems, namely the simplicity of using a desktop program, seamless deployment on various host computational platforms and a rich set of visualizations for mining the NGS data and the output of the bioinformatics pipelines. Specifically, BioPortainers and AlgoRun require users to perform command line installations [8,9], which, besides requiring significant technical expertise, are not available in all operating systems. In the case of GUIdock and especially DockerBIO, a significant effort has been made to provide easy installation and a user-friendly software interface. However, the former system is targeted only to a single application (cellular network analysis), while the latter depends on community developers of the Docker containers to providing the software and defining the user interface, while neither implements visualization capabilities. The Galaxy project has had great adoption by the community and requires a variable rate of a learning curve, depending on the user, while being an open-ended system for developing online bioinformatics portals or being used at sequencing centers where bioinformatics developers can customize it with pre-configured pipelines [10]. Finally, the Cloud BioLinux project [11] demonstrated early on that cloud-computing technology can increase the usability of bioinformatics software for non-experts, and Cloud BioLinux has been used for distributing pre-configured, accessible and ready-to-use bioinformatics software. While the funding for the project has expired and virtual machines are not actively maintained, the full open source code for Cloud BioLinux is still available through the project’s website, enabling users to utilize and improve the functionality in accordance with their specific application. Improving upon this technology, the BioDocklets [12] project demonstrated that further abstraction of complex bioinformatics operations can be achieved, helping non-experts to execute complex bioinformatics data analysis pipelines, with the simplicity of running a desktop software program.In addition to the core bioinformatics data analysis pipelines, many of these platforms also provide intuitive visualizations of the data, something required for lowering the barrier for genomic discoveries from the NGS data. The visualizations can run, for example, on desktop computers or smartphone/tablet computing platforms, which are easy to use for clinicians and researchers to perform data-driven genomic discoveries. Some implementations also leverage the latest web development and smartphone application programming technologies, in order to provide a bioinformatics visualization system that is self-contained and runs without any required installation or external software dependencies. Supplementary Figure 1 shows the Visual Omics Explorer (VOE) [13], where users are able to simply load the application on a web browser or access it as a smartphone/tablet app without any remote server connection. VOE visualizations are fully integrated with the bioinformatics pipelines and are easily shareable or exported as publication-ready graphics and integrative analysis of NGS data generated from their experiments, with data available in public databases such as The Cancer Genome Atlas (TCGA) or the ENCODE projects [14,15].Users of such bioinformatics platforms will have variable knowledge of genomic technologies, and therefore will need to complement their backgrounds accordingly in order for them to efficiently perform bioinformatics data analysis. Therefore, a set of targeted training materials should be provided for all bioinformatics analysis pipelines, visualizations and related tasks to be performed using these platforms, covering a range of genomic technologies from metagenomics to cancer variant discovery. These could be structured as virtual short courses published using online teaching platforms such as Coursera [16-20] or MIT Courseware [21]. Following familiarization with a specific platform and corresponding training for their specific type of data, biomedical and clinical researchers should be able to easily perform integrative bioinformatics analysis without any prior bioinformatics expertise, achieving proficiency at performing analysis of large-scale genomic datasets with a small investment of time and resources.In recent years, throughput from genomic sequencing projects has grown exponentially, following the constant drop in cost of the technology in addition to broad access to sequencing services. As a result, while genomic data released from large, publicly funded sequencing projects have reached the petabyte scale in size, data interpretation and extraction of scientific value from the research community still face significant bottlenecks. Sequencing instruments are typically bundled with only minimal computational and storage capacity, sufficient for data capture during runs of the instrument, and complex bioinformatics analyses are required in post-processing and the generation of scientific insights from the raw sequencing data. These analyses involve bioinformatics specialists and software engineers with specific technical skills and training; additionally, access to significant computational capacity is needed in order to process and store large-scale genomic datasets. Research laboratories in smaller, underrepresented institutions experience a significant bottleneck in finding the financial and time resources to put a bioinformatics infrastructure and teams of specialists in place, preventing them from participating fully in the genomics revolution and equal opportunities for scientific discoveries. A solution for alleviating this bottleneck has been the publication of genomic data analysis platforms by the bioinformatics community, providing easy access to pre-configured software that reduces the funds required to hire bioinformatics specialists and enables these institutions to budget the analysis as a fixed cost. A key factor in enabling this was the availability of cloud-computing services, which further helped democratize the bioinformatics field by providing access to the ample computational capacity required for genome data analysis to smaller, independent laboratories as well as large-scale bioinformatics core facilities.In conclusion, the current and future genomic data analysis platforms will enable biomedical researchers and clinical practitioners at underfunded, minority universities and research institutions to integrate NGS and bioinformatics as a standard component of biomedical research and clinical applications. With further development of software platforms that fully abstract the complexity of bioinformatics operations by the research community, we can remove the data analysis bottleneck and contribute to democratizing access to genomic-based, data-centric research for investigators at these underrepresented institutions.
Authors: Konstantinos Krampis; Tim Booth; Brad Chapman; Bela Tiwari; Mesude Bicak; Dawn Field; Karen E Nelson Journal: BMC Bioinformatics Date: 2012-03-19 Impact factor: 3.169
Authors: Enis Afgan; Dannon Baker; Marius van den Beek; Daniel Blankenberg; Dave Bouvier; Martin Čech; John Chilton; Dave Clements; Nate Coraor; Carl Eberhard; Björn Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Greg Von Kuster; Eric Rasche; Nicola Soranzo; Nitesh Turaga; James Taylor; Anton Nekrutenko; Jeremy Goecks Journal: Nucleic Acids Res Date: 2016-05-02 Impact factor: 16.971
Authors: W Digan; H Countouris; M Barritault; D Baudoin; P Laurent-Puig; H Blons; A Burgun; B Rance Journal: Gigascience Date: 2017-11-01 Impact factor: 6.524
Authors: Fabiano B Menegidio; David Aciole Barbosa; Rafael Dos S Gonçalves; Marcio M Nishime; Daniela L Jabes; Regina Costa de Oliveira; Luiz R Nunes Journal: Gigascience Date: 2019-04-01 Impact factor: 6.524