Literature DB >> 30665349

BPG: Seamless, automated and interactive visualization of scientific data.

Christine P'ng1, Jeffrey Green1, Lauren C Chong1, Daryl Waggott1, Stephenie D Prokopec1, Mehrdad Shamsi1, Francis Nguyen1, Denise Y F Mak1, Felix Lam1, Marco A Albuquerque1, Ying Wu1, Esther H Jung1, Maud H W Starmans1, Michelle A Chan-Seng-Yue1, Cindy Q Yao1,2, Bianca Liang1, Emilie Lalonde1,2, Syed Haider1, Nicole A Simone1, Dorota Sendorek1, Kenneth C Chu1, Nathalie C Moon1, Natalie S Fox1,2, Michal R Grzadkowski1, Nicholas J Harding1, Clement Fung1, Amanda R Murdoch1, Kathleen E Houlahan1,2, Jianxin Wang1,3, David R Garcia1, Richard de Borja1, Ren X Sun1,4, Xihui Lin1, Gregory M Chen1, Aileen Lu1,4, Yu-Jia Shiah1,2, Amin Zia1, Ryan Kearns1, Paul C Boutros5,6,7,8,9,10,11.   

Abstract

BACKGROUND: We introduce BPG, a framework for generating publication-quality, highly-customizable plots in the R statistical environment.
RESULTS: This open-source package includes multiple methods of displaying high-dimensional datasets and facilitates generation of complex multi-panel figures, making it suitable for complex datasets. A web-based interactive tool allows online figure customization, from which R code can be downloaded for integration with computational pipelines.
CONCLUSION: BPG provides a new approach for linking interactive and scripted data visualization and is available at http://labs.oicr.on.ca/boutros-lab/software/bpg or via CRAN at https://cran.r-project.org/web/packages/BoutrosLab.plotting.general.

Entities:  

Keywords:  Data-visualization; Interactive plotting; Software; Web-resources

Mesh:

Year:  2019        PMID: 30665349      PMCID: PMC6341661          DOI: 10.1186/s12859-019-2610-2

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Biological experiments are increasingly generating large, multifaceted datasets. Exploring such data and communicating observations is, in turn, growing more difficult and the need for robust scientific data-visualization is accelerating [1-4]. Myriad data visualization tools exist, particularly as web-based interfaces and local software packages. Unfortunately these often do not integrate easily into R-based statistical pipelines, such as the widely used Bioconductor [5]. Within R, many visualization packages exist, including base graphics [6], ggplot2 [7], lattice [8], Sushi [9], circlize [10], multiDimBio [11], NetBioV [12], GenomeGraphs [13] and ggbio [14]. There is also a broad range of activity-specific visualization packages focused on specific tasks or analysis-types [15-24]. Some of these lack publication-quality defaults such as high-resolution, appropriate label-sizing and default colour palettes appropriate for gray-scale use and visible for those with red-green colour-blindness. Many can require significant parameterization. Others contain limited plot types, provide limited scope for automatic generation of multi-panel figures or are constrained to specific data-types. Few allow interactive visualization, where specific plot elements can be highlighted and the set of parameters available to customize them automatically identified and allowing interactive generation of R code through a GUI interface that visualizes plot changes in real-time. Thus while each of these visualization packages has significant value and user-bases, each lacks some features beneficial for computational biologists and data scientists. Good visualization software must create a wide variety of chart-types in order to match the diversity of data-types available. It should provide flexible parametrization for highly customized figures and allow for multiple output formats while employing reasonable, publication-appropriate default settings, such as producing high resolution output. In addition, it should integrate seamlessly with existing computational pipelines while also providing an easily intuitive, interactive mode. There should be an ability to transition between pipeline and interactive mode, allowing cyclical development. Finally, good design principles should be encouraged, such as suggesting appropriate color choices and layouts for specific use-cases. To help users quickly gain proficiency, detailed examples, tutorials, an ability for real-time interactive plot-tuning and an application programming interface (API) are required. To date, no existing visualization suite fully fills these needs.

Implementation

To address this gap, we have created the BPG (BoutrosLab.plotting.general) library, which is implemented in R using the grid graphics system and lattice framework. It generates a broad suite of chart-types, ranging from common plots such as bar charts and box plots to more specialized plots, such as Manhattan plots (Fig. 1; code is in Additional file 1). These include some novel plot-types, including the dotmap: a grid of circles inset inside a matrix, allowing representation of four-dimensional data (Fig. 1n). Each plotting function is highly parameterized, allowing precise control over plot aesthetics. The default parameters for BPG produce high resolution (1600 dpi) TIFF files, appropriate for publication. The file type is specified simply by specifying a file extension. Other default values contribute to graphical consistency including: the inclusion of tick marks, selection of fonts and default colors that work together to create a consistent plotting style across a project. Default values have been optimized to generate high-quality figures, reducing the need for manual tuning. However, even good defaults will not be appropriate for every use-case [15]. Additional file 2: Figure S1 demonstrates a single scatter plot created using four separate graphics frameworks with either default or optimized settings: BPG, base R graphics, ggplot2, and lattice. BPG required half as much code as the other frameworks for both default and optimized plots, while producing plots with at least similar quality (Additional file 3).
Fig. 1

Available chart-types. The basic chart-types available in BPG: a density plot, b boxplot, c violin plot, d segplot, e strip plot, f barplot, g scatterplot, h histogram, i qqplot fit, j qqplot comparison, k Manhattan plot, l polygon plot, m heatmap, n dotmap and o hexbinplot. All plots are based upon the datasets included in the BPG package and code is given in Additional file 1

Available chart-types. The basic chart-types available in BPG: a density plot, b boxplot, c violin plot, d segplot, e strip plot, f barplot, g scatterplot, h histogram, i qqplot fit, j qqplot comparison, k Manhattan plot, l polygon plot, m heatmap, n dotmap and o hexbinplot. All plots are based upon the datasets included in the BPG package and code is given in Additional file 1 To facilitate rapid graphical prototyping, an online interactive plotting interface was created (http://bpg.oicr.on.ca). This interface allows users to easily and rapidly see the results of adjusting parameter values, thereby encouraging precise improvement of plot aesthetics. The R code generated by this interface is also made available for download, as is a methods paragraph allowing careful reporting of plotting options. A public web-interface is available, and local interfaces can be easily created. One critical feature of BPG is its ability to combine multiple plots into a single figure: a technique used widely in publications. This is accomplished by the create.multiplot function, which automatically aligns plots and standardizes parameters such as line widths and font sizes across all plot elements within the final figure. This replaces the slow and error-prone manual combination of figures using PowerPoint, LaTeX or other similar software, or the time-consuming parameterization of manually align plot locations directly in R with functions like layout(). The necessity of combining multiple plots arises from the complexity of datasets – with high dimensional data, it is often difficult to convey all relevant information within a single chart-type. Combining multiple chart-types allows more in depth visualization of the data. For example, one plot might convey the number of mutations present in different samples; a second plot could add the proportion of different mutation types, while a third could give sample-level information (Fig. 2). We have included a series of example datasets directly in BPG, including the one used to create the visualization in Fig. 2, and the source-code for creating this plot from these datasets is given in Additional file 4.
Fig. 2

Multiplot example. The create.multiplot function is able to join multiple chart-types together into a single figure. In this example, a central dotmap conveys the somatic mutations present in a selection of genes (y-axis) for a number of colorectal tumours (x-axis), while adjacent barplots and heatmaps provide additional information. Within this central dotmap, shaded cells reflect single nucleotide variants (SNVs), while dots in cells reflect copy number aberration (CNAs), which some patients have both types of aberration in a single gene (shaded cells harbouring a dot). The colour of the cell or dot indicates the specific type of mutation, using the legend on the left. The bottom heatmap shows key clinical information about each patient, including their Sex, the Stage of their disease and their microsatellite status (instable, MSI; stable MSS). The barplot to the right shows the percentage of patients with a SNV or a CNA in that gene. The barplot at the top, equivalently, shows the number of SNVs and CNAs for each patient. Finally, the second barplot from the top categorizes all SNVs based on the type of base-change that mutation reflects, showing their proportion as a fraction of the total mutation number. Code used to generate this figure is available in Additional file 4

Multiplot example. The create.multiplot function is able to join multiple chart-types together into a single figure. In this example, a central dotmap conveys the somatic mutations present in a selection of genes (y-axis) for a number of colorectal tumours (x-axis), while adjacent barplots and heatmaps provide additional information. Within this central dotmap, shaded cells reflect single nucleotide variants (SNVs), while dots in cells reflect copy number aberration (CNAs), which some patients have both types of aberration in a single gene (shaded cells harbouring a dot). The colour of the cell or dot indicates the specific type of mutation, using the legend on the left. The bottom heatmap shows key clinical information about each patient, including their Sex, the Stage of their disease and their microsatellite status (instable, MSI; stable MSS). The barplot to the right shows the percentage of patients with a SNV or a CNA in that gene. The barplot at the top, equivalently, shows the number of SNVs and CNAs for each patient. Finally, the second barplot from the top categorizes all SNVs based on the type of base-change that mutation reflects, showing their proportion as a fraction of the total mutation number. Code used to generate this figure is available in Additional file 4 A number of utility functions in BPG assist in plot optimization, such as producing legends and covariate bars, or formatting text with scientific notation for p-values. One difficult step in creating figures is the selection of color schemes that are both pleasing and interpretable [25, 26]. BPG provides a suite of 45 color palettes including qualitative, sequential, and diverging color schemes [27], shown in Additional file 5: Figure S2. Many optimized color schemes exist for numerous use cases including tissue types, chromosomes and mutation types. The default.colors function produces a warning when a requested color scheme is not grey-scale compatible, a common concern for figures reproduced in black and white. This is determined by converting each color to a grey value between 1 and 100, and indicating differences of < 10 as not grey-scale compatible to approximate a color scheme’s visibility when printed in grey-scale. To facilitate reproducibility, image metadata is automatically generated for all plots, creating descriptors such as software and operating system versions.

Results

Extensive documentation is provided to help new users learn how to use BPG. To assist researchers in determining which chart-type is appropriate for their dataset, we provide plotting examples in the documentation which are derived from a real dataset and a plotting guide is included to explain the intended use-case of each function. This guide also contains explanations of typography, basic color theory and layout design which help to improve the design of figures [28, 29]. In addition, an online API is available with both simple and complex use-case examples for each plot-type to help users quickly learn the range of functionality available.

Conclusions

BPG has been used in over 60 publications to date (Additional file 6: Table S1) [30-35]. These plotting functions have been integrated into numerous R analysis pipelines for automated figure generation as part of the analysis of large –omic data. The plots created by this package are reproducible and maintain a consistent aesthetic. We believe that BPG will facilitate improved visualization and communication of complex datasets. Code to generate Fig. 1. (TXT 14 kb) Figure S1. Comparison of graphical software options in R. (a-b) are created with base R graphics, (c-d) are created using ggplot2, (e-f) are made in lattice and (g-h) use BPG. The first plot in each pair uses default settings, while the second plot has been adjusted for font sizes, axes ranges, tick mark locations, grid lines, diagonal lines, background shading and highlighted datapoints. The number of lines of code used to create default plots are: 10 for base R, 10 for ggplot2, 14 for lattice, and 5 for BPG. The customized plots use 73 lines for base R, 83 for ggplot2, 86 for lattice, and 42 for BPG. Code for generating this figure is provided in Additional file 3. (TIFF 1590 kb) Code to generate Additional file 2: Figure S1. (TXT 6 kb) Code to generate Fig. 2. (TXT 11 kb) Figure S2. Color palettes. Color palettes are provided using the default.colors function for (a) generic use-cases and force.color.scheme for (b) specific use-cases. This display is generated using the show.available.palettes function. Interactive display of colors is also available using the display.colors function. (TIF 1036 kb) Table S1. Publications using BPG. (DOC 67 kb)
  19 in total

1.  The Proteogenomic Landscape of Curable Prostate Cancer.

Authors:  Ankit Sinha; Vincent Huang; Julie Livingstone; Jenny Wang; Natalie S Fox; Natalie Kurganovs; Vladimir Ignatchenko; Katharina Fritsch; Nilgun Donmez; Lawrence E Heisler; Yu-Jia Shiah; Cindy Q Yao; Javier A Alfaro; Stas Volik; Anna Lapuk; Michael Fraser; Ken Kron; Alex Murison; Mathieu Lupien; Cenk Sahinalp; Colin C Collins; Bernard Tetu; Mehdi Masoomian; David M Berman; Theodorus van der Kwast; Robert G Bristow; Thomas Kislinger; Paul C Boutros
Journal:  Cancer Cell       Date:  2019-03-18       Impact factor: 31.743

2.  Genome-wide germline correlates of the epigenetic landscape of prostate cancer.

Authors:  Kathleen E Houlahan; Yu-Jia Shiah; Alexander Gusev; Jiapei Yuan; Musaddeque Ahmed; Anamay Shetty; Susmita G Ramanand; Cindy Q Yao; Connor Bell; Edward O'Connor; Vincent Huang; Michael Fraser; Lawrence E Heisler; Julie Livingstone; Takafumi N Yamaguchi; Alexandre Rouette; Adrien Foucal; Shadrielle Melijah G Espiritu; Ankit Sinha; Michelle Sam; Lee Timms; Jeremy Johns; Ada Wong; Alex Murison; Michèle Orain; Valérie Picard; Hélène Hovington; Alain Bergeron; Louis Lacombe; Mathieu Lupien; Yves Fradet; Bernard Têtu; John D McPherson; Bogdan Pasaniuc; Thomas Kislinger; Melvin L K Chua; Mark M Pomerantz; Theodorus van der Kwast; Matthew L Freedman; Ram S Mani; Housheng H He; Robert G Bristow; Paul C Boutros
Journal:  Nat Med       Date:  2019-10-07       Impact factor: 53.440

3.  Improving analysis of the vaginal microbiota of women undergoing assisted reproduction using nanopore sequencing.

Authors:  Theresa Lüth; Simon Graspeuntner; Kay Neumann; Laura Kirchhoff; Antonia Masuch; Susen Schaake; Mariia Lupatsii; Ronnie Tse; Georg Griesinger; Joanne Trinh; Jan Rupp
Journal:  J Assist Reprod Genet       Date:  2022-10-12       Impact factor: 3.357

4.  Sex differences in oncogenic mutational processes.

Authors:  Constance H Li; Stephenie D Prokopec; Ren X Sun; Fouad Yousif; Nathaniel Schmitz; Paul C Boutros
Journal:  Nat Commun       Date:  2020-08-28       Impact factor: 14.919

5.  Landscape of transcriptomic interactions between breast cancer and its microenvironment.

Authors:  Natalie S Fox; Syed Haider; Adrian L Harris; Paul C Boutros
Journal:  Nat Commun       Date:  2019-07-15       Impact factor: 14.919

6.  Visualization of very large high-dimensional data sets as minimum spanning trees.

Authors:  Daniel Probst; Jean-Louis Reymond
Journal:  J Cheminform       Date:  2020-02-12       Impact factor: 5.514

7.  A transcriptome-based signature of pathological angiogenesis predicts breast cancer patient survival.

Authors:  Rodrigo Guarischi-Sousa; Jhonatas S Monteiro; Lilian C Alecrim; Jussara S Michaloski; Laura B Cardeal; Elisa N Ferreira; Dirce M Carraro; Diana N Nunes; Emmanuel Dias-Neto; Jüri Reimand; Paul C Boutros; João C Setubal; Ricardo J Giordano
Journal:  PLoS Genet       Date:  2019-12-17       Impact factor: 5.917

8.  Copy Number Profiles of Prostate Cancer in Men of Middle Eastern Ancestry.

Authors:  Alia Albawardi; Julie Livingstone; Saeeda Almarzooqi; Nallasivam Palanisamy; Kathleen E Houlahan; Aktham Adnan Ahmad Awwad; Ramy A Abdelsalam; Paul C Boutros; Tarek A Bismar
Journal:  Cancers (Basel)       Date:  2021-05-14       Impact factor: 6.639

9.  2,3,7,8-Tetrachlorodibenzo-p-dioxin modifies alternative splicing in mouse liver.

Authors:  Ana B Villaseñor-Altamirano; John D Watson; Stephenie D Prokopec; Cindy Q Yao; Paul C Boutros; Raimo Pohjanvirta; Jesús Valdés-Flores; Guillermo Elizondo
Journal:  PLoS One       Date:  2019-08-06       Impact factor: 3.240

10.  Rethinking Lupus Nephritis Classification on a Molecular Level.

Authors:  Salem Almaani; Stephenie D Prokopec; Jianying Zhang; Lianbo Yu; Carmen Avila-Casado; Joan Wither; James W Scholey; Valeria Alberton; Ana Malvar; Samir V Parikh; Paul C Boutros; Brad H Rovin; Heather N Reich
Journal:  J Clin Med       Date:  2019-09-23       Impact factor: 4.241

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.