Literature DB >> 32542031

bigPint: A Bioconductor visualization package that makes big data pint-sized.

Abstract

Interactive data visualization is imperative in the biological sciences. The development of independent layers of interactivity has been in pursuit in the visualization community. We developed bigPint, a data visualization package available on Bioconductor under the GPL-3 license (https://bioconductor.org/packages/release/bioc/html/bigPint.html). Our software introduces new visualization technology that enables independent layers of interactivity using Plotly in R, which aids in the exploration of large biological datasets. The bigPint package presents modernized versions of scatterplot matrices, volcano plots, and litre plots through the implementation of layered interactivity. These graphics have detected normalization issues, differential expression designation problems, and common analysis errors in public RNA-sequencing datasets. Researchers can apply bigPint graphics to their data by following recommended pipelines written in reproducible code in the user manual. In this paper, we explain how we achieved the independent layers of interactivity that are behind bigPint graphics. Pseudocode and source code are provided. Computational scientists can leverage our open-source code to expand upon our layered interactive technology and/or apply it in new ways toward other computational biology tasks.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 32542031 PMCID： PMC7347224 DOI： 10.1371/journal.pcbi.1007912

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Software paper.

Introduction

Interactive data visualization is increasingly imperative in the biological sciences [1]. When performing RNA-seq studies, researchers wish to determine which genes are differentially expressed between treatment groups. Interactive visualization can help them assess differentially expressed gene (DEG) calls before performing any subsequent functional enrichment analyses. New visualization tools for genomic data have incorporated interactive capabilities, and some believe this trend could enhance the exploration of genomic data in the future [2]. Despite the growing appreciation of the inherent value of interactive graphics, the availability of effective and easy-to-use interactive visualization tools for RNA-seq data remains limited. Interactive visualization tools for genomic data can have restricted access when only available on certain operating systems and/or when requiring payment [3-5]. These limitations can be removed when tools are published on open-source repositories. Indeed, the Bioconductor project aims to foster interdisciplinary scientific research by promoting transparency and reproducibility while allowing software content to be used on Windows, MacOS, and Linux [6]. Bioconductor software is written in the R programming language, which also provides statistical and visualization methods that can facilitate the development of robust graphical tools [7]. Several interactive visualization methods for genomic data have been developed using Shiny, which is also based on the R programming language [8-10]. We recently developed bigPint, an interactive RNA-sequencing data visualization software package available on Bioconductor. In the current paper, we will now explain the technical innovations and merits of the bigPint package, including new interactive visualization techniques that we believe can be helpful in the development and usage of future biological visualization software.

Design and implementation

Quick start

For users who would like to immediately try out the package hands-on and apply bigPint graphics to their data, we recommend consulting the example pipeline (https://lindsayrutter.github.io/bigPint/articles/pipeline). This pipeline uses reproducible code and sample data from the bigPint package, so you can smoothly follow along each line of example code. For additional details, we recommend users to view articles in the Get Started tab on the package website (https://lindsayrutter.github.io/bigPint).

Basic input

Users can choose whether to input their data in SummarizedExperiment format using a single parameter dataSE or input their data as a combination of a parameter data and dataMetrics. In general, the data format corresponds to the assay(SummarizedExperiment) format and contains the read counts for all genes of interest. The value in row i and column j should indicate how many reads have been assigned to gene i in sample j. This is the same input format required in popular RNA-seq count-based statistical packages, such as DESeq2, edgeR, limma, EBSeq, and BaySeq [11-15]. In general, the dataMetrics object corresponds to the rowData(SummarizedExperiment) format and should be a subset of the data (usually DEGs) where each case includes quantitative values of interest (such as fold change and FDR). This information can be easily derived from popular RNA-seq numerical analysis packages. Again, this framework allows users to work smoothly between visualizations in the bigPint package and models in other Bioconductor packages, complying with the belief that the most efficient way to analyze large datasets is to iterate between models and visualizations.

Original features

Independent layers of interactivity

The Bioconductor community advanced the boundaries of biological visualization in the past and generally believes that modern interactive technology must be incorporated to continue these advancements [6]. We will define the term geom-drawing interactivity to indicate user queries that draw geoms (graphical representations of the data, such as lines, hexagons, and points). This could mean the user adjusts sliders or selects buttons to draw a subset of the data from the database as geoms (such as points). We will define the term geom-manipulating interactivity to indicate user queries that alter already-drawn geoms. This could mean the user hovers over a geom (such as a hexagon) and obtains its associated metadata (such as the names of its contained genes). It could also mean the user zooms and pans to further alter how already-drawn geoms are displayed. Our package introduces what we believe is a fairly new interactive visualization technology that is useful in the exploration of large biological datasets. Our technique allows for two independent layers of interactivity, for the foreground and background of the plot respectively. Each layer can include both geom-drawing and geom-manipulating interactivity. Our new technology can enhance the exploration of large datasets, especially in cases where one layer contains large amounts of data (such as the full dataset) and the other layer contains smaller amounts of data (such as a data subset). Because the layers are independent, users can save time and computation by keeping the layer with more data unaltered while only redrawing the layer with less data. The idea of multi-layered interactive graphics in R was raised in [16] and a pilot implementation was made using qt software to create different listeners for different layers, an approach that is no longer functional. While the concept of independent layers of interactivity in R is not new in itself, the technology we introduce is new and solves a difficult problem that has been raised before. We achieved our independent double-layered interactivity using htmlwidgets [17], ggplot2 [18], shiny [19], JavaScript, and plotly [20]. In general, we first converted a static ggplot2 object into an interactive plotly object using ggplotly(). To overlay an additional layer of interactivity to the plotly object, we used the onRender() method of the htmlwidgets package [17]. This method allowed us to overlay an interactive foreground layer via plotly traces while the original interactive background layer of the plotly object did not need to be redrawn, something that cannot foreseeably be achieved with the native onRender() method of the plotly package [20]. Specifically, the htmlwidgets onRender() method contains three input parameters: an HTML Widget object, a character vector containing JavaScript code, and a list of R objects that can be serialized to JSON format. To develop our technique, we specified our plotly object as the HTML Widget object, which allowed for an interactive background. Within the method, we wrote JavaScript code that enabled interactive foregrounds to be updated without redrawing the interactive background. We used the R object list to transfer count tables and DEG lists into the method. In some our our applications, users can link between the layers of different interactive plots. This functionality was achieved by sending custom messages between the Shiny software and the JavaScript code within the htmlwidgets method [19]. The current paper can benefit biology researchers and developers alike. Developers who would like to modify our code can access pseudocode and documented code separately in the supplementary material (see S1 Table). For biology researchers who would like to readily apply our visualization tools to their data, didactic materials (figures, applications, and videos) are available for each plotting type in Table 1. We will now briefly explain how our two-layered interactivity method can improve upon several of the RNA-seq visualization tools in our package.

Table 1

Resources for users about bigPint interactive graphics.

Plot	Figure	Video	Application
Scatterplot matrix	Figs 1–4	bit.ly/spmVid	bit.ly/spmApp
Litre plot	Figs 5–8	bit.ly/litreVid	bit.ly/litreApp
Volcano plot	Figs 9–12	bit.ly/volcVid	bit.ly/volcApp
Parallel coordinate	Figs 13–16	bit.ly/pcpVid	bit.ly/pcpApp

Scatterplot matrices

Scatterplot matrices have appeared in statistical graphics literature for almost four decades and used across various fields of multivariate research [21-24]. Previous user studies have shown that participants performed better when using animated rather than static versions of scatterplot matrices. Users also preferred animated scatterplot matrices and found them easier to understand as they can alleviate overplotting issues [25]. Rendering scatterplot matrices interactive is promising but challenging with large datasets [26]. The number of background geoms that need to be drawn grows exponentially by dimension size: n-dimensional data corresponds to n2 scatterplots. Our two-layered interactive visualization technology improves upon this dilemma by allowing details of interest to be superimposed in the foreground while the massive number of geoms in the background does not require redrawing. See Tables 1 and 2 and Figs 1–4 for details about our interactive scatterplot matrices.

Table 2

Examples of independent layers of interactivity.

Plot	Layer	Geom-drawing interactivity	Geom-manipulation interactivity
Scatterplot matrix	Background	None	User hovers over background hexagons to view gene counts
Scatterplot matrix	Foreground	User clicks on background hexagon to draw corresponding genes as foreground points. Background layer does not need to be redrawn	User hovers over foreground points to view gene names
Litre plot	Background	User uses Shiny buttons to specify treatment pairs and hexagon sizes for drawing background hexagons	User hovers over background hexagons to view gene counts
Litre plot	Foreground	User uses Shiny buttons to specify metric, metric order, and point size for drawing foreground points. Background layer does not need to be redrawn	User hovers over foreground points to view gene names
Volcano plot	Background	User uses Shiny buttons to specify treatment pairs and hexagon sizes for drawing background hexagons	User hovers over background hexagons to view gene counts
Volcano plot	Foreground	User uses Shiny buttons to specify point size, log fold changes, p-values to draw foreground points. Background layer does not need to be redrawn	User hovers over foreground points to view gene names

Fig 1

Step 1: Independent interactive layers of scatterplot matrix.

First step in a four-part example series of user actions. User hovers over background hexagon to determine it contains two genes.

Fig 4

Step 4: Independent interactive layers of scatterplot matrix.

Fourth step in a four-part example series of user actions. User clicks on background hexagon to overlay the 40 corresponding genes as orange points in the foreground layer of each scatterplot. This step does not require the computationally-expensive background layer of hexagons to be redrawn.

Step 1: Independent interactive layers of scatterplot matrix.

First step in a four-part example series of user actions. User hovers over background hexagon to determine it contains two genes.

Step 2: Independent interactive layers of scatterplot matrix.

Second step in a four-part example series of user actions. User clicks on background hexagon to overlay the two corresponding genes as orange points in the foreground layer of each scatterplot. The computationally-expensive background layer of hexagons does not need to be redrawn.

Step 3: Independent interactive layers of scatterplot matrix.

Third step in a four-part example series of user actions. The background layer of hexagons remains interactive and the user can still hover over another hexagon of interest to determine it contains 40 genes.

Step 4: Independent interactive layers of scatterplot matrix.

Litre plots

Problems still remain when scatterplot matrices are applied to large datasets. Physical space requirements increase exponentially. Hence, when extended to large dimensions, it becomes difficult to mentally link many small plots within the matrix [27]. Several techniques have been proposed to ameliorate this problem. Three dimensional scatterplots are useful but can cause occlusion and depth perception issues [27]. Other techniques like grand tours [28], projection pursuits [29, 30], and scagnostics [31] have been proposed. Even though these alternative techniques are useful, they may not simultaneously display distributions across all cases (genes) and variables (samples). We generally want to compare replicate and treatment variability in RNA-seq data, which can be visually accomplished by plotting all genes and samples. We also want to superimpose DEGs to determine how their read count variability compares to that of the whole dataset. In light of this, we developed a plot that collapses the scatterplot matrix onto one Cartesian coordinate system, allowing users to visualize all read counts from one DEG of interest onto all read counts of all genes in the dataset. We call this new plot a repLIcate TREatment (“litre”) plot. An in depth explanation about the litre plot can be found in our previous methods paper [32]. We believe our two-layered interactive visualization method is an indispensable component of the litre plot. Drawing the background (all genes in the dataset) is the time-limiting step, whereas drawing the foreground (one DEG of interest) is immediate. Most users would like to superimpose DEGs from a list one by one onto the background. This process would be unnecessarily time-prohibiting if the background needed to be redrawn each time the user progressed to the next DEG. Fortunately, our technology allows the user to immediately redraw the interactive foreground (the DEG of interest) while the background (all genes in the data) remains unchanged but preserved in its interactive capabilities. See Tables 1 and 2 and Figs 5–8 for details about our interactive litre plots.

Fig 5

Step 1: Independent interactive layers of litre plot.

First step in a four-part example series of user actions. User uses Shiny buttons to specify treatment pairs (N and P) and hexagon size (10) for drawing background hexagon layer. User can hover over hexagon of interest to determine it contains 19 genes.

Fig 8

Step 4: Independent interactive layers of litre plot.

Fourth step in a four-part example series of user actions. User can zoom and pan on the layers using the Plotly Modebar.

Step 1: Independent interactive layers of litre plot.

Step 2: Independent interactive layers of litre plot.

Second step in a four-part example series of user actions. User uses Shiny buttons to specify metric (FDR) and metric order (Increasing) to establish the order in which genes will be overlaid as pink points in the foreground layer. User clicks “Plot gene” button and the gene with the lowest FDR value (Glyma.19G168700.Wm82.a2.v1) is overlaid. The background layer of hexagons does not need to be redrawn.

Step 3: Independent interactive layers of litre plot.

Third step in a four-part example series of user actions. User clicks “Plot gene” button again and the gene with the second-lowest FDR value (Glyma.13G293500.Wm82.a2.v1) is overlaid. This step does not require the background layer of hexagons to be redrawn.

Step 4: Independent interactive layers of litre plot.

Fourth step in a four-part example series of user actions. User can zoom and pan on the layers using the Plotly Modebar.

Volcano plots

Volcano plots draw significance and fold change on the vertical and horizontal axes respectively. In RNA-seq studies, volcano plots allow users to check that genes were not falsely deemed significant due to outliers, low expression levels, and batch effects [33]. Researchers benefit from the ability to quickly identify individual gene names in the volcano plot. This was previously achieved with the identify() method in R, which identifies the closest point in a scatterplot to the position nearest the mouse click [33]. The interactive volcano plot in bigPint can identify individual gene names in a less ambiguous fashion by responding to users hovering directly over corresponding points. It also improves upon traditional volcano plots by allowing users to threshold on statistical values in order to immediately update the superimposed gene subset without having to redraw the more computationally-heavy background that contains all genes. See Tables 1 and 2 and Figs 9–12 for details about our interactive volcano plots.

Fig 9

Step 1: Independent interactive layers of volcano plot.

First step in a four-part example series of user actions. User uses Shiny buttons to specify treatment pairs (N and P) and hexagon size (9) for drawing background hexagon layer. User can hover over hexagon of interest to determine it contains 1 gene.

Fig 12

Step 4: Independent interactive layers of volcano plot.

Fourth step in a four-part example series of user actions. User uses Shiny buttons to update threshold values and again presses “Plot gene subset” button. The subset of genes that pass the new thresholds are overlaid in the foreground layer as pink points. The background layer of hexagons does not need to be redrawn.

Step 1: Independent interactive layers of volcano plot.

Step 2: Independent interactive layers of volcano plot.

Second step in a four-part example series of user actions. User uses Shiny buttons to specify log fold change and p-value thresholds. User clicks “Plot gene subset” button and the subset of genes that pass the thresholds are overlaid in the foreground layer as pink points. The background layer of hexagons does not need to be redrawn. User hovers over foreground point to view gene name (Glyma.19G168700.Wm82.a2.v1).

Step 3: Independent interactive layers of volcano plot.

Third step in a four-part example series of user actions. User uses Shiny buttons to increase point size from 2 to 3. Foreground layer of pink points are increased in size and the background layer of hexagons does not need to be redrawn.

Step 4: Independent interactive layers of volcano plot.

Consecutive box selection

The bigPint package provides interactive tools for consecutive box selection. A box selection is a rectangular query drawn directly on a two-dimensional graph. Users can specify a box selection by clicking on the desired starting point of the rectangular query and dragging the mouse pointer to the desired opposite corner point of the rectangular query. This procedure for generating rectangles is widely used in interactive programs and should be familiar to most users [34]. After the user releases the mouse, the query is processed and only the data cases that were inside the specified rectangle remain. More precisely, a data case remains in a box selection queried between (x1, y1) and (x2, y2) if every point within x1 ≤ x ≤ x2 is also within y1 ≤ y ≤ y2 (where y2 ≥ y1 and x2 ≥ x1). The user can specify consecutive queries with multiple box selections. The consecutive box selection model is convenient in cases where identical thresholds are desired over adjacent features. In these cases, a single box selection of width w can be used to simultaneously query the same threshold across w features. This process is an improvement over single-feature box selection widgets, where w individual queries would be required [34]. Consecutive box selection may have originally been designed for time series data, but has since proven useful for detecting patterns in gene expression data. Combined with parallel coordinate plots, the consecutive box selection technique has been used to elicit candidate regulatory splice sequences showing high values at some positions and low values at other positions [34]. In RNA-seq, this technology can also be used to investigate differential expression showing high read counts for one treatment group and low read counts for another treatment group, requiring a consecutive query. Consecutive box selection tools have been published for gene expression analysis sofware that was restricted for certain operating systems [34]. We believe that publishing consecutive box selection tools in a platform like R can be useful for computational biologists using various operating systems. See Tables 1 and 2 and Figs 13–16 for details about our interactive parallel coordinate plots that feature consecutive box selection.

Fig 13

Step 1: Consecutive box selection in parallel coordinate plot.

First step in a four-part example series of user actions. User selects the Box Select tool from the Plotly Modebar.

Fig 16

Step 4: Consecutive box selection in parallel coordinate plot.

Fourth step in a four-part example series of user actions. User can zoom and pan on the plot using the Plotly Modebar.

Step 1: Consecutive box selection in parallel coordinate plot.

First step in a four-part example series of user actions. User selects the Box Select tool from the Plotly Modebar.

Step 2: Consecutive box selection in parallel coordinate plot.

Second step in a four-part example series of user actions. User specifies box selection by drawing a rectangular query. Only the genes (pink lines) inside the specified rectangle remain.

Step 3: Consecutive box selection in parallel coordinate plot.

Third step in a four-part example series of user actions. User can hover over a gene of interest (pink line) to view its name (Glyma.11G216300.Wm82.a2.v1).

Step 4: Consecutive box selection in parallel coordinate plot.

Fourth step in a four-part example series of user actions. User can zoom and pan on the plot using the Plotly Modebar.

Useful features

Tailoring and saving static plots

Static plots can be saved as list objects in the R workspace and/or as JPG files to a directory chosen by the user. Saving plots into the R workspace allows users to integrate them into analysis workflows. It also allows them to tailor the plots (such as adding titles and changing label sizes) using the grammar of graphics via the conventional + syntax. Saving plots to a directory allows users to keep professional-looking files that can be inserted into proposals and talks. By default, the bigPint package saves static plots both in the R workspace and a directory (the default location is tempdir()).

Second feature layer

Both static and interactive plots allow for a subset of data to be plotted in a different manner than the full dataset. When analyzing RNA-seq data, this second feature layer could represent DEGs. There are three options for creating data subsets with static plots. First, users can threshold the previously-mentioned dataMetrics object by one of its quantitative variables. Second, users can simply declare a geneList object that contains the list of data subset IDs. Third, the user can simply leave the dataMetrics and geneList objects to their default value of NULL and not overlay any data subsets.

Group comparison filters

When users create static plots, the package automatically creates a separate plot for each pairwise combination of treatment groups from the inputted data. When users explore interactive plots, fields are dynamically generated from the inputted data so that any pairwise combination of treatment groups can be selected by buttons. Users can then quickly flip between contrasts in their data. The bigPint package comes with an example soybean cotyledon dataset that has three treatment groups, which is used across several easy-to-follow articles on the package website. These assets can assist users who have data containing more than two treatment groups.

Hexagonal binning

Most bigPint plots represent genes using point geoms (where each point represents one gene) or hexagonal binning geoms (where each hexagon color represents the number of genes in that area). Plotting each gene as a point allows for ideal levels of detail but overplotting can occur as the data increases, which makes it difficult to determine how many genes are in a given area. Hexagonal binning has been used in prior software to successfully manage overplotting issues [26, 35] and has shown superior time performance because less geom objects need to be plotted. The bigPint package allows users to draw the background using either geom, as preferences can depend on the dataset.

Hierarchical clustering

Users can conduct hierarchical clustering analyses on their data using the function plotClusters(). By default, the resulting clusters will be plotted as parallel coordinate lines superimposed onto side-by-side boxplots that represent the five-number summary of the full dataset. There are three main approaches in the plotClusters() function: Approach 1: The clusters are formed by clustering only on a user-defined subset of data (such as significant genes). Only these user-defined genes are overlaid as parallel coordinate lines. Approach 2: The clusters are formed by clustering the full dataset. Then, only a user-defined subset of data (such as significant genes) are overlaid as parallel coordinate lines. Approach 3: The clusters are formed by clustering the full dataset. All genes are overlaid as parallel coordinate lines. The clustering algorithm is based on the hclust() and cutree() functions in the R stats package. It offers the same set of agglomeration methods (“ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median”, and “centroid”) with “ward.D” as the default. In many cases, users may want to save clusters derived from the plotClusters() function for later use, such as to overlay them onto scatterplot matrices, litre plots, and volcano plots. The gene IDs of each cluster can be saved as .RDS files for this purpose by setting the verbose option of the plotClusters() function to a value of TRUE.

Various plot aesthetics

Users can modify various aesthetics for both static and interactive plots, including geom size. Some plots also provide alpha blending, which can benefit users plotting large datasets as parallel coordinate lines [36]. Statistical coloring is inconsistent in numerous packages even though it can greatly enhance biological data visualization [37]. The bigPint package allows users to maintain consistent coloring across hierarchical clusters and when working between various plots.

Selection and aggregation

Some techniques that are effective in data exploration may lose their efficiency and eventually fail as data size increases. Two main approaches to solving this problem are data selection and data aggregation [38]. Data selection means that only a subset of the full data is displayed at a given time. The data subset can be selected through queries and interactive controls which allow the user to quickly examine different data subsets [38]. Data aggregation means that the full dataset is divided into data subsets (called aggregates) that reduce the amount of data being simultaneously visualized. Users with large datasets should ideally be able to perform both data selection and data aggregation [38]. The bigPint package allows users to easily perform data selection using queries (such as thresholds and sliders) and interactive controls (such as zooming, box and lasso selection, and panning) and to perform data aggregation using hierarchical clustering.

Shiny interactivity

Interactive plots in the bigPint package open as Shiny applications that consist of simple dashboards with “About” tabs that explain how to use the applications. They also include “Application” tabs that provide several input fields for the user to tailor their plots. Some of these input fields are generated dynamically from the inputted dataset so that users have more convenience in how they select data subsets. In these applications, users can also download lists of selected genes and static images of interactive graphics to their local computers. Shiny applications can be launched on a local personal computer, hosted on a local or cloud-based server, or hosted for free on the shinyapps.io website. As such, interactive bigPint packages can be deployed on a personal computer using only a local file containing the data, the bigPint package and its dependencies, R / RStudio, and a browser recommended by Shiny (Google Chrome or Mozilla Firefox). This method does not require internet connectivity, which can be useful for users who are protecting sensitive data, analyzing or presenting data in contexts without reliable connectivity, or testing and developing applications.

Results

The bigPint package contains scatterplot matrices, volcano plots, litre plots, and parallel coordinate plots as example graphics that implement our new layered interactivity technology. In a recent methods paper, we used public RNA-seq datasets to demonstrate how these particular bigPint graphics can help biologists detect crucial issues with normalization methods and DEG designation in ways not possible with numerical models [32]. We also applied these bigPint visualization tools in a recent research paper that sought to elicit how nutrition and viral infection affect the honey bee transcriptome [39].

Availability and future directions

The four public datasets used in our recent methods paper [32] are available online: Three are deposited on the NCBI Sequence Read Archive with accession numbers SRA000299 [40], PRJNA318409 [41], and SRA048710 [42]. One is deposited on the NCBI Gene Expression Omnibus with accession number GSE61857 [43]. The original dataset we used in our recent methods paper [39] is also available on the NCBI Gene Expression Omnibus with accession number GSE121885. The bigPint package itself comes with two of the aforementioned RNA-sequencing datasets as examples [42, 43]. The package can be downloaded from the Bioconductor website (https://bioconductor.org/packages/devel/bioc/html/bigPint.html). As linked in the Bioconductor website, the bigPint package also has a vignette website (https://lindsayrutter.github.io/bigPint), where users can follow reproducible code to install the software and use the example datasets to create bigPint graphics and follow an example analysis pipeline. Users can also report bugs and submit requests. In this paper, we introduced new visualization tools that enable independent layers of interactive capabilities for the foreground and background of plots using Plotly in R. We believe this methodology represents a fairly novel contribution to the field of interactive data visualization that can lead to sizable performance gains when working with large datasets. Advocating state-of-the-art visualization tools is crucial for biology researchers to analyze and present their data and for visualization researchers to develop novel methods. We anticipate that our documented source code and pseudocode may encourage computational scientists to expand upon our layered interactive technology and/or apply it in new ways toward other computational biology tasks.

Resources for developers about bigPint interactive graphics.

(PDF) Click here for additional data file.

Pseudocode for interactive scatterplot matrix.

(PDF) Click here for additional data file.

Pseudocode for interactive litre plot.

(PDF) Click here for additional data file.

Pseudocode for interactive volcano plot.

(PDF) Click here for additional data file.

Pseudocode for interactive parallel coordinate plot.

(PDF) Click here for additional data file. 18 Nov 2019 Dear Dr Rutter, Thank you very much for submitting your manuscript 'bigPint: A Bioconductor visualization package that makes big data pint-sized' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Aaron E. Darling Software Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Editor's specific comments: thank you for your patience while this manuscript was under review, as there was difficulty sourcing appropriately qualified reviewers. Please thoroughly address the comments of both reviewers in your revisions, in particular those focused on software architecture, reliability, and ease of use. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Rutter and Cook discuss their new R package offering data visualization techniques geared towards RNA-seq data. The data visualization techniques make clever use of interactivity to highlight aspects of the data, which enable researchers to identify common analysis errors. While I like the concept of the presented visualization techniques, I am not convinced that this will be a widely applied tool. The tool felt clunky and still produced several errors. I will outline these as well as other issues/suggestions below: 1. The authors claim that their software can be used for the exploration of any large biological datasets. However, their choice of plots, in particular the volcanoplot, suggests to me a focus on RNA-seq data. I would suggest the inclusion of another example for a non RNA-seq dataset or making it clear that this package targets RNA-seq data exclusively. 2. It is great that all plots are easily downloadable, but for the purpose of replicability it would be great if logs of all user interactions could be provided. This can, for example, be handled with shinylogs (https://github.com/dreamRs/shinylogs). 3. Part of why I think the package feels clunky is that each plot is provided in a separate application. I think it would be more user friendly to provide one app that has user-specified outputs. A brilliant example of this is the iSEE package. 4. Finally, I am unsure why the author chose the dataMetrics object as input, instead of the much more common SummarizedExperiment object. This has the advantage that it offers support for DelayedMatrices which are extremely useful when datasets are very large. 5. How does bigPint scale? Have you tried analyzing single cell RNA-seq datasets? Errors: 1. The example interactive application for the volcano plot is broken. 2. None of the parallel coordinate plots show the correct labels for genes when hovering over the lines associated with genes. Reviewer #2: This is supplied as a R-markdown to show what I have done to try to use bigPint. I have also supplied the file as an attachment. --- title: "review_bigPint" author: "Paul Brennan" date: "12/10/2019" output: html_document --- ## General comments and recommendation This is an interesting and worthwhile manuscript that shares some very nice and novel technical and visualisation strategies. ### Novelty in visualisation I believe that the hexagon concept is being used in a novel way. I believe it is worthwhile for high density data. I believe that the approach should fit well into the Bioconductor methodology. It would be useful to see the package integrated into a workflow. ### Good points I like the supporting website that the authors have provided. I like the videos they have created. These make the concept understandable and a very positive way. ### Minor concerns I have some minor concerns that I believe need to be addressed prior to publication. My first an most important concern is that I have found it quite difficult to reproduce the workflows and examples using the links in the manuscipt. I have focussed my efforts on trying to make the bigPint package work using the links provided in the manuscript. Sadly, I have found that quite challenging. For some reason, and it may be a limitation of my technical skills, when I accessed the code shown in the links for Table 1, I could not make the code work. The first link to https://github.com/lindsayrutter/bigPint/blob/master/inst/shiny-examples/plotSMApp/app.R points to a very complex example. I do not believe it constitutes a "Helpful resource". When I copy and paste this data into a new R-Studio Project, it is very difficult to get to work. #### Trying to get Scatterplot Matrix App to work... Here I am going to use the R Markdown file to catalogue my attempts. Script from [here](bit.ly/spmCode) ```{r cut_and_paste_from_bit_ly_spmCode, include=FALSE, eval=FALSE} # I have removed this from the review because it uses up too many words... # trust me please, I have cut and pasted it verbatim. ``` When I run this, a window opens and then closes again. Interesting my data is a Value - not Data in my Global Environment and it says "NULL (empty)" Why? Well let's run the data acquisition by itself. Annoyingly this is not a reproducible problem. Sometimes it seems to work but I don't know why. ```{r test_data_download, include=TRUE} data <- bigPint:::PKGENVIR$DATA str(data) ``` Still NULL, sadly. All the other code examples use the same code to get data in so I'm not going to try them here. I have tried them previously. #### Where else can I go for help? Let's check (Bioconductor)[https://www.bioconductor.org/packages/release/bioc/html/bigPint.html]. The page is a good starting point. It has a green build which is a good sign. Check out the (R-script)[https://www.bioconductor.org/packages/release/bioc/vignettes/bigPint/inst/doc/bioconductor.R] Woops! It seems blank. A bit frustrating. [HTML Vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/bigPint/inst/doc/bioconductor.html) points to website with a [rotten url](https://lindsayrutter.github.%20io/bigPint/)! Woops again! Let's go to the Github Repository and see what we can find... The [link](https://lindsayrutter.github.io/bigPint/) is in the manuscript and works. Yeah!! OK, now we have some different code: ```{r test_data_download, include=TRUE} library(bigPint) data("soybean_cn_sub") soybean_cn_sub <- soybean_cn_sub[,1:7] app <- plotSMApp(data=soybean_cn_sub) if (interactive()) { shiny::runApp(app) } ``` Good new is that the data() function works and gives us some data! ```{r str_data} str(soybean_cn_sub) ``` Nice data.frame produced. Bad news is that my interactive plot still won't work! Well actually it does work when I run this again. It is a bit slow and it gives lots of red text: 'scatter' objects don't have these attributes: 't2' What happens if I try the ```{r try_PKGENVIR$DATA_again} data <- bigPint:::PKGENVIR$DATA ``` It works! If I do it after I run your App. Wonder why? I'm not going to spend any time working through that at the moment. Now the script above works with lots of error messages. #### Back to square one, can I make the static hex plots ```{r static_hex} data(soybean_ir_sub) soybean_ir_sub[,-1] <- log(soybean_ir_sub[,-1]+1) data(soybean_ir_sub_metrics) ret <- plotLitre(data = soybean_ir_sub, dataMetrics = soybean_ir_sub_metrics, threshVal = 1e-10, saveFile = FALSE) length(ret) names(ret)[1] ret[[1]] ``` Answer is Yes! Good and it looks interesting with points highlighted on it. I probably should have started here in the first place but hey... #### I need to try to make the apps work... I found some code here: [https://rdrr.io/bioc/bigPint/man/plotVolcanoApp.html] ```{r plot_Volcano_App} library(bigPint) # Example 1: Create interactive volcano plot of logged data using hexagon # bins for the background. data(soybean_cn_sub) data(soybean_cn_sub_metrics) app <- plotVolcanoApp(data = soybean_cn_sub, dataMetrics = soybean_cn_sub_metrics) if (interactive()) { shiny::runApp(app) } ``` OK, so this work. Excellent! I am reassured and in fact more than reassured. I am pleased. I can plot the gene subset and Download selected genes. I was under the impression that I would be able to click on a hexagon but I don't seem to be able to do that. Seems that functionality is not possible here. ```{r plot_Volcano_App} # Example 2: Create interactive volcano plot of logged data using points for # the background. app <- plotVolcanoApp(data = soybean_cn_sub, dataMetrics = soybean_cn_sub_metrics, option = "allPoints", pointColor = "magenta") if (interactive()) { shiny::runApp(app) } ``` This shows the points. The scale of the P-value seems a little unwise. Should it really go all the way up to 1? I think there should be the option to go down to lower P-values more easily and not up. #### try PlotLitre App from online ```{r plot_Volcano_App} data(soybean_ir_sub) data(soybean_ir_sub_metrics) soybean_ir_sub_log <- soybean_ir_sub soybean_ir_sub_log[,-1] <- log(soybean_ir_sub[,-1]+1) app <- plotLitreApp(data = soybean_ir_sub_log, dataMetrics = soybean_ir_sub_metrics) if (interactive()) { shiny::runApp(app, port = 1234, launch.browser = TRUE) } ``` I can see the number of genes in each hexagram if I hover over it - nice. Plot gene works to give orange spots which can be hovered over to identify. However, I'm not sure how I am selecting those genes. I have worked out that it is going through the genes by rank. However, because that information is off the bottom of the screen, it took me a while to realise. I need to watch the video again! Back to (website)[https://lindsayrutter.github.io/bigPint/articles/interactive.html] Watch video about Scatterplot matrix app Cut and paste code: ```{r plotSMAapp_again} library(bigPint) data("soybean_cn_sub") soybean_cn_sub <- soybean_cn_sub[,1:7] app <- plotSMApp(data=soybean_cn_sub) if (interactive()) { shiny::runApp(app) } ``` This works! Excellent. Selecting hexagrams works. Downloading IDs works. Downloading plots doesn't :-( Need to open in Browser as advised! Could add that as error message? OK so it does work but it produces LOTS of warnings. I wonder why? Warning: 'scatter' objects don't have these attributes: 't2' Lots of repeat of this warning. #### Try plotLitreApp ```{r plotSMAapp_again} data("soybean_ir_sub") data("soybean_ir_sub_metrics") soybean_ir_sub_log <- soybean_ir_sub soybean_ir_sub_log[,-1] <- log(soybean_ir_sub[,-1]+1) app <- plotLitreApp(data=soybean_ir_sub_log, dataMetrics = soybean_ir_sub_metrics) if (interactive()) { shiny::runApp(app, port = 1234, launch.browser = TRUE) } ``` #### Try plotPCPApp ```{r plotPCPApp} soybean_ir_sub_st = as.data.frame(t(apply(as.matrix(soybean_ir_sub[,-1]), 1, scale))) soybean_ir_sub_st$ID = as.character(soybean_ir_sub$ID) soybean_ir_sub_st = soybean_ir_sub_st[,c(length(soybean_ir_sub_st), 1:length(soybean_ir_sub_st)-1)] colnames(soybean_ir_sub_st) = colnames(soybean_ir_sub) nID = which(is.nan(soybean_ir_sub_st[,2])) soybean_ir_sub_st[nID,2:length(soybean_ir_sub_st)] = 0 plotGenes = filter(soybean_ir_sub_metrics[["N_P"]], FDR < 0.01, logFC < -4) %>% select(ID) pcpDat = filter(soybean_ir_sub_st, ID %in% plotGenes[,1]) app <- plotPCPApp(data = pcpDat) if (interactive()) { shiny::runApp(app, display.mode = "normal") } ``` Works! In Browser, I can save images as advised. Nice job. #### Try Volcano app from website Sadly no code on [website](https://lindsayrutter.github.io/bigPint/articles/interactive.html#volcano-plot-app) when accessed on 12 Oct 2019. Example code from (here)[https://rdrr.io/bioc/bigPint/man/plotVolcanoApp.html] instead as above. Made it work again! GREAT! ```{r session_info} sessionInfo() ``` So, I made it work but not from code on the manuscript. Please make it easier for me. Here is my session info: R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] shinycssloaders_0.2.0 Hmisc_4.2-0 Formula_1.2-3 [4] survival_2.44-1.1 lattice_0.20-38 RColorBrewer_1.1-2 [7] GGally_1.4.0 data.table_1.12.2 dplyr_0.8.3 [10] stringr_1.4.0 hexbin_1.27.3 tidyr_1.0.0 [13] htmlwidgets_1.5.1 plotly_4.9.0 ggplot2_3.2.1 [16] shinydashboard_0.7.1 shiny_1.4.0 bigPint_1.0.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.2 assertthat_0.2.1 zeallot_0.1.0 [4] digest_0.6.21 mime_0.7 R6_2.4.0 [7] plyr_1.8.4 backports_1.1.5 acepack_1.4.1 [10] httr_1.4.1 pillar_1.4.2 rlang_0.4.0 [13] lazyeval_0.2.2 rstudioapi_0.10 rpart_4.1-15 [16] Matrix_1.2-17 checkmate_1.9.4 labeling_0.3 [19] splines_3.6.1 foreign_0.8-72 munsell_0.5.0 [22] compiler_3.6.1 httpuv_1.5.2 xfun_0.10 [25] pkgconfig_2.0.3 base64enc_0.1-3 htmltools_0.4.0 [28] nnet_7.3-12 tidyselect_0.2.5 tibble_2.1.3 [31] gridExtra_2.3 htmlTable_1.13.2 reshape_0.8.8 [34] viridisLite_0.3.0 withr_2.1.2 crayon_1.3.4 [37] later_1.0.0 grid_3.6.1 jsonlite_1.6 [40] xtable_1.8-4 gtable_0.3.0 lifecycle_0.1.0 [43] magrittr_1.5 scales_1.0.0 stringi_1.4.3 [46] promises_1.1.0 latticeExtra_0.6-28 ellipsis_0.3.0 [49] vctrs_0.2.0 tools_3.6.1 glue_1.3.1 [52] purrr_0.3.2 crosstalk_1.0.0 fastmap_1.0.1 [55] yaml_2.2.0 colorspace_1.4-1 cluster_2.1.0 [58] knitr_1.25 ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Dr Paul Brennan, Centre for Medical Education, School of Medicine, Cardiff University, Cardiff, Wales, CF14 4XN, United Kingdom. BrennanP@cardiff.ac.uk @brennanpcardiff Submitted filename: review_submitted.Rmd Click here for additional data file. 21 Feb 2020 Submitted filename: ResponseToReviewers.doc Click here for additional data file. 27 Apr 2020 Dear Dr. Rutter, We are pleased to inform you that your manuscript 'bigPint: A Bioconductor visualization package that makes big data pint-sized' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Aaron E. Darling Software Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all my comments. I think dividing the package and manuscript by user (developer or analyst) is very helpful. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Saskia Freytag 1 Jun 2020 PCOMPBIOL-D-19-01337R1 bigPint: A Bioconductor visualization package that makes big data pint-sized Dear Dr Rutter, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Laura Mallard PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

18 in total

1. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

Authors: John C Marioni; Christopher E Mason; Shrikant M Mane; Matthew Stephens; Yoav Gilad
Journal: Genome Res Date: 2008-06-11 Impact factor: 9.043

2. Visualizing biological data-now and in the future.

Authors: Seán I O'Donoghue; Anne-Claude Gavin; Nils Gehlenborg; David S Goodsell; Jean-Karim Hériché; Cydney B Nielsen; Chris North; Arthur J Olson; James B Procter; David W Shattuck; Thomas Walter; Bang Wong
Journal: Nat Methods Date: 2010-03 Impact factor: 28.547