Literature DB >> 34788796

MatrixQCvis: shiny-based interactive data quality exploration for omics data.

Abstract

MOTIVATION: First-line data quality assessment and exploratory data analysis are integral parts of any data analysis workflow. In high-throughput quantitative omics experiments (e.g. transcriptomics, proteomics, metabolomics), after initial processing, the data are typically presented as a matrix of numbers (feature IDs × samples). Efficient and standardized data-quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality data sets and subsequent biological question-driven inference.
RESULTS: We present MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R's shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format (features IDs × samples). MatrixQCvis builds upon the Bioconductor SummarizedExperiment S4 class and thus facilitates the integration into existing workflows. AVAILABILITY: MatrixQCVis is implemented in R. It is available via Bioconductor and released under the GPL v3.0 license. SUPPLEMENTARY INFORMATION: Supplementary Information is available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34788796 PMCID： PMC8796383 DOI： 10.1093/bioinformatics/btab748

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Initial first-line data quality assessment is an integral part of data analysis for subsequent biological question-driven inference. To ensure facile exploration of data quality, we developed MatrixQCvis, implemented in the R programming language. MatrixQCvis provides shiny-based interactive visualization and quantification of data quality metrics at the per-sample and per-feature level. It is broadly applicable to quantitative omics data types that come in a matrix-like format (features × samples). It enables the detection of low-quality samples, outliers, drifts and batch effects in datasets. Visualizations include boxplots and violin plots of the (count or intensity) values, mean versus standard deviation plots, MA plots (see Fig. 1a) and Hoeffding’s D statistic (non-parametric measure of independence between M and A), empirical cumulative distribution function plots, visualizations of the distances between samples and multiple types of dimension reduction plots. Furthermore, MatrixQCvis facilitates differential expression analysis based on the limma (moderated t-tests, Ritchie ) and proDA (Wald tests, Ahlmann-Eltze and Anders, 2019) packages.

Fig. 1.

Examples of MatrixQCvis functionality and user interface. (a) MA plot of human plasma proteomics samples identifying a dependence between M and A values for Sample 6 indicating problems with its data. More visualizations using the clinical datasets of Jiang and Brueffer are shown in the Supplementary Material. (b) MatrixQCvis builds upon the SummarizedExperiment S4 class, a container for assay data (e.g. proteomics intensity values) and associated metadata on the features and samples. The figure is adjusted from the vignette of the SummarizedExperiment package. (c) Sidebar panel. MatrixQCvis enables to interactively normalize, transform, perform batch correction and impute the dataset. Furthermore, samples can be excluded or selected. In the shown example, phosphate-buffered saline samples are excluded. (d) Main panel. Navigation within MatrixQCvis is realized by browsing through tabs. Each visualization is embedded within a dedicated tab Similarly to the iSEE package (Rue-Albrecht ), MatrixQCvis builds upon the widely used Bioconductor SummarizedExperiment S4 class (see Fig. 1b) and thus facilitates the integration into existing workflows. Compared to iSEE, which provides a general interface for exploring data in a SummarizedExperiment object, MatrixQCvis focuses on the upstream, initial first-line, data quality control steps and incorporates dedicated visualization capabilities for assessing data quality, although overlaps exist (e.g. dimension reduction plots). Several software packages exist that center around the assessment of data quality: among others, arrayQualityMetrics (Kauffmann ), initially developed more than 10 years ago for the quality assessment of microarrays, that creates automatic reports of the data quality. Contrary to arrayQualityMetrics, which is built upon outdated visualization libraries, MatrixQCvis uses the shiny (Chang ) framework and provides a high number of interactive visualizations to explore the data. MeTaQuaC (Kuhring ), a recently published R package, is dedicated to metabolomics data analysis, accepts either the Biocrates data output or a generic file format and creates a static quality control report. On the other hand, MatrixQCvis is not restricted to a specific technology and centers around interactive exploration of data quality. We highlight the usability and functionality of the MatrixQCvis package in applications of clinical proteomics and transcriptomics studies in the Supplementary Material, using the datasets of Jiang and Brueffer .

2 Usage scenario and user interface

Many tools and software packages exist that directly process raw data and translate these to biological discovery, but offer limited capabilities for data quality control. However, data quality control is a necessary step in omics data processing to identify samples with poor intensities and low signal-to-noise ratio, biases and outliers, batch and confounding effects or drifts in calibration. Quality control involves the monitoring of data processing steps from raw data to processed/completed datasets (including the steps of sample normalization, batch correction, transformation and handling of missing values) and the simultaneous assessment of quality metrics. Concomitantly, a data quality-centered workflow would facilitate consistent processing to minimize technical variance and interference. Additionally, such a workflow would ideally identify systematic trends and deviations (systematic errors) and remove these prior to data interpretation to enable sound biological information retrieval. To offer a tool to the community that addresses this gap, MatrixQCvis was developed. While static reports may, for example, be well-suited for documentation and more amenable to automated interpretation, an advantage of interactive interfaces is the ability to immediately explore the data using multiple plots and parameter choices and to generate and readily pursue hypotheses. Interactivity also may help users in making decisions whether to keep or drop samples while doing quality assessment. Since MatrixQCvis is based on shiny (see Fig. 1c and d), it requires little programmatic interaction to start the quality assessment and is, thus, suitable for anyone who would like to analyze data quality in a fast, efficient and standardized manner. MatrixQCvis enables users to explore data quality of rectangular datasets that are represented as a SummarizedExperiment object (see Fig. 1b) interactively and reproducibly. The interface to the shiny interface is initiated with a single call to the shinyQC function. A SummarizedExperiment object can be passed to the shinyQC call directly or loaded via an upload interface. Using domain-and study-specific knowledge, MatrixQCvis guides the user to downsample datasets on the per-sample and per-feature level. For data entities where missing values are present in the assay structure the interface will load specific interfaces to assess the data quality with respect to missing values, e.g. tab panels to visualize the number of missing values across samples or sets of missing values across sample types, and a user interface to control imputation. Besides the interactive quality exploration, MatrixQCvis enables users to download all plots and to generate markdown/HTML reports from within the shiny interface.

3 Conclusion

The shiny application MatrixQCvis generates interactive data quality workflows and facilitates to monitor data quality along the major data processing steps via several commonly applied data quality metrics and visualizations. It enables users to create a dynamic, easy-to-share and easy-to-store report using user-specified settings. MatrixQCvis can be integrated into existing workflows and provides a means to scrutinize the data quality of rectangular datasets in a fast, efficient and standardized manner. Click here for additional data file.

6 in total

1. Concepts and Software Package for Efficient Quality Control in Targeted Metabolomics Studies: MeTaQuaC.

Authors: Mathias Kuhring; Alina Eisenberger; Vanessa Schmidt; Nicolle Kränkel; David M Leistner; Jennifer Kirwan; Dieter Beule
Journal: Anal Chem Date: 2020-07-16 Impact factor: 6.986

2. Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma.

Authors: Ying Jiang; Aihua Sun; Yang Zhao; Wantao Ying; Huichuan Sun; Xinrong Yang; Baocai Xing; Wei Sun; Liangliang Ren; Bo Hu; Chaoying Li; Li Zhang; Guangrong Qin; Menghuan Zhang; Ning Chen; Manli Zhang; Yin Huang; Jinan Zhou; Yan Zhao; Mingwei Liu; Xiaodong Zhu; Yang Qiu; Yanjun Sun; Cheng Huang; Meng Yan; Mingchao Wang; Wei Liu; Fang Tian; Huali Xu; Jian Zhou; Zhenyu Wu; Tieliu Shi; Weimin Zhu; Jun Qin; Lu Xie; Jia Fan; Xiaohong Qian; Fuchu He
Journal: Nature Date: 2019-02-27 Impact factor: 49.962

3. limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971

4. Clinical Value of RNA Sequencing-Based Classifiers for Prediction of the Five Conventional Breast Cancer Biomarkers: A Report From the Population-Based Multicenter Sweden Cancerome Analysis Network-Breast Initiative.

Authors: Christian Brueffer; Johan Vallon-Christersson; Dorthe Grabau; Anna Ehinger; Jari Häkkinen; Cecilia Hegardt; Janne Malina; Yilun Chen; Pär-Ola Bendahl; Jonas Manjer; Martin Malmberg; Christer Larsson; Niklas Loman; Lisa Rydén; Åke Borg; Lao H Saal
Journal: JCO Precis Oncol Date: 2018-03-09

5. arrayQualityMetrics--a bioconductor package for quality assessment of microarray data.

Authors: Audrey Kauffmann; Robert Gentleman; Wolfgang Huber
Journal: Bioinformatics Date: 2008-12-23 Impact factor: 6.937

6. iSEE: Interactive SummarizedExperiment Explorer.

Authors: Kevin Rue-Albrecht; Federico Marini; Charlotte Soneson; Aaron T L Lun
Journal: F1000Res Date: 2018-06-14

6 in total