Literature DB >> 33416829

ProteomeExpert: a docker image based web-server for exploring, modeling, visualizing, and mining quantitative proteomic data sets.

Tiansheng Zhu^1,2,3, Hao Chen^1,2, Xishan Yan⁴, Zhicheng Wu^1,2,3, Xiaoxu Zhou^1,2, Qi Xiao^1,2, Weigang Ge^1,2, Qiushi Zhang^1,2, Chao Xu⁴, Luang Xu^1,2, Guan Ruan^1,2, Zhangzhi Xue^1,2, Chunhui Yuan^1,2, Guo-Bo Chen⁵, Tiannan Guo^1,2.

Abstract

The rapid progresses of high throughput sequencing technology-based omics and mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100 s-1000s samples. To analyze these quantitative proteomic data sets and other -omics data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of quantitative proteomic data sets. ProteomeExpert can be deployed on an operating system with Docker installed or with R language environment.
AVAILABILITY AND IMPLEMENTATION: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/. In addition, a demo server is provided at https://proteomic.shinyapps.io/peserver/. SUPPLEMENTARY INFORMATION: SUPPLEMENTARY DATA ARE AVAILABLE AT BIOINFORMATICS ONLINE.

Entities: Chemical

Year: 2021 PMID： 33416829 PMCID： PMC8055226 DOI： 10.1093/bioinformatics/btaa1088

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Recent advances in liquid chromatographic mass spectrometry-based proteomics technology permit acquisition of hundreds to thousands of proteomics datasets in a relatively short time, especially using data-independent acquisition (DIA) mass spectrometry (MS) strategy (Gillet ; Guo ; Yue ; Zhang ). The substantial increase of proteomics data during the last few years necessitates effective algorithms and software tools for automatic interpretation of the resultant quantitative datasets to obtain valuable biological insights or assist clinical diagnosis. Currently, existing tools require programming skills, such as SWATH2stats (Blattmann ) and MSstats (Choi ). Another tool, mapDIA (Teo ), is a statistical analysis package for differentially expressed protein using DIA fragment-level intensities. PANDA-view (Chang ) and Perseus (Tyanova ) both depend on other packages and do not include functionalities in feature selection, peptide–protein inference and experimental design. Therefore, proteomics analysis tools supporting various functions across tasks still need replenishment. Here, we provide an easy-to-use web server-based comprehensive data analysis platform for quantitative proteomics data and other omics (e.g. transcriptomics and metabolomics, hereafter) data that covers experimental design, data preprocessing, data quality control, protein inference, statistics, feature selection, unsupervised learning, supervised learning and visualization. In order to facilitate installing and deployment, we release this platform as a Docker image except for GitHub, which is easy to install and share on operating systems, such as Linux, Mac, and Windows.

2 Materials and methods

2.1 Architecture and implementation

ProteomeExpert is built as a Docker image. Figure 1 shows the overview modules of the platform. It includes an interactive web interface based on the Shiny package of R language, which integrates a collection of modules. The analysis can be customized after tuning the parameters in each interface; the resource-demanding computation is done in a remote server that can be further upgraded accordingly. This comprehensive architecture is designed to alleviate the computational burden in handling the increasing volume of proteomic data. Even with little programming skills, users can conduct most analyses that meet their requirements. The detailed description including power design, protein inference, feature selection and help information is in the Supplementary File.

Fig. 1.

The ProteomeExpert data analysis platform

2.2 Data input and output

‘Data upload’ is the core data input interface to upload the data file through a web interface. The data console allows uploading of the user-specific protein matrix and experiment meta-data (including experiment run sample and individual/patient information) as the input data for generating summary statistics, machine learning and data preprocessing. The modules for batch design, protein inference, annotation and supervised learning for the testing set have their own data upload functions. ProteomeExpert also provides interactive parameter settings and options to download processed data and plots.

2.3 Experimental design

The experimental design includes two sub-modules: power analysis and batch design. Power analysis enables the estimation of sample size for detecting the statistically significantly differentiated proteins given the balanced type I and type II error rates. Batch design is suitable for analyzing emerging large cohorts containing hundreds or thousands of samples, a scale that should be processed in multiple batches. However, unwanted variations are often creeped in due to technical variability among batches, so that a balanced batch design is implemented in this module that maximizes the statistical power but controls the technical variation. Batch design and batch correction methods are necessary for downstream data analysis, especially for large cohorts.

2.4 Data preprocessing

The data preprocessing module includes methods for log transform, missing value substitution, normalization, batch effect correction, and replicates treatment. In particular, batch effect correction is important in data preprocessing. Batch effects may obscure biological signals, we recommend performing batch diagnostics, normalization, and adjustment right after peptide identification.

2.5 Machine learning

In the feature selection module, users not only use filter methods including near zero variance and high correlation but also exercise additional feature selection methods: LASSO (Tibshirani, 1996), genetic algorithm and random forest. As in clinical applications classifying disease into subtypes is of great interest in the fields of diagnosis and prognosis, users can perform various machine-learning analyses: unsupervised analysis PCA, t-SNE, and UMAP for dimensionality reduction, and supervised algorithm decision tree, random forest, and XGBoost for clustering.

3 Conclusions

In summary, we have developed ProteomeExpert to meet the requirement for processing large-scale quantitative proteomics datasets. Most, if not all, quantitative proteomic datasets can be fed into ProteomeExpert, including but not limited to DIA-MS with or without ion mobility, label-free or stable isotope labeling-based data-dependent acquisition MS, parallel reaction monitoring MS and multiple reaction monitoring MS. Transcriptomic and metabolomic datasets can also be processed by this tool. ProteomeExpert is compatible with other omics tools by uploading their results in tab-delimited or comma-separated text file format or excel file. Moreover, ProteomeExpert includes comprehensive methods for data preprocessing, visualization, statistics, and machine learning. It can be hosted within R shiny environment under Windows, Linux and Mac system or deployed in Docker available as a web server.

Funding

This work was supported by grants from the National Key R&D Program of China [No. 2020YFE0202200]; Zhejiang Provincial Natural Science Foundation for Distinguished Young Scholars [LR19C050001]; Hangzhou Agriculture and Society Advancement Program [20190101A04]; National Natural Science Foundation of China [81972492]; and National Science Fund for Young Scholars [21904107]. Conflict of Interest: Tiannan Guo is shareholder of Westlake Omics Inc. Hao Chen, Weigang Ge and Qiushi Zhang are employees of Westlake Omics Inc. The remaining authors declare no competing interests. Click here for additional data file.

9 in total

1. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis.

Authors: Ludovic C Gillet; Pedro Navarro; Stephen Tate; Hannes Röst; Nathalie Selevsek; Lukas Reiter; Ron Bonner; Ruedi Aebersold
Journal: Mol Cell Proteomics Date: 2012-01-18 Impact factor: 5.911

2. Generating Proteomic Big Data for Precision Medicine.

Authors: Liang Yue; Fangfei Zhang; Rui Sun; Yaoting Sun; Chunhui Yuan; Yi Zhu; Tiannan Guo
Journal: Proteomics Date: 2020-08-26 Impact factor: 3.984

3. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments.

Authors: Meena Choi; Ching-Yun Chang; Timothy Clough; Daniel Broudy; Trevor Killeen; Brendan MacLean; Olga Vitek
Journal: Bioinformatics Date: 2014-05-02 Impact factor: 6.937

4. The Perseus computational platform for comprehensive analysis of (prote)omics data.

Authors: Stefka Tyanova; Tikira Temu; Pavel Sinitcyn; Arthur Carlson; Marco Y Hein; Tamar Geiger; Matthias Mann; Jürgen Cox
Journal: Nat Methods Date: 2016-06-27 Impact factor: 28.547

Review 5. Data-Independent Acquisition Mass Spectrometry-based Proteomics and Software Tools: A Glimpse in 2020.

Authors: Fangfei Zhang; Weigang Ge; Guan Ruan; Xue Cai; Tiannan Guo
Journal: Proteomics Date: 2020-04-10 Impact factor: 3.984

6. mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry.

Authors: Guoshou Teo; Sinae Kim; Chih-Chiang Tsou; Ben Collins; Anne-Claude Gingras; Alexey I Nesvizhskii; Hyungwon Choi
Journal: J Proteomics Date: 2015-09-15 Impact factor: 4.044

7. Rapid mass spectrometric conversion of tissue biopsy samples into permanent quantitative digital proteome maps.

Authors: Tiannan Guo; Petri Kouvonen; Ching Chiek Koh; Ludovic C Gillet; Witold E Wolski; Hannes L Röst; George Rosenberger; Ben C Collins; Lorenz C Blum; Silke Gillessen; Markus Joerger; Wolfram Jochum; Ruedi Aebersold
Journal: Nat Med Date: 2015-03-02 Impact factor: 53.440

8. SWATH2stats: An R/Bioconductor Package to Process and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools.

Authors: Peter Blattmann; Moritz Heusel; Ruedi Aebersold
Journal: PLoS One Date: 2016-04-07 Impact factor: 3.240

9. PANDA-view: an easy-to-use tool for statistical analysis and visualization of quantitative proteomics data.

Authors: Cheng Chang; Kaikun Xu; Chaoping Guo; Jinxia Wang; Qi Yan; Jian Zhang; Fuchu He; Yunping Zhu
Journal: Bioinformatics Date: 2018-10-15 Impact factor: 6.937

9 in total

2 in total

1. Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.

Authors: Gilbert S Omenn; Lydie Lane; Christopher M Overall; Young-Ki Paik; Ileana M Cristea; Fernando J Corrales; Cecilia Lindskog; Susan Weintraub; Michael H A Roehrl; Siqi Liu; Nuno Bandeira; Sudhir Srivastava; Yu-Ju Chen; Ruedi Aebersold; Robert L Moritz; Eric W Deutsch
Journal: J Proteome Res Date: 2021-10-20 Impact factor: 5.370

2. On the feasibility of deep learning applications using raw mass spectrometry data.

Authors: Joris Cadow; Matteo Manica; Roland Mathis; Tiannan Guo; Ruedi Aebersold; María Rodríguez Martínez
Journal: Bioinformatics Date: 2021-07-12 Impact factor: 6.937

2 in total