Tiansheng Zhu1,2,3, Hao Chen1,2, Xishan Yan4, Zhicheng Wu1,2,3, Xiaoxu Zhou1,2, Qi Xiao1,2, Weigang Ge1,2, Qiushi Zhang1,2, Chao Xu4, Luang Xu1,2, Guan Ruan1,2, Zhangzhi Xue1,2, Chunhui Yuan1,2, Guo-Bo Chen5, Tiannan Guo1,2. 1. Zhejiang Provincial Laboratory of Life Sciences and Biomedicine, Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang, China Province. 2. Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, 18 Shilongshan Road, Hangzhou, China Zhejiang Province. 3. Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, China. 4. College of Mathematics and Informatics, Digital Fujian Institute of Big Data Security Technology, Fujian Normal University, China. 5. Clinical Research Institute, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, China Zhejiang.
Abstract
The rapid progresses of high throughput sequencing technology-based omics and mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100 s-1000s samples. To analyze these quantitative proteomic data sets and other -omics data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of quantitative proteomic data sets. ProteomeExpert can be deployed on an operating system with Docker installed or with R language environment. AVAILABILITY AND IMPLEMENTATION: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/. In addition, a demo server is provided at https://proteomic.shinyapps.io/peserver/. SUPPLEMENTARY INFORMATION: SUPPLEMENTARY DATA ARE AVAILABLE AT BIOINFORMATICS ONLINE.
The rapid progresses of high throughput sequencing technology-based omics and mass spectrometry (MS)-based proteomics such as data-independent acquisition (DIA) and its penetration to clinical studies have generated increasing number of proteomic data sets containing 100 s-1000s samples. To analyze these quantitative proteomic data sets and other -omics data sets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation, and visualization of quantitative proteomic data sets. ProteomeExpert can be deployed on an operating system with Docker installed or with R language environment. AVAILABILITY AND IMPLEMENTATION: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/lifeinfo/ProteomeExpert/. In addition, a demo server is provided at https://proteomic.shinyapps.io/peserver/. SUPPLEMENTARY INFORMATION: SUPPLEMENTARY DATA ARE AVAILABLE AT BIOINFORMATICS ONLINE.
Recent advances in liquid chromatographic mass spectrometry-based proteomics technology permit acquisition of hundreds to thousands of proteomics datasets in a relatively short time, especially using data-independent acquisition (DIA) mass spectrometry (MS) strategy (Gillet ; Guo ; Yue ; Zhang ). The substantial increase of proteomics data during the last few years necessitates effective algorithms and software tools for automatic interpretation of the resultant quantitative datasets to obtain valuable biological insights or assist clinical diagnosis. Currently, existing tools require programming skills, such as SWATH2stats (Blattmann ) and MSstats (Choi ). Another tool, mapDIA (Teo ), is a statistical analysis package for differentially expressed protein using DIA fragment-level intensities. PANDA-view (Chang ) and Perseus (Tyanova ) both depend on other packages and do not include functionalities in feature selection, peptide–protein inference and experimental design. Therefore, proteomics analysis tools supporting various functions across tasks still need replenishment.Here, we provide an easy-to-use web server-based comprehensive data analysis platform for quantitative proteomics data and other omics (e.g. transcriptomics and metabolomics, hereafter) data that covers experimental design, data preprocessing, data quality control, protein inference, statistics, feature selection, unsupervised learning, supervised learning and visualization. In order to facilitate installing and deployment, we release this platform as a Docker image except for GitHub, which is easy to install and share on operating systems, such as Linux, Mac, and Windows.
2 Materials and methods
2.1 Architecture and implementation
ProteomeExpert is built as a Docker image. Figure 1 shows the overview modules of the platform. It includes an interactive web interface based on the Shiny package of R language, which integrates a collection of modules. The analysis can be customized after tuning the parameters in each interface; the resource-demanding computation is done in a remote server that can be further upgraded accordingly. This comprehensive architecture is designed to alleviate the computational burden in handling the increasing volume of proteomic data. Even with little programming skills, users can conduct most analyses that meet their requirements. The detailed description including power design, protein inference, feature selection and help information is in the Supplementary File.
Fig. 1.
The ProteomeExpert data analysis platform
The ProteomeExpert data analysis platform
2.2 Data input and output
‘Data upload’ is the core data input interface to upload the data file through a web interface. The data console allows uploading of the user-specific protein matrix and experiment meta-data (including experiment run sample and individual/patient information) as the input data for generating summary statistics, machine learning and data preprocessing. The modules for batch design, protein inference, annotation and supervised learning for the testing set have their own data upload functions. ProteomeExpert also provides interactive parameter settings and options to download processed data and plots.
2.3 Experimental design
The experimental design includes two sub-modules: power analysis and batch design. Power analysis enables the estimation of sample size for detecting the statistically significantly differentiated proteins given the balanced type I and type II error rates. Batch design is suitable for analyzing emerging large cohorts containing hundreds or thousands of samples, a scale that should be processed in multiple batches. However, unwanted variations are often creeped in due to technical variability among batches, so that a balanced batch design is implemented in this module that maximizes the statistical power but controls the technical variation. Batch design and batch correction methods are necessary for downstream data analysis, especially for large cohorts.
2.4 Data preprocessing
The data preprocessing module includes methods for log transform, missing value substitution, normalization, batch effect correction, and replicates treatment. In particular, batch effect correction is important in data preprocessing. Batch effects may obscure biological signals, we recommend performing batch diagnostics, normalization, and adjustment right after peptide identification.
2.5 Machine learning
In the feature selection module, users not only use filter methods including near zero variance and high correlation but also exercise additional feature selection methods: LASSO (Tibshirani, 1996), genetic algorithm and random forest. As in clinical applications classifying disease into subtypes is of great interest in the fields of diagnosis and prognosis, users can perform various machine-learning analyses: unsupervised analysis PCA, t-SNE, and UMAP for dimensionality reduction, and supervised algorithm decision tree, random forest, and XGBoost for clustering.
3 Conclusions
In summary, we have developed ProteomeExpert to meet the requirement for processing large-scale quantitative proteomics datasets. Most, if not all, quantitative proteomic datasets can be fed into ProteomeExpert, including but not limited to DIA-MS with or without ion mobility, label-free or stable isotope labeling-based data-dependent acquisition MS, parallel reaction monitoring MS and multiple reaction monitoring MS. Transcriptomic and metabolomic datasets can also be processed by this tool. ProteomeExpert is compatible with other omics tools by uploading their results in tab-delimited or comma-separated text file format or excel file. Moreover, ProteomeExpert includes comprehensive methods for data preprocessing, visualization, statistics, and machine learning. It can be hosted within R shiny environment under Windows, Linux and Mac system or deployed in Docker available as a web server.
Funding
This work was supported by grants from the National Key R&D Program of China [No. 2020YFE0202200]; Zhejiang Provincial Natural Science Foundation for Distinguished Young Scholars [LR19C050001]; Hangzhou Agriculture and Society Advancement Program [20190101A04]; National Natural Science Foundation of China [81972492]; and National Science Fund for Young Scholars [21904107].Conflict of Interest: Tiannan Guo is shareholder of Westlake Omics Inc. Hao Chen, Weigang Ge and Qiushi Zhang are employees of Westlake Omics Inc. The remaining authors declare no competing interests.Click here for additional data file.
Authors: Tiannan Guo; Petri Kouvonen; Ching Chiek Koh; Ludovic C Gillet; Witold E Wolski; Hannes L Röst; George Rosenberger; Ben C Collins; Lorenz C Blum; Silke Gillessen; Markus Joerger; Wolfram Jochum; Ruedi Aebersold Journal: Nat Med Date: 2015-03-02 Impact factor: 53.440
Authors: Gilbert S Omenn; Lydie Lane; Christopher M Overall; Young-Ki Paik; Ileana M Cristea; Fernando J Corrales; Cecilia Lindskog; Susan Weintraub; Michael H A Roehrl; Siqi Liu; Nuno Bandeira; Sudhir Srivastava; Yu-Ju Chen; Ruedi Aebersold; Robert L Moritz; Eric W Deutsch Journal: J Proteome Res Date: 2021-10-20 Impact factor: 5.370