| Literature DB >> 35701420 |
Mathias Walzer1, David García-Seisdedos2, Ananth Prakash2, Paul Brack3, Peter Crowther4, Robert L Graham5, Nancy George2, Suhaib Mohammed2, Pablo Moreno2, Irene Papatheodorou2, Simon J Hubbard3, Juan Antonio Vizcaíno6.
Abstract
The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.Entities:
Mesh:
Year: 2022 PMID: 35701420 PMCID: PMC9197839 DOI: 10.1038/s41597-022-01380-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Graphical representation of the DIA data reanalysis pipeline, consisting of 4 parts. (a) Data curation: Metadata annotation protocols and dataset acquisition. (b) SWATH-MS data analysis: Nextflow workflow including steps ranging from data conversion, SWATH-MS window management, data quality assessment and control (QA/QC), OpenSWATH analysis, FDR calculation to measurement alignment. (c) Statistical analysis: Nextflow workflow for MSstats analysis, normalisation and result filtering. (d) Data integration: Data preparation, accession mapping and integration into Expression Atlas.
Overview of the containers and software versions used in the open data analysis pipeline.
| Step | Name | URL or DockerHub handle | Version |
|---|---|---|---|
| Conversion from raw file to mzML | wiffConverter | sciex/wiffconverter:0.7 | 0.7.0 |
| QC/QA | yamato | 1.0.4 | |
| Window management | python scripts | 1.0.0 | |
| OpenSWATH | OpenSWATH | openswath/Openswath:0.1.2 | 2.4.0 (git 868546e) |
| PyProphet | 2.0.dev1 (git ddcedac) | ||
| TRIC | msproteomicstools | 0.8.0 (git eeed765) | |
| Post-processing | R | 4.0.3 | |
| MSstats | 3.22.0 | ||
| MyGene.info | 1.24.0, Ensembl 99/GRCh38 |
Main characteristics of the selected public DIA datasets for data reanalysis. Further details can be found in the ‘Data availability’ section.
| Dataset Identifier | Analysis Type | Short Name | Dataset Size (MS runs) | Technical Replicates | Expression Atlas Accession Number |
|---|---|---|---|---|---|
| PXD004873[ | Baseline | HCCpct | 76 | Available | E-PROT-69 |
| PXD000672[ | Differential | Digital Kidney | 48 | Not available | E-PROT-59 |
| PXD004691[ | Differential | Bank PrCa | 224 | Available | E-PROT-68 |
| PXD014943[ | Differential | Wylm | 113 | Not available | E-PROT-67 |
| PXD003497[ | Baseline | PrCa Regions | 60 | Available | E-PROT-66 |
| PXD004589[ | Baseline | PrCa Network | 210 | Not available | E-PROT-70 |
| PXD014194[ | Baseline | BraCa OFLM4 | 145 | Available | E-PROT-72 |
| PXD003539[ | Baseline | NCI60 | 120 | Not available | E-PROT-73 |
| PXD001064[ | Baseline | Plasma | 240 | Pooled sample replicates | E-PROT-60 |
| PXD010912[ | Baseline | DIATPA | 42 | Not available | E-PROT-71 |
Fig. 2(a) Number of detected proteins per dataset at different FDR levels in the data reanalysis. (a) Protein detection results after 1% protein FDR threshold filtering. Original data refers to the respective publication’s mentioned protein numbers, reported at 1% protein FDR unless indicated otherwise. Reanalysis numbers are provided unfiltered and with the consistency filter applied (at least 50% of all protein’s peptide fragment targets have to be detected within a study group). *Proteins coming from datasets PXD000672 and PXD004873 were reported in the original publication at a 0.1% protein level FDR only. In the case of dataset PXD004589 at 0.1% peptide level FDR was reported.
Median CV values for technical replicates, for the reanalysed results and originally published data, respectively.
| Dataset identifier | Reanalysed data | Original data |
|---|---|---|
| PXD014194 | 16.8 | 19.0 |
| PXD004873 | 7.14 | 6.05 |
| PXD003497 | 20.8 | 20.3 |
Fig. 3Violin-plots showing the results of the group-wise CV comparisons: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. As it can be seen from the similar size and shapes of the violin-plots, the CVs across the datasets are largely concordant.
Fig. 4Correlation analysis of reported log2 protein intensities from technical replicate pairs: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. The first items of pairs are on the x-axis and second items are on the y-axis. Each point represents a protein. The point density is indicated by the colour gradient, with black showing the lowest density. The higher the density the lighter the colour becomes.
Fig. 5Volcano plots corresponding to the differential expression analysis for dataset PXD014943: (a) extranodal diffuse large B-cell lymphoma (eDLBCL) versus primary central nervous system lymphoma (PCNSL); (b) intravascular lymphoma (IVL) versus eDLBCL. For dataset PXD004691: (c) normal tissue (fresh frozen) versus PrC (fresh frozen); (d) normal tissue (paraffin embedded) versus tumour tissue (paraffin embedded). For dataset PXD000672: (e) benign tissue samples versus clear cell RCC; (f) clear cell RCC versus paillary RCC. The FC compared are represented by points on the plot. Significant FC proteins are colour indicated, dashed lines indicate the fold-change cutoff of 2 and the (adjusted) p-value cutoff at 0.05.
Types of input to the DIA reanalysis pipeline implemented in Nextflow.
| An iRT file in OpenSWATH conformant |
| A collection of |
| Target library file in OpenSWATH conformant |
| Study design annotation in MSstats conformant format and SDRF format ( |
| Parameter |
| FDR filtered alignment results of transition quantifications ( |
| QC records for each MS run ( |
| MSstats analysis object ( |
| Dataset analysis report ( |
| Prefiltered Expression Atlas upload input ( |
Detailed information on the datasets included in the reanalysis.
| Dataset identifier | Originally reported protein quantification | Condition contrasts available | Condition common Proteins | Expression Atlas URL |
|---|---|---|---|---|
| PXD004873 | Per MS run | — | — | |
| PXD000672 | Per MS run | 2 | 975; 1022 | |
| PXD004691 | Per sample | 2 | 1435; 991 | |
| PXD014943 | Per sample | 2 | 3019; 1080 | |
| PXD003497 | Per MS run | — | — | |
| PXD004589 | Per MS run | — | — | |
| PXD014194 | Per MS run | — | — | |
| PXD003539 | — | — | — | |
| PXD001064 | — | — | — | |
| PXD010912 | — | — | — |