| Literature DB >> 35166338 |
Matthias Fahrner1,2,3, Melanie Christine Föll1,4, Björn Andreas Grüning5, Matthias Bernt6, Hannes Röst7, Oliver Schilling1,8,9.
Abstract
BACKGROUND: Data-independent acquisition (DIA) has become an important approach in global, mass spectrometric proteomic studies because it provides in-depth insights into the molecular variety of biological systems. However, DIA data analysis remains challenging owing to the high complexity and large data and sample size, which require specialized software and vast computing infrastructures. Most available open-source DIA software necessitates basic programming skills and covers only a fraction of a complete DIA data analysis. In consequence, DIA data analysis often requires usage of multiple software tools and compatibility thereof, severely limiting the usability and reproducibility.Entities:
Keywords: bioinformatics; computational workflows; data-independent acquisition; galaxy; mass spectrometry; proteomics
Mesh:
Year: 2022 PMID: 35166338 PMCID: PMC8848309 DOI: 10.1093/gigascience/giac005
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Overview of newly integrated tools for the DIA data analysis in Galaxy
| Integrated tool reference | Function | Galaxy Toolshed |
|---|---|---|
| diapysef [ | Spectral library generation | [ |
| OpenSwathAssayGenerator [ | Spectral library refinement | [ |
| OpenSwathDecoyGenerator [ | Spectral library refinement | [ |
| TargetedFileConverter [ | Spectral library conversion | [ |
| OpenSwathWorkflow [ | Peptide identification and quantification in DIA data | [ |
| PyProphet merge [ | Combining individual analysis results to allow for global scoring | [ |
| PyProphet subsample [ | Subsampling of combined analysis results for faster scoring | [ |
| PyProphet score [ | Target-decoy scoring | [ |
| PyProphet peptide [ | Applying computed scores on peptide level | [ |
| PyProphet protein [ | Applying computed scores on protein level | [ |
| PyProphet export (includes swath2stats) [ | Export results (Optional: apply swath2stats functionality) | [ |
We integrated 11 tools to enable a complete DIA analysis in Galaxy. Tool names (including the respective references) and their function within the analysis pipeline, as well as a link to the Galaxy toolshed, are provided.
Figure 1:Introducing an all-in-one DIA analysis solution by implementing all necessary tools for a DIA analysis into the Galaxy framework. (A) Schematic overview of a classic data-independent acquisition (DIA) analysis workflow as compared to the here-introduced all-in-one workflow in the Galaxy framework. The classic DIA workflow includes different software environments and operating system requirements as indicated by a different color (light green: local MaxQuant analysis; yellow: diapysef Python shell; dark green: MobaXterm enhanced terminal for Windows; blue: local msconvert; grey: MSstats in Rstudio), whereas all necessary tools are now implemented into the Galaxy framework. A complete DIA analysis can be divided into 3 steps: (i) spectral library generation, (ii) peptide and protein identification and quantification in DIA data, and (iii) statistical analysis to identify differentially expressed proteins. (B) Generation of a spectral library based on the analysis of data-dependent acquisition (DDA) analysis shown as Galaxy workflow. (C) DIA data analysis shown as Galaxy workflow.
Figure 2:Analysis of a DIA spike-in dataset in Galaxy. (A) Experimental design of a spike-in dataset based on equal amounts of HEK proteome and known spike-in amounts of E. coli proteome. For spectral library generation, 1 representative sample of each mixture was measured using data-dependent acquisition (DDA). Each E. coli:HEK ratio was measured in 4 replicates using data-independent acquisition (DIA). DIA analysis was performed on the basis of the spectral library and the individual DIA measurements followed by statistical analysis to identify differentially expressed proteins. (B) Retention time (RT) alignment plot of the measured RT and respective indexed retention time (iRT) of reference peptides (iRT peptides) during the generation of the spectral library (exemplarily shown for 1 of the DDA measurements). All measured RTs are converted to iRTs on the basis of the linear regression of the reference peptides (R² and R² adjusted for the linear regression are shown above the plot).
Figure 3:Overview of target-decoy scoring using PyProphet score during the DIA analysis in Galaxy. (A) Receiver operating characteristic (ROC) curve highlighting the sensitivity and specificity of the target-decoy scoring. (B) Plot showing the discriminatory score (d-score) performance between the target (green) and decoy (red) precursors. (C) Bar plot and (D) density plot showing d-score distribution among target and decoy precursors. (E) Histogram showing the distribution of P-values computed on the basis of the target-decoy scoring.
Overview of identified and quantified precursors, peptides, and proteins
|
| Replicate | Precursors | Peptides | Proteins |
|---|---|---|---|---|
| 1:17 | 1 | 28,792 | 27,117 | 4,979 |
| 2 | 28,697 | 27,040 | 4,972 | |
| 3 | 28,673 | 27,024 | 4,970 | |
| 4 | 28,641 | 27,003 | 4,989 | |
| 1:3 | 1 | 28,711 | 27,060 | 5,011 |
| 2 | 28,690 | 27,043 | 5,000 | |
| 3 | 28,672 | 27,028 | 5,006 | |
| 4 | 28,669 | 27,035 | 4,996 | |
| 1:50 | 1 | 28,191 | 26,576 | 4,906 |
| 2 | 28,255 | 26,636 | 4,914 | |
| 3 | 28,231 | 26,595 | 4,919 | |
| 4 | 28,192 | 26,577 | 4,916 | |
| 1:7 | 1 | 28,833 | 27,160 | 4,994 |
| 2 | 28,791 | 27,123 | 5,006 | |
| 3 | 28,804 | 27,136 | 5,004 | |
| 4 | 28,837 | 27,166 | 5,011 | |
| HEK only | 1 | 27,166 | 25,669 | 4,843 |
| 2 | 27,182 | 25,683 | 4,862 | |
| 3 | 27,176 | 25,663 | 4,869 | |
| 4 | 27,099 | 25,600 | 4,858 |
DIA analysis results were filtered at 1% FDR on peptide and protein level during export using the PyProphet export tool. In combination with a sample annotation file, the swath2stats functionality was applied, yielding an overview of identified and quantified precursors, peptides, and proteins in each sample.
Figure 4:Protein quantification results obtained using the DIA analysis tools in Galaxy. (A) Box plot showing the distribution and median value of global median normalized log2 protein intensities. (B) Violin plot illustrating the distribution of coefficients of variation (CV) of the log2-transformed protein intensities for each condition (here E. coli:HEK ratios) across replicates (each n = 4). For E. coli–containing mixture only E. coli proteins were used (red), and for the HEK replicates only human proteins were used (blue). (C) Volcano plot showing −log10 adjusted P-values against log2 fold changes, highlighting differentially expressed proteins comparing the 2 E. coli:HEK ratios 1:17 vs 1:7. Upregulated proteins are shown in red whereas downregulated proteins are shown in blue. The significance threshold is indicated as dashed line at a adjusted P-value of 0.05.