| Literature DB >> 35551187 |
Klemens Fröhlich1,2,3, Eva Brombacher2,3,4,5, Matthias Fahrner1,2,3, Daniel Vogele1,2, Lucas Kook6,7, Niko Pinter1, Peter Bronsert1,8,9, Sylvia Timme-Bronsert1,9, Alexander Schmidt10, Katja Bärenfaller11, Clemens Kreutz4,5, Oliver Schilling12,13,14.
Abstract
Numerous software tools exist for data-independent acquisition (DIA) analysis of clinical samples, necessitating their comprehensive benchmarking. We present a benchmark dataset comprising real-world inter-patient heterogeneity, which we use for in-depth benchmarking of DIA data analysis workflows for clinical settings. Combining spectral libraries, DIA software, sparsity reduction, normalization, and statistical tests results in 1428 distinct data analysis workflows, which we evaluate based on their ability to correctly identify differentially abundant proteins. From our dataset, we derive bootstrap datasets of varying sample sizes and use the whole range of bootstrap datasets to robustly evaluate each workflow. We find that all DIA software suites benefit from using a gas-phase fractionated spectral library, irrespective of the library refinement used. Gas-phase fractionation-based libraries perform best against two out of three reference protein lists. Among all investigated statistical tests non-parametric permutation-based statistical tests consistently perform best.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35551187 PMCID: PMC9098472 DOI: 10.1038/s41467-022-30094-0
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1Benchmarking workflow.
A data-independent acquisition (DIA) benchmark dataset was created by adding E. coli peptides in known ratios to peptide preparations of lymph nodes of 92 individuals. We analyzed the raw data with different spectral libraries and DIA software suites. From samples to which E. coli peptides were added in the two E. coli: human peptide ratios 1:25 and 1:12, bootstrap datasets with group sizes of 3 to 23 were generated. For each of those 21 different group sizes, 100 bootstrap datasets were generated. On each bootstrap dataset different data analysis workflows, composed of different sparsity reductions, normalization options, and different statistical tests for detecting differentially abundant proteins, were applied. The results were returned in a table containing p-values and log2 fold-changes (log2FCs) for each protein. As the ground truth about the changed proteins (E. coli) is known, the prediction performance of each workflow can be assessed. This can be done based on the p-values from the statistical tests by calculating the receiver operating characteristic (ROC) curve, based on which the area under curve (AUC) is calculated. To quantify the accuracy of quantification the root-mean-square error (RMSE) is calculated based on the detected log2FC.
Fig. 2Choice of spectral library and DIA analysis software influences number of identified proteins.
Protein number, distribution, and variance for each DIA analysis workflow separated by species and color-coded by spike-in condition. Left: Number of all identified and quantified proteins in all 92 samples. Center: Log2 intensity distributions. Right: Log2 variance. Log2 variance values < −12 were excluded from this plot. For spike-condition, 1:6 data of n = 22 biologically independent samples have been used and for each of the other spike-in conditions data of n = 23 biologically independent samples have been used. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers).
Fig. 3Missing value characteristics and correlations at the protein level.
a The way small intensities are handled mainly depends on the employed DIA software. Means of log2 intensities of identified human (blue) and E. coli (red) proteins plotted against the percentage of missing values in the respective protein. E. coli proteins are not physically present in 25% of samples (indicated by the red arrow). b The correlation between the missingness within samples and the sample mean over all proteins of these samples varies with the employed DIA software. Sample means are plotted against the percentage of missing values in the respective sample.
Fig. 4Statistical analysis of benchmark dataset.
a Workflow schematic: for the generation of bootstrap datasets, random samples were drawn with replacement from samples of spike-in conditions 1:25 and 1:12 mimicking two groups containing differentially abundant proteins, here represented by all E. coli proteins. The p-values acquired after data preprocessing and statistical analysis were used to build receiver operating characteristic (ROC) curves. The partial area under the curve (pAUC) was used as a measure of prediction performance. b pAUC distribution for the different sparsity reduction options (as measured against ‘DIA Workflow’ protein list). c pAUC for the different DIA analysis workflows as measured against the three reference protein lists. d pAUC distributions for the statistical tests. All seven statistical tests were two-sided and not adjusted for multiple testing. ‘DIA Workflow’ describes the performance against the proteins present in the given DIA workflow only, ‘Combined’ describes the performance against proteins identified at least by one of all DIA analysis workflows. ‘Intersection’ describes the performance against proteins which were found in >80% (in at least 14 of 17) of the DIA analysis workflows. For each reference protein list, the respective median of all pAUC values is indicated by a red line, and the best performing option with a cross. b–d are based on n = 2100 bootstrap datasets which have been generated by drawing with replacement from data of n = 23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. For c that comes to a total of n = 2100 * 17 DIA workflows * 4 normalizations * 7 statistical tests = 999600 data points per sparsity reduction setting, for c to a total of n = 2100 * 3 sparsity reductions * 7 statistical tests = 176400 per library-software combination, and for d to a total of n = 2100 * 17 DIA workflows * 3 sparsity reductions * 4 normalizations = 428400 per statistical test setting. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers).