| Literature DB >> 26862574 |
Claire Ramus1, Agnès Hovasse2, Marlène Marcellin3, Anne-Marie Hesse1, Emmanuelle Mouton-Barbosa3, David Bouyssié3, Sebastian Vaca2, Christine Carapito2, Karima Chaoui3, Christophe Bruley1, Jérôme Garin1, Sarah Cianférani2, Myriam Ferro1, Alain Van Dorssaeler2, Odile Burlet-Schiltz3, Christine Schaeffer2, Yohann Couté1, Anne Gonzalez de Peredo3.
Abstract
This data article describes a controlled, spiked proteomic dataset for which the "ground truth" of variant proteins is known. It is based on the LC-MS analysis of samples composed of a fixed background of yeast lysate and different spiked amounts of the UPS1 mixture of 48 recombinant proteins. It can be used to objectively evaluate bioinformatic pipelines for label-free quantitative analysis, and their ability to detect variant proteins with good sensitivity and low false discovery rate in large-scale proteomic studies. More specifically, it can be useful for tuning software tools parameters, but also testing new algorithms for label-free quantitative analysis, or for evaluation of downstream statistical methods. The raw MS files can be downloaded from ProteomeXchange with identifier PXD001819. Starting from some raw files of this dataset, we also provide here some processed data obtained through various bioinformatics tools (including MaxQuant, Skyline, MFPaQ, IRMa-hEIDI and Scaffold) in different workflows, to exemplify the use of such data in the context of software benchmarking, as discussed in details in the accompanying manuscript [1]. The experimental design used here for data processing takes advantage of the different spike levels introduced in the samples composing the dataset, and processed data are merged in a single file to facilitate the evaluation and illustration of software tools results for the detection of variant proteins with different absolute expression levels and fold change values.Entities:
Year: 2015 PMID: 26862574 PMCID: PMC4706616 DOI: 10.1016/j.dib.2015.11.063
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1illustration of the absolute abundance of spiked proteins compared to the yeast background in the 6 last samples of the dataset. Absolute abundances were estimated using the iBAQ metric calculated by MaxQuant in workflows 6 and 7 (see below for the details of the workflows).
Fig. 2Experimental design of the data processing workflow.
Fig. 3ROC curves plotted from the dataset to compare filtering criteria (A) or bioinformatics workflows (B). A/sensitivity-FDP curves were plotted for the data obtained from workflows 6 (quantification based on MaxQuant intensity values) by varying either the |Welch t-test difference| threshold (red), the |z-score| threshold (green) or the Welch t-test p-value threshold (blue). The Welch t-test difference, z-score or p-value were used respectively as a unique criterion to classify the proteins (full line curves), or a combinations of these filters were applied to improve the classification (dotted line curves). B/Overlaid ROC curves for the different bioinformatics workflows: proteins were classified as variant by filtering on the p-value thresholds, combined to a fixed |log2(fold change)| threshold of 1 for spectral count workflows (1–4) and to a fixed |z-score| threshold of 1 for MS intensity based workflows (5–8).
| Subject area | |
| More specific subject area | |
| Type of data | |
| How data was acquired | |
| Data format | |
| Experimental factors | |
| Experimental features | |
| Data source location | |
| Data accessibility |