| Literature DB >> 24760958 |
Mathias Walzer1, Lucia Espona Pernas2, Sara Nasso3, Wout Bittremieux4, Sven Nahnsen1, Pieter Kelchtermans5, Peter Pichler6, Henk W P van den Toorn7, An Staes8, Jonathan Vandenbussche8, Michael Mazanek6, Thomas Taus6, Richard A Scheltema9, Christian D Kelstrup10, Laurent Gatto11, Bas van Breukelen7, Stephan Aiche12, Dirk Valkenborg13, Kris Laukens4, Kathryn S Lilley14, Jesper V Olsen10, Albert J R Heck7, Karl Mechtler6, Ruedi Aebersold15, Kris Gevaert8, Juan Antonio Vizcaíno16, Henning Hermjakob16, Oliver Kohlbacher1, Lennart Martens17.
Abstract
Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteomics. Several recent papers discuss relevant parameters for quality control and present applications to extract these from the instrumental raw data. What has been missing, however, is a standard data exchange format for reporting these performance metrics. We therefore developed the qcML format, an XML-based standard that follows the design principles of the related mzML, mzIdentML, mzQuantML, and TraML standards from the HUPO-PSI (Proteomics Standards Initiative). In addition to the XML format, we also provide tools for the calculation of a wide range of quality metrics as well as a database format and interconversion tools, so that existing LIMS systems can easily add relational storage of the quality control data to their existing schema. We here describe the qcML specification, along with possible use cases and an illustrative example of the subsequent analysis possibilities. All information about qcML is available at http://code.google.com/p/qcml.Entities:
Mesh:
Year: 2014 PMID: 24760958 PMCID: PMC4125725 DOI: 10.1074/mcp.M113.035907
Source DB: PubMed Journal: Mol Cell Proteomics ISSN: 1535-9476 Impact factor: 5.911
Fig. 1.An overview of the role of qcML. Experimental data are fed into performance analysis tools, that calculate the values of quality metrics. Those tools output qcML files, which in their turn can be converted to a database format for storage, or managed further with quality control tools. The data in qcML can also be converted to an easily viewable quality report.
Fig. 2.The backbone of the XML schema. This schema specifies the encapsulation of data in a qcML file. The full XML schema and ER schema of qcDB are available in the Supplementary Information.
Example of a controlled vocabulary (cv) term and its implementation as a quality parameter in qcML XML. Each cv term has an id, a name and a definition. Additionally it may have relational references to other cv terms if it is hierarchically embedded, e.g. “total number of PSM” has the relation ‘a part of’ to the term “MS identification result details.” (a) An example term in the controlled vocabulary describing the number of assigned peptide to spectrum matches for a certain run. It is defined as both a “spectrum identification detail” and a “MS identification detail” through “is a” relationships. (b) An example use of the cv term from (a) in a quality parameter in a qcML file. Each quality parameter can be assigned a cv term that defines and puts into context its associated data. These associated data can for instance consist of a value attribute, but it can also take the form of an attachment containing a plot or tabular data
| [Term] | |
|---|---|
| a) | id: QC:0000029 |
| name: total number of PSM | |
| def: “This number indicates the number of spectra that were given peptide annotations.” [PXS:QC] | |
| is_a: MS:1001405 ! spectrum identification result details | |
| is_a: QC:0000025 ! MS identification result details | |
| b) | <qualityParameter name = “total number of PSM” ID = “20100219_SvNa_SA_Ecoli_PP_psms” cvRef = “QC” accession = “QC:0000029” value = “12370”/> |
Fig. 3.Simple QC workflow as implemented in KNIME. An input mzML file is first preprocessed (feature finding/identification with standard parameters), allowing the QCCalculator to subsequently create a basic qcML file. On top of this, the ID ratio (recorded versus identified MS2 on M/Z over RT), the mass accuracy (ppm error histogram), the fractional mass (experimentally recorded versus theoretically expected on fractional mass over nominal mass), and the TIC are all plotted. Finally, verbose or redundant attachments, as source data for generated plots, are removed for a slim report file. More examples can be found in the supplementary information.
List of the basic quality parameters that the QCCalculator program uses to create a basic qcML. Parameters that are included for completeness but not actually metrics (like filename) are written in italics. For an overview on the exsisting qcML metrics see supplementary material
| Quality parameter/metric | Description | CV accession |
|---|---|---|
| MS1 spectra count | Number of MS1 spectra | QC:0000006 |
| MS2 spectra count | Number of MS2 spectra | QC:0000007 |
| Chromatogram count | Number of chromatograms | QC:0000008 |
| Total number of missed cleavages | Number of missed cleavages | QC:0000037 |
| Total number of identified proteins | Number of identified proteins | QC:0000032 |
| Total number of uniquely identified proteins | Number of unique proteins | QC:0000033 |
| Total number of PSMs | Number of PSMs | QC:0000029 |
| Total number of identified peptides | Number of identified proteins | QC:0000030 |
| Total number of uniquely identified peptides | Number of identified peptides | QC:0000031 |
| Mean delta ppm | Mean of ppm error | QC:0000040 |
| Median delta ppm | Median of ppm error | QC:0000041 |
| Id ratio | Ratio of recorded vs. identified MS2 plotted on M/Z over RT | QC:0000035 |
| Number of features | Number of detected features | QC:0000046 |
| MZ aquisition ranges | Value range limitations used for aquisition | QC:0000009 |
| RT aquisition ranges | Value range limitations used for aquisition | QC:0000012 |
| Id settings | The settings of the search engine used engine name and further parameters. | QC:0000026 |
Fig. 4.The median m/z value, and ratio of +2 charged features versus all detected features on the MS1 level for a set of 666 experiments performed on a large variety of samples over time (see main text for a summary) using the same Thermo Scientific LTQ Orbitrap Velos instrument. The data points are colored by the type of experimental protocol.
Fig. 5.Outliers could be identified for the median ppm error metric in a subset of the experimental data set from Fig. 4. The subset contains 57 tryptic digests of enolase, all analyzed using the same experimental conditions with the exception of LC column replacements. The spectra were identified using X!Tandem (33) searches against the whole SwissProt database and were filtered at a 1% false discovery rate.