| Literature DB >> 23813117 |
Fawaz Ghali1, Ritesh Krishna, Pieter Lukasse, Salvador Martínez-Bartolomé, Florian Reisinger, Henning Hermjakob, Juan Antonio Vizcaíno, Andrew R Jones.
Abstract
The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23813117 PMCID: PMC3820921 DOI: 10.1074/mcp.O113.029777
Source DB: PubMed Journal: Mol Cell Proteomics ISSN: 1535-9476 Impact factor: 5.911
Fig. 1.The pipeline of mzidLibrary tools constructed for testing different routines and performing benchmarking using the iPRG 2008 data set.
Fig. 2.Screenshots of ProteoIDViewer. A, the “Protein View” panel with protein groups, individual accessions, and further details about each protein identified. B, the “Peptide View” panel, showing a listing of all PSMs, including spectrum visualization and annotation of fragment products identified (if present in the file. C, the “Global Statistics” panel, showing graphs plotting statistics estimated from decoy database searching.
The routines contained within the mzidLibrary release reported here. All routines must specify the input and output files and have an additional option for compressing the output using the gzip protocol. Most routines have an additional “verbose” mode for debugging output or providing more detail about the run
| Tool | Description | Parameters |
|---|---|---|
| InsertMetaData-FromFasta | Extracts protein sequences and description lines from a FASTA file and inserts them into an mzIdentML file | Location of FASTA file |
| Regular expression (regex) to split the accession from the description line | ||
| FalseDiscovery-Rate | Calculates FDR, q-value, and FDRScore ( | Optional regex for decoy hits (if the isDecoy attribute is not set in the file) |
| Ratio of targets to decoys | ||
| Accession of CV term for score in file to use for ordering | ||
| Scores are ordered low to high | ||
| CombineSearch-Engines | Re-scores and combines PSMs from two or three search engines to produce a single output ( | Regex for decoy hits (as above) |
| Ratio of targets to decoys | ||
| - Locations of input files and identifiers for the type of search engine | ||
| Threshold | Sets the passThreshold attribute on PSMs or proteins, according to the entered value; optionally, it can remove PSMs that fall below the threshold (see main text) | Threshold is for PSMs or proteins |
| Score type to be used for setting the threshold | ||
| Accession of CV term for score in the file to use for thresholding | ||
| Scores are ordered low to high | ||
| Omssa2mzid | Converts OMSSA OMX format to mzIdentML | Include fragmentation products |
| Regex for decoy hits | ||
| Csv2mzid | Converts results in OMSSA CSV format to mzIdentML | Regex for decoy hits |
| Accession of CV term for score in file to use for ordering (in case the CSV file is not from OMSSA) | ||
| Location of the file containing search metadata | ||
| Tandem2mzid | Converts Tandem XML format to mzIdentML | Include fragmentation products |
| Regex for decoy hits | ||
| Whether identifiers start from 0 (mzML file searched) or from 1 (other peak lists were searched) | ||
| Options for capturing additional metadata difficult to parse from file (database file format, peak list file format) | ||
| Mzid2Csv | Exports from mzIdentML format to various CSV formats | Type of export to perform: one row per PSM (no protein information), proteins with details of all PSMs, one row per protein or one row per protein group |
| AddEmpaiToMzid | Calculates pseudo-quantitative abundance values, based on spectral counting (see main text) | Location of FASTA file |
| Regular expression to split the accession from the description line | ||
| ProteoGrouper | Performs protein inference from the PSMs (see main text) | Use only PSMs with ‘passThreshold’ = true |
| Accession of CV term for score in file to use for ordering | ||
| Scores are ordered low to high | ||
| If the score should be log transformed |
Fig. 3.A pseudo–receiver operating characteristic (ROC) plot showing the results from ProteoGrouper for post-processing the iPRG 2008 data sets searched in Mascot, OMSSA, and X!Tandem combined (triangles) and in Mascot alone (X) compared with results from the other research group members and study participants. The results produced by Mascot with a new “protein family” algorithm (30) are represented by the square for comparison.