| Literature DB >> 27307634 |
Shengjie Wang1, John T Halloran2, Jeff A Bilmes3, William S Noble4.
Abstract
UNLABELLED: Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by searching each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speedup afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved spectrum identification accuracy. CONTACT: bilmes@uw.edu or william-noble@uw.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27307634 PMCID: PMC4908353 DOI: 10.1093/bioinformatics/btw269
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) A trellis encoding ‘seattle’, ‘seafood’, ‘kungfu’ and ‘tofu’. and n are trellis nodes, and every arrow corresponds to a trellis link, e.g. ’sea’. (B) An example of a simple trellis for MS/MS scoring functions, consisting of the theoretical peaks (discretized b/y-ions) for three peptides: ‘ELAK’, ‘EALK’ and ‘EAKK’. Every edge corresponds to the m/z value of a fragment ion rounded to the nearest integer. Three colored paths from source node n to sink node n correspond respectively to three peptides. The observed spectrum is charge +2 with low-resolution fragment ions. (C) Trellis construction algorithm that takes as input a set of strings and the corresponding alphabet Σ and returns a trellis representation of . (D) Sample trellis construction for strings: ‘ac’, ‘ad’, ‘bc’ and ‘bd’.
Fig. 2.DBN for traversing a trellis. L corresponds to the set of links being traversed, which contains the data to access. The value of L is decided based on the previous node , the current node V, and the transition Δ.
Fig. 3.(A) The graphical model representation of linear MS/MS scoring functions. (B) DRIP template. Shaded vertices are observed variables, while unshaded vertices are hidden variables. Black edges correspond to deterministic functions of parent variables, red edges correspond to conditional Gaussian distributions and blue edges represent switching parents.
Fig. 4.(A) The DRIP trellis model. The trellis DBN (Trellis Part) is attached to the DRIP DBN (DRIP Part) by taking the input from δ from DRIP (green cone), which controls the traversal of theoretical peaks, and outputting L for DRIP to score (pink cone), which is the m/z values of theoretical peaks. The DRIP DBN structure remains unchanged (the part with green background is unchanged from Fig. 3B). (B) The graphical model representation of linear MS/MS scoring functions incorporated with trellis structure. Trellis Part (pink background) is attached to the linear MS/MS function graphical model (green background), which remains unchanged.
Fig. 5.(A) Percent running time of XCorr trellis, DRIP trellisbase, and DRIP trellisspeed relative to the running time of the original model without a trellis. (B) Discriminative training improves performance for the worm dataset. (C) Similar to (B), but for the yeast dataset.
Notation used in this article
| Symbol | Function |
|---|---|
| observed spectrum of length | |
| index along m/z axis for DGM expansion, | |
| DGM frame unrolling amount and number of peaks in observed spectrum | |
| precursor mass, precursor mass of spectrum | |
| precursor charge, precursor charge of spectrum | |
| number of bins of m/z axis quantization (i.e. typically 2000 for low-resolution data) | |
| binned and processed observed spectrum of length | |
| difference observed spectrum (used in XCorr). | |
| a peptide | |
| length of | |
| number of theoretical peaks in some peptide | |
| binned sparse theoretical vector of length | |
| length | |
| set of strings | |
| vector of peaks indices of | |
| subsequence of | |
| vector of peaks types of | |
| trellis nodes | |
| trellis links | |
| Δ | trellis DGM transition variable |
| trellis DGM node variable | |
| trellis DGM link variable | |
| observed child variable for use in a DGM (i.e. |
Note: x and its variations denote the observed spectrum for different methods/contexts. y and its variations denote the theoretical spectrum for different methods/contexts.