Literature DB >> 30083051

Bayesian Classification of Proteomics Biomarkers from Selected Reaction Monitoring Data using an Approximate Bayesian Computation-Markov Chain Monte Carlo Approach.

Abstract

Selected reaction monitoring (SRM) has become one of the main methods for low-mass-range-targeted proteomics by mass spectrometry (MS). However, in most SRM-MS biomarker validation studies, the sample size is very small, and in particular smaller than the number of proteins measured in the experiment. Moreover, the data can be noisy due to a low number of ions detected per peptide by the instrument. In this article, those issues are addressed by a model-based Bayesian method for classification of SRM-MS data. The methodology is likelihood-free, using approximate Bayesian computation implemented via a Markov chain Monte Carlo procedure and a kernel-based Optimal Bayesian Classifier. Extensive experimental results demonstrate that the proposed method outperforms classical methods such as linear discriminant analysis and 3NN, when sample size is small, dimensionality is large, the data are noisy, or a combination of these.

Entities: Chemical Disease Gene Species

Keywords: Markov chain Monte Carlo (MCMC); Optimal Bayesian Classifier (OBC); Proteomics; approximate Bayesian computation (ABC); biomarker; selected reaction monitoring (SRM)

Year: 2018 PMID： 30083051 PMCID： PMC6071182 DOI： 10.1177/1176935118786927

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Proteomics is the field which deals with the study of cellular behavior and human disease at the protein level. Recently, cancer treatment and prevention have made great strides, thanks to the development of high-throughput technologies in proteomics. Among these, mass spectrometry (MS) analysis has become the preferred choice because of advantages such as high molecular specificity and better detection sensitivity.[1] Hence, MS is widely used in identification and quantification of complex proteome mixtures with the goal of discovering biomarkers, ie, molecular markers for disease.[2-4] However, a major challenge in biomarker discovery is the identification of low-abundance proteins in peripheral blood. Selected reaction monitoring (SRM), conducted using a triple-quadrupole (QQQ) instrument, has an extended mass range and has become one of the main methods for low-mass-range–targeted proteomics by MS.[5] Nevertheless, in most SRM-MS biomarker validation studies, the sample size is very small due to the economic cost of the experiments and difficulty in recruiting cases. Typically, the number of features (measured proteins) is vastly larger than the sample size. Moreover, depending on the instrument sensitivity, the data can be noisy due to low peptide efficiency, ie, low number of ions detected per peptide. All the aforementioned issues create a difficult challenge to classical data-driven classification methods. In this article, this is addressed by a model-based Bayesian method for classification of SRM-MS data. We perform Bayesian inference of the parameters of the SRM model proposed in the work by Atashpaz-Gargari et al[5] and build a kernel classifier, similar to the classifier for liquid chromatography-mass spectrometry (LC-MS) data proposed in the work by Banerjee and Braga-Neto.[6] As in the latter reference, our method uses a likelihood-free approach, called approximate Bayesian computation (ABC),[7-9] which is necessary because the SRM model of Atashpaz-Gargari et al[5] is complex and does not have an analytical formulation of the likelihood. After calibration of the parameters, the ABC method is implemented via a Markov chain Monte Carlo (MCMC) procedure[10,11] to obtain a sample from the posterior distribution of the protein concentrations. Small MCMC sample sizes are sufficient to obtain a kernel-based implementation of the Optimal Bayesian Classifier (OBC).[12] Extensive experimental results examining the effect of various parameters demonstrate that the proposed method outperforms classical methods such as linear discriminant analysis (LDA) and 3NN,[13] when sample size is very small, dimensionality is large, the data are noisy, or a combination of these. The organization of the article is as follows. Section “SRM-based MS model” surveys the SRM-MS model. Section “ABC-MCMC classification algorithm” explains in detail the ABC rejection algorithm and the approximate Bayesian computation-Markov chain Monte Carlo (ABC-MCMC) classifier. Section “Numerical experiments and results” presents the numerical results. Section “Conclusions” presents concluding remarks.

SRM-BASED MS MODEL

In this article, we employ the model for the SRM pipeline proposed in the work by Atashpaz-Gargari et al.[5] Next, we review briefly each of the main components of this model.

Protein mixture model

The protein mixture model concerns the true abundance of proteins in the SRM experiment. There are samples in each class; for convenience, the 2 classes are labeled as 0 for control and 1 for treatment. There are proteins, of which are low-abundance candidates for biomarker validation. Protein identities are input as a FASTA file. As argued in previos works,[5,14] protein concentration can be modeled by a gamma distribution. Hence, the protein concentration is given by The variables and are, respectively, shape and scale parameters. These are uniform random variables defined as , and , , respectively. The initial values of these variables, which are displayed in Table 1, reflect the dynamic range of protein abundance levels while taking into account that the candidate proteins are expressed at a much lower level than the background proteins. The initial values used here are consistent with values obtained experimentally in the work by Taniguchi et al[14] as well as the hyperparameter values used in the work by Atashpaz-Gargari et al.[5] Furthermore, these initial values are modified based on the data, as part of the prior calibration process described in Algorithm 1.

Table 1.

Parameters used in the experiment.

Parameter	Symbol	Value/range
Instrument response factor	κ	5
Noise severity	α,β	0.03, 3.6
Peptide efficiency factor	ei	[0.1, 1]
Shape (gamma distribution)	ka,kc	Unif(1.6, 2.4), Unif(4, 6)
Scale (gamma distribution)	θa,θc	Unif(9e6, 11e6), Unif(90, 110)
Purification	ηi	10−6
Coefficient of variation	ϕ	Unif(0.3, 0.5)
Fold change	f	Unif(1.5, 1.6)

Algorithm 1 . Prior calibration of kc,ka,θc,θa,ϕ using ABC rejection sampling
1. Generate Mcal quintuplets of parameters of kc,ka,θc,θa,ϕ such thatka(t)∼Unif(kalow,kahigh)kc(t)∼Unif(kclow,kchigh)θa(t)∼Unif(θalow,θahigh)θc(t)∼Unif(θclow,θchigh)ϕ(t)∼Unif(ϕlow,ϕhigh)fort=1,2,…,Mcal 2. Now simulate a control sample set S0(t) of size n for each quintuplet of parameters for t = 1, 2,. . ., M_cal 3. Accept the quintuplet (ka(t),kc(t),θa(t),θc(t),ϕa(t)) if \|\|T(S0(t))−T(S0)\|\|<ε, fort=1,2,…,Mcal. Here, \|\| \|\| denotes the Euclidean norm and T denotes vector sample mean.4. Let B=[(k1,θ1,ϕ1),…,(kna,θna,ϕna)] be the set of accepted triplets.5. The calibrated k can be approximated as follows:kacal=∫kalowkahighkp(ka\|Sn)dk=1na∑a=1nakacal 6. Similarly, other 4 parameters are also calculated.

Algorithm 1 . Prior calibration of kc,ka,θc,θa,ϕ using ABC rejection sampling

1. Generate Mcal quintuplets of parameters of kc,ka,θc,θa,ϕ such that

ka(t)∼Unif(kalow,kahigh)kc(t)∼Unif(kclow,kchigh)θa(t)∼Unif(θalow,θahigh)θc(t)∼Unif(θclow,θchigh)ϕ(t)∼Unif(ϕlow,ϕhigh)fort=1,2,…,Mcal

2. Now simulate a control sample set S0(t) of size n for each quintuplet of parameters for t = 1, 2,. . ., M_cal 3. Accept the quintuplet (ka(t),kc(t),θa(t),θc(t),ϕa(t)) if ||T(S0(t))−T(S0)||<ε, fort=1,2,…,Mcal. Here, || || denotes the Euclidean norm and T denotes vector sample mean.4. Let B=[(k1,θ1,ϕ1),…,(kna,θna,ϕna)] be the set of accepted triplets.5. The calibrated k can be approximated as follows:

kacal=∫kalowkahighkp(ka|Sn)dk=1na∑a=1nakacal

6. Similarly, other 4 parameters are also calculated.

Parameters used in the experiment. Proteins are divided into biomarker (differentially expressed) and nonbiomarker (not differentially expressed) proteins. We use fold change to quantify the difference: for . The fold change parameter is uniformly distributed in the interval , for . The value of used here is displayed in Table 1. While the gamma distribution is chosen for mean protein concentrations, the variation of protein concentration is modeled by a multivariate gaussian vector. Accordingly, the concentration of protein in class is modeled as follows: for . Here, we consider a diagonal covariance matrix so that the protein concentrations are mutually independent or very weakly correlated (correlation between proteins can be included at the cost of adding more parameters to the model): where and The coefficient of variation has the initial value displayed in Table 1, which is the same as the one used in the work by Banerjee and Braga-Neto.[6] This value is modified based on the data, as part of the prior calibration process described in Algorithm 1. To model the purification process usually performed as part of the SRM-MS protocol, we select a set of high-abundance proteins to be removed (in fact, attenuated) from the protein mixture: The value for corresponds to the efficiency of the purification process and should be very small. The value assumed here is displayed in Table 1.

Peptide mixture model

In SRM-MS, tryptic digestion of proteins is performed to generate small-mass peptides. Let be the set of all the proteins which contain the ith peptide: The readout abundance of the peptide can be modeled as follows: Here, represents the peptide efficiency factor and represents the LC-MS response factor. However, the true peptide abundance is different from its readout value due to the noise: where is additive gaussian noise, which has a quadratic dependence on s given below: where is the additive exponential noise introduced due to transition effects: where is a fixed constant. The next step is called protein abundance roll-up. This is the process of obtaining the abundances of the parent proteins from the abundances and related characteristisc of their child peptides, detected during the MS1 process. To obtain the identities of the parent proteins, a second round of MS, called MS/MS, is often used and available databases of identities are searched. Here, we assume that the data from the rolled up abundances can be obtained and the readout of protein in sample is given by where is the instrument response factor, is the set of proteins present in peptide l and is the number of peptides in set . The data in equation (13) are then used for classification.

ABC-MCMC Classification Algorithm

As described in the introduction section, the algorithm mainly has 3 steps: prior calibration via ABC rejection sampling, posterior sampling using an ABC-MCMC algorithm, and classification using a kernel-based method. We describe each of these steps below.

Prior calibration via ABC rejection sampling

Once the protein abundances as described in equation (7) are obtained, the total number of proteins is reduced via a feature selection algorithm. As per the equations in the previous section, the protein abundance profiles are a function of the following: Baseline parameters Prior hyperparameters: Instrument parameters: Prior calibration via ABC rejection sampling is as described in Algorithm 1. Monte Carlo integrations are performed to obtain a set of parameters and only some of them are kept and rest are rejected via comparing with a threshold. All the approximated triplets are averaged to obtain the optimal parameter. In this algorithm, is the error tolerance. This has to be chosen optimally so that it should not be too high for bad samples to be accepted or it should not be very small that all the samples are accepted, ie, Once the optimal parameters are obtained, the fold change vector is calculated by the following sample mean estimate: where denotes the lth sample mean for the selected protein only.

ABC-MCMC posterior sampling

ABC-MCMC sampling is as described in Algorithm 2. Vector is sampled from . After a burn-in period for the Markov chain of , the next samples from to are considered as the generated data. Proper selection of the thresholds in step 4 of Algorithm 2 plays a very important role in the performance of the ABC-MCMC algorithm.

Algorithm 2 . Obtain the posterior samples of γ using ABC-MCMC algorithm
1. Generate the mean vector γ(0)=(γ0,γ1,…,γd) from the Γ distribution with optimal parameters generated in Algorithm 1. For t=0,1,…,ts,ts+1,…,ts+M where ts is the burn-in period do:2. Generate γ(t+1)=ColMeans(S0(t)) where ColMeans is a function which calculates mean feature (protein) wise.3. Simulate the control and treatment samples S0t+1 and S1t+1 each of size using γ(t+1) and γ(t+1)⋅fcal, respectively.4. Letq={1∥T(S0(t+1))−T(S0)∥<ϵ0and∥T(S1(t+1))−T(S1)∥<ϵ10otherwise5. If q = 1, accept γ(t+1) else γ(t+1)=γ(t)

Algorithm 2 . Obtain the posterior samples of γ using ABC-MCMC algorithm

1. Generate the mean vector γ(0)=(γ0,γ1,…,γd) from the Γ distribution with optimal parameters generated in Algorithm 1. For t=0,1,…,ts,ts+1,…,ts+M where ts is the burn-in period do:2. Generate γ(t+1)=ColMeans(S0(t)) where ColMeans is a function which calculates mean feature (protein) wise.3. Simulate the control and treatment samples S0t+1 and S1t+1 each of size using γ(t+1) and γ(t+1)⋅fcal, respectively.4. Let

q={1∥T(S0(t+1))−T(S0)∥<ϵ0and∥T(S1(t+1))−T(S1)∥<ϵ10otherwise

5. If q = 1, accept γ(t+1) else γ(t+1)=γ(t)

Kernel-based classification

We employ the kernel-based scheme proposed in the work by Banerjee and Braga-Neto,[6] which is itself based on the OBC in Dalton and Dougherty.[12] One of the issues with kernel-based classification is choosing the right value of the kernel bandwidth parameter. If the value of the bandwidth parameter chosen is high, then it leads to oversmoothing and thus hiding many details in the data distribution. However, a small value for the bandwidth parameter leads to undersmoothing and thus many spurious noisy elements in the data are not eliminated. To address this, we employ an ensemble method, where different classifiers with different bandwidth parameters are obtained and then majority vote is used for classification. The classification algorithm is described in detail in Algorithm 3.

Algorithm 3 . Using the ABC-MCMC–based posterior samples for classification.
1. Choose a set of kernel bandwidth parameters h=(h1,h2,…,hf) where f is the number of bandwidth values taken.2. Choose the number of γ samples from markov chain to be used in the kernel classifier. Say we select q samples from the posterior. It is advisable to choose the samples from the end. For example, in this case, ts+M−q to ts+M.3. Choose a suitable kernel K for the analysis. In this article, we have chosen a zero mean unit variance gaussian kernel.4. For a given test point x do: Declare a result vector res_vec=zeros For i in h1,h2,…,hf do: if (c∑t=ts+M−qts+M∑j=1nK(x−x(j)hi)≥(1−c)∑t=ts+M−qts+M∑j=n+12nK(x−x(j)hi)) res_vec[i]=1 else res_vec[i]=05. The kernel-based classifier is now given by Ψ(x)=(1ifsum(res_vec)≥f+120otherwise (15)

Algorithm 3 . Using the ABC-MCMC–based posterior samples for classification.

1. Choose a set of kernel bandwidth parameters h=(h1,h2,…,hf) where f is the number of bandwidth values taken.2. Choose the number of γ samples from markov chain to be used in the kernel classifier. Say we select q samples from the posterior. It is advisable to choose the samples from the end. For example, in this case, ts+M−q to ts+M.3. Choose a suitable kernel K for the analysis. In this article, we have chosen a zero mean unit variance gaussian kernel.4. For a given test point x do: Declare a result vector res_vec=zeros For i in h1,h2,…,hf do: if (c∑t=ts+M−qts+M∑j=1nK(x−x(j)hi)≥(1−c)∑t=ts+M−qts+M∑j=n+12nK(x−x(j)hi)) res_vec[i]=1 else res_vec[i]=05. The kernel-based classifier is now given by Ψ(x)=(1ifsum(res_vec)≥f+120otherwise (15)

Numerical Experiments and Results

In this section, we demonstrate the application of the proposed ABC-MCMC classification algorithm for SRM data, using a synthetic data set generated from a subset of the human proteome. We selected a list of proteins from the Drugbank and applied tryptic digestion of proteins using the OpenMS software.[15] Because our interest is in small sample sizes, we chose simple classification rules, which are known to perform well with small samples, for comparison: LDA and k-nearest neighbor (KNN) with k = 3. Synthetic SRM-MS data were generated by the model described in section “SRM-based mass spectrometry model,” using the parameters in Table 1. Synthetic sample data for prior calibration were generated using the midpoint of the intervals specified in Table 1. For example, as , we take 0.4 as the initial value. For the MCMC procedure, we consider 10 000 samples from the posterior distribution of . A burn-in stage of around 3000 iterations is considered. The value of prior probability was taken to be 0.5 (equally likely classes). Kernel density estimation is based on 15 MCMC samples of , ie, q = 15 in Algorithm 3 (increasing this number did not show any significant difference in the results). From the initial number of 350 proteins, a t test is applied to select the top 10 to 15 proteins. We consider sample sizes through per class and select the number of features to be . The results displayed below are average results over 6 runs of the experiment for each combination of classification rule, sample size, and dimensionality. The classification error for each case is estimated on an independent synthetic test data set of 100 sample points.

Effect of sample size

Figure 1 displays the average error rates for the different classification rules. The number of proteins selected is fixed at . With the increase in sample size, we see that the total error decreases for all classification rules. An important observation is that at small sample sizes, the performance of ABC-MCMC is best, confirming the general principle of good small-sample performance by Bayesian methods.

Figure 1.

Average classification error rates against sample size for a fixed number of selected proteins . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of dimensionality

The average error rates of the various classification rules against dimensionality, ie, number of selected proteins, are displayed in Figure 2, for fixed sample size per class. We can observe a very strong peaking phenomenon[16]: as the number of selected proteins increases, the average classification error rates tend to go down at first, but then increase sharply, due to the small sample size, ie, small ratio between number of points over the dimensionality. One can observe that the ABC-MCMC classification rule is the most accurate one when is large, which is in agreement with the fact that Bayesian methods tend to outperform competing techniques under small ratios of sample size to dimensionality.

Figure 2.

Average classification error rates against number of selected proteins for a fixed sample size . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of variability

Here, we keep the sample size at and the number of features at to investigate the impact on the classification of error rate of an increasing variability of the true protein concentration values. In Figure 3, one can observe that the performance of all classification rules degrades with increasing values of the coefficient of variation ; however, the performance of the ABC-MCMC algorithm is uniformly better than the others due to the small sample size .

Figure 3.

Average classification error rates against the coefficient of variation for a fixed sample size per class and fixed number of selected proteins . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of peptide efficiency

Finally, we investigate the impact on the classification accuracy of varying the peptide efficiency. The peptide efficiency factor a controls how many ions can be detected for a given peptide. Increasing this parameter uniformly increases efficiency for all peptides, which corresponds to a more accurate SRM-MS experiment. Indeed, one can observe in Figure 4 that classification accuracy tends to increase with increasing peptide efficiency. One can also observe that the ABC-MCMC classification rule displays the smallest error rates among the competing methods at low peptide efficiency, ie, in a more noisy experiment.

Figure 4.

Average classification error rates against the lower bound for the peptide efficiency factor for a fixed sample size per class and fixed number of selected proteins . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Conclusions

In this article, we have proposed a Bayesian approach for classifying SRM data with the goal of facilitating biomarker development. This method is a combination of ABC and MCMC. We can see that for small sample sizes, large dimensionality, or noisy data, the performance of the proposed Bayesian classifier is superior to that of other approaches. Our results are based on a subset of the human proteome selected from the Drugbank, which are submitted to tryptic digestion in silico. In addition, the prior hyperparameters are calibrated using the available data. This makes the the approach realistic and broadly applicable. Because we are studying the effects of the various parameters of the SRM pipeline on the classification error, there is a need to use synthetic data from a generative model. The results are, however, expected to be reproducible on clinical SRM data.

10 in total

Review 1. Mass spectrometry-based proteomics.

Authors: Ruedi Aebersold; Matthias Mann
Journal: Nature Date: 2003-03-13 Impact factor: 49.962

1 in total

Review 1. A Nonmathematical Review of Optimal Operator and Experimental Design for Uncertain Scientific Models with Application to Genomics.

Authors: Edward R Dougherty
Journal: Curr Genomics Date: 2019-01 Impact factor: 2.236

1 in total

Bayesian Classification of Proteomics Biomarkers from Selected Reaction Monitoring Data using an Approximate Bayesian Computation-Markov Chain Monte Carlo Approach.

Introduction

SRM-BASED MS MODEL

Protein mixture model

Peptide mixture model

ABC-MCMC Classification Algorithm

Prior calibration via ABC rejection sampling

ABC-MCMC posterior sampling

Kernel-based classification

Numerical Experiments and Results

Effect of sample size

Effect of dimensionality

Effect of variability

Effect of peptide efficiency

Conclusions

Review 1. Mass spectrometry-based proteomics.

Review 2. Approximate Bayesian Computation (ABC) in practice.

Review 3. Protein biomarker discovery and validation: the long and uncertain path to clinical utility.

4. Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood.

5. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells.

Review 6. Targeted proteomics for validation of biomarkers in clinical samples.

Review 7. Perspectives of targeted mass spectrometry for protein biomarker verification.

8. Bayesian ABC-MCMC Classification of Liquid Chromatography-Mass Spectrometry Data.

9. Modeling and systematic analysis of biomarker validation using selected reaction monitoring.

10. OpenMS - an open-source software framework for mass spectrometry.

Review 1. A Nonmathematical Review of Optimal Operator and Experimental Design for Uncertain Scientific Models with Application to Genomics.