Literature DB >> 19259411

Bayesian mass spectra peak alignment from mass charge ratios.

Junfeng Liu¹, Weichuan Yu, Baolin Wu, Hongyu Zhao.

Abstract

Proteomics studies based on mass spectrometry (MS) are gaining popular applications in biomedical research for protein identification/quantification and biomarker discovery, especially for potential early diagnosis and prognosis of severe disease before the occurrence of symptoms. However, MS data collected using current technologies are very noisy and appropriate data preprocessing is critical for successful applications of MS-based approaches. Among various data preprocessing steps, peak alignment from multiple spectra based on detected peak sample locations presents special statistical challenges when effective experimental calibration is not feasible due to relatively large peak location variation. To avoid intensive tuning parameter optimization, we propose a simple novel Bayesian algorithm "random grafting-pruning Markov chain Monte Carlo (RGPMCMC)" that can be applied to global MS peak alignment and to follow certain model-based sample classification criterion for using aligned peaks to classify spectrum samples. The usefulness of our approach is demonstrated through simulation study by making extensive comparison with other algorithms in the literature. Its application to an ovarian cancer MALDI-MS data set achieves a smaller 10-fold cross validation error rate than other current large scale methodologies.

Entities: Chemical Disease Gene Species

Keywords: Mass spectrometry; peak alignment; random grafting-pruning Markov chain Monte Carlo (RGP CMC); reversible jump Markov chain Monte Carlo (RJMCMC); sample classification; symmetric transition

Year: 2008 PMID： 19259411 PMCID： PMC2623297

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Genomics and proteomics technologies offer much promise in our understanding of fundamental biological processes by allowing us to simultaneously monitor the expression levels of tens of thousands of genes and proteins. Since proteins are the basic functioning units in the cells, there is a great interest to characterize individual molecular profiles based on proteomics for reliable biomarker discovery and effective disease diagnosis/prognosis/treatment. In proteomics research, mass spectrometry (MS) is the most widely used instrument to allow for the mass measurement of molecules, where a mass spectrometer determines chemical compounds’ molecular weight by ionizing, separating, and measuring molecular ions according to their mass-to-charge ratio (m/z: unit Da) and a mass spectrum is the standard data output for analysis and interpretation (Link et al. 1999), where the x-axis represents m/z value (Da) and the y-axis represents intensity (enrichment of particles with certain m/z). Recently, Yu et al. (2005) reviewed current approaches on the extraction of the most relevant information from the raw mass spectra to identify disease biomarkers. A standard MS data analysis usually involves background noise removal, smoothing, intensity normalization, peak identification and alignment, and biomarker identification. The false positive and false negative peaks may exist in taking local maxima as peaks (Coombes et al. 2003) and peak location variation may be due to differences in sample preparation, chemical noise, co-crystallization, deposition of the matrix-sample onto the target, and laser position on the target among others. Several statistical methods have been proposed to reduce the background noise (Coombes et al. 2003; Satten et al. 2004; Coombes et al. 2005a and Randolph and Yasui, 2006). Morris et al. (2005) applied translation-invariant wavelet transformations to the raw spectra and performed peak detection using the mean spectrum derived from a group of spectra, where they assumed calibration can be done in advance experimentally or by interpolation to make common peaks stay closely together, and only false negatives are possible. However, Coombes et al. (2005b) pointed out that, peak location variation, caused by a spread of initial particle velocities at the starting end of mass spectrometer tube, makes calibration more difficult. Compared to background noise removal, peak identification and alignment is more challenging and critical by providing the links of underlying peptides across all spectra. However, existing algorithms for peak alignment are mostly ad hoc and based on heuristic arguments (Nielsen et al. 1998; Johnson et al. 2003; Torgrip et al. 2003; Eilers, 2004; Tibshirani et al. 2004 and Randolph and Yasui 2006), where some parameters need to be optimized empirically and/or subjectively. In this paper, we work on the detected peak samples (possibly with associated intensities) coming from certain protocol. In view of hundreds of features (peaks) in each spectrum, due to complex chemical and physical mechanisms undergoing the mass spectrometry, throughout this paper, we assume that substantial individual peak location variation is existent, say up to one half of the interval between neighboring peaks, thus calibration may be in lack of power. Moreover, we consider statistically false positives and negatives to make our model more accountable. Overall, we will assume that each set of potential peak samples corresponding to the true peak follows an individual composite distribution regulated by true peak location, peak sample location variation, false positive and false negative rates. An effective simple Bayesian MCMC algorithm is proposed to do peak alignment (biomarker identification) and downstream sample classification, where all individual sets of parameters for each true potential peak are able to be estimated in a universal modeling framework without intensive tuning parameter optimization. Our algorithm is much computationally simpler than other available MCMC algorithms which could be applied to MS alignment while retaining competing performance. The rest of this article is organized as follows: Section 2 introduces Bayesian dimension matching problem and proposes our simplified approach; Section 3 summarizes simulation results to demonstrate the efficiency and reliability of our algorithm; Section 4 illustrates the applications of our approach to a real MS data set; Section 5 considers joint analysis of peak alignment and sample classification; Section 6 concludes with discussions; and some technical details are given in the appendix.

Methods

Dimension-matching statistical model

We now develop a simple novel MCMC algorithm, random grafting-pruning Markov chain Monte Carlo (RGPMCMC) which can be applied to MS peak alignment from mass charge ratio information. Within Bayesian framework, we are often interested in the posterior distribution of the dimension-varying parameter θ The prior distribution π(θ)is represented as ∑∈ π(θ|K)π(K), where N is positive integer set and π(θ |K) is the individual prior distribution within the K-dimensional space and π(K)is the mixture probability for dimension K. Since π(K, θ) = π(K)π(θ |K), the posterior distribution of (K, θ) where the denominator is the normalization constant not needed for posterior sampling. Change point model usually involves dimension-matching in two typical cases: one single ordered series where change points are taken as those separating successive discrete points, and multiple ordered series where change points correspond to physical locations in the continuous space. For the discrete case, the partitioning and wrapping up the segments leads to an exponentially increasing computational cost as the model space grows (Denison et al. 2002), and frequentist’s approaches only either work on very few change points or special algorithms (Guan, 2004; Olshen and Venkatraman, 2004). Recently, Fearnhead (2005) proposed an exact non-MCMC sampler by recursive partition, and Loschi, Cruz and Arellano-Valle (2005) introduced the product partition model based Bayesian algorithm which stems from Yao (1984) and Barry and Hartigan (1992 and Barry and Hartigan (1993). For the continuous case, the readers are referred to sampling-based algorithms: the reversible jump MCMC (RJMCMC) by Green (1995), the birth-and-death process MCMC (BDMCMC) and continuous time process MCMC (CTMCMC) by Stephens (2000a, 2000b), Bayesian cluster detection in maps (Knorr-Held and Raßer, 2000) and others. Cappé, Robert and Rydén (2003) showed that, the acceptance probability of the usual MCMC methods is replaced by differential holding times in BDMCMC, and RJMCMC converges to a limiting continuous time birth-and-death process on an appropriate rescaling of time. They also demonstrated that, RJMCMC and CTMCMC have similar computational performance while the latter demands expensive death rate computation. Obviously, discretization is neither always suitable nor efficient for sophisticated change point identification in the continuous space. The present work introduces a simple MCMC algorithm in the context of multiple MS peak alignment by considering all uncertainties including peak number, peak locations, peak sample location variations, false negative and false positive rates. Neither the error-prone Jacobian terms in RJMCMC, the intensive death rate calculation in CTMCMC, nor the computationally expensive recursive partition in other algorithms is needed by our method. Before starting with our statistical model, we assume local maxima (discrete locations) have been detected as peak samples for each raw mass spectrum. In MS peak alignment, peak sample location variation may linearly depend on m/z magnitude (Yasui et al. 2003), and log-transformation of m/z achieves peak sample location variation homogeneity, this observation will justify the identical prior specification for peak sample location variations across true peaks (details later). For notational simplicity, we use m/z instead of log (m/z) in the following discussions and assume that the m/z domain is [(m/z), (m/z)]. The underlying K-dimensional true peak location vector is S̃ = (s1, s2, …, s), where the peak locations are generally separated by at least a distance threshold, say no less than d. The data to be analyzed from multiple spectra are the detected peak sample locations y (j = 1, …, n, i = 1, …, I), where i is the index for spectra, j is the peak sample index within each spectrum (with increasing m/z), and n is the number of detected peak samples in the i-th spectrum. The peak samples are assumed to be normally distributed around their true peak s with standard deviation σ (k = 1, …, K) for locations. For the putative true peak set S̃ = (s1, s2, …, s), each detected peak sample j in i-th spectrum is assigned to its nearest putative true peak among S̃, say n (i, j) (with possible multiple assignments to the same putative peak from a given spectrum). For certain true peak k, (1) when there is no peak sample assignment to it from a given spectrum, we consider it as a false negative case for this spectrum with probability fn; (2) when there are multiple peak sample assignments to it from a given spectrum, we consider it as a false positive case for this spectrum with probability fp; (3) otherwise we consider it as a non-false positive or negative case for this spectrum with probability 1 − fn − fp. Our prior hypothesis is that, each true peak shows these three types of cases proportionally under the homogeneity assumption for the spectra and identical peak sample detection protocol for each local m/z region, say for each true peak, on average 90% spectra will contribute single peak sample, 5% will contribute multiple peak samples and 5% will contribute no peak sample to the putative true peak. For these three cases, we assume an independent trinomial distribution Tri(I; fn, fp, 1 − fn − fp) for each true peak k in the context of numbers of grabbed peak samples from I spectra. Now we restate the following notations for model set-up: stands for the peak sample locations for all I spectra, which need not be a I-row matrix since the numbers of peak samples are not necessarily equal because of false negatives and/or false positives; S̃ s a K-dimensional vector of putative true peak locations (m/z’s); σ̃ is a K-dimensional vector of peak sample location variations at putative true peaks; f̃n and f̃p are K-dimensional vectors of false negative and false positive rates at putative true peaks; n (i, j) is putative true peak assignment to peak sample j in the i-th spectrum; n is the number of spectra without peak samples assigned to putative true peak k; n is the number of spectra with multiple peak samples assigned to putative true peak k, and is the number of spectra with single peak sample assigned to putative true peak k (Figure 1). The likelihood for the observed peak samples across all spectra is

Figure 1

Mass Spectrum Biomarker Model. (The left column is spectrum index, the vertical lines are putative true peaks, the horizontal circle lines are peak samples of each spectrum. The lower left dash rectangle represents a false negative case and the upper right dash rectangle represents a false positive case.)

We now describe the motivation for such a likelihood function which is crucial for Bayesian inference. The matching by minimum distance (S) is to capture important clustering information. We believe that, this objective construction is reasonable compared to some alternatives, e.g. non-model based clustering algorithms, or piecewise binning method where each bin is assumed to hold exactly those peak samples for the individual putative true peak. The other part incorporating false positives and false negatives is mainly a technical statistical consideration, since it is very hard to accurately claim which peak sample is a true false positive or false negative. To sum up, in spite of substantial measurement errors underlying high-throughput mass spectra, we pursue a reasonable eclectic theme to tackle biomarker profile estimation. For notational convenience, “peak” represents putative true peak for the biomarker profile hereafter, other than detected peak sample from each spectrum. On the prior part, we assume that K follows a truncated Poisson or discrete uniform distribution on {K, …, K}. As Green (1995) suggested, the peak locations are taken as even-numbered order statistics from 2K + 1 points uniformly distributed on an L-length interval [(m/z), (m/z)] (for convenience, s0 = (m/z), s+1 = (m/z)) to avoid too many short steps, which has density Π=1+1 (s − s−1)/L2+1 as suggested by Green (1995). To make use of conjugate prior, π(σ2) is taken as Inverse-Gamma (ν, η) density (σ2)−(ν+1) e−1/(σ η−ν/Γ(ν); the joint prior distribution for (fn, fp,1 − fn − fp) is a 3-dimensional Dirichlet distribution with density D (fn, fp, 1 − fn − fp|α1 α2, α3) = (fn)α (fp)α (1 −fn − fp)α Γ(α1)Γ(α2) Γ(α3)/Γ(α1 + α2 + α3). (The posterior distribution is

Random Grafting-pruning Markov chain Monte Carlo (RGPMCMC)

Motivated by Green (1995), our algorithm is based on a redesigned universal “naively informative” parameter proposal involving peaks and the other parameters concurrently without the need for Jacobian terms. We also propose four move types (+, −, H, S), where “+” means peak birth proposal, “−” means peak death proposal, “H” means parameter (σ2, fn and fp) proposal excluding peaks and “S” means peak location mutation with peak number unchanged. We specify (+, −, H, S) probabilities as (π(+),π(−), π(H), π(S)). First we choose one of these four move types based on move type probabilities (π(+),π(−), π(H), π(S)), where π(+) = π(−). For the “+” move type, we describe the parameter proposal process: If K = K, we go to 1) since the upper threshold is reached; if K < K, we randomly sample one of the K + 1 intervals formed by current K peaks, say (s, s+1), with equal probability 1/(K + 1). We may assign j + 1 to this new peak index * and the following indexes increase by one accordingly. Within this sampled interval, we propose (s*, σ*2, fn* and fp*) for peak candidate * as follows: True peak location proposal for peak birth: where g (U1; s, s+1) is a one-to-one mapping from random variable U1 to peak location s* given s and s+1. We take g (U1; s, s+1) to be (s + s+1g1(U1))/(1 + g1(U1)), or (s* − s)/(s+1 − s*) = g1(U1), where g1(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U1 ~ U [0, 1]. It can be seen that, s* is a monotonically increasing function of U1. We simply use g1(u) = u/(1 − u), thus s* = s + (s+1 − s) U1, a uniform random variable ∈ (s, s+1). Peak sample location variance proposal for peak birth: where gσ(U2;σ2, σ12) is a one-to-one mapping from random variable U2 to peak sample location variance σ*2 given σ2 and σ+12. We take gσ (U2;σ2, σ12) to be (σ2σ+12)1/2 g2(U2) or (σ*/σ)/(σ+1/σ*) = g2(U2), where g2(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U2 ~ U [0, 1]. It can be seen that, σ*2 is a monotonically increasing function of U2. We simply use g2(u) = u/(1 − u), thus σ2* = (σ2 σ2+1)1/2 U2/(1 minus; U2). Peak sample false negative and false positive rate proposal for peak birth: where g̃ (U3, U4; fn, fp, fn+1, fp+1), is a one-to-one mapping from (U3, U4) to (fn, fp) given (fn, fp) and (fn+1, fp+1). Specifically, for peak *, we use O* to represent “false negative or positive” odds and R* to represent “false negative vs. positive” ratio, i.e. the false negative and false positive rate proposal is realized in two sequential steps: O* proposal: i.e. (O*/O)1/2/(O+1/O*)1/2 = g3(U3), where g3(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U3 ~U [0, 1]. It can be seen that, O* is a monotonically increasing function of U3, we simply use g3(u) = u/(1 − u). R* proposal: i. e. (R*/R)1/2/(R+1/R*)1/2 = g4(U4), where g4(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U4 ~ U [0, 1]. It can be seen that, R* is a monotonically increasing function of U4, we simply use g4(u) = u/(1 − u). Note that the constraint 0 ≤ fn* + fp* ≤ 1 holds under this proposal. The g̃ (U3, U4; fn, fp, fn+1, fp+1) function is a combination of O* proposal and R* proposal in this case. fn* and fp* are jointly proposed to meet the constraint. The Jacobian of transforming (fn, fp, fn+1, fp+1, u3, u4) into ((fn, fp, fn*, fp*, fn+1, fp+1) is calculated by chain rule (see appendix). When the insertion of the peak birth candidate is before the first peak or after the last peak, there are no real double neighbors. In this case, we take the duplicates of peak birth candidate’s unique succeeding or preceding neighbor as two virtual neighbors for proposal implementation. In this move type, we realize the sequential uniform lift for fn* + fp* and subsequent conditional uniform lift for fn* within fn* + fp*. We claim that, the peak birth proposal by “+” move type, along with the peak death proposal by the following “−” move type, constructs a symmetric transition, i.e. equally probable events (Proposition 3). So the acceptance probability in the Metropolis-Hastings algorithm within Gibbs sampler is simply For the “−” move type, the parameter proposal process is: If K = K, we go to 1) since the lower threshold is reached; if K > K, we randomly sample one from current K peaks with equal probability 1/K, say index *, to delete. Then we simply abandon the associated s*, σ*2,fn* and fp* for likelihood reconstruction. The acceptance probability in the Metropolis-Hastings algorithm within Gibbs sampler is simply The “+/−” move type is demonstrated in Figure 2 for only peak location model.

Figure 2

Symmetric Transition for Peak Birth/death Proposal. (Each rectangle represents a uniform peak birth proposal domain, each internal vertical boundary represents a possible peak death proposal location.)

For the “S” move type, the parameter proposal process is: We randomly sample one of current peaks, say *, with equal probability 1/K, and take the two neighboring peak intervals which peak *separates as a fused interval. A peak location is uniformly randomly drawn within this fused interval for a new candidate peak to replace s*. The other parameters associated with this peak location mutation is kept unchanged, or they could be changed as a set within the uniformity framework. This is also a symmetric transition (Proposition 2). So the acceptance probability in the Metropolis-Hastings algorithm within Gibbs sampler is simply For the “H” move type, the posterior distribution of each σ2 is Inverse-Gamma (ν′, η′), where ν′ = ν + (∑ ∈ 1)/2. η′ = 1/(1/η +∑ ∈ (y − s)2/2). The joint posterior distribution of each (fn, fn, 1 − fn − fp) is Dirichlet (α1 + n, α2 + n, ). We sample the whole set of these parameters once.

Remark

This algorithm is a random scan (by move type probabilities) version of Gibbs sampler introduced by Gelfand and Smith (1990) with generalized parameter components: peak (described by location, peak sample location variance, false negative and positive rates) birth or death changes the number of parameter sets by move type “+” or “−” in the aforementioned parameter sampling proposal; peak location mutation does not change the number of parameter sets by move type “S” in the aforementioned parameter sampling proposal; peak sample location variance or false negative/positive rate sampling does not change the number of parameter sets by move type “H” in the aforementioned parameter sampling proposal. The associated propositions and the detailed description of RGPMCMC are given in the appendix. The following proposition justifies the correct convergence.

Proposition

The induced Markov chain is irreducible, aperiodic, and ergodic.

Interpretation

In view of independent uniform proposals for non-peak parameter set and diverse move types, given any arbitrarily small neighborhood of a current state there is a positive probability that the chain lies in that neighborhood after one sampling iteration, thus the aperiodicity is verified; the irreducibility is established since the chain can move from any state to any other state one step at a time.

Simulation Study

In this section, we study the RGPMCMC performance under diverse circumstances and make comparison with other approaches in the literature.

Prior sensitivity analysis for RGPMCMC

We consider different priors and study the discrepancy between the specified true peak locations and the estimated peak locations. Compaq Fortran 90 is our development package and we use IMSL Fortran Numerical Library to generate random numbers. 200 spectra are simulated for each of the six simulations, the set-ups and priors are listed in Table 1, where four true peaks are considered, the peak location vectors specify the true peak locations, the σ vectors specify the peak sample location variations at true peaks, the fn and fp vectors jointly specify the probabilities for us to simulate no peak sample (with probability fn), multiple peak samples (with probability fp), or single peak sample (with probability 1 − fn − fp) for each true peak from individual spectrum. The Inverse-Gamma and Dirichlet priors are for peak sample location variances and false negative and positive rates at putative true peaks. The peak number prior is K ~ U [1, 20], the starting peak number is 11, and move type probabilities are: π(+) = 0.45, π(−) = 0.45, π(H) = 0.05, and π(S) = 0.05. With burn-in 10,000 and thinning 1,000, each collection of 1,000 posterior samples takes several minutes on a PC powered by Celeron CPU. The results are given in Figure 3. In simulation 1, the true peaks are clearly clustered and the peak estimation is good. In simulation 2 with larger peak sample location variations and false positive/negative rates, the same number of peaks are identified as the true peak number under highly informative priors. By modifying peak sample location variation priors, we observe that, under large true variations, the peak estimation under less informative variation priors is worse than that under more informative variation priors. More uncertainties are introduced in simulation 3, and fewer peaks are identified than the true peak number. By modifying peak sample location variation priors, we observe that, when the variation priors are consistent with the true variation in terms of mean value, the peak number estimation seems to be better than inconsistent variation priors. Simulation 4 shows that, the informative peak sample location variation priors may lead to good estimation even when the peaks’ associated parameters are different, so does simulation 5, where unevenly distributed peaks are considered. In simulation 6, even when the parameter priors become much less informative, the peak estimation performs well, since the true peaks are sharply surrounded by corresponding peak samples. These observations show that, informative a priori knowledge is desirable for reliable estimation.

Table 1

Simulation Configurations (IG: Inverse-Gamma prior for σ2, D: Dirichlet prior for (fn, fp, 1 − fn − fp)).

Peak location (s₁, s₂, s₃, s₄) = (1/5, 2/5, 3/5, 4/5
(1)	σ	0.05	0.05	0.05	0.05	(2)	σ	0.10	0.10	0.10	0.10
	fn	0.10	0.10	0.10	0.10		fn	0.20	0.20	0.20	0.20
	fp	0.10	0.10	0.10	0.10		fp	0.20	0.20	0.20	0.20
		IG(5, 100), D(1, 1, 8)				IG(26, 4), D(40, 40, 120)
(3)	σ	0.20	0.20	0.20	0.20	(4)	σ	0.02	0.08	0.08	0.02
	fn	0.30	0.30	0.30	0.30		fn	0.10	0.20	0.30	0.40
	fp	0.40	0.40	0.40	0.40		fp	0.10	0.20	0.30	0.40
		IG(6, 5), D(5, 5, 5)				IG(20, 6), D(5, 5, 10)

Peak location (s1,s2,s3,s4) = (1/16, 3/16, 7/16, 15/16)

(5)	σ	0.02	0.08	0.08	0.02	(6)	σ	0.02	0.02	0.02	0.02
	fn	0.10	0.20	0.30	0.40		fn	0.10	0.10	0.10	0.10
	fp	0.10	0.20	0.30	0.40		fp	0.10	0.10	0.10	0.10
		IG(20, 6), D(5, 5, 10)						IG(3, 2), D(2, 2, 2)

Figure 3

Simulations and Estimations (The simulation directly produces the peak samples for 200 patients which are plotted in each odd row of panels without the need for peak sample detection by data preprocessing, the y-axis is patient index and the x-axis is log(m/z). The alignments are compared by gridded walls given in each even row of panels. From top to bottom: true peaks, aligned peaks by RGPMCMC, aligned peaks by RJMCMC, aligned peaks by scale-space approach, aligned peaks by super-set approach and aligned peaks by PAM clustering algorithm.)

Comparison with some non-MCMC approaches

We first make comparisons with other non-MCMC algorithms represented by the recently developed scale-space approach (Yu et al. 2006a), super-set approach (Yu et al. 2006b), and Partitioning around Medoids (PAM) approach (Kaufman and Rousseeuw, 1990). For the six simulations in Section 3.1, the combined peak sample locations from 200 spectra along with the optimal cluster number minimizing the “median split silhouette” (Pollard and van der Laan, 2002) are taken as PAM inputs, S-plus function pam offers the medoid (cluster center) locations. These results are also included in Figure 3. Overall, RGPMCMC performs better than non-MCMC methods; the scale-space result is little better than the super-set result in terms of robustness; the PAM algorithm is not specifically designed for peak alignment, so it may perform poorly under certain circumstances, say simulations 2 and 3, where RGPMCMC can not recover all true peaks either.

Comparison with reversible jump Markov chain Monte Carlo

We apply the same simulated data in Section 3.1 and make comparison with reversible jump Markov chain Monte Carlo (RJMCMC) algorithm by Green (1995). RGPMCMC and RJMCMC differ in the method for proposing move type “+” and/or “−” (peak birth and/or death), where the former conditions on the active Markov chain by making use of equally probable peak birth and death proposals, while the latter makes use of additional variables to construct a one-to-one matching for dimension changing (details are given in the appendix). The same prior specifications for RGPMCMC in Section 3.1 are applied to RJMCMC. We consider two starting peak numbers, 11 and 1, both equally partitions the m/z range. Figure 3 also compares the alignments by RGPMCMC and RJMCMC, where no difference between RGPMCMC and RJMCMC exists. The peak number iteration comparison is given in Figure 4, where the burn-in = thinning = 1,000. The third row of dual panels pinpoint 3 or 4 peaks, the 4:3 ratios are 0.124 and 0.122 for RGPMCMC and RJMCMC respectively, which is close to each other. The acceptance rate for peak birth and/or death proposals should be almost identical for these two algorithms.

Figure 4

Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (burn in = thinning = 1,000, the initial peak number = 11.)

The peak number iteration comparison before reaching reasonable peak number (1,000th iteration) can be seen from Figure 5. For these six simulations, peak birth and/or death, and peak mutation acceptance rates are compared in Table 2, which are close to each other.

Figure 5

Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (first 1,000 iterations with initial peak number 11)

Table 2

Acceptance Rate Comparison between RGPMCMC and RJMCMC.

Simulation	1	2	3	4	5	6
	First 1,000 iterations (the initial peak number = 11)
peak birth (RGP)	0.00E-4	0.00E-4	0.00E-4	0.00E-4	0.00E-4	0.00E-4
peak birth (RJ)	0.00E-4	0.00E-4	0.00E-4	0.00E-4	0.00E-4	0.00E-4
peak death (RGP)	6.70E-3	1.55E-2	1.55E-2	1.33E-2	1.55E-2	1.55E-2
peak death (RJ)	6.70E-3	1.55E-2	1.55E-2	1.33E-2	1.55E-2	1.55E-2
peak mutation (RGP)	1.72E-2	3.53E-2	3.94E-2	1.30E-2	5.90E-3	2.73E-2
peak mutation (RJ)	1.93E-2	3.33E-2	4.42E-2	1.30E-2	5.90E-3	2.17E-2
	First 50,000 iterations (the initial peak number = 1)
peak birth (RGP)	1.70E-4	1.26E-4	1.26E-4	1.70E-4	1.27E-4	8.50E-5
peak birth (RJ)	1.27E-4	1.26E-4	8.44E-5	1.28E-4	1.27E-4	8.43E-5
peak death (RGP)	0.00E-4	4.26E-5	2.12E-4	4.20E-5	2.53E-4	7.59E-4
peak death (RJ)	1.26E-4	4.25E-5	2.54E-4	4.20E-5	4.21E-5	7.21E-4
peak mutation (RGP)	1.71E-2	4.27E-2	4.08E-2	1.21E-2	8.96E-3	1.02E-2
peak mutation (RJ)	1.56E-2	3.04E-2	4.90E-2	8.51E-3	6.38E-3	1.41E-2

The reasonable peak number is reached after almost the same number of iterations (~1,000) by GPMCMC and RJMCMC, which have similar efficiency for peak number identification. The peak number iteration comparison is given in Figure 6, where the burn-in = thinning = 1,000. The third row of dual panels pinpoint 3 or 4 peaks, the 4:3 ratios are 0.131 and 0.079 for RGPMCMC and RJMCMC respectively, where the ratio by RGPMCMC (0.131) is very close to peak number 11 case (0.124). Except for simulation 5, both RGPMCMC and RJM-CMC identify the same number of peaks. For simulation 6, they all identify 3 peaks other than the true 4 peaks.

Figure 6

Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (burn in = thining = 1,000 the initial peak number = 1.)

The peak number iteration comparison before 50,000th iteration can be seen from Figure 7. For these six simulations, peak birth and/or death, and peak mutation acceptance rates are also compared in Table 2. The rates are still close to each other.

Figure 7

Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (first 50,000 interations with initial peak number 1)

Initial peak number = 1

We find that, starting from a relatively large peak number is more capable of identifying true peaks by RGPMCMC and RJMCMC.

Mimic-MS simulation study

We mimic the mass spectra using the R package developed by Coombes et al. (2005b), where wide-ranging factors are considered to create the uncertainty, including the acquisition time resolution of the detector, the distribution of initial particle velocities, isotope distribution and others. Ideally, our six simulated mass spectrum groups have 5, 10, 20, 40, 80 and 160 peaks without incorporating any uncertainty and the R simulator produces six mean spectra, where the particles with large mass value (>20,000 Da) have broader hills due to more isotopes (Figure 8). Each mean spectrum leads to 100 uncertainty involved random replicative spectra subject to peak sample detection. For them we smooth spectrum with a predefined Gaussian function (window size of 15), search local maxima in the local neighborhood of 15 data points as peak samples, the minimal intensity value of peaks should be not smaller than 100. The RGPMCMC pinpoints 5, 9, 18, 35, 62 and 112 peaks (Inverse-Gamma (6,500) is peak sample location variance prior, Dirichlet (5,5,90) is false negative and positive rate prior, the starting peak numbers are 10, 20, 40, 80, 160 and 320); the clustering method in Tibshirani et al. (2004) pinpoints 5, 9, 19, 36, 69 and 174 peaks (the tuning parameters are selected as suggested in Tibshirani et al. (2004)). The alignment comparison is also given in Figure 8, where the clustering method tends to identify redundant peaks at large masses (the arrows in the bottom panels of Figure 8), while RGPMCMC performs better in this region; both methods identify less peaks in certain dense regions (bottom panels in Figure 8), while RGPMCMC sometimes combine too concentrated peaks into one peak (the arrows in the middle panels of Figure 8). Overall, these two approaches have similar performance in this simulation scenario.

Figure 8

Mimic-MS Simulations and Estimations (The R simulator produces the spectrum profiles given in each odd row of panels, 100 random spectra were simulated for each of them for peak sample detection. The y-axis is intensity and the x-axis is m/z. After peak sample detection by data preprocessing, the alignments are compared by gridded walls given in each even row of panels. From top to bottom: true peaks, aligned peaks by RGPMCMC and aligned peaks by clustering algorithm.)

Application to Real Data

We model the same ovarian cancer data source as used by Wu et al. (2003) (available on-line at http://bioinformatics.med.yale.edu/MSDATA), where the healthy group has 77 patients and the cancer group has 93 patients. The individual spectrum has tens of thousands of (m/z, intensity) pairs and looks like a more complicated and error-involving version of those simulated profiles in Section 3.4 (Figure 8). The data preprocessing on original spectra involves baseline subtraction, smoothing, intensity normalization and peak sample detection by local maxima as described in Section 3.4. The median of the original peak sample numbers after pre-processing is 249 for the healthy group and 241 for the cancer group. The values of log(m/z) range from 6.565 to 8.200. We use move type probabilities (π(+) = 0.45, π(−) = 0.45, π(H) = 0.05, π(S) = 0.05) and the interval constraint d = 10−5. If this constraint is not met at one iteration, we simply resample the parameters. Richardson and Green (1997) observed that, proper posterior distributions may be not possible under fully noninformative priors, so we apply (α1 = 5, α2 = 5, α3 = 90) to Dirichlet priors, and (ν = 3, η = 26) to Inverse-Gamma priors. Different peak number priors, either truncated Poisson [100, 600|λ] or Uniform [100, 600] lead to similar results. The starting peak number is K = (K + K)/2, the initial peak locations are equal K- partition of log(m/z) range, burn-in is 10,000, and thinning is 1,000. We recommend to start with a relatively large peak number. The sampler approaches the reasonable peak number very quickly and usually sticks around and mostly does effective single peak mutations once approaching the true peak number. Occasional peak number jump ups are highly efficient for joint peak number and location estimation. The alignment results are given in Figure 9. The sampling series of peak number are given in Figure 10, where we empirically identify the modes of the posterior peak number distribution as 274 (healthy group) and 260 (cancer group). The false negative rate and false positive rate estimations are given in Figure 11 with a significant negative correlation. The posterior peak sample location variation medians are given in Figure 12, where the inconsistency of the first point possibly arises from edge effect, so do false negative and positive rate median plots (Figure 11). The average peak distance (~(8.200–6.565)/290 = 0.0058) dominates the estimated peak sample location variation (~0.001). The sampler takes several thousand iterations to approach the reasonable peak number, thus an adaptive strategy with varying move type probabilities may be more efficient than the brute birth/death dominating proposal after that time point. The collection of the posterior samples in Figure 10 takes several days on a PC powered by Celeron CPU.

Figure 9

Aligned Peaks with Peak Sample Background (y-axis: patient index, x-axis: log(m/z). The vertical lines represent aligned peaks (biomarker profile) for each group, the dots in the background are the deteced peak samples for all patients by data preprocessing.)

Figure 10

Sampling Series of Peak Number (y-axis: peak number, x-axis: iteration index. The healthy group seems to have more peaks than cancer group.)

Figure 11

False Negative Rate and False Positive Rate Estimation (The left panel of first row is the histogram of false negatives at aligned peaks from healthy group, the right panel of first row is the histogram of false negatives at aligned peaks from cancer group; the left panel of second row shows the false negative medians at aligned peaks (represented by log (m/z)) from healthy group, the right panel of second row shows the false negative medians at aligned peaks (represented by log (m/z)) from cancer group; the next two rows are for false positives; the bottom row of panels show (false positive rate [x-axis], false negative rate [y-axis]) at aligned peaks for healthy group and cancer group.)

Figure 12

Posterior Medians of Peak Sample Location Variation (The upper left panel is the histogram of peak sample location variation medians from healthy group, the upper right panel is the histogram of peak sample location variation medians from cancer group; the lower left panel shows peak sample location variation medians (healthy group) at aligned peaks represented by log (m/z), the lower right panel shows peak sample location variation medians (cancer group) at aligned peaks represented by log(m/z).)

Sample Classification

Effective sample classification is as important as biomarker profile estimation. Wu et al. (2003) compared a number of sample classification methods without cross-validation and Tibshirani et al. (2004) reached error rates no less than 35%. Since the peak number is biologically variable between healthy and cancer groups, the current equal-peak-number based classifications may not be very suitable. Without considering cross-validation, we simply calculate the two log-likelihood functions for each preprocessed spectrum given fitted model (healthy group or cancer group) from Section 4, where the estimated parameters are taken from posterior medians. The histograms of log-likelihoods of all spectra from both healthy group and cancer group along with the log-likelihood difference histograms are given in Figure 13. From the log-likelihood difference empirical distributions in the bottom panels of Figure 13, we simply take the proportions of negative values as type I error rates: 28.6% for testing: Healthy vs. Cancer and 10.8% for testing: Cancer vs. Healthy, which are overly optimistic. Denote ỹ = (y1, y2, …, y) as the spectrum peak sample location vector to be classified, s̃ = (sh1, sh2, …, shK) is the estimated true peak location vector for the healthy population, and s̃ = (sc1, sc2, …, sc is the estimated true peak location vector for the cancer population (usually K ≠ K). We propose a minimum L̄1 distance sample classification rule. For each sh (1 ≤ k ≤ K), we identify its nearest neighbor yn from ỹ and form a L1 distance summand |sh − yn|, the same rule applies to each sc (1 ≤ k ≤ K). To be conservative, we only use those deviations less than 0.003 (~half of the average peak interval) followed by arithmetic average over these selected estimated true peaks. The L̄1 distances between the new spectrum ỹ from healthy population and cancer population are

Figure 13

Hypothesis Test by Log-likelihood (The upper left panel is the log-likelihood histogram for healthy individuals with healthy group model, the upper right panel is the log-likelihood histogram for healthy individuals with cancer group model; the middle left panel is the log-likelihood histogram for cancer individuals with cancer group model, the middle right panel is the log-likelihood histogram for cancer individuals with healthy group model; the lower left panel is the log-likelihood difference histogram for “healthy individuals with healthy group model—healthy individuals with cancer group model”, the lower right panel is the log-likelihood difference histogram for “cancer individuals with cancer group model—cancer individuals with healthy group model”.)

and where K′ is the number of selected true peaks from s̃ based on 0.003 threshold, and, K′ is the number of selected true peaks from s̃, 1 is the indicator function of selection. The class leading to smaller L̄1 would be predicted. Our 10 fold cross-validation is implemented as follows: we equally divide both the healthy and the cancer groups into ten disjoint pairs of testing set (H, C), (i = 1, …, 10). For each of these pairs, we combine the complementary nine sets as the corresponding healthy and cancer group training sets (H′, C′) (i = 1, …, 10), i.e. H′ = ∪1 ≤ ≤ 10, ≠ H and C′ = ∪1 ≤ ≤ 10, ≠ C. The exact 10 fold cross-validation would be done for each of these testing pairs (H, C) from fitting (H′, C′), (i = 1, …, 10). The results are given in Table 3. The type I error rate for hypothesis test: H0: Healthy vs. H1: Cancer is around 38.97%, and the type I error rate for hypothesis test: H0: Cancer vs. H1: Healthy is around 27.96%, and the overall sample misclassification rate from 10 fold cross-validation is around 32.94%. Our strict 10-fold cross validation error rate is less than that of Tibshirani et al. (2004), which is actually not coming from a complete cross validation by observing that all spectra took part in the initial clustering. Comparing Table 3 and Figure 13, we conclude that, overlapping training and testing sets may favor individual spectrum with a bonus around 10% in terms of classification error rate. The conclusion drawn by Yasui et al. (2003) also holds here: an appreciable proportion of healthy samples may be incorrectly classified as cancer.

Table 3

Sample Classification Error Rates from 10 fold Cross-validation (The error rates represent the proportions of the healthy or cancer patients in the testing set which are misclassified.)

	Health group		Cancer group
Cross-Validation	Identified peak number (Training set)	Error rate (Testing set)	Identified peak number (Training set)	Error rate (Testing set)
1	256	3/8	261	1/10
2	264	5/8	257	3/10
3	269	2/8	258	4/10
4	260	3/8	254	4/10
5	263	4/8	257	0/10
6	265	2/8	263	5/10
7	263	3/8	252	4/10
8	265	3/8	264	3/10
9	262	2/8	270	1/10
10	264	3/5	252	1/3

Overall		30/77		26/93

Discussion

In this article, we take a global viewpoint to avoid multiple edge effects under piecewise processing and incorporate flexible biomarker numbers to make our Bayesian model more accountable. The Jacobian term derivation, intensive death rate calculation, or lengthy recursive partition required by RJMCMC, CTMCMC and others in the literature may impede convenient application of Bayesian algorithm to change point identification (see the RJMCMC computational procedure in the appendix for example). For multiple change point identification problems where each segment has the same set of regulating parameters, we can see that, the superiority of RGPMCMC algorithm over other available algorithms is the most computational efficiency and simplicity by minor local adjustment of likelihood function and prior set armed with “naively informative” global treatment as introduced by the parameter sampling process in Section 2.2. Its competing computational performance has been demonstrated in this article by intensive comparison with others. Moreover, RGPMCMC can be easily modified to apply to multiple change point identification in circular domain (Liu et al. 2006) and others. Although our mass charge ratio (m/z) model already leads to promising sample classification, how to make use of relative intensity information beyond m/z value is a more challenging statistical problem, since peak samples in close proximity with disparate intensities are less likely to belong to the same putative true peak. Under the assumption of reproducibility and homogeneity of mass spectra, this algorithm is designed to be applied to each phenotype group separately (disease and control) at this point, leading to likely different peak location vectors for different phenotypes. Wu et al. (2006) observed that, a protein subset with considerable size, say 20~40 m/z features, may pose as signature between phenotypes, thus separate global statistical models are still desirable. Any peak sample detection protocol will cause inevitable peak sample location variation, false negative and false positive peak samples, which is obviously subject to statistical modeling. From algorithmic aspects, Green (1995) emphasized the importance of proposing parameter efficiently. Because the independent proposal from joint prior distribution of (K, θ) is not very efficient, our proposal works on the joint infinitesimal space to achieve more efficiency by a more fair birth/death move. Although the RGPMCMC does not need intensive tuning parameter optimization, running MCMC properly is never a simple automatic task, since from simulation study we find that, highly informative prior specification consistent with the truth is desirable, for which a solution could be a local small scale study out of the whole spectra picture. The mass spectra’s quality and characteristics vary greatly depending on the platform, e.g. MALDI-TOF or SELDI-TOF, and certain experimental settings used for the measurements. This is not our concern here since it is not difficult to apply this global profile estimation algorithm to those spectra coming from the same source and enjoying high reproducibility and homogeneity. We anticipate that, the RGPMCMC developed in this article will shed light on a broad class of Bayesian multiple change point identification problems, not only MS data analysis. Lastly we emphasize that, diverse alignment problems arise from complicated scenarios in modern bioinformatics research. Beyond this m/z based mass spectra peak alignment which greatly benefits from Green (1995)’s seminal paper, Green and Mardia (2006) recently developed a novel Bayesian approach for simultaneous inference about the matching and the transformation between two protein 2D-gel images, and aligning active sites of proteins in three dimensions.

18 in total

1. Bayesian detection of clusters and discontinuities in disease maps.

Authors: L Knorr-Held; G Rasser
Journal: Biometrics Date: 2000-03 Impact factor: 2.571

2. Parametric time warping.

Authors: Paul H C Eilers
Journal: Anal Chem Date: 2004-01-15 Impact factor: 6.986

3. Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens.

Authors: Glen A Satten; Somnath Datta; Hercules Moura; Adrian R Woolfitt; Maria da G Carvalho; George M Carlone; Barun K De; Antonis Pavlopoulos; John R Barr
Journal: Bioinformatics Date: 2004-06-24 Impact factor: 6.937

4. Detecting and aligning peaks in mass spectrometry data with applications to MALDI.

Authors: Weichuan Yu; Baolin Wu; Ning Lin; Kathy Stone; Kenneth Williams; Hongyu Zhao
Journal: Comput Biol Chem Date: 2005-11-17 Impact factor: 2.877

5. MALDI-MS data analysis for disease biomarker discovery.

Authors: Weichuan Yu; Baolin Wu; Junfeng Liu; Xiaoye Li; Kathy Stone; Kenneth R Williams; Hongyu Zhao
Journal: Methods Mol Biol Date: 2006

6. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform.

Authors: Kevin R Coombes; Spiridon Tsavachidis; Jeffrey S Morris; Keith A Baggerly; Mien-Chie Hung; Henry M Kuerer
Journal: Proteomics Date: 2005-11 Impact factor: 3.984

7. Direct analysis of protein complexes using mass spectrometry.

Authors: A J Link; J Eng; D M Schieltz; E Carmack; G J Mize; D R Morris; B M Garvik; J R Yates
Journal: Nat Biotechnol Date: 1999-07 Impact factor: 54.908

8. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.

Authors: Baolin Wu; Tom Abbott; David Fishman; Walter McMurray; Gil Mor; Kathryn Stone; David Ward; Kenneth Williams; Hongyu Zhao
Journal: Bioinformatics Date: 2003-09-01 Impact factor: 6.937

9. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.

Authors: Yutaka Yasui; Margaret Pepe; Mary Lou Thompson; Bao-Ling Adam; George L Wright; Yinsheng Qu; John D Potter; Marcy Winget; Mark Thornquist; Ziding Feng
Journal: Biostatistics Date: 2003-07 Impact factor: 5.899

10. Ovarian cancer classification based on mass spectrometry analysis of sera.

Authors: Baolin Wu; Tom Abbott; David Fishman; Walter McMurray; Gil Mor; Kathryn Stone; David Ward; Kenneth Williams; Hongyu Zhao
Journal: Cancer Inform Date: 2007-02-17