Literature DB >> 31119032

A Bioconductor workflow for the Bayesian analysis of spatial proteomics.

Oliver M Crook^1,2, Lisa M Breckels¹, Kathryn S Lilley¹, Paul D W Kirk², Laurent Gatto³.

Abstract

Knowledge of the subcellular location of a protein gives valuable insight into its function. The field of spatial proteomics has become increasingly popular due to improved multiplexing capabilities in high-throughput mass spectrometry, which have made it possible to systematically localise thousands of proteins per experiment. In parallel with these experimental advances, improved methods for analysing spatial proteomics data have also been developed. In this workflow, we demonstrate using `pRoloc` for the Bayesian analysis of spatial proteomics data. We detail the software infrastructure and then provide step-by-step guidance of the analysis, including setting up a pipeline, assessing convergence, and interpreting downstream results. In several places we provide additional details on Bayesian analysis to provide users with a holistic view of Bayesian analysis for spatial proteomics data.

Entities: Chemical

Keywords: Bayesian; Bioconductor; machine learning; pRoloc; pRolocdata; proteomics; software; spatial proteomics

Mesh：

Year: 2019 PMID： 31119032 PMCID： PMC6509962 DOI： 10.12688/f1000research.18636.1

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Determining the the spatial subcellular distribution of proteins enables novel insight into protein function [1]. Many proteins function within a single location within the cell; however, it is estimated that up to half of the proteome is thought to reside in multiple locations, with some of these undergoing dynamic relocalisation [2]. These phenomena lead to variability and uncertainty in robustly assigning proteins to a unique localisation. Functional compartmentalisation of proteins allows the cell to control biomolecular pathways and biochemical processes within the cell. Therefore, proteins with multiple localisations may have multiple functional roles [3]. Machine learning algorithms that fail to quantify uncertainty are unable to draw deeper insight into understanding cell biology from mass spectrometry (MS)-based spatial proteomics experiments. Hence, quantifying uncertainty allows us to make rigorous assessments of protein subcellular localisation and multi-localisation. For proteins to carry out their functional role they must be localised to the correct subcellular compartment, ensuring the biochemical conditions for desired molecular interactions are met [4]. Many pathologies, including cancer and obesity are characterised by protein mis-localisations [5– 14]. High-throughput spatial proteomics technologies have seen rapid improvement over the last decade and now a single experiment can provide spatial information on thousands of proteins at once [15– 18]. As a result of these spatial proteomics technologies many biological systems have been characterised [2, 15, 17, 19– 21]. The popularity of such methods is now evident with many new studies in recent years [17, 22– 29]. Bayesian approaches to machine learning and statistics can provide more insight, by providing uncertainty quantification [30]. In a parametric Bayesian setting, a parametric model is proposed, along with a statement about our prior beliefs of the model parameters. Bayes’ theorem tells us how to update the prior distribution of the parameters to obtain the posterior distribution of the parameters after observing the data. It is the posterior distribution which quantifies the uncertainty in the parameters. This contrasts from a maximum-likelihood approach where we obtain only a point estimate of the parameters. Adopting a Bayesian framework for data analysis, though of much interest to experimentalists, can be challenging. Once we have specified a probabilistic model, computational approaches are typically used to obtain the posterior distribution upon observation of the data. These algorithms can have parameters that require tuning and a variety of settings, hindering their practical use by those not familiar with Bayesian methodology. Even once the algorithms have been correctly set-up, assessments of convergence and guidance on how to interpret the results are often sparse. This workflow presents a Bayesian analysis of spatial proteomics to elucidate the process for practitioners. Our workflow also provides a template for others interested in designing tools for the biological community which rely on Bayesian inference. Our model for the data is the t-augmented Gaussian mixture (TAGM) model proposed in 1. Crook et al. [1] provide a detailed description of the model, rigorous comparisons and testing on many spatial proteomics datasets, including a case study in which a hyperLOPIT experiment is performed on mouse pluripotent stem cells [17, 31]. Revisiting these details is not the purpose of this computational protocol; rather we present how to correctly use the software and provide step-by-step guidance for interpreting the results. In brief, the TAGM model posits that each annotated sub-cellular niche can be modelled using a Gaussian distribution. Thus the full complement of proteins within the cell is captured as a mixture of Gaussians. The highly dynamic nature of the cell means that many proteins are not well captured by any of these multivariate Gaussian distributions, and thus the model also includes an outlier component, which is mathematically described as a multivariate student’s t distribution. The heavy tails of the t distribution allow it to better capture dispersed proteins. There are two approaches to perform inference in the TAGM model. The first, which we refer to as TAGM MAP, allows us to obtain maximum a posteriori estimates of posterior localisation probabilities; that is, the modal posterior probability that a protein localises to that class. This approach uses the expectation-maximisation (EM) algorithm to perform inference [32]. Whilst this is a interpretable summary of the TAGM model, it only provides point estimates. For a richer analysis, we also present a Markov-chain Monte-Carlo (MCMC) method to perform fully Bayesian inference in our model, allowing us to obtain full posterior localisation distributions. This method is referred to as TAGM MCMC throughout the text. This workflow begins with a brief review of some of the basic features of mass spectrometry-based spatial proteomics data, including our state-of-the-art computational infrastructure and bespoke software suite. We then present each method in turn, detailing how to obtain high quality results. We provide an extended discussion of the TAGM MCMC method to highlight some of the challenges that may arise when applying this method. This includes how to assess convergence of MCMC methods, as well as methods for manipulating the output. We then take the processed output and explain how to interpret the results, as well as providing some tools for visualisation. We conclude with some remarks and directions for the future. Source code for this workflow, including code used to generate tables and figures, is available on GitHub [33]

Getting started and infrastructure

In this workflow, we are using version 1.23.2 of pRoloc [34]. The package pRoloc contains algorithms and methods for analysing spatial proteomics data, building on the MSnSet structure provided in MSnbase. The pRolocdata package provides many annotated datasets from a variety of species and experimental procedures. The following code chunks install and load the suite of packages require for the analysis. We assume that we have a MS-based spatial proteomics dataset contained in a MSnSet structure. For information on how to import data, perform basic data processing, quality control, supervised machine learning and transfer learning we refer the reader to 35. Here, we start by loading a spatial proteomics dataset on mouse E14TG2a embryonic stem cells [36]. The LOPIT protocol [15, 37] was used and the normalised intensity of proteins from eight iTRAQ 8-plex labelled fraction are provided. The methods provided here are independent of labelling procedure, fractionation process or workflow. Examples of valid experimental protocols are LOPIT [37], hyperLOPIT [17, 31], label-free methods such as PCP [16], and when fractionation is perform by differential centrifugation [18, 38]. In the code chunk below, we load the aforementioned dataset. The printout demonstrates that this experiment quantified 2031 proteins over 8 fractions. In Figure 1, we can visualise the mouse stem cell dataset use the plot2D function. We observe that some of the organelle classes overlap and this is a typical feature of biological datasets. Thus, it is vital to perform uncertainty quantification when analysing biological data.

Figure 1.

First two principal components of mouse stem cell data.

Methods: TAGM MAP

Introduction to TAGM MAP

We can use maximum a posteriori (MAP) estimation to perform Bayesian parameter estimation for our model. The maximum a posteriori estimate is the mode of the posterior distribution and can be used to provide a point estimate summary of the posterior localisation probabilities. In contrast to TAGM MCMC (see later), it does not provide samples from the posterior distribution, however it allows for faster inference by using an extended version of the expectation-maximisation (EM) algorithm. The EM algorithm iterates between an expectation step and a maximisation step. This allows us to find parameters which maximise the logarithm of the posterior, in the presence of latent (unobserved) variables. The EM algorithm is guaranteed to converge to a local mode. The code chunk below executes the tagmMapTrain function for a default of 100 iterations. We use the default priors for simplicity and convenience, however they can be changed, which we explain in a later section. The output is an object of class MAPParams, that captures the details of the TAGM MAP model.

Aside: collinearity

The previous code chunk outputs a message concerning data collinearity. This is because the covariance matrix of the data has become ill-conditioned and as a result the inversion of this matrix becomes unstable with floating point arithmetic. This can lead to the failure of standard matrix algorithms upon which our method depends. In this case, it is standard practice to add a small multiple of the identity to stabilise this matrix. The printed message is a statement that this operation has been performed for these data.

Model visualisation

The results of the modelling can be visualised with the plotEllipse function on Figure 2. The outer ellipse contains 99% of the total probability whilst the middle and inner ellipses contain 95% and 90% of the probability respectively. The centres of the clusters are represented by black circumpunct (circled dot). We can also plot the model in other principal components. The code chunk below plots the probability ellipses along the first and second, as well as the fourth principal component. The user can change the components visualised by altering the dims argument.

Figure 2.

PCA plot with probability ellipses along PC 1 and 2 (left) and PC 1 and 4 (right).

The expectation-maximisation algorithm

The EM algorithm is iterative; that is, the algorithm iterates between an expectation step and a maximisation step until the value of the log-posterior does not change [32]. This fact can be used to assess the convergence of the EM algorithm. The value of the log-posterior at each iteration can be accessed with the logPosteriors function on the MAPParams object. The code chuck below plots the log posterior at each iteration and we see on Figure 3 the algorithm rapidly plateaus and so we have achieved convergence. If convergence has not been reached during this time, we suggest to increase the number of iterations by changing the parameter numIter in the tagmMapTrain method. In practice, it is not unexpected to observe small fluctuations due to numerical errors and this should not concern users.

Figure 3.

Log-posterior at each iteration of the EM algorithm demonstrating convergence.

The code chuck below uses the mappars object generated above, along with the E14RG2aR dataset, to classify the proteins of unknown localisation using tagmPredict function. The results of running tagmPredict are appended to the fData columns of the MSnSet. The new feature variables that are generated are: tagm.map.allocation: the TAGM MAP predictions for the most probable protein sub-cellular allocation. tagm.map.probability: the posterior probability for the protein sub-cellular allocations. tagm.map.outlier: the posterior probability for that protein to belong to the outlier component rather than any annotated component. We can visualise the results by scaling the pointer according the posterior localisation probabilities. To do this we extract the MAP localisation probabilities from the feature columns of the the MSnSet and pass these to the plot2D function ( Figure 4).

Figure 4.

TAGM MAP allocations, where the pointer is scaled according to the localisation probability and coloured according to the most probable subcellular niche.

The TAGM MAP method is easy to use and it is simple to check convergence, however it is limited in that it can only provide point estimates of the posterior localisation distributions. To obtain the full posterior distributions and therefore a rich analysis of the data, we use Markov-Chain Monte-Carlo methods. In our particular case, we use a collapsed Gibbs sampler [39].

Methods: TAGM MCMC a brief overview

The TAGM MCMC method allows a fully Bayesian analysis of spatial proteomics datasets. It employs a collapsed Gibbs sampler to sample from the posterior distribution of localisation probablities, providing a rich analysis of the data. This section demonstrates the advantage of taking a Bayesian approach and the biological information that can be extracted from this analysis. For those unfamiliar with Bayesian methodology, some of the key ideas for a more complete understanding are as follows. Firstly, MCMC based inference contrasts with MAP based inference in that it samples from the posterior distribution of localisation probabilities. Hence, we do not just have a single estimate for each quantity but a distribution of estimates. MCMC methods are a large class of algorithms used to sample from a probability distribution, in our case the posterior distribution of the parameters [40]. Once we have sampled from the posterior distribution, we can estimate the mean of the posterior distribution by simply taking the mean of the samples. In a similar fashion, we can obtain estimates of other summaries of the posterior distribution. A schematic of MCMC sampling is provided in Figure 5 to aid understanding. Proteins, coloured blue, are visualised along two variables of the data. Probability ellipses representing contours of a probability distribution matching the distribution of the proteins are overlaid. We now wish to obtain samples from this distribution. The MCMC algorithm is initialised with a starting location, then at each iteration a new value is proposed. These proposed values are either accepted or rejected (according to a carefully computed acceptance probability) and over many iterations the algorithm converges and produces samples from the desired distribution. Samples from the mean of this distribution are coloured in red in the schematic figure. A large portion of the earlier samples may not reflect the true distribution, because the MCMC sampler has yet to converge. These early samples are usually discarded and this is referred to as burn-in. The next state of the algorithm depends on its current state and this leads to auto-correlation in the samples. To suppress this auto-correlation, we only retain every r sample. This is known as thinning. The details of burn-in and thinning are further explained in later sections.

Figure 5.

A schematic figure of MCMC sampling.

A schematic figure of MCMC sampling.

Proteins are coloured in blue and probability ellipses are overlaid representing contours of a probability distribution matching the distribution of the proteins. MCMC samples from the mean of this distribution are then coloured in red. The TAGM MCMC method is computationally intensive and requires at least modest processing power. Leaving the MCMC algorithm to run overnight on a modern desktop is usually sufficient, however this, of course, depends on the particular dataset being analysed. For guidance: it should not be expected that the analysis will finish in just a couple of hours on a medium specification laptop, for example. To demonstrate the class structure and expected outputs of the TAGM MCMC method, we run a brief analysis on a subset (400 randomly chosen proteins) of the tan2009r1 dataset from the pRolocdata, purely for illustration. This is to provide a bare bones analysis of these data without being held back by computational requirements. We perform a complete demonstration and provide precise details of the analysis of the stem cell dataset considered above in the next section. The first step is to run a few MCMC chains (below we use only 2 chains) for a few iterations (we specify 3 iterations in the below code, but typically we would suggest in the order of tens of thousands; see for example the algorithms default settings by typing ?tagmMcmcTrain) using the tagmMcmcTrain function. This function will generate a object of class MCMCParams. Information for each MCMC chain is contained within the chains slot. If needed, this information can be accessed manually. The function tagmMcmcProcess processes the MCMCParams object and populates the summary slot. The summary slot has now been populated to include basic summaries of the MCMC chains, such as organelle allocations and localisation probabilities. Protein information can be appended to the feature columns of the MSnSet by using the tagmPredict function, which extracts the required information from the summary slot of the MCMCParams object. We can now access new variables: tagm.mcmc.allocation: the TAGM MCMC prediction for the most likely protein sub-cellular annotation. tagm.mcmc.probability: the mean posterior probability for the protein sub-cellular allocations. We can also access other useful summaries of the MCMC methods: tagm.mcmc.outlier the posterior probability for the protein to belong to the outlier component. tagm.mcmc.probability.lowerquantile and tagm.mcmc.probability.upperquantile are the lower and upper boundaries to the equi-tailed 95% credible interval of tagm.mcmc.probability. tagm.mcmc.mean.shannon a Monte-Carlo averaged Shannon entropy, which is a measure of uncertainty in the allocations.

Methods: TAGM MCMC the details

This section explains how to manually manipulate the MCMC output of the TAGM model. In the code chunk below, we load a pre-computed TAGM MCMC model. The data file e14tagm.rda is available online [1] and is not directly loaded into this package due to its size. The file itself if around 500mb, which is too large to directly load into a package. The following code, which is not evaluated dynamically, was used to produce the tagmE14 MCMCParams object. We run the MCMC algorithm for 20,000 iterations with 10,000 iterations discarded for burn-in. We then thin the chain by 20. We ran 6 chains in parallel and so we obtain 500 samples for each of the 6 chains, totalling 3,000 samples. The resulting file is assumed to be in our working directory. Manually inspecting the object, we see that it is a MCMCParams object with 6 chains.

Data exploration and convergence diagnostics

Assessing whether or not an MCMC algorithm has converged is challenging. Assessing and diagnosing convergence is an active area of research and throughout the 1990s many approaches were proposed [41– 44]. We provide a more detailed exploration of this issue, but readers should bare in mind that the methods provided below are diagnostics and cannot guarantee convergence. We direct readers to several important works in the literature discussing the assessment of convergence. Users that do not assess convergence and base their downstream analysis on unconverged chains are likely to obtain poor quality results. We first assess convergence using a parallel chains approach. We find producing multiple chains is benificial not only for computational advantages but also for analysis of convergence of our chains. The following code chunks set up a manual convergence diagnostic check. We make use of objects and methods in the package to perform this analysis [45]. Our function below automatically coerces our objects into for ease of analysis. We first calculate the total number of outliers at each iteration of each chain and, if the algorithm has converged, this number should be the same (or very similar) across all 6 chains. We can observe this from the trace plots and histograms for each MCMC chain ( Figure 6). Unconverged chains should be discarded from downstream analysis.

Figure 6.

Trace (left) and density (right) of the 6 MCMC chains.

Chains 3, 5 and 6 are centred around an average of 153, with rapid back and forth oscillations. Chain 2 should be immediately discarded, since it has a large jump in the chain with clearly skewed histogram. The other two chains oscillate differently with contrasting quantiles to the 3 chains (3, 5 and 6) that agree with one another, suggesting these chains have yet to converge. We can use the package to produce summaries of our chains. Here is the coda summary for the third chain.

Applying the Gelman diagnostic

So far, our analysis appears promising. Three of our chains are centred around an average of 153 outliers and there is no observed monotonicity in our output. However, for a more rigorous and unbiased analysis of convergence we can calculate the Gelman diagnostic using the package [42, 44]. This statistic is often referred to as or the potential scale reduction factor. The idea of the Gelman diagnostics is to compare the inter and intra chain variances. The ratio of these quantities should be close to one. A more detailed and in depth discussion can be found in the references. The package also reports the 95% upper confidence interval of the statistic. In this case, our samples are approximately normally distributed (see histograms on the right in Figure 6). The package allows for transformations to improve normality of the data, and in some cases we set the transform argument to apply log transformation. Gelman and Rubin [42] suggest that chains with value of less than 1.2 are likely to have converged. In all cases, we see that the Gelman diagnostic for convergence is < 1.2. However, the upper confidence interval is 1.32 when all chains are used; 1.31 when chain 2 is removed and when chains 1, 2 and 4 are removed the upper confidence interval is 1.01 indicating that the MCMC algorithm for chains 3,5 and 6 might have converged. We can also look at the Gelman diagnostics statistics for groups or pairs of chains. The first line below computes the Gelman diagnostic across the first three chains, whereas the second calculates the diagnostic between chain 3 and chain 5. To assess another summary statistic, we can look at the mean component allocation at each iteration of the MCMC algorithm and as before we produce trace plots of this quantity ( Figure 7).

Figure 7.

Trace (left) and density (right) of the mean component allocation of the 6 MCMC chains.

As before we can produce summaries of the data. We can already observe that there are some slight difference between these chains which raises suspicion that some of the chains may not have converged. For example each chain appears to be centred around 5.7, but chains 2 and 4 have clear jumps in the their trace plots. For a more quantitative analysis, we again apply the Gelman diagnostics to these summaries. The above values are close to 1 and so we there are no significant difference between the chains. As observed previously, chains 2 and 4 look quite different from the other chains and so we recalculate the diagnostic excluding these chains. The computed Gelman diagnostic below suggest that chains 3, 5 and 6 have converged and that we should discard chains 1, 2 and 4 from further analysis. For a further check, we can look at the mean outlier probability at each iteration of the MCMC algorithm and again computing the Gelman diagnostics between chains 4, 5 and 6. An statistics of 1 is indicative of convergence, since it is less than the recommend value of 1.2.

Applying the Geweke diagnostic

Along with the Gelman diagnostic, which uses parallel chains, we can also apply a single chain analysis using the Geweke diagnostic [41]. The Geweke diagnostic tests to see whether the mean calculated from the first 10% of iterations is significantly different from the mean calculated from the last 50% of iterations. If they are significantly different, at say a level 0.01, then this is evidence that particular chains have not converged. The following code chunk calculates the Geweke diagnostic for each chain on the summarising quantities we have previously computed. The first test suggests chain 2 has not converged, since the p-value is less than 10 −10 suggesting that the mean in the first 10% of iterations is significantly different from those in the final 50%. Moreover, the second test and third tests also suggest that chain 2 has not converged. Furthermore, for the second test chain 4 has a marginally small p-value, providing further evidence that this chain is of low quality. These convergence diagnostics are not limited to the quantities we have computed here and further diagnostics can be performed on any summary of the data. An important question to consider is whether removing an early portion of the chain might lead to an improvement of the convergence diagnostics. This might be particularly relevant if a chain converges some iterations after our orginally specified burn-in. For example, let us take the second Geweke test above, which suggested chains 2 and 4 had not converged and see if discarding the initial 10% of the chain improves the statistic. The function below removes 50 samples, known as burn-in, from the beginning of each chain and the output shows that we now have 450 samples in each chain. In practice, as 2 chains are sufficient for good posterior estimates and convergence we could simply discard chains 2 and 4 and proceed with downstream analysis with the remaining chains. The following function recomputes the number of outliers in each chain at each iteration of each Markov-chain. The code chuck below computes the Geweke diagnostic for this new truncated chain and demonstrates that chain 4 has an improved Geweke diagnostic, whilst chain 2 does not. Thus, in practice, it maybe useful to remove iterations from the beginning of the chain. However, as chain 4 did not pass the Gelman diagnostics we still discard it from downstream analysis.

Processing converged chains

Having made an assessment of convergence, we decide to discard chains 1,2 and 4 from any further analysis. The code chunk below removes these chains and creates a new object to store the converged chains. The MCMCParams object can be large and therefore if we have a large number of samples we may want to subsample our chain, known as thinning, to reduce the number of samples. Thinning also has another purpose. We may desire independent samples from our posterior distribution but the MCMC algorithm produces autocorrelated samples. Thinning can be applied to reduce the auto-correlation between samples. The code chuck below, which is not evaluated, demonstrates retaining every 5 iteration. Recall that we thinned by 20 when we first ran the MCMC algorithm. We initially ran 6 chains and, after having made an assessment of convergence, we decided to discard 3 of the chains. We desire to make inference using samples from all 3 chains, since this leads to better posterior estimates. In their current class structure all the chains are stored separately, so the following function pools all sample for all chains together to make a single longer chain with all samplers. Pooling a mixture of converged and unconverged chains is likely to lead to poor quality results so should be done with care. To populate the summary slot of the converged and pooled chain, we can use the tagmMcmcProcess function. As we can see from the object below a summary is now available. The information now available in the summary slot was detailed in the previous section. We note that if there is more than 1 chain in the MCMCParams object then the chains are automatically pooled to compute the summaries. To create new feature columns in the MSnSet and append the summary information, we apply the tagmPredict function. The probJoint argument indicates whether or not to add probabilistic information for all organelles for all proteins, rather than just the information for the most probable organelle. The outlier probabilities are also returned by default, but users can change this using the probOutlier argument.

Aside: Priors

Bayesian analysis requires users to specify prior information about the parameters. This may appear to be a challenging task; however, good default options are often possible. Should expert information be available for any of these priors then the users should provide this, otherwise we have found that the default choices work well in practice. The priors also provide regularisation and shrinkage to avoid overfitting. Given enough data the likelihood overwhelms the prior and the influence of the prior is weak. We place a normal inverse-Wishart prior on the parameters of the mutivariate normal mixture components. The normal inverse-Wishart prior has 4 hyperparameters that must be specified. These are: the prior mean mu0 expressing the prior location of each organelle; a prior shrinkage lambda0, which is a scalar expressing uncertainty in the prior mean; the prior degrees of freedom nu0; and a scale prior S0 on the covariance. Together, nu0 and S0 specify the prior variability on organelle covariances. The same prior distribution is assumed for the parameters of all mutivariate normal mixture components. The default options for these are based on the choice recommended by [46]. The prior mean mu0 is set to be the mean of the data. lambda0 is set to be 0.01 meaning some uncertainty in the covariance is propagated to the mean, increasing lambda0 increases shrinkage towards the prior. nu0 is set to the number of feature variables plus 2, which is the smallest integer value that ensures a finite covariance matrix. The prior scale matrix S0 is set to and represents a diffuse prior on the covariance. Another good choice which is often used is a constant multiple of the identity matrix. The prior for the Dirichlet distribution concentration parameters beta0 is set to 1 for each organelle. Another reasonable choice would be the non-informative Jeffery’s prior for the Dirichlet hyperparameter, which sets beta0 to 0.5 for each organelle. The prior weight for the outlier detection class is a ℬ ( u, v) distribution. The default for u = 2 and the default for v = 10. This represents the reasonable belief that proteins a priori might be an outlier and we believe is unlikely that more than 50% of proteins are outliers. Decreasing the value of v, represents more uncertainty about the number of protein that are outliers.

Analysis, visualisation and interpretation of results

Now that we have a single pooled chain of samples from a converged MCMC algorithm, we can begin to analyse the results. Preliminary analysis includes visualising the allocated organelle and localisation probability of each protein to its most probable organelle, as shown on Figure 8.

Figure 8.

TAGM MCMC allocations. On the left, point size have been scaled based on allocation probabilities. On the right, the point size have been scaled based on the global uncertainty using the mean Shannon entropy.

We can visualise other summaries of the data including a Monte-Carlo averaged Shannon entropy, as shown in Figure 8 on the right. This is a measure of uncertainty and proteins with greater Shannon entropy have more uncertainty in their localisation. We observe global patterns of uncertainty, particularly in areas where organelle boundaries overlap. There are also regions of low uncertainty indicating little doubt about the localisation of these proteins. We are also interested in the relationship between localisation probability to the most probable class and the Shannon entropy ( Figure 9). Even though the two quantities are evidently correlated there is still considerable spread. Thus it is important to base inference not only on localisation probability but also a measure of uncertainty, for example the Shannon entropy. Proteins with low Shannon entropy have low uncertainty in their localisation, whilst those with higher Shannon entropy have uncertain localisation. Since multi-localised protein have uncertain localisation to a single subcellular niche, exploring the Shannon can aid in identifying multi-localised proteins.

Figure 9.

Shannon entropy and localisation probability.

Aside from global visualisation of the data, we can also interrogate each individual protein. As illustrated on Figure 10, we can obtain the full posterior distribution of localisation probabilities for each protein from the e14Tagm_converged_pooled object. We can use the plot generic on the MCMCParams object to obtain a violin plot of the localisation distribution. Simply providing the name of the protein in the second argument produces the plot for that protein. The solute carrier transporter protein E9QMX3, also referred to as Slc15a1, is most probably localised to plasma membrane in line with its role as a transmembrane transporter but also shows some uncertainty, potentially also localising to other comparments. The first violin plot visualises this uncertainty. The protein Q3V1Z5 is a supposed constitute of the 40S ribosome and has poor UniProt annotation with evidence only at the transcript level. From the plot below is is clear that Q3V1Z5 is a ribosomal associated protein, but it previous localisation has only been computational inferred and here we provide experimental evidence of a ribosomal annotation. Thus, quantifying uncertainty recovers important additional annotations.

Figure 10.

Full posterior distribution of localisation probabilities for individual proteins.

Discussion

The Bayesian analysis of biological data is of clear interest to many because of its ability to provide richer information about the experimental results. A fully Bayesian analysis differs from other machine learning approaches, since it can quantify the uncertainty in our inferences. Furthermore, we use a generative model to explicitly describe the data, which makes inferences more interpretable compared to the less interpretable outputs of black-box classifiers such as, for example, support vector machines (SVM). Bayesian analysis is often characterised by its provision of a (posterior) probability distribution over the biological parameters of interest, as opposed to single point estimate of these parameters. In the case that is presented in this workflow, a Bayesian analysis “computes” a posterior probability distribution over the protein localisation probabilities. These probability distributions can then be rigorously interrogated for greater biological insight; in addition, it may allow us to ask additional questions about the data, such as whether a protein might be multi-localised. Despite the wealth of information a Bayesian analysis can provide, the uptake amongst cell biologists is still low. This is because a Bayesian analysis presents a new set of challenges and little practical guidance exists regarding how to address these challenges. Bayesian analyses often rely on computatinally intensive approaches such as Markov-chain Monte-Carlo (MCMC) and a practical understanding of these algorithms and the interpretation of their output is a key barrier to their use. A Bayesian analysis usually consists of three broad steps: (1) Data pre-processing and algorithmic implementation, (2) assessing algorithmic convergence and (3) summarising and visualising the results. This workflow provides a set of tools to simplify these steps and provides step-by-step guidance in the context of the analysis of spatial proteomics data. We have provided a workflow for the Bayesian analysis of spatial proteomics using the pRoloc and MSnbase software. We have demonstrated, in a step-by-step fashion, the challenges and advantages associated with taking a Bayesian approach to data analysis. We hope this workflow will help spatial proteomics practitioners to apply our methods and will motivate others to create detailed documentation for the Bayesian analysis of biological data.

Session information

Below, we provide a summary of all packages and versions used to generate this document. The source of this document, including the code necessary to reproduce the analyses and figures is available in a public manuscript repository on GitHub [47].

Data availability

The data used in this workflow was first published in Breckels (2016) [36] and is available in the pRolocdata package.

Software availability

Computational workflow for this study available from: https://github.com/ococrook/TAGMworkflow [47] Archived source code at time of publication: https://doi.org/10.5281/zenodo.2593712 [33] License: CC BY 4.0 The subcellular distribution of the proteome is a very important determinant of cell function and therefore the ability to analyse subcellular protein localisation is vital for studying cell biology and regulation. Consequently, technical developments that facilitate the analysis of how the proteome is distributed between cell compartments are of great value to both the proteomics and molecular cell biology research communities. Bioconductor is also a fantastic source of tools that can be used to analyse ‘omics’ datasets, hence further additions to this toolbox are always welcome and of value. Here the authors provide a detailed workflow in Bioconductor that enables the convenient Bayesian analysis of spatial proteomics data. The explanations of the models provided by the authors are user friendly and allow also non-bioinformaticians to understand clearly the mechanics of the different methods. The suggestions provided relating to default parameters for the models are also very welcome. Overall, I view this as a timely and very useful contribution for the community that will be well received and of genuine value. I note below a few specific points that the authors should address. Specific Points: I was initially unable to load either pRoloc or pRolocdata on R and therefore could not follow any of the examples provided. I had to upgrade to a new version of R to resolve this issue. The authors should specify the R version compatibility required. Line 1: Remove one ‘the’ 3rd page 5 paragraph, second sentence is disconnected. Example dataset uses iTRAQ, which means the data will have false positives all across The authors should explain/justify their assertion that their Method is independent of the isotope labelling method because TMT and iTRAQ are expected to provide different challenges? It would be helpful for the Gelman diagnostic to suggest minimum a number of chains for MCMC - Suggestion for an ideal number of chains for a medium range desktop computer would be useful too “Trace for Chain 2 & 4 have clear jumps” This is for figure 7 - This sounds imprecise; hard to spot and replicate. Also, I do not see the “clear jumps” the authors refer to – can this be clarified? Suggesting default priors for the Bayesian analysis is very useful. However, it would be good to show the visual consequences of changing S0 from beta0 of 1 to beta0 of 0.5 The suggestion by the authors that it is likely that 16.7% of proteins are outliers needs to be justified. The use of Shannon entropy to focus on proteins with multiple localisations is an interesting approach that could be discussed further. We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This paper provides an important advance in the study of spatial proteomics. Through the use of Monte Carlo sampling the authors are able to provide evaluations of uncertainty by using the T-augmented Gaussian mixture model proposed by Crook et al. The authors have been careful to provide all the code and the relevant data are included in the package 'pRolocdata'. Although the focus is on the spatial localisation of proteins, the authors only study the probabilities of correct localizations and do not show a spatial plot of the probabilities so one does not see clearly how the probabilities vary spatially. First Issue: PCA plots a) Ratios: There are a few corrections needed to the figures that show PCA plots. It is probably not worth showing the percentages of variance up to two decimal digits, however it is important to respect the axes relative scale and units. For instance, since the ratio of variances in Figure 1 is 40:25 the plot should be rectangular, this is more egregious in Figures 2, especially the right side panel. I suggest making the figures on top of each other so they are not narrow (For a reference see Chapter 7 of the book [1] that explains why usually PCA plots should be rectangular and not square). b) Construction of ellipses on the PCA plots in Figure 2 is very confusing. These cannot be the posterior distributions and their construction presupposes there is a relevant multivariate normal distribution which I do not think is justified here. Second Issue: Bayesian computations a) Priors In the "Aside" subsection on priors, I suggest exploring the possibility of using prior data to construct a prior as in practical situations this is often how pragmatic Bayesians think about building priors. I think it is worth supporting the statement that the posterior is overwhelmed by the data with a small example. This could be done following the type of process recommended by Betancourt who gives very good examples of the effect of priors through the use of posterior predictive distributions (see here https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#24_model_adequacy ). b) Convergence diagnostics The use of trace plots and Gelman and Rubin's Rhat can be quite confusing and can only show non-convergence, the current version puts too much emphasis on a few short runs. The authors have shown some chains that have not converged but do not mention that the contemporary literature is much more careful of how small the Rhat should be, see the preprint by Vats,D and Knudson, C 2018 [2] and the discussions by Dan Simpson and Andrew Gelman for instance (https://statmodeling.stat.columbia.edu/2019/03/19/maybe-its-time-to-let-the-old-ways-die-or-we-broke-r-hat-so-now-we-have-to-fix-it/), the preprint is on the arXiv [3]. Third Issue: Visualization of results: Figure 8 does not show the uncertainty in an intuitive way. It is very hard to differentiate the size of the dots on the left hand plot and the dots give the impression of a circular posterior distribution which is incorrect. The reference Ren et al, 2017 [4] contains several examples of posterior confidence contours that communicate uncertainty in a more precise way which enables the comparisons of uncertainty across different locations in the PCA. It would also be illuminating if the authors could make a plot with uncertainty contours as a function of the underlying spatial coordinates. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. The spatial proteomics field has seen increased popularity over the past few years through development of experimental, statistical, and computational methodologies. In this Method Article, Crook OM and colleagues present a bioinformatics workflow for the analysis of spatial proteomics data using a set of Bayesian analysis tools. This work is a useful guide for biologists that wish to properly apply and diagnose a Markov-chain Monte-Carlo (MCMC) inference procedure using the pRoloc package. The article is timely and relevant given the increasing number of biologists acquiring programming skills yet lacking extensive expertise in this area of statistics and modeling. In particular, the authors did a good job explaining basic terminology and rationale for the diagnostics necessary during MCMC inference. The methodology is clearly explained, all the code is provided, and the packages and datasets are available through Bioconductor, making them easily accessible and reproducible. However, I believe this article lacks in background information and details for those readers that are not experts in the spatial proteomics field. In addition, the value of applying the more resource-intensive MCMC method compared to the expectation-maximisation (EM) algorithm is not clear from the data presented. I recommend this article for indexing given that the following issues are resolved. Major comments Minor comments Currently, there is no direct comparison of the results provided by the EM and MCMC algorithms. Given that MCMC is computationally intensive and requires additional diagnostics, the reader needs to be convinced that there is a tangible benefit from using the MCMC approach. A discussion of scenarios in which the EM algorithm would be preferable over MCMC would be useful for the reader. For example, can MCMC fail to converge? If so, what kind of interpretation can be obtained by looking at the point-estimates from the EM algorithm instead? Additional background information and references would be useful, as this article seems to assume that the reader is familiar with spatial proteomics data sets. For example, the author can explain the organelle MS-proteomics data sets, the use of organelle markers to train the model, and the prediction of unannotated proteins from the model fitted on the organelle markers. In addition, I would suggest pointing the reader to the review by Gatto L et al. 2014 [1] in MCP, or another similar review as a primer. Page 20 – The authors claim that lower Shannon entropy is an indication of low uncertainty in localization and therefore can aid in identifying multi-localized proteins. This claim should be backed up by some examples or evidence from the literature that agree with the data presented. Figure 1 – Additional details should be provided in the figure legend. Is this plot only showing organelle markers? A short description of the outlier component and how the analyst can interpret this component would be particularly helpful. Page 15 – It is not clear why the Gelman diagnostic performed on mean allocation of all chains helps discriminate chains 1, 2, and 4. The upper C.I. is already near to 1 (1.01) for this example. I observed several small writing mistakes; the article should be proof-read before publication. Some examples are indicated below. Page 15 – sentence ending with “… Gelman diagnostics between chains 4, 5, 6”. However, the code shows that the diagnostic was performed on chains 3, 5, 6. Page 21 – “From the plot below is is…” should read “it is”. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

37 in total

Review 1. Nuclear transport and cancer: from mechanism to intervention.

Authors: Tweeny R Kau; Jeffrey C Way; Pamela A Silver
Journal: Nat Rev Cancer Date: 2004-02 Impact factor: 60.716

2. The organelle proteome of the DT40 lymphocyte cell line.

Authors: Stephanie L Hall; Svenja Hester; Julian L Griffin; Kathryn S Lilley; Antony P Jackson
Journal: Mol Cell Proteomics Date: 2009-01-30 Impact factor: 5.911

Review 3. The many functions of mRNA localization during normal development and disease: from pillar to post.

Authors: Neal A L Cody; Carole Iampietro; Eric Lécuyer
Journal: Wiley Interdiscip Rev Dev Biol Date: 2013-03-18 Impact factor: 5.814

4. Using hyperLOPIT to perform high-resolution mapping of the spatial proteome.

Authors: Claire M Mulvey; Lisa M Breckels; Aikaterini Geladaki; Nina Kočevar Britovšek; Daniel J H Nightingale; Andy Christoforou; Mohamed Elzek; Michael J Deery; Laurent Gatto; Kathryn S Lilley
Journal: Nat Protoc Date: 2017-05-04 Impact factor: 13.491

5. Bayesian Nonparametric Ordination for the Analysis of Microbial Communities.

Authors: Boyu Ren; Sergio Bacallado; Stefano Favaro; Susan Holmes; Lorenzo Trippa
Journal: J Am Stat Assoc Date: 2017-02-28 Impact factor: 5.033

Review 6. When intracellular logistics fails--genetic defects in membrane trafficking.

Authors: Vesa M Olkkonen; Elina Ikonen
Journal: J Cell Sci Date: 2006-12-15 Impact factor: 5.285

7. Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics.

Authors: Lisa M Breckels; Sean B Holden; David Wojnar; Claire M Mulvey; Andy Christoforou; Arnoud Groen; Matthew W B Trotter; Oliver Kohlbacher; Kathryn S Lilley; Laurent Gatto
Journal: PLoS Comput Biol Date: 2016-05-13 Impact factor: 4.475

8. Subcellular localization of MC4R with ADCY3 at neuronal primary cilia underlies a common pathway for genetic predisposition to obesity.

Authors: Jacqueline E Siljee; Yi Wang; Adelaide A Bernard; Baran A Ersoy; Sumei Zhang; Aaron Marley; Mark Von Zastrow; Jeremy F Reiter; Christian Vaisse
Journal: Nat Genet Date: 2018-01-08 Impact factor: 38.330

9. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval.

Authors: Jennifer Hirst; Daniel N Itzhak; Robin Antrobus; Georg H H Borner; Margaret S Robinson
Journal: PLoS Biol Date: 2018-01-30 Impact factor: 8.029

10. A Bayesian mixture modelling approach for spatial proteomics.

Authors: Oliver M Crook; Claire M Mulvey; Paul D W Kirk; Kathryn S Lilley; Laurent Gatto
Journal: PLoS Comput Biol Date: 2018-11-27 Impact factor: 4.475

7 in total

Review 1. Challenges and Opportunities for Bayesian Statistics in Proteomics.

Authors: Oliver M Crook; Chun-Wa Chung; Charlotte M Deane
Journal: J Proteome Res Date: 2022-03-08 Impact factor: 4.466

2. Genetic disruption of WASHC4 drives endo-lysosomal dysfunction and cognitive-movement impairments in mice and humans.

Authors: Jamie L Courtland; Tyler Wa Bradshaw; Greg Waitt; Erik J Soderblom; Tricia Ho; Anna Rajab; Ricardo Vancini; Il Hwan Kim; Scott H Soderling
Journal: Elife Date: 2021-03-22 Impact factor: 8.713

Review 3. Organellar Maps Through Proteomic Profiling - A Conceptual Guide.

Authors: Georg H H Borner
Journal: Mol Cell Proteomics Date: 2020-04-28 Impact factor: 5.911

4. Spatial proteomics defines the content of trafficking vesicles captured by golgin tethers.

Authors: John J H Shin; Oliver M Crook; Alicia C Borgeaud; Jérôme Cattin-Ortolá; Sew Y Peak-Chew; Lisa M Breckels; Alison K Gillingham; Jessica Chadwick; Kathryn S Lilley; Sean Munro
Journal: Nat Commun Date: 2020-11-25 Impact factor: 14.919

5. A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection.

Authors: Oliver M Crook; Aikaterini Geladaki; Daniel J H Nightingale; Owen L Vennard; Kathryn S Lilley; Laurent Gatto; Paul D W Kirk
Journal: PLoS Comput Biol Date: 2020-11-09 Impact factor: 4.475

6. Inferring differential subcellular localisation in comparative spatial proteomics using BANDLE.

Authors: Oliver M Crook; Colin T R Davies; Lisa M Breckels; Josie A Christopher; Laurent Gatto; Paul D W Kirk; Kathryn S Lilley
Journal: Nat Commun Date: 2022-10-10 Impact factor: 17.694

7. A Comprehensive Subcellular Atlas of the Toxoplasma Proteome via hyperLOPIT Provides Spatial Context for Protein Functions.

Authors: Konstantin Barylyuk; Ludek Koreny; Huiling Ke; Simon Butterworth; Oliver M Crook; Imen Lassadi; Vipul Gupta; Eelco Tromer; Tobias Mourier; Tim J Stevens; Lisa M Breckels; Arnab Pain; Kathryn S Lilley; Ross F Waller
Journal: Cell Host Microbe Date: 2020-10-13 Impact factor: 21.023

7 in total