Literature DB >> 30079225

A Bioconductor workflow for processing and analysing spatial proteomics data.

Lisa M Breckels^1,2, Claire M Mulvey², Kathryn S Lilley², Laurent Gatto^1,2.

Abstract

Spatial proteomics is the systematic study of protein sub-cellular localisation. In this workflow, we describe the analysis of a typical quantitative mass spectrometry-based spatial proteomics experiment using the MSnbase and pRoloc Bioconductor package suite. To walk the user through the computational pipeline, we use a recently published experiment predicting protein sub-cellular localisation in pluripotent embryonic mouse stem cells. We describe the software infrastructure at hand, importing and processing data, quality control, sub-cellular marker definition, visualisation and interactive exploration. We then demonstrate the application and interpretation of statistical learning methods, including novelty detection using semi-supervised learning, classification, clustering and transfer learning and conclude the pipeline with data export. The workflow is aimed at beginners who are familiar with proteomics in general and spatial proteomics in particular.

Entities: Chemical

Keywords: Bioconductor; R Package; machine learning; mass spectromery; protein sub-cellular localisation; proteomics; spatial proteomics; transfer learning

Year: 2016 PMID： 30079225 PMCID： PMC6053703 DOI： 10.12688/f1000research.10411.2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Quantitative mass spectrometry-based spatial proteomics involves elaborate, expensive and time consuming experimental protocols and considerable effort is invested in the generation of such data. Multiple research groups have described a variety of approaches to establish high quality proteome-wide datasets (see for example 1 for a review, and 2– 6 for recent examples). However, data analysis is as critical as data production for reliable and insightful biological interpretation. Here, we walk the reader through a typical pipeline for the analysis of such data using several Bioconductor [7] packages for the R statistical programming environment. The main package to analyse protein localisation data is , which offers a set of dedicated functions for the analysis of such data. itself relies on to manipulate and process quantitative proteomics data. Many other packages are used by for clustering, classification and visualisation. Support for interactive visualisation is offered by the package. In this workflow, we will describe how to prepare the spatial proteomics data starting from a spreadsheet containing quantitative mass spectrometry data, through to some essential data processing steps, and finish with different applications of machine learning ( Figure 1). We focus on a recent pluripotent mouse embryonic stem cells experiment [2]. These data, as well as additional annotated and pre-formatted datasets from various species are readily available in the package.

Figure 1.

Schematic overview of the pRoloc pipeline from data import, through to data processing, machine learning and data export.

Installation of Bioconductor packages is documented in detail on the Bioconductor installation help page. Below, we show how to install the four main packages used in this workflow: This procedure is also applicable to any packages, from CRAN as well as GitHub. Once a package has been installed, it needs to be loaded for its functionality to become available in the R session; this is done with the library function e.g. to load the package one would type library(" pRoloc") after installation. If you have questions about this workflow in particular, or about other Bioconductor packages in general, they are best asked on the Bioconductor support site following the posting guidelines. Questions can be tagged with specific package names or keywords. For more general information about mass spectrometry and proteomics, the readers are invited to read the package vignettes and associated papers [8, 9].

Reading and processing spatial proteomics data

The use-case: predicting sub-cellular localisation in pluripotent embryonic mouse stem cells

As a use-case, we analyse a recent high-throughput spatial proteomics dataset from pluripotent mouse embryonic stem cells (E14TG2a) [2]. The data was generated using hyperplexed LOPIT (hyperLOPIT), a state-of-the-art method relying on improved sub-cellular fractionation and more accurate quantitation, leading to more reliable classification of protein localisation across the whole sub-cellular space. The method uses an elaborate sub-cellular fractionation scheme, enabled by the use of Tandem Mass Tag (TMT) [10] 10-plex and application of the MS data acquisition technique named synchronous precursor selection MS 3 (SPS-MS 3) [11], for TMT quantification with high accuracy and precision. Three biological replicates were generated from the E14TG2a experiment, the first was to target low density fractions and the second and third were to emphasis separation of the denser organelles. The intersect of replicates 1 and 2 was treated as a 20-plex dataset for the analysis. As discussed in the publication [2], it has been shown that combining replicates from different gradients can increase spatial resolution [12]. The combination of replicates resulted in 5032 proteins common to both experiments. These, as well as many other data are directly available as properly structured and annotated datasets from the experiment package. In this workflow, we will start with a description of how to generate these ad hoc data objects starting from an arbitrary spreadsheet, as produced by many popular third-party applications. While we focus here on a LOPIT-type dataset, these analyses are relevent for any quantitative spatial proteomics data, irrespective of the fractionation (i.e. density gradient or differential centrifugation [3]) or quantitation (i.e. labelled or label-free) methods.

The infrastructure: and packages

To make use of the full functionality of the software, users need to import their data into R and prepare them as an MSnSet. The MSnSet is a dedicated data structure for the efficient manipulation and processing of mass spectrometry and proteomics data in R. Figure 2 illustrates a simplified view of the MSnSet structure; there exists 3 key sub-parts (termed slots) to such a data object: (1) the exprs (short for expression data) slot for storing the quantitation data, (2) the fData slot (short for feature-metadata) for storing the feature meta-data, and finally (3) the pData slot (short for pheno-metadata, i.e. sample phenotypic data) for storing the sample meta-data.

Figure 2.

Simplified representation of the MSnSet data structure (reproduced with permission from the vignette).

Feature metadata typically contains general annotation about the proteins (accession numbers, description, …), information related to the identification search (confidence scores, number of peptides, …) as well as annotation about known sub-cellular location (see in particular the Markers section) and results from data analysis. The sample metadata would, for example, record what stable isotope labels were used for the respective fraction (when labelled quantitation is used), replicate number, fraction number along the gradient and pooling information. Another slot of interest is processingData, that logs the processing MSnSet objects undergo. The processing log can be accessed with the processingData function and is displayed under Processing information in the textual object summary when an MSnSet’s name is typed in the R console (see example below).

Importing data

There are a number of ways to import quantitation data and create an MSnSet instance. All methods are described in the input/output capabilities vignette. The simplest method is to use the function readMSnSet2. The function takes a single spreadsheet file name as input and extracts the columns containing the quantitation data, as identified by the argument ecol, to create the expression data, while the other columns in the spreadsheet are appended to the feature meta-data slot. By example, in the code chunk below we read in the csv spreadsheet containing the quantitation data from the intersect of replicates 1 and 2 of the mouse map [2], using the readMSnSet2 function. The data is as available online with the manuscript (see tab 2 of the xlsx supplementary data set 1 in 2, which should be exported as a text-based spreadsheet). It is also available as a csv in the Bioconductor data package, which we make use of below. To use the readMSnSet2 function, as a minimum one must specify the file path to the data and which columns of the spreadsheet contain quantitation data. In the code chunk below, we start by identifying the file that we want to use. The system.file function is used to return the path to the extdata directory from the package, which is where our file of interest resides. We then use the dir function to list the content of that directory and store the path that matches the file name of interest in the csvfile variable. Note that these two lines are only needed here to locate a file in a package; in a more general use case, the user would define the csvfile variable containing the file name of interest directly. A common pitfall here is to provide only the file name, rather than full path to the file (which is what is shown below with basename; we don’t print the full path, as it will vary from computer to computer). Note that only specifying the name of the file is sufficient when it exists in the working directory (i.e. the directory in which R is running, which can be queried and changed with the getwd and setwd functions respectively). Note that the file is compressed (as indicated by the gz, for gzip, extension), and will be decompressed on-the-fly when read into R. Next, we need to identify which columns in the spreadsheet contain the quantitation data. This can be done using the getEcols function inside R. The spreadsheet deposited by the authors contains two headers, with the second header containing information about where the quantitation data is stored ( Figure 3).

Figure 3.

A screenshot of the data in the spreadsheet.

We can display the names of the second header by calling the getEcols function with the argument n = 2 (the default value is n = 1), to specify that we wish to display the column names of the second line. We also specify the name of the spreadsheet file (defined as csvfile above) and the separator that splits cells. It is now easy for one to identify that the quantitation data, corresponding to the 10 TMT isobaric tags, is located in columns 8 to 27. We now have the two mandatory arguments to readMSnSet2, namely the file name (stored in the csvfile variable) and the quantitation column indices. In addition to these, it is also possible to pass the optional argument fnames to indicate which column to use as the labels by which to identify each protein in the sample. Here, we use fnames = 1 to use the UniProt identifiers contained in the first (unnamed) column of the spreadsheet. We also need to specify to skip the first line of the file (for the same reason that we used n = 2 in getEcols above) to read the csv data and convert it to an MSnSet object, named hl (for hyperLOPIT). Below, we display a short summary of the data. The data contains 5032 proteins/features common across the 2 biological replicates for the respective 2 × 10-plex reporter tags (20 columns or samples), along with associated feature meta-data such as protein markers, protein description, number of quantified peptides etc (see below). Below, we examine the quantitative information along the whole gradient for first 5 proteins. It is also possible to access specific rows and columns by naming the proteins and TMT tag channels of interest. The feature meta-data is stored in the fData slot and can be accessed by fData(hl). When using readMSnSet2 everything that is not defined as quantitation data by ecol is deposited to the fData slot. We see the fData contains 25 columns describing information such as the number of peptides, associated markers, machine learning results etc. To identify the feature variable names we can use the function fvarLabels. We see that the first 6 feature variable names contain non-discriminatory label names, so we relabel them to help us identify what feature data information is stored in the associated columns. Note that when using the simple readMSnSet2 procedure, the pData slot which is used to store information about the samples/channels is kept empty. As illustrated below, one can use the $ operator to access (or create) individual columns in the metadata slot. It is advised to annotate the channels as well. Below, we annotate the replicate from which the profiles originate and the TMT tag (extracted from the sample/channel names). To do so, we use the sample names that were assigned automatically using the quantiation column names and remove leading X and trailing .1 using the sub function. Throughout this workflow we refer to the different columns that are found in the exprs (expression data) slot as channels (short for TMT channels). In the frame of LOPIT and hyperLOPIT these channels constitute the relative abundance of each protein (along the rows) in the channel of interest. Each TMT channel originates from fractions collected from the density gradient, or a set of pooled fractions or may be a sample originating from an alternative preparation e.g. such as from the chromatin enrichment performed in Christoforou et al. [2] Information about which gradient fractions were used for which tag should also be stored in the sample meta-data pData slot. The sample meta-data that is distributed with the package for Christoforou’s hyperLOPIT experiment and (as above) the quantitation data file, are located in the extdata in the package on the hard drive. In the code chunk below we again use the dir function to locate the filepath to the meta-data csv file and then read it into R using read.csv. We then append the meta-data to the pData slot. Information about the gradient fractions used and the associated subcellular fraction densities in each replicate are stored here.

Data processing

There are two aspects related to data normalisation that are relevent to spatial proteomics data processing. The first one focuses on reducing purely technical variation between channels without affecting biological variability (i.e. the shape of the quantitatives profiles). This normalisation will depend on the underlying quantitative technology and the experimental design, and will not be addressed in this workflow. The second aspect, and more specific to spatial proteomics data, is scaling all the organelle-specific profiles into the same intensity interval (typically 0 and 1) by, for example, dividing each intensity by the sum of the intensities for that quantitative feature. This is not necessary in this example as the intensities for each replicate have already been re-scaled to 1 in Proteome Discoverer v1.4 Thermo Fisher. However, if the data require normalisation, the user can execute the normalise function as demonstrated in the below code chunk. This transformation of the data assures cancellation of the effect of the absolute intensities of the quantitative features along the rows, and focus subsequent analyses on the relative profiles along the sub-cellular channels. The same normalise function (or normalize, both spellings are supported) can also be applied in the first case described above. Different normalisation methods, such as mean or median scaling, variance stabilisation or quantile normalisation, to cite a few, can be applied to accomodate different needs (see ?normalise for available options). As previously mentioned, before combination, the two replicates in the hl data that we read into R were separately normalised by sum (i.e. to 1) across the 10 channels for each replicate respectively. We can verify this by summing each rows for each replicate: We see that some features do not add up exactly to 1 due to rounding errors after exporting to intermediate files. These small deviations do not bear any consequences here.

Combining acquisitions

The spreadsheet that was used to create the hl MSnSet included the two replicates within one .csv file. We also provide individual replicates in the package. Below, we show how to combine MSnSet objects and, subsequently, how to filter and handle missing values. We start by loading the package and the equivalent replicates using the data function. At the R prompt, typing will list the 75 datasets that are available in . Combining data is performed with the combine function. This function will inspect the feature and sample names to identify how to combine the data. As we want our replicates to be combined along the columns (same proteins, different sets of channels), we need to assure that the respective sample names differ so they can be identified from one another. The function updateSampleNames can be used do this. In addition to matching names, the content of the feature metadata for identical feature annotations must match exactly across the data to be combined. In particular for these data, we expect the same proteins in each replicate to be annotated with the same UniProt entry names and descriptions, but not with the same coverage of number of peptides or peptide-spectrum matches (PSMs). Below, we update the replicate specific feature variable names and remove the shared annotation. In the first line, we update only the feature variable names 3 to 5 (by appending a 1) of the first replicate and in the second line, we apply the updateFvarLabels function to update all feature variable names (by appending a 2) of the second replicate. In lines 3 and 4, we retain the first 5 feature variables for the first replicate and the relevant third to fifth variables for the second replicate. We can now combine the two experiments into a single MSnSet: More details about combining data are given in the dedicated Combining MSnSet instances section of the tutorial vignette.

Missing data

Missing data are a recurrent issue in mass spectrometry applications, and should be addressed independently of this workflow [13, 14]. In 15, we have described how a high content in missing values in spatial proteomics data and their inappropriate handling leads to a reduction of sub-cellular resolution. We can impute missing data using ’s impute function. The method underlying the imputation method is then determined by a methods parameter (see ?impute for available options). To impute missing values using nearest neighbour imputation, one would In our particular case, missing values are indicative of protein groups that were not acquired in both replicates ( Figure 4, produced with the image2 function).

Figure 4.

Heatmap of missing values.

Note that the features are re-ordered to highlight clusters of proteins with similar numbers of missing values.

Heatmap of missing values.

Note that the features are re-ordered to highlight clusters of proteins with similar numbers of missing values. We prefer to remove proteins that were not assayed in both replicated experiments. This is done with the filterNA function that removes features (proteins) that contain more than a certain proportion (default is 0) missing values. The Processing information section summarises how combining and filtering missing values (subsetting) changed the dimensions of the data. When more than 2 datasets are to be combined and too many proteins have not been consistently assayed, leading to too many proteins being filtered out, we suggest to implement an ensemble of classifiers voting on protein-sub-cellular niche membership over the output of several experiments (see section Supervised machine learning for the description of sub-cellular assignments).

Quality control

Data quality is routinely examined through visualisation to verify that sub-cellular niches have been separated along the gradient. Based on De Duve's principle [16] proteins that co-localise in a cell, exhibit similar quantitation profiles across the gradient fractions employed. One approach that has been widely used to visualise and inspect high throughput mass spectrometry-based proteomics data is principal components analysis (PCA). PCA is one of many dimensionality reduction methods, that allows one to effectively summarise multi-dimensional data in to 2 or 3 dimensions to enable visualisation. Very generally, the original continuous multi-dimensional data is transformed into a set of orthogonal components ordered according to the amount of variability that they describe. The plot2D and plot3D functions in allows one to plot the principal components (PCs) of a dataset against one another. By default, the first two components are plotted on the x- and y-axis for the plot2D function, and first three components are loaded for the plot3D function, respectively (the dims argument can be used to plot other PCs). If distinct clusters are observed, we assume that there is organellar separation present in the data. Although, representing the multi-dimensional data along a limited set of PCs does not give us a hard quantitative measure of separation, it is extremely useful summarising complex experimental information in one figure, to get a simplified overview of the data. In the code chunk below we produce a 2-dimensional PCA plot of the mouse stem cell dataset ( Figure 5). Each point on the plot represents one protein. We can indeed see several distinct protein clusters. We specify fcol = NULL to ignore feature metadata columns and not annotate any feature (protein) with a colour. We will see later how to use this argument to annotate the PCA plot with prior information about sub-cellular localisation.

Figure 5.

PCA plot of the mouse stem cell data hl.

PCA plot of the mouse stem cell data hl.

Each dot represents a single protein, and cluster of proteins represent proteins residing in the same sub-cellular niche. The figure on the right bins proteins and represent the bins density to highlight the presence of protein clusters. In the first instance we advise one to visualise their data without any annotation (i.e. with fcol = NULL), before proceeding with data annotation. The identification of well resolved clusters in the data, constitutes an unbiased assessment of the data structure, demonstrating the successful separation of sub-cellular clusters. It is also useful to visualise the relative intensities along the gradient to identify channels displaying particularly low yield. This can be done using the plotDist and boxplot functions, that plot the protein profiles occupancy along the gradient (we also display the mean channel intensities below) and a boxplot of the column intensities. In the two plots displayed on Figure 6, we re-order the TMT channels to pair corresponding channels in the two replicates (rather than ordering the channels by replicate).

Figure 6.

Protein profiles and distribution of channel intensities.

The red dots represent the mean relative intensity for each channel.

Protein profiles and distribution of channel intensities.

The red dots represent the mean relative intensity for each channel.

Markers

In the context of spatial proteomics, a marker protein is defined as a well-known resident of a specific sub-cellular niche in a species and condition of interest. Applying this to machine learning (ML), and specifically supervised learning for the task of protein localisation prediction, these markers constitute the labelled training data to use as input to a classification analyses. Defining well-known residents, and obtaining labelled training data for ML analyses can be time consuming, but it is important to define markers that are representative of the multivariate data space and on which a classifier will be trained and generated. provides a convenience function, addMarkers, to directly add markers to an MSnSet object, as demonstrated in the code chunk below. These marker sets can be accessed using the pRolocmarkers() function. Marker sets are stored as a simple named vector in R, and originate from in-house user-defined sets of markers or from previous published studies [15], which are continuosly updated and integrated. These markers can then be mapped to an MSnSet’s featureNames. The mouse dataset used here has Uniprot IDs stored as the featureNames (see head(featureNames(hl))) and the names of the vector of the mouse markers stored in ( mmus markers) are also Uniprot IDs (see head(mrk) in the code chunk below, that displays the 6 first markers), so it is straightforward to match names between the markers and the MSnSet instance using the addMarkers function. We recommend at least 13 markers per sub-cellular class (see the Optimisation section for details about the algorithmic motivation of this number). Markers should be chosen to confidently represent the distribution of genuine residents of a sub-cellular niche. We generally recommend a conservative approach in defining markers to avoid false assignments when assigning sub-cellular localisation of proteins of unknown localisation. A more relaxed definition of markers, i.e. one that broadly or over-confidently defines markers, risks the erroneous assignment of proteins to a single location, when, in reality, they reside in multiple locations (including the assumed unique location). One can not expect to identify exact boundaries between sub-cellular classes through marker annotation alone; the definition of these boundaries is better handled algorithmically, i.e. after application of the supervised learning algorithm, using the prediction scores (as described in the Classification section, in particular Figure 16).

Figure 16.

Classification results.

Colours indicate class membership and point size are representative of the classification confidence.

If the protein naming between the marker sets and the MSnSet dataset are different e.g. the markers are labelled by Uniprot accession numbers and the dataset entries are labelled by Uniprot entry names, one will have to convert and match the proteins according to the appropriate identifier. Sometimes, we find the equivalent entry name, Uniprot ID or accession number is stored in the feature metadata, which makes conversion between identifers relatively straightforward. If this is not the case however, conversion can be performed using , the Bioconductor annotation resources or any conversion softwares available online.

Adding user-defined marker lists

It is also possible for users to use their own marker list with the addMarkers function. The user needs to create a named vector of marker localisation, or a create a csv file with two columns (one for the protein names, one for the corresponding sub-cellular marker annotation) and pass the vector or file name respectively to the function. As previously mentioned, the protein names of these markers must match some (but not necessarily all) of the MSnSet’s feature names. See ?addMarkers for more details. In general, the Gene Ontology (GO) [17], and in particular the cellular compartment (CC) namespace are a good starting point for protein annotation and marker definition. It is important to note however that automatic retrieval of sub-cellular localisation information, from or elsewhere, is only the beginning in defining a marker set for downstream analyses. Expert curation is vital to check that any annotation added is in the correct context for the biological question under investigation.

Visualising markers

Having added the mouse markers to our fData from the pRolocmarkers, we can now visualise these annotations on the PCA plot using the plot2D function and then use the addLegend function to map the marker classes to the pre-defined colours. As previously mentioned, PCA transforms the original high dimensional data into a set of linearly uncorrelated principal components (PCs) such that the first accounts for as much variability in the data as possible and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. We saw in the previous section how visualisation of the PCs is useful for quality control and checking organelle seperation. Adding marker definiton allows one to quickly see if known residents appear in defined clusters. One must be careful though as different organelles may be resolved in different dimensions. For example, we can display the data along the first and seventh PCs using the dims argument. Note that in these calls to the plot2D function, we have omitted the fcol argument and so by default the feature variable named " markers" is used to annotate the plot. We choose to display PCs 1 and 7 to illustrate that while upper principal components explain much less variability in the data (2.23% for PC7, as opposed to 48.41% for PC1), we see that the mitochondrial (purple) and peroxisome (dark blue) clusters can be differentiated, despite the apparent overlap in the visualisation of the two first PCs ( Figure 7).

Figure 7.

Annotated PCA plots of the hl dataset, as produced with plot2D.

This is further highlighted if we plot the profiles of these two clusters using the plotDist function ( Figure 8). The plotDist function is another useful visualisation that relies on marker annotation. It allows one to represent the protein profiles occupancy along the gradient. While the PCA plot enables efficient visualisation of the complete dataset and assessment the relative separation of different sub-cellular niches, comparing profiles of a few marker clusters is useful to assess how exactly they differ (in terms of peak channels, for example). On Figure 8, we plot the profiles of the mitochondrial and peroxisome markers to highlight the differences in the channels labelled with tag 129C, also represented above along the 7th PC on the PCA plot on Figure 7.

Figure 8.

Mitochondrion and peroxisome protein profiles.

The data can also be visualised along three PCs using the plot3D function ( Figure 9). When produced interactively, the plot can be rotated and zoomed using the mouse.

Figure 9.

Using the plot3D function to visualise the hl dataset along PCs 1, 2 and 7.

The default colours for plotting have been defined so as to enable the differentiation of up to 30 classes. If more are provided, different character symbols (circles, squares, ... and empty and solid symbols) are used. The colours and the default plotting characters (solid dots for the markers and empty circles for the features of unknown localisation) can of course be changed, as described in the setStockcol manual page. As demonstrated in 2 and illustrated in the PCA plot ( Figure 7), the Golgi apparatus proteins (dark brown) display a dynamic pattern, noting sets of Golgi marker proteins that are distributed amongst other subcellular structures, an observation supported by microscopy. As such, we are going to reset the annotation of Golgi markers to unknown using the fDataToUnknown function. It is often used to replace empty strings ( "") or missing values in the markers definition to a common definition of unknown localisation.

Features of interest

In addition to adding annotation using the addMarkers function, one can store specific sets of proteins by using the Features of interest infrastructure from the package. If users have specific subsets of proteins they wish to highlight in their data (possibly across multiple experiments) they would first create a FeaturesOfInterest object and then use the highlightOnPlot function to visualise these. For example, if we wanted to highlight proteins with the accession numbers Q8CG48, Q8CG47, Q8K2Z4, and Q8C156, which are some of the proteins known to form part of the 13S condensin complex, we would call the code displayed on Figure 10. Users can also create several sets of FeaturesOfInterest object and store them in a FoICollection.

Figure 10.

Highlighting protein features of interest.

It is also worthy of note that it is possible to search for a specific protein of interest by featureNames or using any identifying information found in the fData columns by using the search box on the pRolocVis application part of the package (see section on interactive visualisation). This can be handy for quickly searching and highlighting proteins on the fly, the disavanatge here is that proteins can only be searched for a one-by-one basis.

Replication

With the aim of maximising the sub-cellular resolution and, consequently, the reliability in protein sub-cellular assignments, we follow the advice in 12 and combine replicated spatial proteomics experiments as described above. Indeed, Trotter et al. have shown a significant improvement in protein–organelle association upon direct combination of single experiments, in particular when these resolve different subcellular niches. Direct comparisons of individual channels in replicated experiments do not provide an adequate, goal-driven assessment of different experiments. Indeed, due to the nature of the experiment and gradient fraction collection, the quantitative channels do not correspond to identical selected fractions along the gradient. For example, in Table 1 below (taken from hl’s pData) TMT channels 127C (among others) in both replicates originate from different sets of gradient fractions (gradient fractions 7–9 and 8–9 for each replicate, respectively). Different sets of gradient fractions are often pooled to obtain enough material and optimise acurate quantitation.

Table 1.

Differences in gradient fraction pooling.

	Replicate	Tag	Gradient.Fraction	Iodixonal.Density
X127C	1	127C	8 to 9 (pooled)	11.00
X127C.1	2	127C	7 to 9 (pooled)	10.00

The more relevent comparison unit is not a single channel, but rather the complete protein occupancy profiles, which are best visualised experiment-wide on a PCA plot. As such, we prefer to focus on the direct, qualitative comparison of individual replicate PCA plots ( Figure 11), assuring that each displays acceptable sub-cellular resolution. Note that in the code chunk below, we mirror the x-axis to represent the two figures with the same orientation. The interactive "compare" application part of the package is also useful for examining replicate experiments (see the next section interactive visualisation for details).

Figure 11.

PCA plots of replicates 1 and 2.

In addition, the reproducibility can be assessed by performing independent classification analyses on each replicate (see the section on Supervised machine learning below) and comparing the the results. Even when the gradient conditions different (for unexpected technical or voluntary reasons, to maximise resolution when combining experiments [12]), one expects agreement in the most confident organelle assignments.

Interactive visualisation

Visualisation and data exploration is an important aspect of data analyses allowing one to shed light on data structure and patterns of interest. Using the package, we can interactively visualise, explore and interrogate quantitative spatial proteomics data. The package relies on the shiny framework for reactivity and interactivity. It distributes 3 different GUI’s ( main (default), compare or classify) which are wrapped and launched by the pRolocVis function.

The main application

In the below code chunk we lauch the main app ( Figure 12) (note, we do not need to specify the argument, app = "main" as it is the default).

Figure 12.

A screen shot of clickable interface and zoomable PCA plot of the main app in the package.

As diplayed in the screenshot in Figure 12, the main application is designed for exploratory data analysis and is divided into 3 tabs: (1) PCA, (2) Profiles and (3) Table selection. The default view upon loading is the PCA tab, which features a clickable interface and zoomable PCA plot with an interactive data table for displaying the quantitation information. Particular proteins of interest can be highlighted using the text search box. There is also a Profiles tab for visualisation of the protein profiles, which can be used to examine the patterns of proteins of interest. The Table selection tab provides an interface to control data table column selection. A short animation https://github.com/lmsimp/bioc-pRoloc-hyperLOPIT-workflow/blob/master/Figures/pRolocVis_pca.gif illustrating the interface is available in the manuscript repository [18].

The compare application

The compare application ( Figure 13) is useful for examining two replicate experiments, or two experiments from different conditions, treatments etc. The compare application is called by default if the input object to pRolocVis is an MSnSetList of 2 MSnSets, but it can also be specified by calling the argument app = "compare". For example, in the code chunk below we first create an MSnSetList of replicates 1 and 2 of the hyperLOPIT data, this is then passed to pRolocVis.

Figure 13.

The compare application, main panel.

The comparison app loads the two PCA plots side-by-side. Only common proteins between the two data sets are displayed. As per the main application, proteins can be searched, identified and highlighted on both PCA plots and in the dedicated profiles tab. One key feature of the compare application is the ability to re-map the second dataset onto the PCA data space of the first (reference) data set (see ?pRolocVis and the argument remap = TRUE). Using the first dataset as the reference set, PCA is carried out on the first dataset and the standard deviations of the principal components (i.e. the square roots of the eigenvalues of the covariance/correlation matrix) and the matrix of variable loadings (i.e. a matrix whose columns contain the eigenvectors) are stored and then used to calculate the principal components of the second dataset. Both datasets are scaled and centered in the usual way. The first dataset appears on the left, and the second re-mapped data appears on the right. The order of the first (the reference data for remapping) and second dataset can be changed through regeneration/re-ordering of the MSnSetList object.

The classify application

The final application classify, has been designed to view machine learning classification results according to user-specified thresholds for the assignment of proteins to its sub-cellular location, as discussed later in the subsection Thresholding in the Supervised machine learning section.

Novelty detection

The extraction of sub-cellular protein clusters can be difficult owing to the limited number of marker proteins that exist in databases and elsewhere. Furthermore, given the vast complexity of the cell, automatic annotation retrieval does not always give a full representation of the true sub-cellular diversity in the data. For downstream analyses, such as supervised machine learning, it is desirable to obtain reliable markers that cover as many sub-cellular niches as possible, as these markers are directly used in the training phase of the ML classification. We find that a lack of sub-cellular diversity in the labelled training data leads to prediction errors, as unlabelled instances can only be assigned to a class that exists in the training data [19]. In such scenarios, novelty detection can be useful to identify data-specific sub-cellular groupings such as organelles and protein complexes. The phenotype discovery (phenoDisco) algorithm [19] is one such method and is available in . It is an iterative semi-supervised learning method that combines the classification of proteins on existing labelled data with the detection of new clusters. In addition to extracting new phenotypes, novelty detection methods are also useful for confirming the presence of known or postulated clusters in an unbiased fashion. For example in 2 the phenoDisco algorithm was used to confirm the data-specific presence of the nucleus and nucleus sub-compartments. In the code chunk below, we demonstrate how to do this analysis, highlighting some of the optional arguments and parameters available for phenotype extraction and give some advice on how to interpret the output. As the phenoDisco algorithm is semi-supervised it uses both labelled (markers) and unlabelled data to explore the data structure and find new sub-cellular data clusters. Thus the first step is to define some input labelled data i.e. markers, that the algorithm will use as input for the supervised learning aspect of the algorithm. As described in 2 we define a set of markers to use as input for the analyses that cover well-known residents from three distinct organelle structures; the mitochondria, plasma membrane and ER, and from three well-known and abundant protein complexes; the proteasome and two ribosomal subunits, 40S and 60S. These input markers are stored in the phenoDisco.Input featureData column of hl and below set by fcol = "phenoDisco.Input". We can use the convenience accessor function getMarkers to print out a table of the markers contained in this marker set. These initial markers were manually curated using information from the UniProt database, the Gene Ontology and the literature. In the code chunk below we show how to run the phenoDisco function and return a novelty detection result, according to the specified parameters. The algorithm parameters times (number of iterations) and GS (minimum number of proteins required to form a new phenotype) are passed to the function, along with the fcol to tell the algorithm where the input training data is contained. The above analysis is computationally intensive and best parallelised over multiple workers. This phenoDisco analysis took 24 hours to complete when parallelised over 40 workers. As such, in the interest of time users can access the above results which are pre-computed and stored along with the pRolocdata package. Please see the Appendix to load these results. The argument times indicates the number of times we run unsupervied Gaussian Mixture Modelling before defining a new phenotype cluster. The recommended minimum and default value is 100. In the above code chunk we increase the value to times = 200 as we have found for larger datasets (e.g. 5000+ proteins) a higher times is requried for convergence. GS defines the minimum number of proteins allowed per new data cluster and thus heavily influences what type of new clusters are extracted. For example, if a user is interested in the detection of small complexes they may wish to use a small GS = 10, or GS = 20 etc. If they wish to detect larger, more abundant sub-cellular niches a much higher GS would be preferable. Specifying a small GS can be more time consuming than using a larger GS, and there is a trade off between finding interesting small complexes and those that may not be of interest as we find there is a tendancy to find more noise when using a small GS compared to using a higher one. One may also consider increasing the search space for new data clusters by increasing the value of the parameter G. This defines the number of GMM components to test and fit; the default is G = 1:9 (the default value in the package [20]). One should note that the decreasing the GS, and increasing the values of the arguments times, G (among other function arguments, see ?phenoDisco) will heavily influence (increase) the total time taken to run the algorithm. phenoDisco supports parallelisation and we strongly suggest you make use of a parallel processing to run these analyses. The ouput of running the phenoDisco algorithm is an MSnSet containing the new data clusters, appended to the featureData under the name pd. The results can be displayed by using the getMarkers function. We see that 5 new phenotype data clusters were found. We can plot the results using the plot2D function ( Figure 14).

Figure 14.

Results of the novelty detection algorithm.

The five new phenotype data clusters can be extracted and examined. In the code chunk below we write the results to a .csv file using the write.exprs function. We use the argument fDataCols to specify which columns of the featureData to write. We can also examine each phenotype interactively and visualise their protein profiles by using the pRolocVis function in the package. We found that phenotype 1 was enriched in nucleus associated proteins, phenotype 2 in chromatin associated proteins, phenotype 3 in cytosolic and phenotypes 4 and 5 in both lysosomal and endosomal proteins.

Supervised machine learning

Supervised machine learning, also known as classification, is an essential tool for the assignment of proteins to distinct sub-cellular niches. Using a set of labelled training examples i.e. markers, we can train a machine learning classifier to learn a mapping between the data i.e. the quantitative protein profiles, and a known localisation. The trained classifier can then be used to predict the localisation of a protein of unknown localisation, based on its observed protein profiles. To date, this method has been extensively used in spatial quantitative proteomics to assign thousands of proteins to distinct sub-cellular niches [2, 12, 21– 24]. There are several classification algorithms readily available in , which are documented in the dedicated machine learning techniques vignette. We find the general tendency to be that it is not the choice of classifier, but the improper optimisation of the algorithmic parameters, that limits classification accuracy. Before employing any classification algorithm and generating a model on the training data, one must first find the optimal parameters for the algorithm of choice.

Optimisation

In the code chunk below we use a Support Vector Machine (SVM) to learn a classifier on the labelled training data. As previously mentioned, one first needs to train the classifier’s parameters before an algorithm can be used to predict the class labels of the proteins with unknown location. One of the most common ways to optimise the parameters of a classifier is to partition the labelled data into training and testing subsets. In this framework parameters are tested via a grid search using cross-validation on the training partition. The best parameters chosen from the cross-validation stage are then used to build a classifier to predict the class labels of the protein profiles on the test partition. Observed and expected classication results can be compared, and then used to assess how well a given model works by getting an estimate of the classifiers ability to achieve a good generalisation i.e. that is given an unknown example predict its class label with high accuracy. In , algorithmic performance is estimated using stratified 80/20 partitioning for the training/testing subsets respectively, in conjuction with five-fold cross-validation in order to optimise the free parameters via a grid search. This procedure is usually repeated 100 times and then the best parameter(s) are selected upon investigation of classifier accuracy. We recommend a minimum of 13 markers per sub-cellular class for stratified 80/20 partitioning and 5-fold cross-validation; this allows a minimum of 10 examples for parameter optimisation on the training partition i.e. 2 per fold for 5-fold cross-validation, and then 3 for testing the best parameters on the validation set. Classifier accuracy is estimated using the macro F1 score, i.e. the harmonic mean of precision and recall. In the code chunk below we demonstrate how to optimise the free parameters, sigma (the inverse kernel width for the radial basis kernel) and cost (the cost of constraints violation), of a classical SVM classifier with a Gaussian kernel using the function svmOptimisation. As the number of labelled instances per class varies from organelle to organelle, we can account for class imbalance by setting specific class weights when generating the SVM model. Below the weights, w are set to be inversely proportional to the class frequencies. In the code chunk below we then pass these weights to the svmOptimisation function. Once again, we provide the optimisation results for users to load directly if they wish to save computational time (see the Appendix for details). As mentioned previously, we rely on the default feature variable "markers" to define the class labels and hence do not need to specify it in the above code chunk. To use another feature variables, one needs to explicitly specify its name using the fcol argument (for example fcol = "markers2"). The output params is an object of class GenRegRes; a dedicated container for the storage of the design and results from a machine learning optimisation. To assess classifier performance we can examine the macro F1 scores and the most frequently chosen parameters. A high macro F1 score indicates that the marker proteins in the test dataset are consistently and correctly assigned by the the algorithm. Often more than one parameter or set of parameters gives rise to the best generalisation accuracy. As such it is always important to investigate the model parameters and critically assess the best choice. The f1Count function counts the number of parameter occurences above a certain F1 value. The best choice may not be as simple as the parameter set that gives rise to the highest macro F1 score. One must be careful to avoid overfitting, and choose parameters that frequently provide high classification accuracy. Below, we see that only a sigma of 0.1 produces macro F1 scores above 0.6, but that a cost of 16 is much more frequently chosen than lower values. The parameter optimistion results can also be visualised as a boxplot or heatmap, as shown in Figure 15. The plot method for GenRegRes object shows the respective distributions of the 100 macro F1 scores for the best cost/sigma parameter pairs, and levelPlot shows the averaged macro F1 scores, for the full range of parameter values. These figures also indicate that values of 0.1 and 16 for sigma and cost consistently deliver better classification scores.

Figure 15.

Assessment of the classification model parameter optimisation.

By using the function getParams we can extract the best set of parameters. Currently, getParams retrieves the first best set automatically, but users are encouraged to critically assess whether this is the most wise choice (which it is, as demonstrated above). Once we have selected the best parameters we can then use them to build a classifier from the labelled marker proteins.

Classification

We can use the function svmClassification to return a classification result for all unlabelled instances in the dataset corresponding to their most likely sub-cellular location. The algorithm parameters are passed to the function, along with the class weights. As above, the fcol argument does not need to be specified as we use the labels defined in the default "markers" feature variable. In the code chunk above, we pass the whole params parameter results and, internally, the first pair that return the highest F1 score are returned (using the getParams function above). It is advised to always check that these are actually good parameters and, if necessary, set them explicitly, as shown below. Automatically, the output of the above classification, the organelle predictions and assignment scores, are stored in the featureData slot of the MSnSet. In this case, they are given the labels svm and svm.scores for the predictions and scores respectively. The resultant predictions can be visualised using plot2D. In the code chunk below plot2D is called to generate a PCA plot of the data and fcol is used to specify where the new assignments are located e.g. fcol = "svm". Additionally, when calling plot2D we can use the cex argument to change the size of each point on the plot to be proportional to the SVM score ( Figure 16). This gives an initial overview of the high scoring localisations from the SVM predictions.

Classification results.

Colours indicate class membership and point size are representative of the classification confidence. The adjustment of the point size intuitively confers important information that is more difficult to define formally (that we will address in the next section). The classifier (SVM in our case, but this is also valid of other classifiers) defines boundaries based on the labelled marker proteins. These class/organelle boundaries define how non-assigned proteins are classified and with what confidence.

Thresholding

It is common when applying a supervised classification algorithm to set a specific score cutoff on which to define new assignments, below which classifications are kept unknown/unassigned. This is important as in a supervised learning setup, proteins can only be predicted to be localised to one of the sub-cellular niches that appear in the labelled training data. We can not guarantee (and do not expect) that the whole sub-cellular diversity is represented in the labelled training data as (1) finding markers that represent the whole diversity of the cell is challenging (especially obtaining dual- and multiply-localised protein markers) and (2) many sub-cellular niches contain too few proteins to train on (see above for a motivation of a minimum of 13 markers). Deciding on a threshold is not trivial as classifier scores are heavily dependent upon the classifier used and different sub-cellular niches can exhibit different score distributions, as highlighted in the boxplot below. We recommend users to set class-specific thresholds. In the code chunk below we display a boxplot of the score distributions per organelle ( Figure 17).

Figure 17.

Visualistion of class-specific classification score distribution.

There are many ways to set thresholds and the choice of method will depend on the biological question and experimental design at hand. One viable approach in the frame of the above experimetal design would be to manually set a FDR, say 5%, per organelle. To do this the user would examine the top scoring predictions for each organelle, and then set a threshold at the score at which they achieve 5% of false assignments per organelle. The definintion of a false assignment would depend on the information available, for example, validity or lack of validity for the localisation from another experiment as reported in the literature or a reliable database. If such information is not available, one crude method is to set a threshold per organelle by extracting the median or 3rd quartile score per organelle. For example, in the code chunk below, we use the orgQuants function to extract the median organelle scores and then pass these scores to the getPredictions function to extract the new localisations that meet this scoring criteria. Any sub-cellular predictions that fall below the specified thresholds are labelled as unknown. The organelle threshold ( ts above) can also be set manually using an interactive app (see below) or by using a named vector of thresholds, as shown in the putative example below for 4 organelles. The output of getPredictons is the original MSnSet dataset with a new feature variable appended to the feature data called fcol.pred (i.e. in our case svm.pred) containing the prediction results. The results can also be visualised using plot2D function ( Figure 18) and extracted by retrieving that specific column from the feature metadata using, for example, fData(hl)$svm.pred.

Figure 18.

Results of the localisation preductions after thresholding.

There is also a dedicated interactive application to help users examine these distributions in the package ( Figure 19). This app can be launched via the pRolocVis function and specifying the argument app = "classify" along with the relevent fcol, scol and mcol which refer to the columns in the feature data that contain the new assignments, assignment scores and markers respectively (see also fvarLabels(svmres)).

Figure 19.

The classify application enable the interactive exploration of classification score thresholding.

The data is loaded and displayed on a PCA plot and a boxplot is used to display the classifier scores by data class. On the left, there is a sidebar panel with sliders to control the thresholds upon which classifications are made. There are two types of cut-off that the user can choose from: (1) Quantile and (2) User-defined. By default, when the application is launched quantile scoring is selected and set to 0.5, the median. The class-specific score thresholds that correspond to selecting the desired quantile are shown as red dots on the boxplot. The assignments on the PCA plot are also updated according to the selected threshold. The quantile threshold can be set by moving the corresponding quantile slider. If the users wishes to set their own cut-offs, the User-defined radio button must be selected and then the sliders for defining user-specified scores become active and the scores are highlighted on the boxplot by blue dots. For more information we refer users to the tutorial vignette.

Transfer learning

In addition to high quality MS-based quantitative proteomics data, there exist a number of other sources of information that are freely available in the public domain that may be useful to assign a protein to its sub-cellular niche. For example, imaging from immunofluorescence microscopy, protein annotations and sequences, and protein-protein interactions among others, represent a rich and vast source of complementary information. We can integrate this auxiliary information with our primary MS-based quantitative data using a paradigm known as transfer learning (TL). The integration of data between different technologies is one of the biggest challenges in computational biology to date and the package provides functionality to do such analyses. We recently developed two transfer learning algorithms using a k-NN and SVM framework and applied them to the task of protein localisation prediction [25]. In this section we will begin with explaining the concept of transfer learning and then show how to apply this in the frame of spatial proteomics and protein localisation prediction. In TL one typically has a primary task that one wishes to solve, and some complementary (often heterogeneous) auxiliary information that is related to the primary learning objective, that can be used to help solve the primary goal. For example, here our primary task is to assign proteins to their sub-cellular niche with high generalisation accuracy from data collected from quantitative MS-based experiments. We have already seen that straightforward supervised ML works well for these types of experiments, however, Transfer learning is particularly useful for classes that are not as well separated. In the example below we extract Gene Ontology (GO) information to use as an auxiliary data source to help solve our task of protein localisation prediction. Using the functions setAnnotationParams and makeGoSet we can contruct an auxiliary MSnSet of GO terms, from the primary data’s features i.e. the protein accession numbers. All the GO terms associated to each accession number are retrieved and used to create a binary matrix where a 1 (0) at position ( i, j) indicates that term j has (not) been used to annotate protein i. The GO terms are retrieved from an appropriate repository using the package. The specific Biomart repository and query will depend on the species under study and the type of identifiers. The first step is to construct the annotation parameters that will enable to perform the query, which is done using setAnnotationParams. Typing into the R console par <- setAnnotationParams() will present two menus, firstly asking you to identify the species of study, and then what type of identifier you have used to annotate the proteins in your MSnSet. It is also possible to pass arbitrary text to match the species e.g. in the code chunk below we pass "Mus musculus", and the identifier type for our data (see featureNames(hl)) which is "Uniprot/Swissprot", for the Biomart query. Now we have contructed the query parameters we can use the makeGoSet function to retrieve and build an auxiliary GO MSnSet as described above. By default, the cellular component terms are downloaded, without any filtering on evidence codes. It is also possible to download terms from the molecular function and biological process GO namespaces, and also apply filtering based on evidence codes as desired, see ?makeGoSet for more details. (We also provide the pre-computed gocc object for users if they wish to load directly, please see the Appendix). The function makeGoSet uses the package to query the relevent database (e.g. Ensembl, Uniprot) for GO terms. All GO terms that have been observed for the 5032 proteins in the hyperLOPIT dataset are retrieved. Users should note that the number of GO terms used is also dependent on the database version queried and thus is always subject to change. We find it is common to see GO terms with only one protein assigned to that term. Such terms do not bring any information for building the classifier and are thus removed using the filterBinMSnSet function. Now that we have generated our auxiliary data, we can use the k-NN implementation of transfer learning available in to integrate this with our primary MS-based quantitative proteomics data using the functions knntlOptimisation to estimate the free-parameters for the integration, and knntlClassification to do the predictions. We have shown that using transfer learning results in the assignment of proteins to sub-cellular niches with a higher generalisation accuracy than using standard supervised machine learning with a single source of information [25].

TL optimisation

The first step, as with any machine learning algorithm, is to optimise any free paramaters of the classifier. For the k-NN TL classifier there are two sets of parameters that need optimising: the first set are the k’s for the primary and auxiliary data sources required for the nearest neighbour calculations for each data source. The second set of parameters (noted by a vector of θ weights) that require optimising are the class weights, one per subcellular niche, that control the proportion of primary and auxiliary data to use for learning. A weight can take any real value number between 0 and 1. A weight of θ = 1 indicates that all weight is given to the primary data (and this implicitly implies that a weight of 1 − θ = 0 is given to the auxiliary data), and similarly a weight of θ = 0 implies that all weight is given to the auxiliary data (so 0 is given to the primary source). If we conduct a parameter search and test weights θ = 0, 1 /3, 2 /3, 1 for each class, and if we have, for example 10 subcellular niches, this will result in 4 10 different combinations of parameters to test. The parameter optimisation is therefore time consuming and as such we recommend making use of a computing cluster (code and submissing scripts are also available in the supporting information). The markers in the hl dataset contain 14 subcellular classes. If we examine these markers and classes on the PCA plot above we can see that in particular the two ribosomes and two nuclear compartments are highly separated along the first two components, this is also evident from the profiles plot which gives us a good indication that these subcellular niches are well-resolved in the hyperLOPIT dataset. Transfer learning is particularly useful for classes that are not as well separated. We find that subcellular niches that are well-separated under hyperLOPIT and LOPIT obtain a class score of 1 (i.e. use only primary data from transfer learning [25]). Therefore, for the optimisation stage of the analyses we can already infer a subcellular class weight of 1 for these niches and only optimise over the remaining organelles. This can significantly cut down optimisation time as by removing these 4 classes from the optimisation (and not the classification) we only have 4 10 class weight combinations to consider instead of 4 14 combinations. In the example below we first remove these 4 classes from the marker set and create a new marker set called tlmarkers. Then we re-run the knnOptimisation for each data source and then run the knntlOptimisation with the 10 remaining classes. (Note: this is not run live as the hl dataset with 10 classes, 707 markers and 4 10 combinations of parameters takes around 76 hours to run on the University of Cambridge HPC using 256 workers. To save time for users, the results of the following optimisation are pre-computed and provided with the pRolocdata package. Please see the Appendix for details on how to load these directly. TL optimisation stage 1 Run knnOptimisation to get the best k’s for each data source. From examining the parameter search plots as described in section Optimisation, we find the best k’s for both the primary and auxiliary are 3. TL optimisation stage 2 Run knntlOptimisation to get the best transfer learning weights for each sub-cellular class. The results of the optimisation can be visualised using the plot method for tlopt optimisation result (shown in Figure 20):

Figure 20.

Visualisation of the transfer learning parameter optimisation procedure.

Each row displays the frequency of observed weights (along the columns) for a specific sub-cellular class, with large dots representing higher observation frequencies.

Visualisation of the transfer learning parameter optimisation procedure.

Each row displays the frequency of observed weights (along the columns) for a specific sub-cellular class, with large dots representing higher observation frequencies.

TL classification

Looking at the bubble plot displaying the distribution of best weights over the 50 runs we find that for many of the subcellular niches a weight of 1 is most popular (i.e. use only primary hyperLOPIT data in classification), this is unsuprising as we already know the dataset is well resolved for these classes. We see that the most popular weights for the proteasome and lysosome tend to be towards 0, indicating that these niches are well-resolved in the Gene Ontology. This tells us that we would benefit from including auxiliary GO information in our classifier for these subcellular compartments. The plasma membrane weights are relatively equally spread between using hyperLOPIT and GO data. Using the getParams function we can return the best weights and then use this as input for the classification. One of the benefits of the algorithm is the ability to manually select weights for each class. In the optimisation above, for time constraints, we removed the two ribosomal subunits and the two nuclear compartments, and therefore in the code chunk below when we extract the best parameters, these subcellular niches are not included. To include these 4 subcellular niches in the next classification step we must include them in the parameters. We define a weight of 1 for each of these niches, as we know they are well resolved in hyperLOPIT. We then re-order the weights according to getMarkerClasses and perform the classification using the function knntlClassification. The results from the classification results and associated scores are appended to the fData slot and named knntl and knntl.scores respectively. Results can be visualised using plot2D, scores assessed and cutoffs calculated using the classify app in pRolocVis, predictions obtained using getPredictions in the same way as demonstrated above for the SVM classifier. In ’s transfer learning vignette, we demonstrate how to use imaging data from the Human Protein Atlas [26] via the package [27] as an auxiliary data source.

Unsupervised machine learning

In there is functionality for unsupervsied machine learning methods. In unsupervised learning, the training data consists of a set of input vectors e.g. protein profiles, ignoring the information about the class label e.g. localisation, other than for annotation purposes. The main goal in unsupervised learning is to uncover groups of similar features within the data, termed clustering. Ordination methods such as principal components analysis (PCA) also fall into the category of unsupervised learning methods, where the data can be projected from a high-dimensional space down to two or three dimensions. As described and demonstrated already above, PCA is a valuable and powerful method for data visualisation and quality control. Another application uses hierarchical clustering to summarise the relation between marker proteins using the mrkHClust function, where the euclidean distance between average class-specific profiles is used to produce a dendrogramme describing a simple relationship between the sub-cellular classes ( Figure 21). The mrkHClust uses the same defaults as all other function, using the markers feature variable to define marker proteins. In the code chunk, we adapt the figure margins to fully display the class names.

Figure 21.

Hierarchical clustering of the average marker profiles summarising the relation between organelles profiles.

We generally find supervised learning more suited to the task of protein localisation prediction in which we use high-quality curated marker proteins to build a classifier, instead of using an entirely unsupervised approach to look for clusters and then look for enrichment of organelles and complexes. In the latter we do not make good use of valuable prior knowledge, and in our experience unsupervised clustering can be extremely difficult due to (i) the loose definition of what constitutes a cluster (for example whether it is defined by the quantitative data or the localisation information), (ii) the influence of the algorithm assumption on the cluster identification (for example parametric or non-parametric) and (iii) poor estimates of the number of clusters that may appear in the data.

Writing and exporting data

An MSnSet can be exported from R using the write.exprs function. This function writes the expression values to a text-based spreadsheet. The fcol argument can be used to specify which featureData columns (as column names, column number or logical) to append to the right of the expression matrix. In the below code chunk we write the hl object to a csv file. The file argument is used to specify the file path, the sep argument specifies the field separator string, here we use a comma, finally as we want to write all the information in the featureData to the file, as well as the expression data, we specify fvarLabels(hl), that returns all feature variable names, and write the resulting data to the file "hl.csv". Exporting to a spreadsheet however loses a lot of important information, such as the processing data, and the sample metadata in the phenoData slot. Other objects, such as parameters from the machine learning optimisation, cannot be represented as tabular data. To directly serialise R objects to disk, on can use the standard save function, and later reload the object using load. For example, to save and then re-load the parameters from the SVM optimisation,

Session information and getting help

The function sessionInfo provides a summary of all packages and versions used to generate this document. This enables us to record the exact state of our session that lead to these results. Conversely, if the script stops working or if it returns different results, we are in a position to re-generate the original results using the adequate software versions and retrace changes in the software that lead to failure and/or different results. We also recommend that users regularly update the packages as well as the R itself. This can be done with the biocLite function. It is always important to include session information details along with a short reproducible example highlighting the problem or question at hand. The source of this document, including the code necessary to reproduce the analyses and figures is available in a public manuscript repository on GitHub [18].

Data availability

The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2018 Breckels LM et al. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). The software and data presented in this workflow are part of the Bioconductor project. Version numbers for all packages used are shown in the Session information section. The source of this document, including the code necessary to reproduce the analyses and figures is available at: https://github.com/lmsimp/bioc-pRoloc-hyperLOPIT-workflow/ [18]. An archived version as at the time of publication is available at: DOI https://zenodo.org/record/1256018 [28]. The revised manuscript of Breckels et al. addresses all my previously stated comments. The inclusion of code snippets in the appendix to fully reproduce the results from the authors adds to the overall very high credibility and quality of the article. It can be seen as a poster child article for reproducible research. For not finding the appropriate, computationally costly files in the first review online, I would like to excuse myself. Any other issues will be dealt with through the corresponding developer channels. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Breckels et al. have written a very nice piece on analysing appropriate proteomics data for subcellular localisation. I particularly like the " workshop characteristics" of the text. Which allows a novice, but interested reader to work through the analysis stepwise and reproduce the results described therein. The authors took great care in keeping this ideal up during their text and this is also where I have put my greatest reservation to the manuscript in its present form - since a reader cannot work through the code presented in the manuscript, since there at at least two situations where a readily available HPC and quite some time is required. This kind of leaves a dent in my impression - however, given this can be resolved as well as some typos - the workflow report is superb. Major comments: Next to reducing the dimensions of data for visualisation, PCA also offers a way to understand how the variability is distributed across the multidimensional data by providing linear combinations of the variables which then constitute the actual PCs. On that note it would be nice to mention this in Visualising markers section on page 16, where PC7 explains not much variability but due to the correct weighing of the variables we do get a separation between mitochondrial and peroxisome. This then can be further motivated with Figure 9 - where we probably can see that the weights for the fractions where the two localisations differ are larger than otherwise. I was unable to reproduce Figure 13 comparing the two MSnSets. While I was able to look at each set separately using pRolocVis(hllst@x[[I]]), where i is 1 or 2, I only got an error using the code from the manuscript: > pRolocVis(hllst, app=“compare”) Subsetting MSnSetList to their common feature names 5032 features in common Remapping data to the same PC space Error in (function (od, vd) : object and replacement value dimnames differ Error in pRolocVis_compare(object, ...) : object 'idDT' not found When using ‘remap=FALSE’ it actually works, but since this makes barely sense it is of no use - but just as a hint at debugging it. You really need to make the results from the phenoDisco classification available too. It is super disappointing that one cannot continue reproducing the code from page 23 on, because it takes 24 hours to compute it using 40 cores… The above comment is of course also true for the KNN TL Optimisation on page 33 - this needs to be downloadable, since not everyone has access to Cambridge’s HPC and probably even less have 76 hours to spare. Your comment on the increase suitability of classification instead of clustering (when additional information on classes is available) at the bottom of page 35 could be more pronounced - for educational reasons. Minor comments: I was not able to naïvely reproduce the workflow from the R commands in the article due to an error installing pRolocdata on a Windows machine. On OS X it was smooth. On page 10 line 2 there is a ‘to’ missing. I never came across the verb imputate in the context of missing values, I guess the proper term is impute. On page 11 the image2 function is called after the filterNA function a couple of lines above. This however would result in an only black heat map (since there are no more missing). The image2 function should be called before the filterNA function. Since the reader does not see the chunk options, it could be puzzling. For completeness sake there should also be an install.packages(c(“hexbin”, “rgl”)) somewhere to generate the second PCA-plot and the 3D plot. Moreover, Mac users will need to install xquartz to use rgl properly. On page 14 the plotting code chunk is off track - in the middle of the marker sets output. On page 18: …wanted to highlight a proteins with the… -> lose the a and later in the sentence there is a ‘ create a’ too many. Direct comparisons of individual channels in replicated experiments do not provide… You may want to consider adding a layout(1) or similar, after changing the mfrow argument of the parameters to accommodate 2 panels, such that the uncanny reader does not get confused. I would prefer links to referred sections of the text, but that may be personal taste… Page 23: One should note that the decreasing the GS, and increasing the … at least one the too many, probably two. On page 25: We find the general tendancy to be that it is not the choice … tendency? On page 28 you refer to ‘…the code chunk below…’ for Figure 17, however, the following code chunk is generating Figure 16 (which is above and btw not referenced in the text). Maybe force your figures a little to float where you want them/refer to them. On page 28: …by extracting the median or 3rd quantile score per organelle… do you mean quartile? Otherwise I do not follow. On page 32: …package to query the relevent database … relevant? On page 32 - there is something wrong with this sentence: To remove the 4 classes and create a new column of markers in the feature data called tlmarkers to use for the analysis: On page 34: From examining the parameter seach plots as described in section Optimisation… search! On page 36: …and later reload the object using save. -> that would be ‘load’ then! On page 38 - I fully agree with the following sentence, but right after the updating comment it kind of seems ‘ misplaced’? Maybe add a title like ‘ Getting help’? It is always important to include session information details along with a short reproducible example highlighting the problem or question at hand. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Thank you for your comments. Please find our responses to these inset below. Next to reducing the dimensions of data for visualisation, PCA also offers a way to understand how the variability is distributed across the multidimensional data by providing linear combinations of the variables which then constitute the actual PCs. On that note it would be nice to mention this in Visualising markers section on page 16, where PC7 explains not much variability but due to the correct weighing of the variables we do get a separation between mitochondrial and peroxisome. This then can be further motivated with Figure 9 - where we probably can see that the weights for the fractions where the two localisations differ are larger than otherwise. We have added a paragraph to the 'Visualising markers' section of the manuscript reiterating the purpose of PCA and motivating the choice of looking at PC's 1 and 7. Figure 9 now follows on from this (now Figure 8), along with the corresponding code and an explanation of the `plotDist` function. I was unable to reproduce Figure 13 comparing the two MSnSets. While I was able to look at each set separately using `pRolocVis(hllst@x[[I]])`, where i is 1 or 2, I only got an error using the code from the manuscript. When using `remap=FALSE` it actually works, but since this makes barely sense it is of no use - but just as a hint at debugging it. We can not reproduce this error. Have you updated to the latest version of R and the latest version of `pRolocGUI`? If you still get this error message could you please post this as an issue (https://github.com/ComputationalProteomicsUnit/pRolocGUI/issues) on the `pRolocGUI` Github page along with your `sessionInfo()` and we will certainly attempt to solve this. You really need to make the results from the phenoDisco classification available too. It is super disappointing that one cannot continue reproducing the code from page 23 on, because it takes 24 hours to compute it using 40 cores? The results are already available as a `RDS` file and stored in `pRolocdata` for users. This is what is called in the manuscript under the hood: ``` f0 <- dir(extdatadir, full.names = TRUE, pattern = "bpw-pdres.rds") pdres <- readRDS(f0) hl <- addMarkers(hl, pdres, mcol = "pd", verbose = FALSE) ``` We have made this code available in the manuscript in an appendix so users can continue to produce the exact plots as they see in this workflow. The above comment is of course also true for the KNN TL Optimisation on page 33 - this needs to be downloadable, since not everyone has access to Cambridge's HPC and probably even less have 76 hours to spare. The same as for the `phenoDisco` analysis and `svm`, the TL results are stored as a `RDS` in `pRolocdata` and are loaded in the background. We have added the code required to the appendix so that users can load the results directly. Your comment on the increase suitability of classification instead of clustering (when additional information on classes is available) at the bottom of page 35 could be more pronounced - for educational reasons. To address the above comment on suitability we have added a few additional points on the challenges of using clustering for this type of data. We generally find supervised learning more suited to the task of protein localisation prediction in which we use high-quality curated marker proteins to build a classifier, instead of using an entirely unsupervised approach to look for clusters and then look for enrichment of organelles and complexes. In the latter we do not make good use of valuable prior knowledge, and in our experience unsupervised clustering can be extremely difficult due to (i) the loose definition of what constitutes a cluster (for example whether it is defined by the quantitative data or the localisation information), (ii) the influence of the algorithm assumption on the cluster identification (for example parametric or non-parametric) and (iii) poor estimates of the number of clusters that may appear in the data. I was not able to naively reproduce the workflow from the R commands in the article due to an error installing pRolocdata on a Windows machine. On OS X it was smooth. We didn't experience any Windows-specific problems. If you re-try the installation and please let us know if you still have any issues by opening an issue (https://github.com/lgatto/pRolocdata/issues) or by posting this issue on the Bioconductor Support site (https://support.bioconductor.org). On page 10 line 2 there is a *to* missing. In the version have, we currently can't find the missing 'to'. I never came across the verb imputate in the context of missing values, I guess the proper term is impute. This has been changed to read "We can impute missing data..." On page 11 the image2 function is called after the filterNA function a couple of lines above. This however would result in an only black heat map (since there are no more missing). The image2 function should be called before the filterNA function. Since the reader does not see the chunk options, it could be puzzling. This was an editing mistake and has now been rectified. For completeness sake there should also be an install.packages(c("hexbin", "rgl")) somewhere to generate the second PCA-plot and the 3D plot. Moreover, Mac users will need to install xquartz to use rgl properly. A footnote has been added here to tell users that the package `rgl` may need to be installed with `install.packages("rgl")` and mac users may need to install xquartz if it's not already installed. LG: this one is still missing. On page 14 the plotting code chunk is off track - in the middle of the marker sets output. This has now been rectified. On page 18: ...wanted to highlight a proteins with the ... -> lose the a and later in the sentence there is a 'create a' too many. These typos have been rectified. Direct comparisons of individual channels in replicated experiments do not provide? You may want to consider adding a layout(1) or similar, after changing the mfrow argument of the parameters to accommodate 2 panels, such that the uncanny reader does not get confused. We would prefer to keep the code as it is and not introduce more noise with calls to other functions such as `layout`. The workflow is not aimed at teaching R. Users should have some basic knowledge of R before tackling this tutorial. I would prefer links to referred sections of the text, but that may be personal taste? This is a comment for F1000. We cannot control the linking of sections in the final version. Page 23: One should note that the decreasing the GS, and increasing the ...at least one the too many, probably two. We have reworded this sentence as requested. On page 25: We find the general tendancy to be that it is not the choice ...tendency? This typo has been rectified. On page 28 you refer to '...the code chunk below...' for Figure 17, however, the following code chunk is generating Figure 16 (which is above and btw not referenced in the text). Maybe force your figures a little to float where you want them/refer to them. We have now referenced Figure 16 in the text and made sure that the code chunks and figures follow inline where they are referenced in the text. On page 28: ...by extracting the median or 3rd quantile score per organelle? do you mean quartile? Otherwise I do not follow. Thank you, yes this is a typo and has now been changed to 'quartile'. On page 32: package to query the relevent database *relevant* This typo has been rectified. On page 32 - there is something wrong with this sentence: To remove the 4 classes and create a new column of markers in the feature data called tlmarkers to use for the analysis: This sentence is not needed here and so it has now been removed as it essentially reiterates what is said in the above paragraph. On page 34: From examining the parameter seach plots as described in section Optimisation... search! This typo has been rectified. On page 36: and later reload the object using save. -> that would be `load` then! This typo has been rectified. On page 38 - I fully agree with the following sentence, but right after the updating comment it kind of seems misplaced? Maybe add a title like Getting help? We have changed the title of this section to 'Session information and getting help' to clarify this section of the tutorial. A pdf version of our replies is also available here. This manuscript describes a Bioconductor workflow for analyzing subcellular proteomics data. It is very detailed and comprehensive and will be useful for others in the field. A few comments: Some clearer statement early on would help to clarify for readers what types of data this works with. I know that the authors indicate that the example they use is 10-plex TMT and that it can be used with label-free or other labels, but that is not what I am referring to. Rather, structure of the experiment. That is, that one needs systematic quantitative data on all the different relevant fractions from a cell, as opposed to someone who perhaps did a differential centrifugation experiment to isolate a couple fractions and then wants to apply this (my understanding is that this latter example would not be usable). How do the authors recommend collapsing replicates? This could be covered in the section dedicated to the Compare function. Two replicates will (almost) never agree 100% so how are discrepancies handled? I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This manuscript describes a Bioconductor workflow for analyzing subcellular proteomics data. It is very detailed and comprehensive and will be useful for others in the field. Many thanks for your comments, we have responded to them inset below. Some clearer statement early on would help to clarify for readers what types of data this works with. I know that the authors indicate that the example they use is 10-plex TMT and that it can be used with label-free or other labels, but that is not what I am referring to. Rather, structure of the experiment. That is, that one needs systematic quantitative data on all the different relevant fractions from a cell, as opposed to someone who perhaps did a differential centrifugation experiment to isolate a couple fractions and then wants to apply this (my understanding is that this latter example would not be usable). From a purely technical point of view, we need a data matrix with features (typically proteins or protein groups) along the rows and sub-cellular fractions along the columns. Given the requirement for complete (or near-complete) quantitative vectors along all fractions to assure best results, a data set that would only contain one quantitation value per fraction would not work. However, separation using differential centrifugation, or any separation that generates organelle-specific separation profiles is a good fit for pRoloc. For instance, the type of data generated by methods such as described in Itzhak et al. (2016) are well fitted for our software: > library(pRoloc) > library(pRolocdata) > data(itzhak2016stcSILAC) > ## 6 combined replicates of 5 fractions each > dim(itzhak2016stcSILAC) [1] 5265 30 > plot2D(itzhak2016stcSILAC) (see outout here) How do the authors recommend collapsing replicates? This could be covered in the section dedicated to the Compare function. Two replicates will (almost) never agree 100% so how are discrepancies handled? Currently, we recommend to visualise different replicates on their own, to confirm that they are of sufficient quality, and then combine them, retaining the proteins that have been quantified over all replicatedd experiments. This allows to obtain localisation information over all replicated data. We however do not explicitly assess the variability using this approach. This could be done by analysing replicated independently and then compare the coherence of the classification results. Proteins that are only observed in some replicates could be rescued by repeating the analysis using only the relevant (possibly unique) replicate(s).

24 in total

1. Towards a knowledge-based Human Protein Atlas.

Authors: Mathias Uhlen; Per Oksvold; Linn Fagerberg; Emma Lundberg; Kalle Jonasson; Mattias Forsberg; Martin Zwahlen; Caroline Kampf; Kenneth Wester; Sophia Hober; Henrik Wernerus; Lisa Björling; Fredrik Ponten
Journal: Nat Biotechnol Date: 2010-12 Impact factor: 54.908

Review 2. Using R and Bioconductor for proteomics data analysis.

Authors: Laurent Gatto; Andy Christoforou
Journal: Biochim Biophys Acta Date: 2013-05-18

3. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

Authors: Cosmin Lazar; Laurent Gatto; Myriam Ferro; Christophe Bruley; Thomas Burger
Journal: J Proteome Res Date: 2016-03-01 Impact factor: 4.466

4. Global, quantitative and dynamic mapping of protein subcellular localization.

Authors: Daniel N Itzhak; Stefka Tyanova; Jürgen Cox; Georg Hh Borner
Journal: Elife Date: 2016-06-09 Impact factor: 8.140

5. A Portrait of the Human Organelle Proteome In Space and Time during Cytomegalovirus Infection.

Authors: Pierre M Jean Beltran; Rommel A Mathias; Ileana M Cristea
Journal: Cell Syst Date: 2016-09-15 Impact factor: 10.304

6. The effect of organelle discovery upon sub-cellular protein localisation.

Authors: L M Breckels; L Gatto; A Christoforou; A J Groen; K S Lilley; M W B Trotter
Journal: J Proteomics Date: 2013-03-21 Impact factor: 4.044

7. A draft map of the mouse pluripotent stem cell spatial proteome.

Authors: Andy Christoforou; Claire M Mulvey; Lisa M Breckels; Aikaterini Geladaki; Tracey Hurrell; Penelope C Hayward; Thomas Naake; Laurent Gatto; Rosa Viner; Alfonso Martinez Arias; Kathryn S Lilley
Journal: Nat Commun Date: 2016-01-12 Impact factor: 14.919

8. Identification of trans-golgi network proteins in Arabidopsis thaliana root tissue.

Authors: Arnoud J Groen; Gloria Sancho-Andrés; Lisa M Breckels; Laurent Gatto; Fernando Aniento; Kathryn S Lilley
Journal: J Proteome Res Date: 2014-01-17 Impact factor: 4.466

9. A Mass Spectrometry-Based Approach for Mapping Protein Subcellular Localization Reveals the Spatial Proteome of Mouse Primary Neurons.

Authors: Daniel N Itzhak; Colin Davies; Stefka Tyanova; Archana Mishra; James Williamson; Robin Antrobus; Jürgen Cox; Michael P Weekes; Georg H H Borner
Journal: Cell Rep Date: 2017-09-12 Impact factor: 9.423

10. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval.

Authors: Jennifer Hirst; Daniel N Itzhak; Robin Antrobus; Georg H H Borner; Margaret S Robinson
Journal: PLoS Biol Date: 2018-01-30 Impact factor: 8.029

13 in total

1. Mapping the Saccharomyces cerevisiae Spatial Proteome with High Resolution Using hyperLOPIT.

Authors: Daniel J H Nightingale; Stephen G Oliver; Kathryn S Lilley
Journal: Methods Mol Biol Date: 2019

2. Comparative Analysis of Quantitative Mass Spectrometric Methods for Subcellular Proteomics.

Authors: Abla Tannous; Marielle Boonen; Haiyan Zheng; Caifeng Zhao; Colin J Germain; Dirk F Moore; David E Sleat; Michel Jadot; Peter Lobel
Journal: J Proteome Res Date: 2020-03-05 Impact factor: 4.466

3. CFTR Rescue by Lumacaftor (VX-809) Induces an Extensive Reorganization of Mitochondria in the Cystic Fibrosis Bronchial Epithelium.

Authors: Clarissa Braccia; Josie A Christopher; Oliver M Crook; Lisa M Breckels; Rayner M L Queiroz; Nara Liessi; Valeria Tomati; Valeria Capurro; Tiziano Bandiera; Simona Baldassari; Nicoletta Pedemonte; Kathryn S Lilley; Andrea Armirotti
Journal: Cells Date: 2022-06-16 Impact factor: 7.666

4. Separating Golgi Proteins from Cis to Trans Reveals Underlying Properties of Cisternal Localization.

Authors: Harriet T Parsons; Tim J Stevens; Heather E McFarlane; Silvia Vidal-Melgosa; Johannes Griss; Nicola Lawrence; Richard Butler; Mirta M L Sousa; Michelle Salemi; William G T Willats; Christopher J Petzold; Joshua L Heazlewood; Kathryn S Lilley
Journal: Plant Cell Date: 2019-07-02 Impact factor: 11.277

5. A Protocol to Map the Spatial Proteome Using HyperLOPIT in Saccharomyces cerevisiae.

Authors: Daniel J H Nightingale; Kathryn S Lilley; Stephen G Oliver
Journal: Bio Protoc Date: 2019-07-20

6. Genetic disruption of WASHC4 drives endo-lysosomal dysfunction and cognitive-movement impairments in mice and humans.

Authors: Jamie L Courtland; Tyler Wa Bradshaw; Greg Waitt; Erik J Soderblom; Tricia Ho; Anna Rajab; Ricardo Vancini; Il Hwan Kim; Scott H Soderling
Journal: Elife Date: 2021-03-22 Impact factor: 8.713

7. Combining LOPIT with differential ultracentrifugation for high-resolution spatial proteomics.

Authors: Aikaterini Geladaki; Nina Kočevar Britovšek; Lisa M Breckels; Tom S Smith; Owen L Vennard; Claire M Mulvey; Oliver M Crook; Laurent Gatto; Kathryn S Lilley
Journal: Nat Commun Date: 2019-01-18 Impact factor: 14.919

8. A Bayesian mixture modelling approach for spatial proteomics.

Authors: Oliver M Crook; Claire M Mulvey; Paul D W Kirk; Kathryn S Lilley; Laurent Gatto
Journal: PLoS Comput Biol Date: 2018-11-27 Impact factor: 4.475

9. A Bioconductor workflow for the Bayesian analysis of spatial proteomics.

Authors: Oliver M Crook; Lisa M Breckels; Kathryn S Lilley; Paul D W Kirk; Laurent Gatto
Journal: F1000Res Date: 2019-04-11

10. A Comprehensive Subcellular Atlas of the Toxoplasma Proteome via hyperLOPIT Provides Spatial Context for Protein Functions.

Authors: Konstantin Barylyuk; Ludek Koreny; Huiling Ke; Simon Butterworth; Oliver M Crook; Imen Lassadi; Vipul Gupta; Eelco Tromer; Tobias Mourier; Tim J Stevens; Lisa M Breckels; Arnab Pain; Kathryn S Lilley; Ross F Waller
Journal: Cell Host Microbe Date: 2020-10-13 Impact factor: 21.023