Joanne Watson1,2, Jean-Marc Schwartz1, Chiara Francavilla2. 1. Division of Evolution & Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicine & Health, University of Manchester, Manchester M13 9PT, U.K. 2. Division of Molecular and Cellular Function, School of Biological Sciences, Faculty of Biology, Medicine & Health, University of Manchester, Manchester M13 9PT, U.K.
Abstract
Mass spectrometry-based quantitative phosphoproteomics has become an essential approach in the study of cellular processes such as signaling. Commonly used methods to analyze phosphoproteomics datasets depend on generic, gene-centric annotations such as Gene Ontology terms, which do not account for the function of a protein in a particular phosphorylation state. Analysis of phosphoproteomics data is hampered by a lack of phosphorylated site-specific annotations. We propose a method that combines shotgun phosphoproteomics data, protein-protein interactions, and functional annotations into a heterogeneous multilayer network. Phosphorylation sites are associated to potential functions using a random walk on the heterogeneous network (RWHN) algorithm. We validated our approach against a model of the MAPK/ERK pathway and functional annotations from PhosphoSitePlus and were able to associate differentially regulated sites on the same proteins to their previously described specific functions. We further tested the algorithm on three previously published datasets and were able to reproduce their experimentally validated conclusions and to associate phosphorylation sites with known functions based on their regulatory patterns. Our approach provides a refinement of commonly used analysis methods and accurately predicts context-specific functions for sites with similar phosphorylation profiles.
Mass spectrometry-based quantitative phosphoproteomics has become an essential approach in the study of cellular processes such as signaling. Commonly used methods to analyze phosphoproteomics datasets depend on generic, gene-centric annotations such as Gene Ontology terms, which do not account for the function of a protein in a particular phosphorylation state. Analysis of phosphoproteomics data is hampered by a lack of phosphorylated site-specific annotations. We propose a method that combines shotgun phosphoproteomics data, protein-protein interactions, and functional annotations into a heterogeneous multilayer network. Phosphorylation sites are associated to potential functions using a random walk on the heterogeneous network (RWHN) algorithm. We validated our approach against a model of the MAPK/ERK pathway and functional annotations from PhosphoSitePlus and were able to associate differentially regulated sites on the same proteins to their previously described specific functions. We further tested the algorithm on three previously published datasets and were able to reproduce their experimentally validated conclusions and to associate phosphorylation sites with known functions based on their regulatory patterns. Our approach provides a refinement of commonly used analysis methods and accurately predicts context-specific functions for sites with similar phosphorylation profiles.
Entities:
Keywords:
Gene Ontology; bioinformatics; functional analysis; multilayer networks; pathways; phosphoproteomics; random walk
Phosphorylation
is the most studied post-translational modification
(PTM) due to its central role in cellular regulation. It is thought
to be the principal PTM in the human proteome and an essential mediator
of protein–protein interactions (PPIs) and protein functions.[1] Transient changes occur at specifically regulated
phosphorylation sites, of which there may be multiple on each protein.
Regulation of phosphorylation is often dependent on perturbations
such as the activity of extracellular ligands, drug treatment, or
physical stimuli in the extracellular environment.[2] By comparing changes in the phosphoproteome of cells under
different experimental conditions through mass spectrometry-based
phosphoproteomics, phosphorylated sites that are key players in cellular
processes and functions can be uncovered in an unbiased, high-throughput
manner.[3]Functional analysis of phosphoproteomics
datasets is typically
based on gene-centric enrichment of Gene Ontology (GO) terms or involvement
in known pathways.[4] However, this approach
disregards information captured by phosphoproteomics data on changes
at specific phosphorylated sites, by limiting the analysis to the
protein level. The modification state of a protein is inherently coupled
to its function; PTMs alter protein activity and the ability to interact
with different sets of proteins. Furthermore, if a protein is phosphorylated
on multiple sites, each with a different function and regulatory pattern,
this information is not revealed by gene-centric analysis.[5] For instance, the well-studied signaling protein
MAPK1 has 18 known phosphorylated sites recorded in the database PhosphoSitePlus
(PSP); however, only six have been annotated to a downstream function.[6] In the most recent release of PSP (v6.5.9.3),
only 4271 of the more than 230,000 human phosphorylated sites recorded
in the database are associated to 19 generic biological functions
(e.g., “cell cycle” and “transcription”),
which are qualified with one of the following: “induced”,
“inhibited”, “regulated”, or “altered”.
Enrichment analyses that rely on generalizations based on protein-level
or gene-centric descriptions exclude the details that are encoded
in the phosphorylation signature. Analyses are thus hampered by the
lack of phosphorylation site-specific functional annotations.Several network-based methods have been proposed to move toward
phosphorylation site-specific analyses of phosphoproteomics data.
A significant focus has been on the inference of kinase–substrate
networks, such as NetworKIN,[7] KSEA,[8] and IKAP,[9] among others.
These methods are useful for reconstructing the architecture of signaling
intracellular networks, which can be informative for identifying modules
of modified proteins involved in cellular processes, but again remain
hampered by the lack of site-specific functional annotation.[10] They may also be biased toward the most studied
kinases and the exclusion of non-kinase proteins. Rudolph et al.(11) proposed a method to
address this issue named PHOTON, which identified differentially regulated
proteins based on the level of phosphorylation of their binding partners
in a high-confidence PPI network and then used logistic regression
to identify the involvement of the phosphorylated proteins in known
signaling pathways. PHOTON is not truly site-centric, however, as
it relies on summarized quantitative values of phosphorylated sites.
An alternative approach described by Krug et al.(12) uses a modification to the gene-centric method
gene set enrichment analysis (GSEA), which relies on a curated resource
containing literature-derived phosphorylated site-specific signatures
to assign functions to sites. However, the use of such a resource
is limited by the lack of phosphorylated site annotation to these
signatures or specific cellular processes; although it is useful for
identifying well-studied sites, predictions of functions of under-studied
sites or sites in alternative contexts would not be captured. These
do not fulfill the same role as popular gene-centric methods such
as over-representation analysis (ORA)[13] in aiding prediction and hypothesis generation when performing exploratory
analysis of shotgun phosphoproteomics data.In recent years,
heterogeneous or multilayer networks have been
used to represent many types of omics datasets.[14] These specialized networks are used to describe multiple
types of associations with nodes representing different entities.
To identify relationships between the different biological layers,
random walk algorithms have been applied to these networks. The random
walk on the heterogeneous network (RWHN) with the restart (RWR) method[15] has been particularly popular. Jiang[16] used RWR to prioritize disease candidate genes
in a PPI–phenome network; similarly, Soul et al.(17) applied it to a PPI–phenome
network to identify disease mechanisms. Recent work has extended the
method to multiple layers of biological information, for example,
to infer disease-associated m6A RNA methylation via known gene–disease associations.[18] Similar methodology was used to associate phosphorylation sites
recorded in the PSP database to diseases via kinase–substrate
interactions.[19]Here, we propose
an algorithm that uses RWHN to associate phosphorylated
sites to context-specific function via a heterogeneous
multilayer network using shotgun phosphoproteomics data. The network
combines three layers of information: phosphorylated sites, protein
interactions, and GO terms. Clustering of phosphoproteomics data is
used to find common features within datasets and is generally followed
by enrichment analyses. This is based on the assumption that common
patterns of phosphorylation, based on temporal changes or those in
response to a particular stimulus, treatment, or environmental context,
are a likely indicator of involvement in common functions or processes.[20] We utilize this concept in our algorithm by
connecting phosphorylated sites that have been clustered together
and therefore share regulatory patterns, within the multilayer network.
The algorithm is intended for use on large phosphoproteomics datasets,
assessing perturbations or processes as opposed to an interpretation
of the full phosphoproteomics network in the cell.We first
apply our algorithm to a small-scale, manually curated
validation network. We also assess the ability of our method and the
most commonly used alternative, ORA, to capture the functional descriptions
recorded in PSP, which are based on experimental analysis. We then
demonstrate the utility of our algorithm using three previously published
shotgun phosphoproteomics datasets, describing early signaling events
in HeLa cells upon EGF and TGF-α stimulation,[21] phosphorylation-mediated changes in breast cancer cells
resistant to lapatinib treatment,[22] and
subcellular location-dependent signaling events downstream of HRAS.[23] We demonstrate that phosphorylated sites can
be differentially assigned to functional annotations and this is driven
by changes in the context-dependent modification of these sites. Our
method is suitable for use with any phosphoproteomics dataset and
could be generalized for data describing other PTMs.
Experimental Procedures
A multilayer heterogeneous network
was constructed to relate biological
functions to phosphorylation sites via a protein–protein
interaction network (Figure ). Three types of nodes are contained within the network:
phosphorylation sites (called “sites” here on in for
brevity), proteins, and functional annotations [either Gene Ontology
biological process (GOBP) terms or KEGG pathways]. The edges describe
five possible associations, which are either bipartite (i.e., site to protein and protein to function) or between the same type
of nodes.
Figure 1
Overview of the multilayer heterogeneous network construction from
phosphoproteomics data.
Overview of the multilayer heterogeneous network construction from
phosphoproteomics data.
Dataset
Accession and Processing
Sites annotated to have a role in
biological processes were extracted
from the “Regulatory sites” dataset, available for download
from PhosphoSitePlus (version 6.5.9.3). The dataset was filtered on
the “ON_PROCESS” column to only include sites annotated
to biological roles, rather than those with molecular or interaction
roles.The results of the phosphoproteomics experiment described
in Francavilla et al.(21) were taken from Supporting Information Table S1 of their manuscript. The authors had processed the raw mass
spectrometry data using MaxQuant,[24] filtered
identified phosphorylated sites based on a localization probability
of greater than 0.75, and normalized remaining data to the ratio of
EGFR at 1 min after stimulation with EGF or TGF-α. Prior to
multilayer heterogeneous network construction, the data were further
filtered to remove sites with two or more missing ratios in their
time series and containing at least one SILAC ratio higher than 2
or lower than 0.5 (as described in the methods of Francavilla et al.(21)). Missing data were
imputed with random draws from a truncated distribution, as previously
described,[25] using the impute.QRLIC function from the imputeLCMD R package.Data from the Ruprecht et al.(22) study were taken from Supporting Information file named “Filtered and normalized phosphoproteome
dataset”.
The raw mass spectrometry data had been processed using MaxQuant,
filtered to include only identified phosphorylated sites with a localization
probability of greater than 0.75, and normalized. Sites that did not
show a significant change (FDR <1%) either between the untreated
parental and lapatinib-treated resistant (SILAC ratios H/L) or between
the lapatinib-treated parental and lapatinib-treated resistant (SILAC
ratios H/M) experimental conditions were filtered from the data, as
described in the methods of Ruprecht et al.(22) Missing data were imputed with the same method
used for the Francavilla et al. dataset.Data
from the Santra et al. study[23] were taken from Supporting Information Table S2. The raw mass spectrometry data had been processed
using MaxQuant and filtered to include only identified phosphorylated
sites with a localization probability of greater than 0.75; significant
sites were identified using a two-sample t-test with
FDR. The downloaded dataset was normalized using the limma R package function normalizeBetweenArrays using
the quantile method.[26] Missing data were
imputed with the same method used for the Francavilla et al. dataset. The data were then filtered to exclude sites that had not
been identified as significant in the original analysis by Santra et al.Before applying the algorithm, all datasets
were clustered based
on what was reported in the original publication, where available.
The model dataset was clustered using the fuzzy C-means (FCM) method,[27] with the number of clusters selected based on
the temporal trends we could visually identify in the data; generally,
silhouette analysis is recommended for selecting the number of clusters
using FCM for real experimental datasets.[28] The sites extracted from PSP were clustered based on what process
they were annotated to (as indicated in the “ON_PROCESS”
column in the Regulatory_sites file available for download on the
PSP website). The regulated phosphorylated sites from Francavilla et al. were clustered using FCM (Figure ); since the original publication used a
cluster number of six, the same was done here for better comparison.
The regulated phosphorylated sites from Ruprecht et al. and Santra et al. were both clustered using the
k-means method, with the cluster number being selected using the elbow
method (Figures S4 and S5, respectively).[29]
Figure 3
Distribution
of annotations and classifications of sites extracted
from PhosphoSitePlus. (A) Distributions of functionally annotated
sites (to mapped GO terms). (B) Distribution of terms in ORA results
and RWHN results when analyzing functionally annotated sites from
PSP. (C) Number of terms from ORA or RWHN that are from different
levels of the GO hierarchy.
Construction of the Multilayer
Network
Phosphorylation Site—Phosphorylation
Site Subnetwork
Edges were drawn between sites in the same
cluster that had a Pearson correlation (R2) between all the data points greater than or equal to 0.99 (Figure ).
Protein—Protein Subnetwork
The protein–protein
interaction network was extracted from
STRING.[30] All interactors of the proteins
included in the phosphoproteomics datasets with an experimental confidence
score of greater than 0.4 were included (Figure ).
Function—Function
Subnetwork
We used either GOBP terms or KEGG pathways as
functional annotators
in this work. If GOBP terms were included in the multilayer network,
the GOSemSim package from Bioconductor[31] was used to calculate the semantic similarity
of enriched GOBP terms. An edge was drawn between terms with a semantic
similarity of greater than 0.7, as calculated using the Wang method
included in the goSim function of GOSemSim.[31,32] In the case of KEGG pathways, edges were
drawn between pathways that had greater than 70% pairwise similarity
in their functional annotation profiles, following the method described
in Stoney et al.(33) (Figure ).
Phosphorylation Site—Protein Bipartite
Edges
Sites and proteins had an edge between them if the
residue was found on those proteins. Therefore, sites will only have
one edge, but proteins will have edges to all the sites found on that
protein that were differentially regulated in the dataset (Figure ).
Protein—Function Bipartite Edge
We assumed that
closely connected nodes in the protein–protein
subnetwork would more likely be involved in similar biological processes.
Therefore, we computed modules of the protein–protein subnetwork
using the Louvain module detection method.[34] Proteins from each module were analyzed for enrichment (hypergeometric
test with Benjamini–Hochberg correction, FDR <0.05%) of
functional annotations (from either the “GO_Biological_Process_2018”
or “KEGG_2019_Human” libraries, included in enrichR
and listed on the EnrichR website) using the enrichR(35) R interface. This increased the specificity
of terms to be included in the network. When GO terms were included,
high-frequency (annotated to more than 5% of genes) and semantically
redundant terms (similarity >0.9) were filtered using the Bioconductor GOSemSim(31) and GO.db packages[36] (Figure ).
Random Walk on the Heterogeneous Network
The heterogeneous
network can be represented as an adjacency matrix
as followswhere ARR represents
site–site associations, APP represents
protein–protein associations, AFF represents function–function associations, ARP represents site–protein associations with APR as the transpose, and APF represents protein–function associations with AFP as the transpose.As described in previous
work,[15] a transition matrix (M) was calculated for use in the first stage of the algorithmThe bipartite inter-subgraph
transition matrices (MRP and MPF) were calculated
aswhere λ is the transition probability
(i.e., the likelihood of the walker moving between
two layers of the network).The intra-subgraph transition matrices
(MRR, MPP,
and MFF) were calculated asRWHN is a ranking algorithm;
nodes are ranked based on the probabilities
of finding the random walker at a given node in the steady state,
having started at a given seed node or a set of seed nodes. In this
work, we set the seed nodes to be those sites belonging to a particular
cluster. The probability of finding the random walker at each node
for each step is calculated based on the iterative equationwhere r is the restart probability
(set to 0.7 as described in Kohler et al.(37)), P0 is the initial
probability vector, and P is the probability vector at step s. P0 was calculated such that all seed nodes were given equal
probabilities with their sum equal to 1. All other nodes in the site–site
subnetwork were assigned an initial probability of 0. Nodes in other
subnetworks were assigned equal probabilities with their sum equal
to 1 and weighted with the tunable parameters ηP and
ηF that are used to weigh the influence of each layer.The output of the algorithm is a ranked list of all the nodes,
based on the probability of finding the random walker at each node
in the steady state. This list is filtered by retaining functional
annotations and removing proteins and sites. We remove annotations
that are ranked in the same position regardless of the seed node,
in order to focus on specific functions related to each cluster. The
top 5% of functions in the probability distribution are retained.We implemented the algorithm in R using packages available from
Bioconductor and CRAN. The source code is available at github.com/jowatson2011/RWHN_phosphoproteomics, and an R package for general use of the tool can be found at github.com/JoWatson2011/phosphoRWHN. RWHN on the validation network took less than 30 s to run; however,
the larger experimental datasets took several hours on a moderately
powerful computer (32GB RAM, Intel i7-4770 Processor).
Over-representation Analysis
The
names of modified proteins in each cluster were used as input for
the enrichR R interface.[35] The libraries “GO_Biological_Process_2018” or “KEGG_2019_Human”
(included in enrichR and listed on the EnrichR website) were used
depending on the one used for the RWHN analysis of the same data.
The list of over-represented terms was filtered to only include those
with an FDR <0.05.
Results
Overview
of the Algorithm
To associate
phosphorylated sites of unknown functions to potential cellular functions,
we developed an algorithm to apply to shotgun phosphoproteomics data.
First, in order to help experimental biologists to identify the context-
or perturbation-specific roles of modified proteins changing in each
given experiment, a multilayer heterogeneous network is constructed
based on the regulation of phosphorylated sites. Regulated phosphorylated
sites may be defined differently by the user depending on the quantitation
method used in their phosphoproteomics experiment (e.g., label-free quantification and SILAC) or their experimental question.
The network represents three layers of biological entities and information:
regulated phosphorylated sites, protein–protein interactions,
and biological functions (Figure ). We then apply a ranking algorithm, RWHN, which ranks
nodes of each layer based on (i) the distance from the phosphorylated
sites of interest, which are assigned as “seed” nodes,
and (ii) the topology of the multilayer heterogeneous network.[15] Functions that are highly ranked can be considered
more correlated with a set of seed nodes.In the multilayer
heterogeneous network, edges are drawn between sites based on similarity
in the regulatory pattern of their phosphorylation, which we determine
using k-means or fuzzy C-means clustering. For the protein layer,
a PPI network is constructed of the phosphorylated proteins and their
interactors using the STRING database.[30] Interactors are included to account for non-phosphorylated proteins
or those below the limit of detection of the experiment.[38] To get a comprehensive and specific list of
functional annotations, we identified closely connected groups of
proteins in the PPI network using module detection and performed functional
enrichment analysis on these modules. Edges between functional annotations
are drawn based on functional similarity and overlap.[31,33]The RWHN algorithm simulates a walker moving from a starting
node(s)
(called a seed) and then node to node through the multilayer network.
Each step is influenced by the probability of transition to another
layer (λ), the weighting of the protein and function layers
(ηP and ηF), and the probability
of restart [i.e., teleportation back to the seed
node(s), r]. The output of RWHN consists of a list
of ranks for all the nodes in the network based on the likelihood
of the walker reaching those nodes.To optimize the algorithm,
we ran RWHN with each of these parameters
(λ, r, ηP, and ηF) tested over a range of values (0.1–0.9), changing
one while setting all others to 0.5 (Figure S1A). Performance was decidedly stable over the range of ηP and ηF; however, altering λ and r resulted in differential ranking depending on the parameter
value. Previous work has suggested an ideal λ and r of 0.7,[15,37,39] while ηP and ηF were set to 0.7 and 0.3, respectively,
to prioritize movement in the protein layer and reduce the number
of terms having the same rank regardless of the seed.The parameters
used in the construction of the PPI network were
also assessed. Proteins interacting with the phosphorylated proteins
found in a given sample were extracted from STRING; those proteins
whose interaction had a STRING experimental confidence score above
a given threshold were included in the multilayer heterogeneous network.
To investigate the impact of choosing different STRING confidence
scores on the results, we tested a confidence score range of 0.1–0.9.
First, we looked at whether connectivity of the graph was altered
by the score. As expected, the greater the confidence score, the more
components the PPIs were split into; above a score of 0.7, the PPIs
had two or more components (Figure S1B).
We next investigated whether the RWHN results were affected by changing
the STRING confidence score increasing components within the protein
layer. Using weighted tau correlation[40] to compare the rankings from each set of results, we found that
the correlation was generally in the range of 0.8–1.0, with
the lowest value being 0.54 between networks constructed with 0.1
or 0.9 confidence threshold, which were the extremes of the range
tested.
Validation 1: Model of the MAPK/ERK Pathway
MAPK/ERK signaling has been well studied, and the temporal phosphorylation
status of component proteins in response to growth factor stimuli
is relatively established[41,42] (Figure A). Based on this general understanding,
we compiled a simple validation dataset of phosphorylation dynamics
at particular sites of the main signaling proteins within the pathway
(Figure B). Sites
were chosen based on whether their phosphorylation is known to activate
or inactivate protein activity, as recorded in the PSP database.[6] We clustered the data into five clusters, referred
to here as clusters 1–5, using the fuzzy c-means algorithm
(Figure B). The features
of the multilayer network that was constructed are summarized in Table . Enrichment of GOBP
terms was used to form the functional annotation layer of the network.
Figure 2
Site-
and gene-centric analysis of validation data. (A) Construction
of validation data was based on a traditional representation of the
MAPK/ERK pathway. (B) Validation dataset was constructed based on
the MAPK/ERK pathway, simulating phosphorylation dynamics over time
after pathway activation. The sites were clustered using fuzzy c-means
clusters. These data were used to construct a heterogeneous multilayer
network. (C) RWHN results with sites annotated to functional roles
in PSP; seed nodes are equivalent to the clusters. (D) ORA results
with modified proteins contained in toy data. (E) GO hierarchy, which
has been subset to include only the parents of terms that occurred
in either the RHWN or ORA analysis. Terms that occur in either set
of results are colored in purple or green, respectively, with terms
that occur in both colored in light blue.
Table 1
Multilayer Heterogeneous Network Constructed
from MAPK/ERK Validation Dataa
subnetwork
edges
nodes
site–site
28
19
protein–protein
932
313
function–function
148
88
site–protein
19
19 sites, 8 proteins
protein–function
1796
261 proteins, 298 functions
GOBP terms were included as functional
annotations.
Site-
and gene-centric analysis of validation data. (A) Construction
of validation data was based on a traditional representation of the
MAPK/ERK pathway. (B) Validation dataset was constructed based on
the MAPK/ERK pathway, simulating phosphorylation dynamics over time
after pathway activation. The sites were clustered using fuzzy c-means
clusters. These data were used to construct a heterogeneous multilayer
network. (C) RWHN results with sites annotated to functional roles
in PSP; seed nodes are equivalent to the clusters. (D) ORA results
with modified proteins contained in toy data. (E) GO hierarchy, which
has been subset to include only the parents of terms that occurred
in either the RHWN or ORA analysis. Terms that occur in either set
of results are colored in purple or green, respectively, with terms
that occur in both colored in light blue.GOBP terms were included as functional
annotations.We ran the
RWHN algorithm over the multilayer heterogeneous network
with seed nodes set to all the sites belonging to one of the clusters;
this was then repeated for each cluster (Figure C). The highest ranked GOBP term was the
same (“protein autophosphorylation”) regardless of the
seed nodes. However, it was possible to differentiate between the
five clusters based on the ranking of terms below the first one.Where available, we could use the biological process annotations
in PSP (v6.5.9.3) as a “benchmark” for the highly ranked
terms associated to each cluster. RPS6KA3_Y529, annotated to “apoptosis,
altered” in PSP was found in cluster 2, along with DUSP6 sites;
this cluster highly ranks “positive regulation of programmed
cell death”. Of the four RAF1 sites included in this model
dataset, two were annotated to “cell cycle regulation”
in PSP: RAF1_S289, in cluster 2, and RAF1_S338, in cluster 5. These
two clusters both rank “regulation of the mitotic cell cycle”
in the top 5% of terms.Largely due to the inherent step in
our method in which the redundant
terms are reduced in the multilayer heterogeneous network, fewer terms
were found in the results of RWHN than in ORA. We next wanted to assess
whether the terms in the RWHN result were drawn from a particular
level of the GO hierarchy; if they were, this would suggest a bias
toward more specific or broader terms compared to ORA. From the GO
hierarchy, we extracted the parent and ancestor terms of the terms
found in each set of the results (Figure E). We found that RWHN terms were drawn from
the same range of the hierarchy as the ORA terms, with slightly more
coming from lower in the hierarchy, indicating that the results have
more specificity than the ORA results. We note that there is also
less redundancy in the RWHN results, despite the ORA results also
being reduced for semantically similar/redundant terms. For example,
the terms “regulation of the ERK1 and ERK2 cascades”
and “ERK1 and ERK2 cascades” both come up in the ORA
results and are associated to different clusters.To assess
the robustness of the algorithm, we performed a random
permutation control. RWHN was run 100 times with the seeds set as
before and random permutations of the subnetworks and bipartite edges
maintained. Kernel density estimation (KDE) was calculated to assess
how often each GO term occurred at each rank in the 100 random permutations
for each set of seed nodes. In each case, there was a subset of terms
that were more likely to have a rank greater than 50; however, none
strongly correlated with the actual rankings for each set of seed
nodes (Figure S2A). This confirms that
the rankings are not random but primarily determined by the network
topology. We next investigated the rate of potential false positives
in the results. Previous implementations of RWHN, primarily using
genomics data, have utilized an arbitrary threshold of the top x ranked terms or the top x % of ranked
terms (based on the probability vector as calculated in eq ). A threshold that is too stringent
would not remove lower ranking terms from the results, while a too
low threshold would increase the number of false positives. Using
the results of RWHN run on the randomly permuted networks, we calculated
the probability that the “true” ranking (i.e., from RWHN run on the non-permuted network) was different from the
random rankings (p < 0.05, Mann–Whitney U test with Benjamini–Hochberg adjustment[43,44]). We found that most of the rankings differed from random (Figure S2B). The terms that had p > 0.05 were checked against the RWHN results with different thresholds
selected (top 1, 5, 10, and 15% of ranked terms); none were found
above the 1% threshold for any of the clusters, and only one term
(“peptidyl serine modification” in cluster 2) was found
above the 5 and 10% thresholds (Figure S2B). We chose to set the default threshold to the top 5% of terms to
be more conservative while retaining useful annotations.For
a given set of seed nodes, the algorithm is capable of giving
high ranks to functions with known associations to sites. It is also
capable of predicting non-random, reasonable functions for sites of
unknown functions in this context.
Validation
2: Classification of PhosphoSitePlus-Annotated
Sites
A comprehensive benchmark of dynamically regulated
phosphorylated sites is not currently available, due to the lack of
phosphorylated sites with a regulatory pattern universally defined
independent of the experimental conditions. Therefore, we used the
static classifications of phosphorylated site functions recorded in
PSP to benchmark the predictive power of our algorithm. The PSP database
has annotations to biological processes for ∼4000 human phosphorylated
sites. As these annotations are non-standard, each one was mapped
to the most closely related GOBP term, where available (Table S1). The majority of sites were annotated
to signal transduction, gene expression, or cell differentiation (Figure A). Of these, 1496 sites were annotated to more than one function;
as these annotations are manually assigned based on literature searches,
the multiple functions may be due to the differences in experimental
questions in the source datasets. The sites were clustered based on
which function they were annotated to and analyzed with RWHN and ORA.
The GOBP terms in the results of both methods were associated to the
child terms of each mapped PSP GO term, as per the GO hierarchy. The
terms in the results of both methods showed a similar distribution
to the terms in PSP as a whole, the main exception being in the cell
differentiation category (Figure B). The two methods showed a considerable difference
in the number of terms in the output, with ORA resulting in several
hundred more terms in the output than RWHN. To test whether the two
sets of results were drawn from the same parts of the GO hierarchy
(with those from the top of the hierarchy being broader and those
from the bottom being more specific), we constructed a tree based
on the GO hierarchy. The tree was then divided into levels, as in Figure E, to capture terms
that were of similar descriptiveness. Finally, we counted the number
of terms that came up in the RWHN or ORA results (Figure C). Both sets of results draw
from the same part of the hierarchy, with a skew toward the broader
terms. We therefore concluded that RWHN resulted in similarly descriptive
but more restricted and focused annotations than ORA. This was notable
given that RWHN did not associate any terms to the smaller categories
in PSP, such as exocytosis, endocytosis, cell growth, and autophagy
(Figure B).Distribution
of annotations and classifications of sites extracted
from PhosphoSitePlus. (A) Distributions of functionally annotated
sites (to mapped GO terms). (B) Distribution of terms in ORA results
and RWHN results when analyzing functionally annotated sites from
PSP. (C) Number of terms from ORA or RWHN that are from different
levels of the GO hierarchy.
Experimental Case Study 1: Dissecting EGF-
and TGF-α-Induced Dynamic Phosphorylation
Given that
PSP contains static annotations of sites from various sources, we
wanted to test our algorithm on experimental datasets with controlled
perturbations. We reasoned that this would allow us to assess whether
our algorithm could differentially associate sites to their context-dependent
function. The phosphoproteomics data retrieved from Francavilla et al.(21) describe the effect
in HeLa cells of stimulation with EGF or TGF-α over a period
of 90 min (with time points at 1, 8, 40, and 90 min). In this study,
one of the findings was that EGF or TGF-α stimulation of EGFR
induces receptor degradation or recycling, respectively. TGF-α
was associated with a more potent mitogenic and migratory response
than EGF over time.[21] Further details on
the experimental design and results from the publication can be found
in Supporting Information Table S2. Phosphorylated
sites from Supporting Information Table S1 of the original publication were filtered based on regulation by
EGF or by TGF-α and divided based on which stimuli the regulation
was dependent on. The separated data were clustered into six clusters
using the fuzzy C-means method, as per the original publication, with
each cluster representing a distinct dynamic profile of phosphorylation
(Figure S3). A multilayer heterogeneous
network was constructed for both sets of data, described in Tables and 3.
Table 2
Multilayer Heterogeneous Network Constructed
from Francavilla et al.(21) EGF-Regulated Phosphorylated Sitesa
subnetwork
edges
nodes
site–site
4347
733
protein–protein
16,654
4072
function–function
532
544
site–protein
733
733 sites, 433 proteins
protein–function
1223
115 proteins, 532 functions
GOBP terms were
included as functional
annotations.
Table 3
Multilayer Heterogeneous Network Constructed
from Francavilla et al.(21) TGFα-Regulated Phosphorylated Sitesa
subnetwork
edges
nodes
site–site
3085
675
protein–protein
16,556
3960
function–function
332
496
site–protein
675
675 sites, 403 proteins
protein–function
1015
109 proteins, 486 functions
GOBP terms were
included as functional
annotations.
GOBP terms were
included as functional
annotations.GOBP terms were
included as functional
annotations.Several biologically
relevant terms were differentially ranked
in the EGF and TGF-α networks (Figure A,C). Consistent with the original publication,
the term “regulation of ERK1 and ERK2 cascades” was
highly ranked using RWHN when seeds were set to sites in cluster 2
(representing EGF late responders) for the EGF network and in cluster
1 (representing TGF-α early responders) for the TGF-α
network. Despite their central role in TGF-α signaling, when
we performed standard ORA enrichment, terms related to MAPK/ERK were
not found in any of the clusters (Figure D). However, both ORA and our method highlighted
the role of the proteins belonging to TGF-α cluster 1 in regulating
EGFR/ERBB signaling, with terms referring to regulation of these pathways
ranked highly (Figure C,D). We used cluster 2 under the EGF conditions and cluster 1 under
the TGF-α conditions to verify the robustness and accuracy of
our approach. As the MAPK/ERK cascade is well studied in respect to
EGF/TGF-α signaling, many of its associated sites have documented
functions that we could use to benchmark our results. EGF cluster
2 contains sites of several known players of the MAPK/ERK cascades,
including RSP6KA3_S375, SOS1 (T1119 and S1137), RAF1 (S289, S96, and
S301), JUN_S243, and JUNB_259. All of these sites are predicted MAPK1/MAPK3
targets, with the exception of SOS1_T1119, which is a MAP2K1 target.[7] As this cluster represents those sites phosphorylated
late in the time course, we would expect ERK1/2 feedback regulation
to emerge. Besides confirming the term “regulation of ERK1
and ERK2 cascades” in EGF cluster 2, results from the RHWN
algorithm add detail to this picture, with “establishment of
protein localization to organelles” ranked highly in this cluster
(Figure A). This finding
confirms the conclusion of Francavilla et al. of
EGFR trafficking to the lysosome and subsequent degradation when stimulated
by EGF.[21] EGF cluster 2 is the only one
which did not highly rank “regulation of clathrin-dependent
endocytosis” compared to the five other clusters; as this cluster
represents sites that are highly phosphorylated late in the time course,
this coincides with the understanding of receptor endocytosis as a
process regulated in the early stage. This cluster is one of the five
that highly ranks “regulation of cell migration” (Figure A); PSP also annotates
several of the sites in this cluster to cell motility (PTPN12_S571,
SLC9A1_S703, and FLNA_S2152).
Figure 4
Site-centric and gene-centric analysis of Francavilla et
al.(21) datasets. (A) RWHN results
related to EGF-regulated sites. (B) ORA results related to proteins
modified upon EGF stimulation. (C) RWHN results related to TGF-α-regulated
sites. (D) ORA results related to proteins modified upon TGF-α
stimulation.
Site-centric and gene-centric analysis of Francavilla et
al.(21) datasets. (A) RWHN results
related to EGF-regulated sites. (B) ORA results related to proteins
modified upon EGF stimulation. (C) RWHN results related to TGF-α-regulated
sites. (D) ORA results related to proteins modified upon TGF-α
stimulation.TGF-α cluster 1, on the
other hand, contains sites that are
phosphorylated within the first 8 minutes upon TGF-α stimulation.
Accordingly, this cluster contains different members of the MAPK/ERK
cascade, such as the canonical activating sites of MAPK1, MAPK3, and
MAPK14 (also known as p38) and several sites on EGFR (EGFR_S991, EGFR_S995,
and EGFR_T693). A few of these sites, such as EGFR_T693, are annotated
to “receptor internalization” in PSP. This cluster also
contains several phosphorylated sites on known players in receptor
trafficking, including RAB7A_Y183 and RAB11B_Y8, which are highlighted
in the original publication,[21] along with
CBL_Y674 and SH3BP4_T118. Our algorithm ranked terms such as “endosomal
transport,” “establishment of protein localization to
organelles,” and “rab protein signal transduction”
high when seed nodes were set to the sites found in this cluster (Figure C). This was only
partially captured by the ORA (Figure D), which showed enrichment of the term “regulation
of vesicle mediated transport” but also “protein localization
to the lysosome”, the opposite of the experimentally derived
conclusion of Francavilla et al.EGF cluster
1 representing phosphorylated cycling sites ranks both
“regulation of cell migration” and “positive
regulation of cell migration” in the top 5% of terms but did
not have a similar association with trafficking terms as TGF-α
cluster 1 in the RWHN output. Consistent with Francavilla et al., we concluded that there was no association between
trafficking and regulation of migration upon EGF stimulation.[21] Upon further investigation of this cluster,
we found several sites that were shared with TGF-α cluster 1,
such as the MAPK1/3/14, EGFR_S991, and EGFR_T693 sites. We also found
the inhibitory RAF1 site S259 and SCRIB_1448. SCRIB has a putative
role in cell migration.[45,46]TGF-α clusters
4 and 5 also highly rank terms related to
intracellular trafficking. In the ORA analysis (Figure D), cluster 4 was also enriched for “regulation
of clathrin-dependent endocytosis”, corroborating this association.
Of particular interest in cluster 4 are the EGFR (Y1110, Y1125, and
Y1197), CBL (Y700), and CBLB (Y665, Y763, and Y889) sites. None of
the CBL/CBLB sites have roles described in PSP, but these proteins
are known regulators of receptors in trafficking via clathrin.[47,48] Moreover, EGFR_Y1110 and EGFR_Y1197
are both annotated to receptor internalization in PSP and inducing
cell motility. This is in line with “regulation of epithelial
cell migration” highly ranked in this cluster (Figure C). Taken together, these data
indicate that TGF cluster 4 sites may have an initial role in regulating
receptor internalization and migration, while the cluster 1 sites
are regulating later parts of the process.As a positive control
for assignment of functions to clustered
sites, sites annotated in PSP to the function “cell growth,
inhibited” were added as a “spike-in” cluster
to the EGF and TGF-α data, with sites already found in the data
excluded from this new cluster. RWHN and ORA were then applied as
done previously. We found that both RWHN and ORA could assign these
sites to the function “negative regulation of cell growth”
(Figure A). However,
ORA assigned 134 and 122 other terms to the sites added to the EGF
and TGF data, respectively, while RWHN assigned a refined list of
36 and 39, respectively, (Figure B). This result suggests that RWHN assigns more specific
functions to each phosphorylated site. Indeed, for each of the investigated
clusters, RWHN was able to assign sites belonging to those clusters
to functions that closely resemble those that have been experimentally
proven by Francavilla et al. or described in PSP.
The ORA results, though not inaccurate, assigned more generic biological
functions and did not assign specific terms. These results suggest
that the RWHN algorithm can distinguish between sites of known functions
and can assign functional terms to differentially regulated phosphorylation
sites that could be further investigated experimentally.
Figure 5
Cluster containing
sites annotated to “cell growth, inhibited”
in PhosphoSitePlus, spiked into Francavilla et al. networks. (A) Terms in common between ORA and RWHN results with
both the EGF and TGF-α networks. (B) Total number of terms in
the output of both types of analyses on both networks.
Cluster containing
sites annotated to “cell growth, inhibited”
in PhosphoSitePlus, spiked into Francavilla et al. networks. (A) Terms in common between ORA and RWHN results with
both the EGF and TGF-α networks. (B) Total number of terms in
the output of both types of analyses on both networks.
Experimental Case Study 2: Phosphorylation-Dependent
Response to Lapatinib Treatment and Resistance in Breast Cancer
To verify the applicability of the algorithm with non-temporal
shotgun phosphoproteomics data, we applied it to the experimental
dataset from Ruprecht et al.,[22] describing changes in the phosphoproteome upon treatment
with the breast cancer drug lapatinib in sensitive (“parental”
BT-474) or lapatinib-resistant (BT-474-J4) cells. The paper uncovered
and experimentally validated the role of several metabolic enzymes
and signaling pathways that were driving lapatinib resistance in breast
cancer. In particular, proteins that form the spliceosome, those involved
in glycolysis and glycogen catabolism, and PI3K/AKT/mTOR pathway members
were differentially phosphorylated in lapatinib-resistant cells compared
to those in parental lapatinib-treated cells. Further details on the
experimental design and results from the publication can be found
in Supporting Information Table S2.We first constructed a multilayer heterogeneous network, as described
in the Experimental Procedures, which included
1603 phosphorylated sites that showed significant changes in response
to lapatinib treatment in either the parental or resistant cell lines
(Table ). Data were
grouped into five clusters (referred to as clusters 1–5) using
the k-means method. The number of clusters to be used was determined
using the elbow plot method[29] (Figure S4A), confirming that five clusters were
sufficient to capture all features found in the data. Indeed, sites
sharing similar regulation profiles between the resistant and parental
cells were clustered together (Figure S4B). As the original publication used KEGG pathways rather than GO
terms, we incorporated these in the functional annotation layer of
our network to investigate the flexibility of our approach.
Table 4
Multilayer Heterogeneous Network Constructed
from Ruprecht et al.(22) Lapatinib-Regulated Phosphorylated Sites in Parental and Lapatinib-Resistant
Cell Linesa
subnetwork
edges
nodes
site–site
75,688
1599
protein–protein
12,752
3129
function–function
15
152
site–protein
1599
1599 sites, 930 proteins
protein–function
692
103 proteins, 152 pathways
KEGG pathways were
included as functional
annotations.
KEGG pathways were
included as functional
annotations.The results
of RWHN applied to the multilayer heterogeneous network
(Figure A) show that
when the seed nodes are set to sites belonging to cluster 4, representing
sites that are less phosphorylated in lapatinib-resistant cells, the
spliceosome pathway was highly ranked. This cluster interestingly
contained signaling sites such as ERBB2_Y1233, ERBB2_T1227, ERBB2_Y1233,
ERBB3_S627, and MAPK1_Y187, alongside sites on known spliceosome factors
such as SRRM2 (S454 and S2449), SRSF6 (S301), and CWC25 (S170). Figure B shows that there
was no enrichment for any pathways in this cluster using standard
ORA. However, ORA did show enrichment for metabolic terms in several
clusters, unlike RWHN, despite this being one of the clear conclusions
of the original paper. Metabolic pathways were enriched in clusters
3 (“glycolysis/gluconeogenesis”) and 5 (“central
carbon metabolism in cancer”) using ORA. Since much of the
investigation and experimental validation by Ruprecht et al.(22) was done on the sites found to be regulated
only in lapatinib-treated resistant cells, we investigated whether
our approach could uncover more nuanced roles of sites within these
datasets. We constructed a second network using only those sites that
were regulated in lapatinib-treated resistant cells (Table ). These were grouped into four
clusters (referred to as clusters 1–4) by k-means clustering
as mentioned before (Figure S4A,C), and
KEGG pathways were used again in the functional annotation layer.
Figure 6
Site-centric
and gene-centric analysis of Ruprecht et al.(22) datasets. (A) RWHN results related
to sites regulated by lapatinib in lapatinib-resistant and parental
cell lines. (B) ORA results related to proteins modified by lapatinib
in lapatinib-resistant and parental cell lines. (C) RWHN results related
to sites regulated by lapatinib in lapatinib-resistant cell lines
only. (D) ORA results related to proteins modified by lapatinib in
lapatinib-resistant cell lines only.
Table 5
Multilayer Heterogeneous Network Constructed
from Ruprecht et al.(22) Lapatinib-Regulated Phosphorylated Sites in the Lapatinib-Resistant
Cell Linea
subnetwork
edges
nodes
site–site
668,491
1910
protein–protein
14,770
3880
function–function
15
187
site–protein
1910
1910 sites, 1087 proteins
protein–function
874
96 proteins, 187 pathways
KEGG pathways were
included as functional
annotations.
Site-centric
and gene-centric analysis of Ruprecht et al.(22) datasets. (A) RWHN results related
to sites regulated by lapatinib in lapatinib-resistant and parental
cell lines. (B) ORA results related to proteins modified by lapatinib
in lapatinib-resistant and parental cell lines. (C) RWHN results related
to sites regulated by lapatinib in lapatinib-resistant cell lines
only. (D) ORA results related to proteins modified by lapatinib in
lapatinib-resistant cell lines only.KEGG pathways were
included as functional
annotations.As described
in Figure C, two sets
of seed nodes (clusters 2 and 3, representing
sites substantially upregulated and downregulated, respectively, Figure S4C) tended to rank metabolic pathways
more highly (“starch and sucrose metabolism”, “galactose
metabolism”, “pentose phosphate pathway”, “purine
metabolism”, and “amino sugar and nucleotide sugar metabolism”).
Cluster 3, along with cluster 1 (representing moderately downregulated
sites), ranked “spliceosome” highly. These initial interpretations
indicate that using a specific subset of the regulated sites in the
data yielded more specific and functionally relevant terms being ranked
highly by the RWHN algorithm. Moreover, only cluster 2 showed enrichment
for any metabolic pathways using ORA (Figure D). As this is a large cluster containing
all the sites shown to be significantly upregulated in the resistant
cell line, there is a considerable amount of “noise”
in the ORA results. The enrichment of two metabolic pathways (“glycolysis/gluconeogenesis”
and “central carbon metabolism in cancer”) potentially
corroborates the role of these sites as metabolism regulators, as
indicated by RWHN. This cluster contained several of the sites classified
as metabolic drivers of the resistance phenotype, such as ENO1_Y44,
PKM_Y175, and PDHA_S293/S300. Intriguingly, S17 on the diverse signaling
regulator SRC is also found in this cluster; this site is annotated
to induction of enzymatic activity in PSP.A closer look at
cluster 3 uncovered sites on spliceosome factors
(RBMX_S58, CDK13_S525, and CWC25_S17) and known signaling proteins
(ERBB2_Y1233 and FGFR4_S505). A site on the adaptor protein IRS2 (S1176)
was also found in this cluster, as a known player in glucose and lipid
metabolism.[49] This points toward the link
between rewired signaling and metabolism investigated by Ruprecht et al.(22) It is also annotated
to “metastatic potential” in PSP, indicating its therapeutic
potential. ORA also showed enrichment for the “spliceosome”
in cluster 1. Indeed, this cluster contained several of the sites
highlighted in the paper. For example, the known lapatinib-responsive
sites SF3B2_S309 and SF3B1_T261 could be found in this cluster, alongside
17 of the 30 SRRM2 sites that were significantly regulated by lapatinib.From the comparison of ORA and RWHN methods, we have extracted
meaningful associations between specific phosphorylated sites and
their functions in driving lapatinib resistance. Our algorithm also
highlighted potential further sites, such as IRS2_S1176, compared
to the ORA method, which may serve as hubs or intersections of crosstalk
between key signaling pathways and the rewired metabolic processes.
It can be noted that the RWHN output was substantially more refined
than the ORA output, with the conclusions of the original paper being
captured in 14 terms in our analysis of the sites regulated in resistance,
as opposed to 51 using ORA.
Experimental Case Study
3: Spatially Resolved
Activity of HRAS
The final dataset we analyzed described
immunoprecipitation (IP) samples, in order for us to see if our algorithm
would be useful for analyzing local and global phosphoproteomes. The
subcellular localization of signaling proteins has been established
as crucial in regulating cascades and cellular processes. There are
now numerous methods that allow experimental biologists to detect
spatial effects of proteins and phosphorylation, for example, through
proximity labeling and targeted proteomics.[50,51] Recent work by Santra et al.(23) took a similar approach to investigate the differential
roles of a mutant, constitutively active form of HRAS, HRASV12, at
different subcellular locations. HRASV12 constructs tagged with different
signal peptides (targeting them to different subcellular locations)
were stably transfected into HeLa cells; this allowed the authors
to collect multi-omics samples with endogenous HRAS, unlocalized HRASV12
and HRASV12 localized to either the plasma membrane in disordered
membrane regions (DM) or lipid rafts (LR) and the Golgi apparatus
(GA) and the endoplasmic reticulum (ER). Using this approach, it was
confirmed that the majority of HRASV12 activity is mediated from the
plasma membrane, with ER- and GA-localized HRASV12 involved in regulating
some of HRASV12’s role in cell survival. Further details on
the experimental design and results from the publication can be found
in Supporting Information Table S2.The phosphoproteomics data from this study were clustered using k-means
clustering, with an optimal cluster number of six determined using
an elbow plot (Figure S5A,B). Table describes the properties
of the multilayer network constructed with these clusters. The first
three clusters represented dynamics discussed in the original paper:
cluster 1 represented phosphorylated events of mutant HRAS regardless
of localization (location-independent); cluster 4 represented HRAS
in the secretory pathway, at the ER and GA (ER/GA); and cluster 6
represented HRAS at the plasma membrane, in the DM or LR (DM/LR).
As these experimental conditions were best characterized in the original
paper, downstream analysis focused on these three clusters.
Table 6
Multilayer Heterogeneous Network Constructed
Using Regulated Sites from the Santra et al.(23) Dataseta
subnetwork
edges
nodes
site–site
766
621
protein–protein
7262
2163
function–function
370
452
site–protein
621
621 sites, 474 proteins
protein–function
900
122 proteins, 442 functions
GOBP terms were
included as functional
annotations.
GOBP terms were
included as functional
annotations.Using RWHN
(Figure , in purple),
we found that the ER/GA and DM/LR clusters both highly
ranked key signaling pathways such as “ERK1 and ERK2 cascades”
and “TOR signaling”, alongside signaling regulation
such as “receptor-mediated endocytosis”. This is consistent
with the finding from the original paper that the majority of HRAS-mediated
effects derive from the localized population. Specifically, the DM/LR
cluster ranks mitogenic processes highly, such as “regulation
of epithelial cell proliferation” and “positive regulation
of metaphase/anaphase transition of the cell cycle”. The original
paper also concluded that the majority of HRAS’ mitogenic effects
were mediated from the plasma membrane. The ER/GA cluster independently
ranks the “positive regulation of growth” and “Notch
signaling pathway”. This potentially implicates that ER/GA
localized HRAS, in the known developmental roles of HRAS in development;
this was touched upon but not investigated in the original paper.
Figure 7
Site-centric
and gene-centric analysis of Santra et al.(23) datasets; results from RWHN and ORA
are in green and purple, respectively. Clusters shown in the figure
refer to phosphorylated sites that are location-independent or found
at the endoplasmic reticulum (ER)/Golgi apparatus (GA) or disordered
membrane (DM)/lipid rafts (LR) (marked in the figure as clusters 1,
2, and 3, respectively).
Site-centric
and gene-centric analysis of Santra et al.(23) datasets; results from RWHN and ORA
are in green and purple, respectively. Clusters shown in the figure
refer to phosphorylated sites that are location-independent or found
at the endoplasmic reticulum (ER)/Golgi apparatus (GA) or disordered
membrane (DM)/lipid rafts (LR) (marked in the figure as clusters 1,
2, and 3, respectively).Conversely, when a term
is ranked highly in all three clusters,
this could indicate that this is part of the localization-independent
regulation of HRAS. A total of 20 of the highly regulated terms fit
this definition; of these, several are related to immunological processes,
such as T-help cell lineage commitment” and “cellular
response to interleukin-25”. The location-independent cluster
also independently ranks the “interleukin-27-mediated signaling
pathway” and “STAT cascade”. This suggests that
HRAS’ role in regulating cytokine and immune response is not
location-dependent.An obvious outcome of the results using
RWHN and ORA on these data
is the difference in the number of terms associated to each cluster.
There were 132 terms describing these clusters with ORA (Figure , in green), compared
to 86 with RWHN, with many of the enriched terms related to the over-represented
categories depicted in Figure , such as “gene expression”. Several comparisons
can be drawn between the results of the two analyses to add credence
to the accuracy of results from RWHN. For example, cluster 1 shows
enrichment for the term “interleukin-6-mediated signaling pathway”,
and cluster 3 is enriched for the term “regulation of the mitotic
cell cycle”. These results from ORA are consistent with the
results from RWHN and the original paper, however, with a significant
amount of noise and redundancy in the ORA output relative to that
of RWHN.
Discussion
Phosphorylation
has an important impact on protein functions and
thus the cellular behavior. Within a network of kinase–substrate
interactions, phosphorylation modulates the flow of information and
regulation of disparate processes throughout the cell. Understanding
the impact of phosphorylation on protein functions traditionally required
the experimental investigation and manipulation of individual sites;
however, with the advent of high-throughput phosphoproteomics, thousands
of novel phosphorylation sites of unknown functions have been discovered.
As there is limited experimentally validated information available
on how regulation at individual phosphorylated sites impacts the downstream
cellular output, current methods for analyzing high-throughput phosphoproteomics
data rely on general descriptors of phosphorylated protein functions
such as GO terms or involvement in known pathways. Their use in proteomics
and transcriptomics analysis, although widespread, has come under
recent criticism. Maertens et al.(52) demonstrated that annotations are lacking for a substantial
number of genes known to be important in cancer, while Haynes et al.(53) demonstrate a “rich-getting-richer”
effect, with well-studied genes having higher annotation rates despite
having less omics or molecular evidence for the association with processes
or diseases. These problems are exacerbated when used in analyzing
phosphoproteomics data, as enrichment analyses also disregard site-specific
or multiple-site regulation, masking the roles of proteins in a particular
modification state.[10,12]Here, we developed and
tested a method that associates phosphorylated
sites to potential functions using RWHN. By incorporating the pattern
of phosphorylation upon perturbation, we consider more of the information
available in phosphoproteomics datasets and establish the phosphorylated
sites as the key drivers in the functional analysis. The algorithm
is capable of recapturing experimentally proven functions of phosphorylated
sites in a non-gene-centric manner, refining analysis and extracting
more information from phosphoproteomics data.To prove the utility
of our algorithm, we applied it to a validation
network, which represented a simple model of phosphorylation dynamics
in the MAPK/ERK pathway, and to two previously published phosphoproteomics
dataset describing temporal cellular signaling events and the impact
of resistance to drug treatment in breast cancer. Our algorithm successfully
distinguished between differentially regulated phosphorylated sites
from the same protein and associated them to both known and previously
uninvestigated functions. For example, when applying this approach
to the data from Francavilla et al.,[21] we can recapture the roles of EGFR sites with functions
described in PSP. Both EGFR_Y1110 and EGFR_Y1197 are annotated to
receptor internalization and cell motility in PSP; in our analysis,
we found that these sites are both found in clusters that associate
with “regulation of clathrin-dependent endocytosis”;
however, only the TGF-α cluster also ranked “regulation
of cell migration” highly, reproducing the experimental evidence
from Francavilla et al. While the authors of the
original manuscript associated these sites with their functions using
a combination of gene-centric approaches and experimental validation,
here, we use a single site-centric algorithm to extract these associations
directly from the phosphoproteomics dataset, demonstrating the power
of our approach in quickly and more specifically narrowing down candidates
for further functional studies. By associating ligand-dependent regulation
of phosphorylated sites, it is possible to disentangle their multifaceted
role in regulating cellular signaling networks and downstream cellular
behaviors.From the Ruprecht et al. data, we
can use the
example of SRRM2 to illustrate the validity of our approach. This
component of the spliceosome is known to be highly phosphorylated,
with 675 phosphorylation sites recorded in PSP. At the time of writing,
none of these sites were ascribed to a function in the database, despite
the importance of phosphorylation and dephosphorylation events in
orchestrating splicing events.[54] It is
also regularly mutated (>5% of cases in TGCA) in lung, stomach,
bladder,
endometrial, and colorectal cancers.[6] Ruprecht et al. uncovered a new role of the spliceosome in modulating
lapatinib resistance, highlighting several sites on spliceosome proteins
(including SRRM2_S1132, S1987, and S970) that were differentially
modified in resistant and parental cell populations. Previous work
has described how SRRM2 depletion in HER2-positive breast and ovarian
cancer cells reduced the rate of migration;[55] the spliceosome plays myriad roles in the breast cancer environment,
but site-specific analysis is lacking. Using our algorithm, we can
begin to uncover the nuanced modification-specific roles of proteins
such as SRRM2 in metabolic rewiring. We suggest that 11 of the SRRM2
sites found in cluster 2 of the lapatinib-resistant sites (Figure C) could be at the
intersection between the spliceosome and metabolic processes; the
conjunction between these processes was central to the work of Ruprecht et al. The sites in this cluster could be investigated as
points of crosstalk between these different cellular processes. This
example highlights the value of our approach in generating hypotheses
of modified protein functions and signaling pathway crosstalk.Although the associations predicted here between sites and functions
cannot be interpreted as causative, phosphorylation sites can act
as indirect and context-dependent regulators in cellular processes.
This is demonstrated by the clustering of SRC_S17, a protein that
is typically associated with signaling downstream of cell–surface
receptors, with a cluster that ranked metabolic pathways higher than
cellular processing or signaling pathways. Critical to this is a robust
method of defining clusters, to capture meaningful regulatory patterns
at different phosphorylated sites. Here, we define clusters using
established techniques (e.g., elbow method for k-means)
or visually identifying them with differing time-resolved behaviors.
Using clusters to define edges between phosphorylated sites in the
multilayer heterogeneous network, the pattern of experimentally perturbed
modules in the cellular system can be investigated. Although having
fixed edges between sites, as with the PPI layer of the multilayer
network, would likely allow for a closer representation of the whole
cellular system, this would not be computationally viable or in the
interest of experimental biologists. This algorithm is targeted toward
datasets that describe the effect of specific treatments or perturbations,
allowing for a focused view on the impacted processes to identify
candidates to investigate further. Quantitatively defining the edges
between phosphorylated sites based on the dataset collected also prevents
the issues plaguing standard enrichment analyses as discussed before;
fixed edges, information provided in databases or previously published
data, would almost certainly skew the analysis toward well-studied
phosphorylated sites and biological processes.Multilayer heterogeneous
networks are increasingly being used to
integrate omics data; here, their use allows phosphoproteomics data
to be incorporated with PPI networks and functional annotations, overcoming
the issue of considering phosphoproteomics data primarily on the protein
level. A potential drawback of our approach is the reliance on large
semi-curated resources such GO or STRING. For instance, different
clusters may have many terms or pathways in common given the involvement
of many proteins in the same biological functions and the high proportion
of frequently used GO terms.[56] We theorize
that this may be rectified if the data were clustered into more groups,
in order to capture more nuanced phosphorylation patterns. However,
enrichment of non-specific terms or false positives remains an issue
when analyzing high-throughput omics data by other commonly used methods
too. By only including edges in the protein layer of the network with
a STRING experimental confidence score of greater than 0.4 and only
including non-redundant terms in the function layer, we have reduced
the noise in the resulting list of associated terms. This could be
taken further by filtering for only high-confidence functional annotations
or incorporating annotations from multiple sources into the network.
Moreover, RWHN does not result in a different set of terms in the
output compared to the accepted standard method ORA but rather provides
a refinement that will allow for biological insights to be reached
more clearly.Our method is flexible enough to be used with
any discovery phosphoproteomics
data that describe a change between conditions. This is an improvement
in the previously published use of RWHN using multiple sources of
phosphoproteomics data to uncover disease-dependent regulation.[19] Moreover, it could easily be generalized to
any post-translational modification proteomics dataset, as it incorporates
readily available PPI and functional annotation data, as demonstrated
here. The fundamental aspect would be maintained with any of these
expansions: specific patterns of regulation at modified sites dictate
movement through the multilayer heterogeneous network.
Conclusions
We have proposed a site-centric approach to
analyze phosphoproteomics
data that provides a robust alternative to gene-centric methods of
analysis. We integrated clustered quantitative phosphoproteomics data,
a context-specific PPI network, and functional annotations into a
multilayer heterogeneous network and used the RWHN method to predict
the functions of phosphorylation sites with similar regulatory patterns.
Using our algorithm, we extracted experimentally validated associations
between phosphorylated sites and their role in cellular processes,
which could not be captured using the typical gene-centric methods.
Moreover, our algorithm has the potential to be used by researchers
in predicting novel site–function associations and generating
hypotheses to be experimentally validated.
Authors: Karsten Krug; Philipp Mertins; Bin Zhang; Peter Hornbeck; Rajesh Raju; Rushdy Ahmad; Matthew Szucs; Filip Mundt; Dominique Forestier; Judit Jane-Valbuena; Hasmik Keshishian; Michael A Gillette; Pablo Tamayo; Jill P Mesirov; Jacob D Jaffe; Steven A Carr; D R Mani Journal: Mol Cell Proteomics Date: 2018-12-18 Impact factor: 5.911
Authors: Rune Linding; Lars Juhl Jensen; Adrian Pasculescu; Marina Olhovsky; Karen Colwill; Peer Bork; Michael B Yaffe; Tony Pawson Journal: Nucleic Acids Res Date: 2007-11-02 Impact factor: 16.971
Authors: Sabine S Neukamm; Rachel Toth; Nick Morrice; David G Campbell; Carol Mackintosh; Rainer Lehmann; Hans-Ulrich Haering; Erwin D Schleicher; Cora Weigert Journal: PLoS One Date: 2012-08-17 Impact factor: 3.240