The plasma proteome has the potential to enable a holistic analysis of the health state of an individual. However, plasma biomarker discovery is difficult due to its high dynamic range and variability. Here, we present a novel automated analytical approach for deep plasma profiling and applied it to a 180-sample cohort of human plasma from lung, breast, colorectal, pancreatic, and prostate cancers. Using a controlled quantitative experiment, we demonstrate a 257% increase in protein identification and a 263% increase in significantly differentially abundant proteins over neat plasma. In the cohort, we identified 2732 proteins. Using machine learning, we discovered biomarker candidates such as STAT3 in colorectal cancer and developed models that classify the diseased state. For pancreatic cancer, a separation by stage was achieved. Importantly, biomarker candidates came predominantly from the low abundance region, demonstrating the necessity to deeply profile because they would have been missed by shallow profiling.
The plasma proteome has the potential to enable a holistic analysis of the health state of an individual. However, plasma biomarker discovery is difficult due to its high dynamic range and variability. Here, we present a novel automated analytical approach for deep plasma profiling and applied it to a 180-sample cohort of human plasma from lung, breast, colorectal, pancreatic, and prostate cancers. Using a controlled quantitative experiment, we demonstrate a 257% increase in protein identification and a 263% increase in significantly differentially abundant proteins over neat plasma. In the cohort, we identified 2732 proteins. Using machine learning, we discovered biomarker candidates such as STAT3 in colorectal cancer and developed models that classify the diseased state. For pancreatic cancer, a separation by stage was achieved. Importantly, biomarker candidates came predominantly from the low abundance region, demonstrating the necessity to deeply profile because they would have been missed by shallow profiling.
Entities:
Keywords:
SWATH; cancer; clinical proteomics; data-independent acquisition; depletion; high throughput; label-free quantification; library; plasma proteomics; single shot; stable isotope-based quantification
Proteins control most
biological processes in life. Alterations
in their expression level, localization, and proteoforms are often
correlated with disease onset and progression.[1] In humans and animals, blood flows through virtually all tissues.
Therefore, it has the potential to indicate the health state of any
inner organ, even those not accessible from the outside. Blood is
readily obtainable with minimal invasive sampling, and large biobanks
exist for retrospective analyses.[2] Clinical
analysis of blood is the most widespread diagnostic procedure in medicine,
and blood biomarkers are used to diagnose diseases, categorize patients,
and support treatment decisions. While proteins (6–8%) are
by far the second major component of plasma after water (90–92%),
metabolic, lipidomic, transcriptomic, and genomic readouts are also
gaining traction as diagnostic tests in plasma.[3−6] Different omics readouts can be
used in conjunction to improve diagnostic power.[7] Despite more than 20,000 diseases reported to affect humans,[8] it is only for a small fraction of them that
accurate, sensitive, and specific diagnostic tests exist.The
limited success of blood protein biomarkers is primarily due
to analytical challenges that come with the proteomic analysis of
blood plasma. On the one hand, the large biological variance between
individuals and within individuals over time makes the discovery of
reliable biomarker signatures difficult.[9−12] Further, the steep dynamic range
of human plasma, with an estimated dynamic range of 12–13 orders
of magnitude,[13] renders comprehensive proteome
profiling challenging to any analytical technique. In the lower concentration
range, thousands of proteins reside, mostly tissue leakage proteins
and signaling molecules that could serve as biomarkers but are very
challenging to measure, especially in an unbiased manner.[14,15]Mass spectrometry (MS)-based plasma analysis provides an unbiased,
quantitative, and therefore ideal technology for the system-wide characterization
of the proteome.[16] Recently, technological
developments in sample preparation, chromatography, and acquisition
enabled automated, large-scale plasma projects of hundreds of specimens
that have resulted in reproducible findings.[15,17−20] These approaches share the shallow depth of proteome coverage, reaching
a maximum of about 600 proteins identified and quantified in a sample.
From qualitative analysis, disproportionately more proteins were found
to be present in the lower abundance region of plasma than in the
higher concentration range.[14] Novel MS-based
approaches have been developed to improve analytical depth while retaining
quantitative information. These include the depletion of high-abundance
proteins, the enrichment of low abundant proteins of interest, and
prefractionation.[21] Still, they have yet
to reach the throughput level needed to measure larger cohorts of
clinical samples. Automatization and depletion, batch, and quality
control have been tackeled[18,22,23] but require further improvement for large-scale studies. In summary,
while current plasma proteome biomarker research approaches mostly
cover the first few hundred proteins by concentration, rigorous experimental
design and comprehensive, large-scale quantitative studies will achieve
generalizable biomarker discovery.[16]Screening for the most common cancer types cannot be done in a
routine and population-wide manner. To date, only a few nonideal,
validated biomarkers exist in clinical use.[24] A significant challenge is that generally, only a single analyte
or metric is measured despite the known heterogeneity of cancer. Biomarkers
that accurately enable early detection in asymptotic subjects, reflect
cancer aggressiveness at diagnosis, and improve risk stratification
are urgently needed.[24] Despite the medical
need, plasma biomarker candidates for cancer are rarely validated
or transferred to the clinic. Recent examples are as follows: Zhang
et al. performed discovery proteomics in the plasma of 10 patients
with colorectal cancer, discovered 72 biomarker candidates, and then
performed a successful follow-up verification for prognostic markers
with 419 patients using an immunoassay.[25,26] Enroth et
al. found plasma protein biomarker signatures for ovarian cancer[27] but performed no validation. He et al. showed
that for hepatocellular carcinoma and cholangiocarcinoma, biomarker
candidates could be identified from plasma; the validation of these
candidates is still pending.[28] Zhou et
al. identified biomarkers for early gastric cancer from a small sample
set, but validation is still pending.[29] For prostate cancer, a blood diagnostic test was successfully developed
based on the discovery of proteomics and is now being used in the
clinic.[30] For the detection of early ovarian
cancer, the OVA1 test was developed and approved, where the measurement
of β-2 macroglobulin, apolipoprotein 1, serum transferrin, and
prealbumin is combined with the previously established marker CA125
to deliver better care.[31,32] This case exemplifies
that multimeasurement techniques are expected to outperform single
biomarker panels. Furthermore, single protein biomarkers are rarely
specific for a single disease, e.g., α fetoprotein is diagnostic
in liver cancer, but the biomarker is not specific, as it is altered
in other liver diseases and ovarian and testis cancers.[33] Rarely, there are highly specific biomarkers
such as β subunit HCG (β-HCG), which is a serum marker
for testicular carcinoma as β-HCG is never detected in the circulation
of healthy males.[34] To make plasma biomarker
discovery more efficient and successful, the comprehensive profiling
and validation of large cohorts of plasma proteomes need to be significantly
improved with new approaches.[16] The expected
outcome is new biomarkers that will allow early cancer detection and
prediction of the probable response to therapy (in precision medicine).We demonstrate a novel, automated analytical approach for plasma
profiling to a depth of 2732 proteins in the presented cancer study
and identifying deep into tissue leakage and signaling molecular areas.
We demonstrate the identification and quantitative benefits of neat
plasma profiling through a controlled quantitative experiment. Further,
we profiled deep into the tissue leakage plasma samples coming from
both healthy patients and patients with one of the five most deadly
solid tumors in the United States.[35] A
biomarker analysis with machine learning revealed candidates and models
able to classify healthy and diseased samples. The discovered biomarker
candidates predominantly came from low abundance protein regions,
clearly demonstrating the need to measure deeply because they would
have been missed by shallow plasma profiling.
Experimental Procedures
Ethics
The Cantonal Ethics Committee for Research on
Human Beings, Zürich, Switzerland, approved the study protocol
to be performed (proteomic analysis of plasma samples (2020-02892)).
Cohort Selection and Study Design
Cohort selection
and experimental design were driven by sample availability in commercial
repositories. For each cancer type, 30 matching samples were selected
and split into early (nonmetastatic stages IA–IIC) and late
(nonmetastatic stage IIIA–C) groups. Prior to the analysis,
normal individuals were matched for age, sex, and whenever possible
balanced across ethnicities to both early and late groups for each
cancer type. Healthy samples are self-declared healthy. This resulted
in three equal control groups (n = 15) with overlapping
individuals, namely, breast cancer control, prostate cancer control,
and remaining cancer control. Matching was done manually using the
χ2 test or ANOVA with a p-value threshold of
0.05 (R-package “tableone”).
Sample Preparation of the
Pan-Cancer Cohort
One hundred
and eighty human plasma samples were obtained from Precision for Medicine
and its subsidiaries (Norton USA), Discovery Life Sciences (Huntsville),
and ProteoGenex (Los Angeles). Due to limited availability, samples
were not balanced across suppliers; collection procedures and handling
until storage at −80 °C are considered to be the same
in the case of all three providers (Supporting Information Table 1). All samples were handled equally and
thawed twice. During the aliquoting, a small amount of each sample
was pooled. This quality control sample was subsequently used for
the library generation and to assess the quality and batch effects
throughout the sample preparation and acquisition. The processing
batches were block-randomized for disease status, diseased state,
gender, and ethnicity (only relevant for breast cancer samples) and
kept for the entire sample preparation.Depletion was performed
using the Agilent multi affinity removal column human-14, 4.6 ×
50 mm2 (Agilent Technologies) set up on a Dionex Ultimate
3000 RS pump (Thermo Fisher Scientific) and run according to the manufacturer’s
instructions. Briefly, the plasma was diluted 4:1 with buffer A for
multiple affinity removal LC columns (Agilent Technologies) and filtered
through a 0.22 μm hydrophilic PVDF membrane filter plate (Millipore)
before 70 μL was injected onto the column. The gradient was
27.5 min long, with the collection occurring between 3.6 and 9.2 min,
a flow rate of 1 mL/min during 11 and 26.5 min and 0.125 mL/min during
the rest of the gradient, and buffer B for multiple affinity removal
LC columns (Agilent Technologies) only in the time period of 13–17.5
min (100% buffer B). Well-spaced within each processing batch, we
depleted the quality control sample three times and treated it as
a separate sample thereon (depletion control samples).Following
depletion, we digested the samples with protein aggregation
capture using a KingFisher Flex (Thermo Fisher Scientific).[36] To assess digestion reproducibility, we mixed
two extra depletions of the quality control sample before splitting
it into digestion triplicate (digestion control samples). The acidified
peptide mixtures were loaded for clean-up into MacroSpin C18 96-well
plates (The Nest Group), desalted, and eluted with 50% acetonitrile.
Samples were dried in a vacuum centrifuge and solubilized in 0.1%
formic acid and 1% acetonitrile with Biognosys’s iRT and PQ500
kits (Biognosys) spiked following the manufacturer’s instruction.
Prior to DIA mass spectrometric analyses, the sample’s peptide
concentrations were determined using a UV/vis spectrometer at 280
nm/430 nm (SPECTROstar Nano, BMG Labtech) and centrifuged at 14,000g at 4 °C for 30 min.
Sample Preparation of the
Controlled Quantitative Experiment
The controlled quantitative
experiment was generated from 20 healthy
human EDTA K3 plasma samples obtained from Sera Laboratories International
Ltd. (West Sussex, U.K.). Saccharomyces cerevisiae (S. cerevisiae) was lysed in 100
mM HEPES pH 7.4, 150 mM KCl, 1 mM MgCl2, by shear force
passing through a gauge 12 syringe 15 times on ice before filtering
(0.2 μm). Escherichia coli (E. coli) was lysed with a cell cracker before filtering
(0.2 μm). After protein concentration determination using a
UV/vis spectrometer at 280 nm (SPECTROstar Nano, BMG Labtech), each
sample was spiked with fixed ratios of E. coli and S. cerevisiae leading to a synthetic
1:2- and 4:3-fold change, respectively. To 20 μL of plasma (∼1200
μg proteins), 40 or 30 μg of S. cerevisiae and 12 or 24 μg of E. coli lysate
were added for conditions A and B, respectively. The resulting 40
samples were diluted 4:1 with buffer A for multiple affinity removal
LC columns (Agilent Technologies), filtered through a 0.22 μm
hydrophilic PVDF membrane filter plate (Millipore). Seventy microliters
was used for depletion as described above followed by filter-aided
sample preparation (FASP)[37] and 30 μL
for the neat plasma comparison. The diluted neat plasma sample was
precipitated by adding four excesses of cold acetone (v/v) and overnight
incubation at −20 °C. The pellet was subsequently washed
twice with cold 80% acetone in water (v/v). After air-drying the pellet,
the proteins were resuspended in 50 μL denaturation buffer (8
M urea, 20 mM TCEP, 40 mM CAA, 0.1 M ABC), sonicated for 5 min (Bioruptor
Plus, Diagenode, 5 cycles high, 30 s on, 30 s off), and incubated
at 37 °C for 60 min. Upon dilution with 0.1 M ABC to a final
urea concentration of 1.4 M, the samples were digested overnight with
a 2 μg sequencing-grade trypsin (Promega) and trypsin inactivated
by adding TFA to a final concentration of 1% v/v. Peptide clean-up
was carried out as described above.
Library Generation
High pH reverse-phase (HPRP) fractionation
was performed using a Dionex UltiMate 3,000 RS pump (Thermo Fisher
Scientific) on an Acquity UPLC CSH C18 1.7 μm, 2.1 × 150
mm2 column (Waters) at 60 °C with a 0.3 mL/min flow
rate. Prior to loading, the pH of 300 μg of pooled depleted
samples was adjusted to pH 10 by adding ammonium hydroxide. The used
gradient was 1–40% solvent B in 30 min; solvents were A: 20
mM ammonium formate in water, B: acetonitrile. Fractions were taken
every 30 s and sequentially pooled to 20 fraction pools. The fraction
pools were then dried down and resuspended in 0.1% formic acid and
1% acetonitrile with Biognosys’s iRT kits spiked according
to the manufacturer’s instruction. Before data-dependent acquisition
(DDA) mass spectrometric analyses, peptide concentrations were determined,
and the samples were centrifuged as described above.
Mass Spectrometric
Acquisition
For data-independent
acquisition (DIA) LC-MS measurements for the controlled quantitative
experiment, 1 μg of peptides per sample was injected onto an
in-house-packed reverse-phase column (PicoFrit emitter) with a 75
μm inner diameter, 60 cm length, and 10 μm tip from New
Objective, packed with the Reprosil Saphir C18 1.5 μm phase
(Dr. Maisch, Ammerbuch, Germany) on a Thermo Fisher Scientific EASY-nLC
1,200 nanoliquid chromatography system connected to a Thermo Fisher
Scientific Orbitrap Exploris 480 mass spectrometer equipped with a
Nanospray Flex ion source. The DIA method was adopted from Bruderer
et al.[38] and consisted of one full-range
MS1 scan and 29 DIA segments.For DDA and DIA LC-FAIMS-MS/MS
measurements, 4 μg of each sample was separated using a self-packed
analytical PicoFrit column (75 μm × 50 cm length) (New
Objective, Woburn, MA) packed with ReproSil Saphir C18 1.5 μm
(Dr. Maisch GmbH, Ammerbuch, Germany) with a 2 h segmented gradient
using an EASY-nLC 1200 (Thermo Fisher Scientific). LC solvents were
A: water with 0.1% FA; B: 20% water in acetonitrile with 0.1% FA.
For the 2 h gradient, a nonlinear LC gradient was 1–59% solvent
B in 120 min followed by 59–90% B in 10 s, 90% B for 8 min,
90 to 1% B in 10 s and 1% B for 5 min at 60 °C, and a flow rate
of 250 nL/min. The samples were acquired on an Orbitrap Exploris 480
mass spectrometer (Thermo Fisher Scientific) equipped with a FAIMS
Pro device (Thermo Fisher Scientific) using methods based on ref (39). If not specified differently,
the FAIMS-DIA method contained three FAIMS CV (−35, −55,
and −75 V) parts, each with a survey scan of 120,000 resolution
with 20 ms max IT and an AGC of 3 × 106 and 35 DIA
segments of 15,000 resolution with IT set to auto and AGC set to custom
1000%. The mass range was set to 350–1650m/z, the default charge state was set to 3, the loop
count was set to 1, and the normalized collision energy was set to
30. For the acquisition of the fractionated sample for the library,
a DDA method was applied. The DDA method consisted of three FAIMS
CVs (−35, −55, and −75 V): each contained a DDA
experiment with 60,000 resolution of MS1, 15,000 resolution of MS2,
with a fixed cycle time (1.3 s), IT set to AUTO, and AGC set to custom
500%.[40]
Mass Spectrometric Data
Analysis
Database Search for Library Generation
DIA and DDA
mass spectrometric data were analyzed using software SpectroMine (version
3.0.2101115.47784, Biognosys) using the default settings, including
a 1% false discovery rate control at PSM, peptide, and protein levels,
allowing for two missed cleavages and variable modifications (N-term
acetylation and methionine oxidation). The human UniProt.fasta database
(Homo sapiens, 2020-07-01, 20,368 entries) was used,
and for the library generation, the default settings were used except
for the use of a top 300 precursors per protein filter.
Quantitative
Analysis of Data-Independent Acquisition
Raw mass spectrometric
data were first converted using the HTRMS
Converter (version 14.3.200701.47784, Biognosys) and then analyzed
using software Spectronaut (version 15.0.210108, Biognosys) with the
default settings, but Q-value sparse filtering was enabled with a
global imputing strategy and a hybrid library comprising all DIA and
DDA runs conducted in this study.[41] The
imputing strategy defines how to estimate the missing values (identifications
not fulfilling the FDR threshold), and with the global imputing strategy,
the missing values are imputed based on random sampling from a distribution
of low abundant signals taken across the entire experiment (lowest
10th percentile ±1 standard deviation).[42] Default settings include peptide and protein level false discovery
rate control at 1% and cross-run normalization using global normalization
on the median. Including a high number of quality control samples
(depletion, digestion, and injection controls) enabled the investigation
of batch effects and quantification of the introduced variability
at each step. No batch effect was identified by either principal component
analysis (PCA, “stats” R-package) or hierarchical clustering.CQE DIA data were analyzed using the directDIA approach of Spectronaut
software (version 15.0.210108, Biognosys) using the default settings,
including a 1% false discovery rate control at PSM, peptide, and protein
levels, allowing for two missed cleavages and variable modifications
(N-term acetylation and methionine oxidation). The directDIA approach
within Spectronaut is an implementation with minor improvements of
the published DIA Umpire approach.[43] The
combined human, E. coli, and S. cerevisiae.fasta databases with the removal of
the overlapping tryptic sequences (Homo sapiens 2020-08-31,
96,996 entries; Saccharomyces cerevisiae (strain ATCC 204508/S288c), 6078 entries; Escherichia
coli (strain K12), 4857 entries; Combined, 96,637 entries) were used, and for the library generation, the
default settings were used except for the Q-value sparse filtering
enabled with a global imputing strategy and cross-run normalization
using global normalization on the median based solely on the human
identifications.When we use proteins, we refer to protein groups
as determined
by the ID picker algorithm[44] and implemented
in Spectronaut.
Data Analysis and Biomarker Selection
Initial univariate
candidate filtering was performed using the pairwise Wilcoxon test
applied per protein across disease status (healthy, early, and late
stages) with the Holmes–Bonferroni correction (within-group).
Proteins with a p-value below or equal to 0.05 from the randomly selected
80% of observations were used for further optimization using sparse
partial least-squares discriminant analysis (sPLSDA).[45] A leave-one-out algorithm was used for optimal component
and protein selection. sPLSDA training and testing were performed
using the R-package “mixOmics”.[46] The remaining 20% of observations were used for validation. The
accuracy of prediction for all three groups, healthy, early, late
stages, and healthy against early and late stages together, was calculated
as the ratio of the true positive and negative sum of all observations
(R-package “caret”). Unsupervised hierarchical analysis
was performed with Manhattan distance and Ward’s clustering
on centered and normalized data (xij-x̅j/sj, i-th observation
with j-th protein) using R-package “ComplexHeatmap”.
PCA analysis was performed using R-package “stats”.
Correlation analysis was performed using the Pearson correlation with
R-packages “stats” and “corrplot”. Correlation
significance was tested using a two-sided t-test at 0.05 α.
All analyses were performed using log2-transformed data.
Gene ontology enrichment was performed using GOrilla,[47] and the identifications of this study were selected as
the background. All basic calculations and data transformations were
performed in R with R-packages: “dplyr” and “ggplot2”.
Results
Optimization and Validation of the Analytical
Approach
While methods to analyze the plasma proteome in-depth
exist, they
are usually either targeted and therefore biased, as for the case
of antibody- or aptamer-based technologies, or are based on the principle
of fractionation and are therefore difficult to scale. We aimed to
develop an analytical method that provided deep coverage and quantitative
accuracy while minimizing sample handling, bias, and batch effects.
For this scope, we developed and optimized an automated plasma depletion
pipeline composed of three major steps: sequential depletion, parallel
digestion, and LC-MS acquisition (Figure A).
Figure 1
Deep plasma profiling: automated analytical
approach and benchmarking.
(a) Sketch of the major steps of the analytical approach developed
for deep human plasma profiling for biomarker discovery, including
the depletion of the 14 most abundant proteins and the approximate
time requirements. (b) Schema of the controlled quantitative experiment
based on human plasma spiked with known amounts of Saccharomyces cerevisiae (S. cerevisiae) (1:1.3) and Escherichia coli (E. coli). (1:.5). The controlled mixtures were either
directly digested or processed using the process described in panel
(a). (c) Plot showing the measured distributions of the fold changes
of the controlled quantitative experiment divided by species. The
dashed lines represent the theoretical fold change. (d) Comparison
of the number of protein groups identified at different gradient lengths
for a depleted human plasma pool by either directDIA (blue) or with
a sample-specific library (red).
Deep plasma profiling: automated analytical
approach and benchmarking.
(a) Sketch of the major steps of the analytical approach developed
for deep human plasma profiling for biomarker discovery, including
the depletion of the 14 most abundant proteins and the approximate
time requirements. (b) Schema of the controlled quantitative experiment
based on human plasma spiked with known amounts of Saccharomyces cerevisiae (S. cerevisiae) (1:1.3) and Escherichia coli (E. coli). (1:.5). The controlled mixtures were either
directly digested or processed using the process described in panel
(a). (c) Plot showing the measured distributions of the fold changes
of the controlled quantitative experiment divided by species. The
dashed lines represent the theoretical fold change. (d) Comparison
of the number of protein groups identified at different gradient lengths
for a depleted human plasma pool by either directDIA (blue) or with
a sample-specific library (red).First, we automated the depletion of the 14 most abundant proteins
using a sequential approach supporting a 96-well format.[48] Briefly, after randomization and filtration
of the samples into a 96-well plate, an automated chromatographic
system sequentially and automatically processed the plate, thereby
depleting the 14 most abundant human proteins in plasma via the use
of specific antibodies.To quantify the analytical gain of the
approach and to assess whether
depletion maintains quantitative precision and accuracy, we performed
a controlled quantitative experiment (CQE). The CQE sample set was
generated from 20 healthy human plasma samples spiked with either
1:400 E. coli and 1:90 S. cerevisiae for condition A or 1:200 E. coli and 1:120 S. cerevisiae for condition B (Figure B). After processing the 40 samples with or without the automated
depletion pipeline, they were analyzed on a mass spectrometer using
data-independent analysis (DIA). Since the major challenge linked
to quantification in plasma is the large dynamic range, removing the
14 most abundant proteins should lead to an increase in the number
of proteins identified compared to the neat plasma. Indeed, while
the processing of the neat plasma samples led to an average identification
of 572 proteins (3920 peptides) across all samples, depletion significantly
increased the coverage by 257% to 1471 proteins (10,230 peptides)
(n = 40, p-value = 1e – 98; Supporting Information Figure 1A). Importantly,
depletion retained the quantitative accuracy close to the expected
ratios between conditions B and A of 0.415 for E. coli and −1 for S. cerevisiae: E. coli median ratios −1.20 and −1.18
and S. cerevisiae 0.38 and 0.32 for
the neat and depleted sets, respectively (Figure C). We observed a reduction in the intensity
of the depleted proteins along with the closely related proteins (e.g.,
other immunoglobulins or apolipoproteins) while observing an overall
increase in intensity in the rest of the plasma proteome (Supporting Information Figure 1B). Furthermore,
intensities of human proteins are correlated between the two data
sets (Pearson correlation 0.58, n = 247), and if
only nondepleted proteins are considered, this correlation becomes
much stronger (0.85, n = 198). Finally, we performed
an unpaired t-test between conditions B and A and
could identify 171 and 621 candidates (FDR, q-value
≥ 0.01) for the neat and depleted sets, respectively (Supporting Information Figure 1C). Given the
experiment’s controlled nature, we could identify the true
hits as those proteins mapping to either E. coli or S. cerevisiae and showing the
expected directionality. Overall, the depletion led to a 362% increase
in true hits, 170 and 615 for neat and depleted (actual FDR < 1%
for both), respectively. In summary, the automated depletion more
than tripled the number of proteins identified and the number of true
hits while maintaining quantitative accuracy and reducing the manual
workload to only the filtering of the samples (about half a day per
96 samples; Figure A).In the second step following depletion, the sample plate
was prepared
for digestion on an automated platform using a protein aggregation
capture approach.[36] Subsequently, the samples
were cleaned using C18 plates, and peptide concentration was measured.
In case a library was generated, a fraction of all samples can be
pooled and an ultra-high-pressure liquid chromatography-controlled
high pH reverse-phase (HPRP) fractionation was performed.[38]The third step comprises the LC-MS measurement
of the samples.
Even after depletion of the most abundant proteins, the major challenge
hindering quantification is the large dynamic range in plasma. Hence,
we developed and optimized the LC-MS acquisition for deep proteome
coverage using FAIMS-based ion mobility on the orbitrap platform combined
with high-performance chromatography. We developed FAIMS-DIA methods
that maximize the protein and peptide identification by comparing
values and counts of FAIMS compensation voltages with different scan
resolutions. This resulted in a set of optimized methods for gradients
from 1 to 4 h. Benchmarking with the depleted plasma resulted in 1300
protein identifications in 1 h gradients to 2103 protein identifications
in 4 h (Figure D).
For reference, in the human cell line HeLa, 10,026 proteins were identified
in 4 h (Supporting Information Figure 1D).Altogether, we demonstrated that the presented automated
plasma
depletion pipeline has the potential to enable the unbiased, reproducible,
and precise quantification of more than 2000 proteins on average per
sample across very large cohorts.
Plasma Proteome Depth Achieved
To test our pipeline,
we set out to analyze a diverse cohort of human plasma samples coming
from the five most deadly solid cancer types in the United States:[35] pancreatic, colorectal, breast, prostate, and
non-small-cell lung cancers. For each cancer type, 15 early-stage
(I–IIC) and 15 late-stage (IIIA–IIIC) nonmetastatic
patients, as well as 15 matching normal control samples, were selected
based on the available baseline data (including gender, age, and where
applicable smoking status; Figure A and Supporting Information Table 2). Altogether, we processed 180 samples (and an additional
24 quality control samples) over the course of 1 week and approximately
a month of measurement time. With this scalable approach, we could
identify and quantify 2732 proteins (2463 proteins with two or more
peptide sequences and on average with 9.2 peptides per protein) across
226 measurements (180 samples and 46 quality control samples, about
900 proteins/h measurement; Figure B), of which 1804 are found in at least 50% of the
runs (Supporting Information Figure 2A).
On average, we identified 1806 proteins per run. Importantly, missing
values were stemming mostly from biological variation as in injection
triplicates 88.7% of the 2209 protein groups detected are complete
observations. Additionally, 77% of the 2402 protein groups from 15
injection replicates were complete, representing only 6% (119) less
protein groups with full profiles than in injection triplicates. Across
cancers (and the healthy cohort), the identifications varied between
2524 in prostate cancer and 2682 in lung cancer, showing that only
a minimal part (<10%) of the identification is disease-specific
and around 1000 protein groups are consistently quantified despite
variable biology (Supporting Information Figure 2B). Furthermore, it can be assumed that peptides and proteins
that do not fulfill the FDR criteria are below the limit of detection
since DIA measures all ions and does not have the stochastic nature
of DDA. With the identified proteins, we could cover the 8-order-of-magnitude
dynamic range reported for plasma in the Human Protein Atlas (3222
proteins detected in human plasma by mass spectrometry, of which we
could quantify 70%; Supporting Information Figure 2C). Within this range, we extensively covered the tissue leakage
proteome, interleukins, and signaling proteins such as EGF, KLK3 (PSA),
AKT1, CD86, MET, ERBB2, and CD33 (Figure C). As expected, among the 500 highest intensity
proteins, meaning that proteins would likely be identified if no depletion
would have been applied, 196 (39%) are classified as secreted proteins.
On the lower end, we identified tissue-specific proteins coming from
the diseased organs (n = 42, 81% of which are not
part of the 500 most abundant proteins), cytokines (n = 29, 85%), and nucleoplasm (n = 637, 90%) proteins
exemplifying different functional plasma concentration ranges (Figure C). We identified
190 targets for FDA-approved drugs, of which 125 (66%) fall in the
lower intensity range.[49] The different
biological role of low and high abundant plasma proteins shows that
we could recover the known biology of the plasma proteome.
Figure 2
Deep plasma
discovery proteomics of five solid cancer types. (a)
Description of cohort comprising five solid cancers: breast (infiltrating
ductal carcinoma), colon (adenocarcinoma), pancreas (adenocarcinoma),
prostate (adenocarcinoma), and lung (non-small-cell lung cancer, squamous
cell) cancers. Fifteen subjects for early and late stages were selected
for each cancer type, along with 15 matching healthy individuals (a
total of 30, given the need to balance ethnicity and sex for prostate
and breast cancers). (b) Z-score of all quantified
proteins (n = 2732) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and
the samples were hierarchically clustered. Selected, significantly
enriched gene ontology pathways are reported on the right with the p-value in parentheses. (c) The protein rank vs protein
average intensity (n = 180). Proteins were categorized
according to Human Protein Atlas, and the average rank was calculated
(dotted, vertical lines). The green box depicts the proteome region
that is typically below the sensitivity of the neat plasma profiling
by mass spectrometry. (d) The coefficient of variation (CV) of the
quality control measurements across the processing steps was plotted.
The LC-MS variance was controlled by reinjection of the same digested
sample (injection). Digestion and depletion were done repeatedly of
the same sample (digest, depletion) and the batch stemming from sample
preparation 96-well plates (batch). Thick lines indicate medians,
boxes indicate 25 and 75% quartiles, and whiskers extend between the
median and ±(1.58 × interquartile range).
Deep plasma
discovery proteomics of five solid cancer types. (a)
Description of cohort comprising five solid cancers: breast (infiltrating
ductal carcinoma), colon (adenocarcinoma), pancreas (adenocarcinoma),
prostate (adenocarcinoma), and lung (non-small-cell lung cancer, squamous
cell) cancers. Fifteen subjects for early and late stages were selected
for each cancer type, along with 15 matching healthy individuals (a
total of 30, given the need to balance ethnicity and sex for prostate
and breast cancers). (b) Z-score of all quantified
proteins (n = 2732) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and
the samples were hierarchically clustered. Selected, significantly
enriched gene ontology pathways are reported on the right with the p-value in parentheses. (c) The protein rank vs protein
average intensity (n = 180). Proteins were categorized
according to Human Protein Atlas, and the average rank was calculated
(dotted, vertical lines). The green box depicts the proteome region
that is typically below the sensitivity of the neat plasma profiling
by mass spectrometry. (d) The coefficient of variation (CV) of the
quality control measurements across the processing steps was plotted.
The LC-MS variance was controlled by reinjection of the same digested
sample (injection). Digestion and depletion were done repeatedly of
the same sample (digest, depletion) and the batch stemming from sample
preparation 96-well plates (batch). Thick lines indicate medians,
boxes indicate 25 and 75% quartiles, and whiskers extend between the
median and ±(1.58 × interquartile range).Furthermore, based on quality control samples, we could characterize
variance introduced on each level: injection (median coefficient of
variation (CV = 16%), digestion (CV = 19%), depletion (CV = 25%),
and column (CV = 26%)), all of which are much lower than the healthy
interindividual variability (CV = 56%; Figure D and Supporting Information Figure 2D). As a further quality control, we focused on known
protein levels’ interpatient variability (measured by CV; Supporting Information Figure 2E). On one hand,
coagulation and complement cascade proteins (KEGG complement and coagulation
cascades) were significantly enriched among the proteins with the
least interpatient variability (median CV = 32% and p-value = 2.8e – 12), such as complement factor I (CF1, CV
= 23%) and complement component C6 (CV = 27%), demonstrating tight
regulation.[18] On the other hand, keratins
(likely contaminants, Go biological process keratinization) were significantly
enriched among the proteins with the most interpatient variability
(CV = 339% and p-value = 4.46e – 8), with
HLA molecules (CV = 90%) also showing high variability across patients.[50] Additionally, lipoprotein A (LPA) showcases
a large interpatient variability (CV = 113%), likely due to the known
genetic variants affecting its secretion into plasma.[51,52] Overall, the quantitative data set generated recapitulates known
biological features of intrapatient heterogeneity while providing
a deep unbiased view of the plasma proteome.
Considerable Heterogeneity
across Cancer Types
The
cohort was designed to enable five independent within-cancer analyses,
each comprising a healthy-, early-, and late-stage group (each n = 15; Supporting Information Table 2 and Figure A). Overall, we included 30 control samples, but only a subset of
15 per cancer were matched (see methods). Hence, a combined analysis
of all samples together was not the primary goal of this study. Aware
of these limitations, we explored the entire data set for markers
that would agnostically predict the cancer stage. The analysis pipeline
applied to the whole data set, and the cancer-specific analyses were
the same and aimed at providing actionable insights about specific
disease development. Given a large amount of data (2732 proteins combined),
we performed a two-step approach (Figure A). First, we filtered for differentially
abundant proteins between healthy-, early- and late-stage cancers
using univariate analysis. In the case of the pan-cancer model, we
found 468 proteins dysregulated (Figure B, Supporting Information Figure 3A, and Supporting Information Table 3). Second, using the selected proteins, we trained a model
based on sparse partial least-squares discriminant analysis (sPLSDA)
on 80% of the data set. This modeling step further reduced the number
of proteins to 94 (Figure B). The model partially differentiated healthy from disease
but not late to early stage (Supporting Information Figure 3B and Supporting Information Table 4). Interestingly, the majority of the differentiating proteins
would have been below the detection level in a neat plasma preparation
(65%; Figure C). Furthermore,
the unsupervised clustering of the differentiating proteins generated
enriched patterns (Figure D). For example, proteins enriched for immunoglobulin production
and complement activation tend to be higher in healthy samples (Figure E). A subset of cancer
samples have a strong upregulation of proteins linked to metabolic
processes and cellular oxidant detoxification (Figure D,E). Immunoglobulin kappa variable 6–21
(KV621) was among the proteins higher in healthy samples, was the
third most important discriminant protein in the model (0.56 importance),
and showed a more pronounced bimodal distribution in healthy individuals
and a decrease in diseased individuals (Figure F and Supporting Information Figure 3C). In addition, the model identified the known inflammation
marker complement C5 (CO5, importance 1[53]) increased in the early and late stages and spondin-1 (SPON1, importance
0.58) increased in the late stage (Figure F and Supporting Information Figure 3C), as the first and second most important contributors,
respectively. Finally, the predictive power of the model was validated
using the remaining 20% of the samples. The predictive power was low
at 55.6% (Supporting Information Figure 3D), likely due to the cohort imbalance, the sample heterogeneity,
and the small sample set, as each cancer type is known to have a particular
protein signature.[54] Nonetheless, unsupervised
clustering using the final protein panel (enrichment p-value = 1.4e – 9) allowed for a more efficient separation
of samples between healthy and diseased states compared to the entire
proteome (p-value = 0.09; Figures B and 3D). Altogether,
the global data analysis underlined the importance and necessity of
precision medicine and a much larger sample set would be needed to
find a potential “one-fits-all” solution.
Figure 3
Machine learning-based
candidate biomarker discovery. (a) Schematic
detailing the steps of the postprocessing, including univariate testing
for filtering, machine learning (sPLSDA) on 80% of the data, and classification
performance accuracy on the 20% hold-out validation data. (b) Overview
of the number of biomarker candidates selected by univariate analysis
(gray) and machine learning (blue) for healthy, early, and late stages
across all cancers and individual cancers. (c) Average protein intensity
plotted vs protein abundance rank. The machine learning-selected biomarker
candidates for the pan-cancer model are colored blue (the average
is plotted as a blue line), and important contributors are highlighted.
The green box depicts the proteome region that is typically below
the sensitivity of neat plasma profiling by mass spectrometry. (d) Z-score of all machine learning-selected candidate biomarkers
for the pan-cancer model (n = 94) across all measured
samples (n = 180). Stage calling is overlaid. Both
the proteins and the samples were hierarchically clustered. Selected,
significantly enriched gene ontology pathways are reported on the
right with the p-value in parentheses. Proteins highlighted in blue
and gray are reported in panels (e) and (f), respectively. (e) Boxplot
visualization of the average z-transformed protein
intensity for all proteins (n = 288) in the cluster
highlighted in blue in panel (d) divided by stage (n = 180). Thick lines indicate medians, boxes indicate 25 and 75%
quartiles, and whiskers extend between the median and ±(1.58
× interquartile range). (f) Boxplot visualization (as in panel
(e)) of the log-transformed protein quantities of the three most differentiating
proteins based on the machine learning model (SPON1, KV621, and CO5).
Each data point represents a sample (n = 180).
Machine learning-based
candidate biomarker discovery. (a) Schematic
detailing the steps of the postprocessing, including univariate testing
for filtering, machine learning (sPLSDA) on 80% of the data, and classification
performance accuracy on the 20% hold-out validation data. (b) Overview
of the number of biomarker candidates selected by univariate analysis
(gray) and machine learning (blue) for healthy, early, and late stages
across all cancers and individual cancers. (c) Average protein intensity
plotted vs protein abundance rank. The machine learning-selected biomarker
candidates for the pan-cancer model are colored blue (the average
is plotted as a blue line), and important contributors are highlighted.
The green box depicts the proteome region that is typically below
the sensitivity of neat plasma profiling by mass spectrometry. (d) Z-score of all machine learning-selected candidate biomarkers
for the pan-cancer model (n = 94) across all measured
samples (n = 180). Stage calling is overlaid. Both
the proteins and the samples were hierarchically clustered. Selected,
significantly enriched gene ontology pathways are reported on the
right with the p-value in parentheses. Proteins highlighted in blue
and gray are reported in panels (e) and (f), respectively. (e) Boxplot
visualization of the average z-transformed protein
intensity for all proteins (n = 288) in the cluster
highlighted in blue in panel (d) divided by stage (n = 180). Thick lines indicate medians, boxes indicate 25 and 75%
quartiles, and whiskers extend between the median and ±(1.58
× interquartile range). (f) Boxplot visualization (as in panel
(e)) of the log-transformed protein quantities of the three most differentiating
proteins based on the machine learning model (SPON1, KV621, and CO5).
Each data point represents a sample (n = 180).
Overall Changes within and across Cancer
Types
Next,
we applied the same analysis strategy using the matched healthy controls
to each of the five solid tumor types. In the first step, we identified
on average 325 significantly altered proteins between healthy, late,
and early stages (Figures B and 4A and Supporting Information Table 3). With 436 significantly altered proteins
(83% reduction in features), prostate cancer had the highest number
of differentially abundant proteins, while breast cancer had the fewest
with 229 (92% reduction). Interestingly, only a few proteins were
shared among cancers (Supporting Information Figure 4A). Pancreatic and prostate had the most with 190 overlapping
proteins, while breast and pancreas had the least at 37 (Supporting Information Figure 4A). Seven candidate
proteins were consistently selected as differentially abundant across
all cancers: the complement activation protein C4b-binding protein
β chain (C4BPB), the immunoglobulin component immunoglobulin
heavy variable 4-4 (HV404), the T-cell apoptosis inducer galectin-1
(LEG1), the degrader of the inflammation-promoting bradykinin peptide
Xaa-Pro aminopeptidase 1 (XPP1), the solute carrier family 2 facilitated
glucose transporter member 1 (GTR1), the glycan metabolism β-mannosidase
enzyme (MANBA), and the suggested growth inducer of epithelial tumors
tenascin-X (TENX; Figure B and Supporting Information Figure 4A,B). These candidates have rather decreasing (HV404, XPP1, MANBA, TENX)
or increasing (LEG1, C4BPB) trends in a cancer agnostic manner, with
the exception of GTR1, which strongly increases in the late-stage
breast cancer while decreasing in the other types (Figure C). Interestingly, this small
set of proteins separated healthy- from the cancer-stage samples quite
well (p-value = 1.9e – 8; Figure B). Fitting an sPLSDA model
with 80% of the data overall decreased the number of candidates to
less than 5% of the total measured proteins. It led to an average
of 129 candidates, making biological interpretation and follow-up
more feasible (Figures B and 4A and Supporting Information Table 4). The relative decrease in the input data
was highly cancer-dependent, from an almost 76% reduction in pancreatic
cancer to only a 15% reduction in lung cancer. The number of overlapping
proteins across models was minimal, likely due to the reductionist
approach of sPLSDA and cancer-type-specific mechanisms, with no proteins
being selected for all models (Supporting Information Figure 4C). Still, TAGL and MANBA were selected in all but
breast cancer models, and GTR1 and LEG10 in all but the pan-cancer
and breast cancer models (Figure C and Supporting Information Figure 4B).
Figure 4
Classification accuracy of the five cancer types. (a) Overview
of the data analysis per cancer and combined (pan-cancer) as a normalized
score. Percentage reduction upon univariate filtering and sPLSDA on
80% of the data set along with percentage accuracy as measured on
the 20% hold-out samples as a three-way (healthy-, early-, and late-stage)
and two-way (cancer and healthy) classification and p-value of enrichment
based on the heatmap clustering (Manhattan distance, Ward clustering).
(b) Z-score of the seven candidate proteins consistently
selected across all cancers (by univariate analysis, n = 180). Stage calling is overlaid. Both the proteins and the samples
were hierarchically clustered. (c) Boxplot visualization of log-transformed
GTR1 quantities across the stage and cancer type. The healthy samples
were matched to the respective cancer samples. Thick lines indicate
medians, boxes indicate 25 and 75% quartiles, whiskers extend between
the median and ±(1.58 × interquartile range), and each data
point represents a sample (n = 180). The dashed blue
line connects the median values across stages.
Classification accuracy of the five cancer types. (a) Overview
of the data analysis per cancer and combined (pan-cancer) as a normalized
score. Percentage reduction upon univariate filtering and sPLSDA on
80% of the data set along with percentage accuracy as measured on
the 20% hold-out samples as a three-way (healthy-, early-, and late-stage)
and two-way (cancer and healthy) classification and p-value of enrichment
based on the heatmap clustering (Manhattan distance, Ward clustering).
(b) Z-score of the seven candidate proteins consistently
selected across all cancers (by univariate analysis, n = 180). Stage calling is overlaid. Both the proteins and the samples
were hierarchically clustered. (c) Boxplot visualization of log-transformed
GTR1 quantities across the stage and cancer type. The healthy samples
were matched to the respective cancer samples. Thick lines indicate
medians, boxes indicate 25 and 75% quartiles, whiskers extend between
the median and ±(1.58 × interquartile range), and each data
point represents a sample (n = 180). The dashed blue
line connects the median values across stages.In summary, the model classification performance measured on the
20% validation set ranged between 33.3% in lung and prostate cancers
and 77.8% in colorectal cancer when all three groups were considered
and between 86.1% for the pan-cancer model and 100% for lung and colorectal
cancers when healthy and overall disease status were considered (Figure A and Supporting Information Table 2). While for the
early/late-stage differentiation two of the six models were close
to random performance, the disease status was easier to predict, especially
if the cancer type is known, as the pan-cancer model performed the
worst with an 86% accuracy. Interestingly, high model performance
was not always associated with high separation efficiency using PCA
or distance analysis and vice versa (Figure A). This is especially apparent in the case
of pancreatic and colorectal cancers. While colorectal performs the
best on the validation set, especially in the differentiation of healthy/disease,
pancreatic cancer leads to the best separation by hierarchical clustering
on all three groups (p-value = 3.1e – 16).
In a nutshell, in contrast to the “one-fits-all” approach,
the cancer-specific models performed better. In some cases, the classification
accuracy of the derived models was good, demonstrating the benefit
of deep profiling of the plasma proteome.
Diseased State Separation
in Colorectal Cancer
In colorectal
cancer (CRC), we identified 307 proteins significantly altered between
healthy, early, and late stages (Supporting Information Figure 5A). The sPLSDA model further reduced these candidate
proteins to 90, and both hierarchical clustering and PCA analysis
led to the efficient separation of healthy subjects from patients
regardless of tumor staging (p-value = 2.1e –
8; Figure A and Supporting Information Figure 5B). Multiple biological
GO enrichments in the candidates could be dissected, for example,
response to leptin and regulation of proteolysis increased in cancer
(including STAT3 and transgelin (TAGL)). In contrast, the negative
regulation of cell–cell adhesion, leukocyte homeostasis, and
response to hydrogen peroxide decreased (including CD47; Figure A,B). TAGL (importance
= 1.00), STAT3 (importance = 0.65), and CD47 (importance = 0.57) were
the three most predictive proteins from the sPLSDA model and showed
interesting patterns (Figure B and Supporting Information Figure 5C). While CD47 and STAT3 showed strong heterogeneity in late-stage
colorectal cancer, TAGL was highly expressed in the early- and late-stage
colorectal cancers (Figure B). The selected 90 proteins were distributed across the entire
intensity range of measured proteins, with more than 80% of the selected
proteins (including the three most important) being beyond the 500
protein mark representing the usual range of proteins detected in
neat plasma (Supporting Information Figure 5D). Furthermore, at 78%, the model had the best overall classification
accuracy among all tested malignancies on the validation set (Figure C). As no misclassification
for healthy subjects was observed, the panel of identified candidate
proteins could be helpful for early CRC diagnosis. In summary, despite
the small sample set, deep profiling of the human plasma enabled the
partial classification of diseased patients based on a panel of 90
proteins that span a large dynamic range while providing an unbiased
glimpse into the biological processes at the base of colorectal cancer.
Figure 5
Colorectal
cancer biomarker candidates predict diseased status.
(a) Z-score of all machine learning-selected candidate
biomarkers for the colorectal cancer model (n = 90)
across the matched colorectal sample set (n = 45).
Stage calling is overlaid. Both the proteins and the samples were
hierarchically clustered. Selected, significantly enriched gene ontology
pathways are reported on the right with the p-value
in parentheses. Proteins highlighted in gray are reported in panel
(b). (b) Boxplot visualization of log-transformed CD47, STAT3, and
TAGL quantities divided by the stage for the colorectal cancer set.
Thick lines indicate medians, boxes indicate 25 and 75% quartiles,
whiskers extend between the median and ±(1.58 × interquartile
range), and each data point represents a sample (n = 45). (c) Overview of the classification accuracy of the machine
learning models for the colorectal cancer validation set (n = 9). Correct classifications are represented in the highlighted
boxes.
Colorectal
cancer biomarker candidates predict diseased status.
(a) Z-score of all machine learning-selected candidate
biomarkers for the colorectal cancer model (n = 90)
across the matched colorectal sample set (n = 45).
Stage calling is overlaid. Both the proteins and the samples were
hierarchically clustered. Selected, significantly enriched gene ontology
pathways are reported on the right with the p-value
in parentheses. Proteins highlighted in gray are reported in panel
(b). (b) Boxplot visualization of log-transformed CD47, STAT3, and
TAGL quantities divided by the stage for the colorectal cancer set.
Thick lines indicate medians, boxes indicate 25 and 75% quartiles,
whiskers extend between the median and ±(1.58 × interquartile
range), and each data point represents a sample (n = 45). (c) Overview of the classification accuracy of the machine
learning models for the colorectal cancer validation set (n = 9). Correct classifications are represented in the highlighted
boxes.
Stage Separation in Pancreatic
Cancer
In the pancreatic
cancer set, 436 proteins were significantly altered between healthy,
early, and late stages (Supporting Information Figure 6A). The sPLSDA modeling selected 106 proteins, which
efficiently separated the three classes in both hierarchical clustering
and PCA analyses (p-value = 3.1e – 16; Figure A,B). The separation
was driven primarily by CD9 (importance = 0.37), TENX (importance
= 0.32), and di-N-acetylchitobiase (DIAC, importance = 0.28), with
both TENX and DIAC showing a downregulation with disease progression
and CD9 showing a stronger upregulation in early- than late-stage
pancreatic cancer (Figure C and Supporting Information Figure 6B). CD9 levels correlated most strongly with endocytosis-related protein
dynamin-1 (DYN1), heat shock protein β-1 (HSPB1), platelet glycoprotein
4 (CD36), and a profibrotic matricellular protein CCN family member
2 (CCN2). The unsupervised clustering of the candidate proteins resulted
in interesting patterns (Figure A). In the early-stage pancreatic cancer, proteins
involved in the regulation of peptide secretion, cell communication,
and chemokine production are overall downregulated including LEG10,
which is essential for the suppressive function of CD25 positive regulatory
T-cells[55,56] (Supporting Information Figure 6C), while proteins involved in the negative regulation
of apoptotic process and receptor internalization (including proto-oncogene
tyrosine-protein kinase Src (SRC) and CD9; Figure C and Supporting Information Figure 6C) are upregulated. In late-stage pancreatic cancer,
cellular oxidant detoxification and oxygen transport, including hemoglobin
subunit γ-1 (HBG1), are upregulated (Supporting Information Figure 6C). Of the 125 biomarker candidates selected,
65% were in the low abundance range (Supporting Information Figure 6D). In the validation set, the model had
an accuracy of 66.7%, with two out of nine observations incorrectly
assigned to the healthy group instead of early-stage cancer (Figure D). On the whole,
deep profiling of human plasma enabled the clustering of diseased
patients based on the disease stage and feature reduction makes biological
patterns related to disease progression emerge.
Figure 6
Pancreatic cancer biomarker
candidates predict diseased stage.
(a) Z-score of all machine learning-selected candidate
biomarkers for the pancreatic cancer model (n = 106)
across the matched pancreatic cancer sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples
were hierarchically clustered. Selected, significantly enriched gene
ontology pathways are reported on the right with the p-value in parentheses.
Proteins highlighted in gray are reported in panel (c) and the Supporting Information Figure 6. (b) Representation
of the first two dimensions from the PCA analysis based on candidates
identified in the sPLSDA model for pancreatic cancer. Small points
represent samples, and large points represent the average across the
stage. While the first dimension separates healthy from diseased samples
and explains 18% of the variance in the data, the second dimension
separates early- and late-stage samples and represents 13% of the
variability. The corresponding ellipses represent sample concentration
around the mean. (c) Boxplot visualization of log-transformed CD9,
DIAC, and TNXB quantities divided by the stage for the pancreatic
cancer set. Thick lines indicate medians, boxes indicate 25 and 75%
quartiles, whiskers extend between the median and ± (1.58 ×
interquartile range), and each data point represents a sample (n = 45). (d) Overview of the classification accuracy of
the machine learning models for the pancreatic cancer validation set
(n = 9). Correct classifications are represented
in the highlighted boxes.
Pancreatic cancer biomarker
candidates predict diseased stage.
(a) Z-score of all machine learning-selected candidate
biomarkers for the pancreatic cancer model (n = 106)
across the matched pancreatic cancer sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples
were hierarchically clustered. Selected, significantly enriched gene
ontology pathways are reported on the right with the p-value in parentheses.
Proteins highlighted in gray are reported in panel (c) and the Supporting Information Figure 6. (b) Representation
of the first two dimensions from the PCA analysis based on candidates
identified in the sPLSDA model for pancreatic cancer. Small points
represent samples, and large points represent the average across the
stage. While the first dimension separates healthy from diseased samples
and explains 18% of the variance in the data, the second dimension
separates early- and late-stage samples and represents 13% of the
variability. The corresponding ellipses represent sample concentration
around the mean. (c) Boxplot visualization of log-transformed CD9,
DIAC, and TNXB quantities divided by the stage for the pancreatic
cancer set. Thick lines indicate medians, boxes indicate 25 and 75%
quartiles, whiskers extend between the median and ± (1.58 ×
interquartile range), and each data point represents a sample (n = 45). (d) Overview of the classification accuracy of
the machine learning models for the pancreatic cancer validation set
(n = 9). Correct classifications are represented
in the highlighted boxes.
Discussion
We have developed an automated, robust, and parallelizable
workflow
for deep, large-scale plasma proteome profiling by depletion and sample
preparation and by generating deep coverage ion mobility DIA methods.
First, we demonstrated substantial improvements upon depletion for
identification and quantification using a controlled quantitative
plasma experiment. Furthermore, through multistage quality control,
we assessed the variance introduced at each step of processing. In
summary, the novel plasma discovery workflow enables the deep profiling
of 10 samples per day per analytical platform to a depth of approximately
2700 proteins per study for 2 h gradients, reaching deep into tissue
leakage and signaling molecules while maintaining quantitative accuracy.
To evaluate the potential of deeper proteome coverage of the analytical
pipeline, we measured a subset of the cancer plasma study with 3.5
h gradient FAIMS-DIA acquisitions. This resulted in a substantial
increase in protein identifications to 3372 cumulatively (Supporting Information Figure 7).Next,
we applied the novel plasma discovery workflow to a cohort
containing samples coming from five solid tumors. Data analysis, including
machine learning, revealed biomarker candidates and resulted in predictive
models. The biomarkers mainly contain proteins from low abundance
regions that would have likely been missed by neat plasma profiling,
as previously speculated by Geyer et al.[9] Given the limited sample size and sample selection limitations (e.g.,
the healthy samples are self-declared healthy), the presented biomarker
candidates require additional validation in an independent cohort.While the separation of healthy from cancer plasma samples was
quite accurate for the cancer-specific models (average accuracy 93%),
early- to late-stage differentiation was much more challenging, showing
weaker separation (average accuracy 56%). The pan-cancer model performed
worse than the cancer-specific models, indicating that “one-fits-all”
biomarkers are generally harder to discover. This is likely because
of the considerable heterogeneity across cancer types and could be
solved by a larger cohort, more advanced stratification strategy and
would likely lead to a larger biomarker panel.Seven candidate
proteins were consistently differentially abundant
across all cancers, of which one followed a cancer-type-specific behavior.
Notably, the previously reported pan-cancer biomarker candidate TENX
was reproduced, showing a reduction in the disease progression irrespective
of the cancer type.[57] Overall, our approach
showed that deep exploration of the proteome of cancer plasma samples
can be realized for biomarker discovery. Larger cohorts and a longitudinal
study design, where the same subjects are monitored ideally before
disease onset, would likely lead to more robust biomarkers.When focusing on colorectal cancer, 307 proteins were altered between
healthy, early, and late stages. These include three with a documented
role in colorectal cancer development: STAT3,[58] TAGL,[59] and CD47.[60] In addition, gene ontology enrichments based on identified
candidates showed a response to leptin and the regulation of proteolysis
increased in cancer. At the same time, there was a negative regulation
of cell–cell adhesion, leukocyte homeostasis, and response
to hydrogen peroxide. Based on the machine learning-assisted biomarker
discovery approach, a prediction model based on 90 proteins had the
highest predictive classification power with a 78% accuracy on the
hold-out set.In pancreatic cancer, 436 proteins were altered
between healthy,
early, and late stages. Of these, seven (GTR1, APOA4, IBP2, CD9, CAB45,
OLFM4, BGH3) have previously been suggested as possible pancreatic
cancer biomarkers.[61−65] Machine learning-based modeling selected 106 proteins, which led
to an efficient separation using distance measures of healthy-, early-,
and late-stage samples. The selected proteins showed an average overall
prediction accuracy of 67%, with two observations incorrectly assigned
to the healthy group instead of early-stage cancer. This separation
was primarily driven by the three cancer-related proteins CD9,[66] TENX,[57] and DIAC.[62,67] Further proving the quality of the candidates, the separation was
also driven by the recently proposed therapeutic target CNN2[68] and the prognostic marker GTR1.[69] A study by Jayaraman et al. demonstrated that the exposure
of pancreatic cancer cells to zinc leads to increased protein ubiquitination
and enhanced cell death, implicating zinc as a potential therapy in
treating pancreatic cancer.[70] We found
the sequestration of zinc ions as an enriched biological process in
pancreatic cancer, specifically downregulated in cancer samples (especially
early stage).Clinical analysis of blood is the most widespread
diagnostic procedure
in medicine, and blood biomarkers are used to diagnose diseases, categorize
patients, and support treatment decisions. The presented approach
is well suited for deep, epidemiological biomarker studies in plasma
as it reaches deep into the tissue leakage area, where information
on the health state of distal tissues can be discovered. Furthermore,
biomarker sets derived from the machine learning biomarker discovery
analysis are not optimally suited for a direct transition into a “classical”
clinical biomarker, as new multiplexed approaches for clinical assays
would be required. Such challenges could potentially be facilitated
by DIA or multiple PRM-based assays, which are fully compatible with
the presented workflow and could ultimately result in streamlined
discovery-to-target-driven personalized medicine utilizing only one
technology platform.[71,72]Hence, we envision that
the profiling of large cohorts at high
proteome depth will strongly support the development of novel biomarkers
previously not accessible to large-scale discovery approaches and
will lead to the development of biomarker panels that will finally
deliver on the promise of noninvasive, preventive cancer screening.
Authors: Andrei Turtoi; Davide Musmeci; Yinghong Wang; Bruno Dumont; Joan Somja; Generoso Bevilacqua; Edwin De Pauw; Philippe Delvenne; Vincent Castronovo Journal: J Proteome Res Date: 2011-07-29 Impact factor: 4.466
Authors: Zhen Zhang; Robert C Bast; Yinhua Yu; Jinong Li; Lori J Sokoll; Alex J Rai; Jason M Rosenzweig; Bonnie Cameron; Young Y Wang; Xiao-Ying Meng; Andrew Berchuck; Carolien Van Haaften-Day; Neville F Hacker; Henk W A de Bruijn; Ate G J van der Zee; Ian J Jacobs; Eric T Fung; Daniel W Chan Journal: Cancer Res Date: 2004-08-15 Impact factor: 12.701
Authors: Peter R Galle; Friedrich Foerster; Masatoshi Kudo; Stephen L Chan; Josep M Llovet; Shukui Qin; William R Schelman; Sudhakar Chintharlapalli; Paolo B Abada; Morris Sherman; Andrew X Zhu Journal: Liver Int Date: 2019-09-11 Impact factor: 5.828
Authors: Bo Burla; Makoto Arita; Masanori Arita; Anne K Bendt; Amaury Cazenave-Gassiot; Edward A Dennis; Kim Ekroos; Xianlin Han; Kazutaka Ikeda; Gerhard Liebisch; Michelle K Lin; Tze Ping Loh; Peter J Meikle; Matej Orešič; Oswald Quehenberger; Andrej Shevchenko; Federico Torta; Michael J O Wakelam; Craig E Wheelock; Markus R Wenk Journal: J Lipid Res Date: 2018-08-16 Impact factor: 5.922
Authors: Xiaofang Cao; AnnSofi Sandberg; José Eduardo Araújo; Filip Cvetkovski; Erik Berglund; Lars E Eriksson; Maria Pernemalm Journal: J Proteome Res Date: 2021-07-28 Impact factor: 4.466
Authors: Eric W Deutsch; Gilbert S Omenn; Zhi Sun; Michal Maes; Maria Pernemalm; Krishnan K Palaniappan; Natasha Letunica; Yves Vandenbrouck; Virginie Brun; Sheng-Ce Tao; Xiaobo Yu; Philipp E Geyer; Vera Ignjatovic; Robert L Moritz; Jochen M Schwenk Journal: J Proteome Res Date: 2021-10-21 Impact factor: 5.370