Literature DB >> 35605973

Biomarker Candidates for Tumors Identified from Deep-Profiled Plasma Stem Predominantly from the Low Abundant Area.

Marco Tognetti1, Kamil Sklodowski1, Sebastian Müller1, Dominique Kamber1, Jan Muntel1, Roland Bruderer1, Lukas Reiter1.   

Abstract

The plasma proteome has the potential to enable a holistic analysis of the health state of an individual. However, plasma biomarker discovery is difficult due to its high dynamic range and variability. Here, we present a novel automated analytical approach for deep plasma profiling and applied it to a 180-sample cohort of human plasma from lung, breast, colorectal, pancreatic, and prostate cancers. Using a controlled quantitative experiment, we demonstrate a 257% increase in protein identification and a 263% increase in significantly differentially abundant proteins over neat plasma. In the cohort, we identified 2732 proteins. Using machine learning, we discovered biomarker candidates such as STAT3 in colorectal cancer and developed models that classify the diseased state. For pancreatic cancer, a separation by stage was achieved. Importantly, biomarker candidates came predominantly from the low abundance region, demonstrating the necessity to deeply profile because they would have been missed by shallow profiling.

Entities:  

Keywords:  SWATH; cancer; clinical proteomics; data-independent acquisition; depletion; high throughput; label-free quantification; library; plasma proteomics; single shot; stable isotope-based quantification

Mesh:

Substances:

Year:  2022        PMID: 35605973      PMCID: PMC9251764          DOI: 10.1021/acs.jproteome.2c00122

Source DB:  PubMed          Journal:  J Proteome Res        ISSN: 1535-3893            Impact factor:   5.370


Introduction

Proteins control most biological processes in life. Alterations in their expression level, localization, and proteoforms are often correlated with disease onset and progression.[1] In humans and animals, blood flows through virtually all tissues. Therefore, it has the potential to indicate the health state of any inner organ, even those not accessible from the outside. Blood is readily obtainable with minimal invasive sampling, and large biobanks exist for retrospective analyses.[2] Clinical analysis of blood is the most widespread diagnostic procedure in medicine, and blood biomarkers are used to diagnose diseases, categorize patients, and support treatment decisions. While proteins (6–8%) are by far the second major component of plasma after water (90–92%), metabolic, lipidomic, transcriptomic, and genomic readouts are also gaining traction as diagnostic tests in plasma.[3−6] Different omics readouts can be used in conjunction to improve diagnostic power.[7] Despite more than 20,000 diseases reported to affect humans,[8] it is only for a small fraction of them that accurate, sensitive, and specific diagnostic tests exist. The limited success of blood protein biomarkers is primarily due to analytical challenges that come with the proteomic analysis of blood plasma. On the one hand, the large biological variance between individuals and within individuals over time makes the discovery of reliable biomarker signatures difficult.[9−12] Further, the steep dynamic range of human plasma, with an estimated dynamic range of 12–13 orders of magnitude,[13] renders comprehensive proteome profiling challenging to any analytical technique. In the lower concentration range, thousands of proteins reside, mostly tissue leakage proteins and signaling molecules that could serve as biomarkers but are very challenging to measure, especially in an unbiased manner.[14,15] Mass spectrometry (MS)-based plasma analysis provides an unbiased, quantitative, and therefore ideal technology for the system-wide characterization of the proteome.[16] Recently, technological developments in sample preparation, chromatography, and acquisition enabled automated, large-scale plasma projects of hundreds of specimens that have resulted in reproducible findings.[15,17−20] These approaches share the shallow depth of proteome coverage, reaching a maximum of about 600 proteins identified and quantified in a sample. From qualitative analysis, disproportionately more proteins were found to be present in the lower abundance region of plasma than in the higher concentration range.[14] Novel MS-based approaches have been developed to improve analytical depth while retaining quantitative information. These include the depletion of high-abundance proteins, the enrichment of low abundant proteins of interest, and prefractionation.[21] Still, they have yet to reach the throughput level needed to measure larger cohorts of clinical samples. Automatization and depletion, batch, and quality control have been tackeled[18,22,23] but require further improvement for large-scale studies. In summary, while current plasma proteome biomarker research approaches mostly cover the first few hundred proteins by concentration, rigorous experimental design and comprehensive, large-scale quantitative studies will achieve generalizable biomarker discovery.[16] Screening for the most common cancer types cannot be done in a routine and population-wide manner. To date, only a few nonideal, validated biomarkers exist in clinical use.[24] A significant challenge is that generally, only a single analyte or metric is measured despite the known heterogeneity of cancer. Biomarkers that accurately enable early detection in asymptotic subjects, reflect cancer aggressiveness at diagnosis, and improve risk stratification are urgently needed.[24] Despite the medical need, plasma biomarker candidates for cancer are rarely validated or transferred to the clinic. Recent examples are as follows: Zhang et al. performed discovery proteomics in the plasma of 10 patients with colorectal cancer, discovered 72 biomarker candidates, and then performed a successful follow-up verification for prognostic markers with 419 patients using an immunoassay.[25,26] Enroth et al. found plasma protein biomarker signatures for ovarian cancer[27] but performed no validation. He et al. showed that for hepatocellular carcinoma and cholangiocarcinoma, biomarker candidates could be identified from plasma; the validation of these candidates is still pending.[28] Zhou et al. identified biomarkers for early gastric cancer from a small sample set, but validation is still pending.[29] For prostate cancer, a blood diagnostic test was successfully developed based on the discovery of proteomics and is now being used in the clinic.[30] For the detection of early ovarian cancer, the OVA1 test was developed and approved, where the measurement of β-2 macroglobulin, apolipoprotein 1, serum transferrin, and prealbumin is combined with the previously established marker CA125 to deliver better care.[31,32] This case exemplifies that multimeasurement techniques are expected to outperform single biomarker panels. Furthermore, single protein biomarkers are rarely specific for a single disease, e.g., α fetoprotein is diagnostic in liver cancer, but the biomarker is not specific, as it is altered in other liver diseases and ovarian and testis cancers.[33] Rarely, there are highly specific biomarkers such as β subunit HCG (β-HCG), which is a serum marker for testicular carcinoma as β-HCG is never detected in the circulation of healthy males.[34] To make plasma biomarker discovery more efficient and successful, the comprehensive profiling and validation of large cohorts of plasma proteomes need to be significantly improved with new approaches.[16] The expected outcome is new biomarkers that will allow early cancer detection and prediction of the probable response to therapy (in precision medicine). We demonstrate a novel, automated analytical approach for plasma profiling to a depth of 2732 proteins in the presented cancer study and identifying deep into tissue leakage and signaling molecular areas. We demonstrate the identification and quantitative benefits of neat plasma profiling through a controlled quantitative experiment. Further, we profiled deep into the tissue leakage plasma samples coming from both healthy patients and patients with one of the five most deadly solid tumors in the United States.[35] A biomarker analysis with machine learning revealed candidates and models able to classify healthy and diseased samples. The discovered biomarker candidates predominantly came from low abundance protein regions, clearly demonstrating the need to measure deeply because they would have been missed by shallow plasma profiling.

Experimental Procedures

Ethics

The Cantonal Ethics Committee for Research on Human Beings, Zürich, Switzerland, approved the study protocol to be performed (proteomic analysis of plasma samples (2020-02892)).

Cohort Selection and Study Design

Cohort selection and experimental design were driven by sample availability in commercial repositories. For each cancer type, 30 matching samples were selected and split into early (nonmetastatic stages IA–IIC) and late (nonmetastatic stage IIIA–C) groups. Prior to the analysis, normal individuals were matched for age, sex, and whenever possible balanced across ethnicities to both early and late groups for each cancer type. Healthy samples are self-declared healthy. This resulted in three equal control groups (n = 15) with overlapping individuals, namely, breast cancer control, prostate cancer control, and remaining cancer control. Matching was done manually using the χ2 test or ANOVA with a p-value threshold of 0.05 (R-package “tableone”).

Sample Preparation of the Pan-Cancer Cohort

One hundred and eighty human plasma samples were obtained from Precision for Medicine and its subsidiaries (Norton USA), Discovery Life Sciences (Huntsville), and ProteoGenex (Los Angeles). Due to limited availability, samples were not balanced across suppliers; collection procedures and handling until storage at −80 °C are considered to be the same in the case of all three providers (Supporting Information Table 1). All samples were handled equally and thawed twice. During the aliquoting, a small amount of each sample was pooled. This quality control sample was subsequently used for the library generation and to assess the quality and batch effects throughout the sample preparation and acquisition. The processing batches were block-randomized for disease status, diseased state, gender, and ethnicity (only relevant for breast cancer samples) and kept for the entire sample preparation. Depletion was performed using the Agilent multi affinity removal column human-14, 4.6 × 50 mm2 (Agilent Technologies) set up on a Dionex Ultimate 3000 RS pump (Thermo Fisher Scientific) and run according to the manufacturer’s instructions. Briefly, the plasma was diluted 4:1 with buffer A for multiple affinity removal LC columns (Agilent Technologies) and filtered through a 0.22 μm hydrophilic PVDF membrane filter plate (Millipore) before 70 μL was injected onto the column. The gradient was 27.5 min long, with the collection occurring between 3.6 and 9.2 min, a flow rate of 1 mL/min during 11 and 26.5 min and 0.125 mL/min during the rest of the gradient, and buffer B for multiple affinity removal LC columns (Agilent Technologies) only in the time period of 13–17.5 min (100% buffer B). Well-spaced within each processing batch, we depleted the quality control sample three times and treated it as a separate sample thereon (depletion control samples). Following depletion, we digested the samples with protein aggregation capture using a KingFisher Flex (Thermo Fisher Scientific).[36] To assess digestion reproducibility, we mixed two extra depletions of the quality control sample before splitting it into digestion triplicate (digestion control samples). The acidified peptide mixtures were loaded for clean-up into MacroSpin C18 96-well plates (The Nest Group), desalted, and eluted with 50% acetonitrile. Samples were dried in a vacuum centrifuge and solubilized in 0.1% formic acid and 1% acetonitrile with Biognosys’s iRT and PQ500 kits (Biognosys) spiked following the manufacturer’s instruction. Prior to DIA mass spectrometric analyses, the sample’s peptide concentrations were determined using a UV/vis spectrometer at 280 nm/430 nm (SPECTROstar Nano, BMG Labtech) and centrifuged at 14,000g at 4 °C for 30 min.

Sample Preparation of the Controlled Quantitative Experiment

The controlled quantitative experiment was generated from 20 healthy human EDTA K3 plasma samples obtained from Sera Laboratories International Ltd. (West Sussex, U.K.). Saccharomyces cerevisiae (S. cerevisiae) was lysed in 100 mM HEPES pH 7.4, 150 mM KCl, 1 mM MgCl2, by shear force passing through a gauge 12 syringe 15 times on ice before filtering (0.2 μm). Escherichia coli (E. coli) was lysed with a cell cracker before filtering (0.2 μm). After protein concentration determination using a UV/vis spectrometer at 280 nm (SPECTROstar Nano, BMG Labtech), each sample was spiked with fixed ratios of E. coli and S. cerevisiae leading to a synthetic 1:2- and 4:3-fold change, respectively. To 20 μL of plasma (∼1200 μg proteins), 40 or 30 μg of S. cerevisiae and 12 or 24 μg of E. coli lysate were added for conditions A and B, respectively. The resulting 40 samples were diluted 4:1 with buffer A for multiple affinity removal LC columns (Agilent Technologies), filtered through a 0.22 μm hydrophilic PVDF membrane filter plate (Millipore). Seventy microliters was used for depletion as described above followed by filter-aided sample preparation (FASP)[37] and 30 μL for the neat plasma comparison. The diluted neat plasma sample was precipitated by adding four excesses of cold acetone (v/v) and overnight incubation at −20 °C. The pellet was subsequently washed twice with cold 80% acetone in water (v/v). After air-drying the pellet, the proteins were resuspended in 50 μL denaturation buffer (8 M urea, 20 mM TCEP, 40 mM CAA, 0.1 M ABC), sonicated for 5 min (Bioruptor Plus, Diagenode, 5 cycles high, 30 s on, 30 s off), and incubated at 37 °C for 60 min. Upon dilution with 0.1 M ABC to a final urea concentration of 1.4 M, the samples were digested overnight with a 2 μg sequencing-grade trypsin (Promega) and trypsin inactivated by adding TFA to a final concentration of 1% v/v. Peptide clean-up was carried out as described above.

Library Generation

High pH reverse-phase (HPRP) fractionation was performed using a Dionex UltiMate 3,000 RS pump (Thermo Fisher Scientific) on an Acquity UPLC CSH C18 1.7 μm, 2.1 × 150 mm2 column (Waters) at 60 °C with a 0.3 mL/min flow rate. Prior to loading, the pH of 300 μg of pooled depleted samples was adjusted to pH 10 by adding ammonium hydroxide. The used gradient was 1–40% solvent B in 30 min; solvents were A: 20 mM ammonium formate in water, B: acetonitrile. Fractions were taken every 30 s and sequentially pooled to 20 fraction pools. The fraction pools were then dried down and resuspended in 0.1% formic acid and 1% acetonitrile with Biognosys’s iRT kits spiked according to the manufacturer’s instruction. Before data-dependent acquisition (DDA) mass spectrometric analyses, peptide concentrations were determined, and the samples were centrifuged as described above.

Mass Spectrometric Acquisition

For data-independent acquisition (DIA) LC-MS measurements for the controlled quantitative experiment, 1 μg of peptides per sample was injected onto an in-house-packed reverse-phase column (PicoFrit emitter) with a 75 μm inner diameter, 60 cm length, and 10 μm tip from New Objective, packed with the Reprosil Saphir C18 1.5 μm phase (Dr. Maisch, Ammerbuch, Germany) on a Thermo Fisher Scientific EASY-nLC 1,200 nanoliquid chromatography system connected to a Thermo Fisher Scientific Orbitrap Exploris 480 mass spectrometer equipped with a Nanospray Flex ion source. The DIA method was adopted from Bruderer et al.[38] and consisted of one full-range MS1 scan and 29 DIA segments. For DDA and DIA LC-FAIMS-MS/MS measurements, 4 μg of each sample was separated using a self-packed analytical PicoFrit column (75 μm × 50 cm length) (New Objective, Woburn, MA) packed with ReproSil Saphir C18 1.5 μm (Dr. Maisch GmbH, Ammerbuch, Germany) with a 2 h segmented gradient using an EASY-nLC 1200 (Thermo Fisher Scientific). LC solvents were A: water with 0.1% FA; B: 20% water in acetonitrile with 0.1% FA. For the 2 h gradient, a nonlinear LC gradient was 1–59% solvent B in 120 min followed by 59–90% B in 10 s, 90% B for 8 min, 90 to 1% B in 10 s and 1% B for 5 min at 60 °C, and a flow rate of 250 nL/min. The samples were acquired on an Orbitrap Exploris 480 mass spectrometer (Thermo Fisher Scientific) equipped with a FAIMS Pro device (Thermo Fisher Scientific) using methods based on ref (39). If not specified differently, the FAIMS-DIA method contained three FAIMS CV (−35, −55, and −75 V) parts, each with a survey scan of 120,000 resolution with 20 ms max IT and an AGC of 3 × 106 and 35 DIA segments of 15,000 resolution with IT set to auto and AGC set to custom 1000%. The mass range was set to 350–1650m/z, the default charge state was set to 3, the loop count was set to 1, and the normalized collision energy was set to 30. For the acquisition of the fractionated sample for the library, a DDA method was applied. The DDA method consisted of three FAIMS CVs (−35, −55, and −75 V): each contained a DDA experiment with 60,000 resolution of MS1, 15,000 resolution of MS2, with a fixed cycle time (1.3 s), IT set to AUTO, and AGC set to custom 500%.[40]

Mass Spectrometric Data Analysis

Database Search for Library Generation

DIA and DDA mass spectrometric data were analyzed using software SpectroMine (version 3.0.2101115.47784, Biognosys) using the default settings, including a 1% false discovery rate control at PSM, peptide, and protein levels, allowing for two missed cleavages and variable modifications (N-term acetylation and methionine oxidation). The human UniProt.fasta database (Homo sapiens, 2020-07-01, 20,368 entries) was used, and for the library generation, the default settings were used except for the use of a top 300 precursors per protein filter.

Quantitative Analysis of Data-Independent Acquisition

Raw mass spectrometric data were first converted using the HTRMS Converter (version 14.3.200701.47784, Biognosys) and then analyzed using software Spectronaut (version 15.0.210108, Biognosys) with the default settings, but Q-value sparse filtering was enabled with a global imputing strategy and a hybrid library comprising all DIA and DDA runs conducted in this study.[41] The imputing strategy defines how to estimate the missing values (identifications not fulfilling the FDR threshold), and with the global imputing strategy, the missing values are imputed based on random sampling from a distribution of low abundant signals taken across the entire experiment (lowest 10th percentile ±1 standard deviation).[42] Default settings include peptide and protein level false discovery rate control at 1% and cross-run normalization using global normalization on the median. Including a high number of quality control samples (depletion, digestion, and injection controls) enabled the investigation of batch effects and quantification of the introduced variability at each step. No batch effect was identified by either principal component analysis (PCA, “stats” R-package) or hierarchical clustering. CQE DIA data were analyzed using the directDIA approach of Spectronaut software (version 15.0.210108, Biognosys) using the default settings, including a 1% false discovery rate control at PSM, peptide, and protein levels, allowing for two missed cleavages and variable modifications (N-term acetylation and methionine oxidation). The directDIA approach within Spectronaut is an implementation with minor improvements of the published DIA Umpire approach.[43] The combined human, E. coli, and S. cerevisiae.fasta databases with the removal of the overlapping tryptic sequences (Homo sapiens 2020-08-31, 96,996 entries; Saccharomyces cerevisiae (strain ATCC 204508/S288c), 6078 entries; Escherichia coli (strain K12), 4857 entries; Combined, 96,637 entries) were used, and for the library generation, the default settings were used except for the Q-value sparse filtering enabled with a global imputing strategy and cross-run normalization using global normalization on the median based solely on the human identifications. When we use proteins, we refer to protein groups as determined by the ID picker algorithm[44] and implemented in Spectronaut.

Data Analysis and Biomarker Selection

Initial univariate candidate filtering was performed using the pairwise Wilcoxon test applied per protein across disease status (healthy, early, and late stages) with the Holmes–Bonferroni correction (within-group). Proteins with a p-value below or equal to 0.05 from the randomly selected 80% of observations were used for further optimization using sparse partial least-squares discriminant analysis (sPLSDA).[45] A leave-one-out algorithm was used for optimal component and protein selection. sPLSDA training and testing were performed using the R-package “mixOmics”.[46] The remaining 20% of observations were used for validation. The accuracy of prediction for all three groups, healthy, early, late stages, and healthy against early and late stages together, was calculated as the ratio of the true positive and negative sum of all observations (R-package “caret”). Unsupervised hierarchical analysis was performed with Manhattan distance and Ward’s clustering on centered and normalized data (xij-x̅j/sj, i-th observation with j-th protein) using R-package “ComplexHeatmap”. PCA analysis was performed using R-package “stats”. Correlation analysis was performed using the Pearson correlation with R-packages “stats” and “corrplot”. Correlation significance was tested using a two-sided t-test at 0.05 α. All analyses were performed using log2-transformed data. Gene ontology enrichment was performed using GOrilla,[47] and the identifications of this study were selected as the background. All basic calculations and data transformations were performed in R with R-packages: “dplyr” and “ggplot2”.

Results

Optimization and Validation of the Analytical Approach

While methods to analyze the plasma proteome in-depth exist, they are usually either targeted and therefore biased, as for the case of antibody- or aptamer-based technologies, or are based on the principle of fractionation and are therefore difficult to scale. We aimed to develop an analytical method that provided deep coverage and quantitative accuracy while minimizing sample handling, bias, and batch effects. For this scope, we developed and optimized an automated plasma depletion pipeline composed of three major steps: sequential depletion, parallel digestion, and LC-MS acquisition (Figure A).
Figure 1

Deep plasma profiling: automated analytical approach and benchmarking. (a) Sketch of the major steps of the analytical approach developed for deep human plasma profiling for biomarker discovery, including the depletion of the 14 most abundant proteins and the approximate time requirements. (b) Schema of the controlled quantitative experiment based on human plasma spiked with known amounts of Saccharomyces cerevisiae (S. cerevisiae) (1:1.3) and Escherichia coli (E. coli). (1:.5). The controlled mixtures were either directly digested or processed using the process described in panel (a). (c) Plot showing the measured distributions of the fold changes of the controlled quantitative experiment divided by species. The dashed lines represent the theoretical fold change. (d) Comparison of the number of protein groups identified at different gradient lengths for a depleted human plasma pool by either directDIA (blue) or with a sample-specific library (red).

Deep plasma profiling: automated analytical approach and benchmarking. (a) Sketch of the major steps of the analytical approach developed for deep human plasma profiling for biomarker discovery, including the depletion of the 14 most abundant proteins and the approximate time requirements. (b) Schema of the controlled quantitative experiment based on human plasma spiked with known amounts of Saccharomyces cerevisiae (S. cerevisiae) (1:1.3) and Escherichia coli (E. coli). (1:.5). The controlled mixtures were either directly digested or processed using the process described in panel (a). (c) Plot showing the measured distributions of the fold changes of the controlled quantitative experiment divided by species. The dashed lines represent the theoretical fold change. (d) Comparison of the number of protein groups identified at different gradient lengths for a depleted human plasma pool by either directDIA (blue) or with a sample-specific library (red). First, we automated the depletion of the 14 most abundant proteins using a sequential approach supporting a 96-well format.[48] Briefly, after randomization and filtration of the samples into a 96-well plate, an automated chromatographic system sequentially and automatically processed the plate, thereby depleting the 14 most abundant human proteins in plasma via the use of specific antibodies. To quantify the analytical gain of the approach and to assess whether depletion maintains quantitative precision and accuracy, we performed a controlled quantitative experiment (CQE). The CQE sample set was generated from 20 healthy human plasma samples spiked with either 1:400 E. coli and 1:90 S. cerevisiae for condition A or 1:200 E. coli and 1:120 S. cerevisiae for condition B (Figure B). After processing the 40 samples with or without the automated depletion pipeline, they were analyzed on a mass spectrometer using data-independent analysis (DIA). Since the major challenge linked to quantification in plasma is the large dynamic range, removing the 14 most abundant proteins should lead to an increase in the number of proteins identified compared to the neat plasma. Indeed, while the processing of the neat plasma samples led to an average identification of 572 proteins (3920 peptides) across all samples, depletion significantly increased the coverage by 257% to 1471 proteins (10,230 peptides) (n = 40, p-value = 1e – 98; Supporting Information Figure 1A). Importantly, depletion retained the quantitative accuracy close to the expected ratios between conditions B and A of 0.415 for E. coli and −1 for S. cerevisiae: E. coli median ratios −1.20 and −1.18 and S. cerevisiae 0.38 and 0.32 for the neat and depleted sets, respectively (Figure C). We observed a reduction in the intensity of the depleted proteins along with the closely related proteins (e.g., other immunoglobulins or apolipoproteins) while observing an overall increase in intensity in the rest of the plasma proteome (Supporting Information Figure 1B). Furthermore, intensities of human proteins are correlated between the two data sets (Pearson correlation 0.58, n = 247), and if only nondepleted proteins are considered, this correlation becomes much stronger (0.85, n = 198). Finally, we performed an unpaired t-test between conditions B and A and could identify 171 and 621 candidates (FDR, q-value ≥ 0.01) for the neat and depleted sets, respectively (Supporting Information Figure 1C). Given the experiment’s controlled nature, we could identify the true hits as those proteins mapping to either E. coli or S. cerevisiae and showing the expected directionality. Overall, the depletion led to a 362% increase in true hits, 170 and 615 for neat and depleted (actual FDR < 1% for both), respectively. In summary, the automated depletion more than tripled the number of proteins identified and the number of true hits while maintaining quantitative accuracy and reducing the manual workload to only the filtering of the samples (about half a day per 96 samples; Figure A). In the second step following depletion, the sample plate was prepared for digestion on an automated platform using a protein aggregation capture approach.[36] Subsequently, the samples were cleaned using C18 plates, and peptide concentration was measured. In case a library was generated, a fraction of all samples can be pooled and an ultra-high-pressure liquid chromatography-controlled high pH reverse-phase (HPRP) fractionation was performed.[38] The third step comprises the LC-MS measurement of the samples. Even after depletion of the most abundant proteins, the major challenge hindering quantification is the large dynamic range in plasma. Hence, we developed and optimized the LC-MS acquisition for deep proteome coverage using FAIMS-based ion mobility on the orbitrap platform combined with high-performance chromatography. We developed FAIMS-DIA methods that maximize the protein and peptide identification by comparing values and counts of FAIMS compensation voltages with different scan resolutions. This resulted in a set of optimized methods for gradients from 1 to 4 h. Benchmarking with the depleted plasma resulted in 1300 protein identifications in 1 h gradients to 2103 protein identifications in 4 h (Figure D). For reference, in the human cell line HeLa, 10,026 proteins were identified in 4 h (Supporting Information Figure 1D). Altogether, we demonstrated that the presented automated plasma depletion pipeline has the potential to enable the unbiased, reproducible, and precise quantification of more than 2000 proteins on average per sample across very large cohorts.

Plasma Proteome Depth Achieved

To test our pipeline, we set out to analyze a diverse cohort of human plasma samples coming from the five most deadly solid cancer types in the United States:[35] pancreatic, colorectal, breast, prostate, and non-small-cell lung cancers. For each cancer type, 15 early-stage (I–IIC) and 15 late-stage (IIIA–IIIC) nonmetastatic patients, as well as 15 matching normal control samples, were selected based on the available baseline data (including gender, age, and where applicable smoking status; Figure A and Supporting Information Table 2). Altogether, we processed 180 samples (and an additional 24 quality control samples) over the course of 1 week and approximately a month of measurement time. With this scalable approach, we could identify and quantify 2732 proteins (2463 proteins with two or more peptide sequences and on average with 9.2 peptides per protein) across 226 measurements (180 samples and 46 quality control samples, about 900 proteins/h measurement; Figure B), of which 1804 are found in at least 50% of the runs (Supporting Information Figure 2A). On average, we identified 1806 proteins per run. Importantly, missing values were stemming mostly from biological variation as in injection triplicates 88.7% of the 2209 protein groups detected are complete observations. Additionally, 77% of the 2402 protein groups from 15 injection replicates were complete, representing only 6% (119) less protein groups with full profiles than in injection triplicates. Across cancers (and the healthy cohort), the identifications varied between 2524 in prostate cancer and 2682 in lung cancer, showing that only a minimal part (<10%) of the identification is disease-specific and around 1000 protein groups are consistently quantified despite variable biology (Supporting Information Figure 2B). Furthermore, it can be assumed that peptides and proteins that do not fulfill the FDR criteria are below the limit of detection since DIA measures all ions and does not have the stochastic nature of DDA. With the identified proteins, we could cover the 8-order-of-magnitude dynamic range reported for plasma in the Human Protein Atlas (3222 proteins detected in human plasma by mass spectrometry, of which we could quantify 70%; Supporting Information Figure 2C). Within this range, we extensively covered the tissue leakage proteome, interleukins, and signaling proteins such as EGF, KLK3 (PSA), AKT1, CD86, MET, ERBB2, and CD33 (Figure C). As expected, among the 500 highest intensity proteins, meaning that proteins would likely be identified if no depletion would have been applied, 196 (39%) are classified as secreted proteins. On the lower end, we identified tissue-specific proteins coming from the diseased organs (n = 42, 81% of which are not part of the 500 most abundant proteins), cytokines (n = 29, 85%), and nucleoplasm (n = 637, 90%) proteins exemplifying different functional plasma concentration ranges (Figure C). We identified 190 targets for FDA-approved drugs, of which 125 (66%) fall in the lower intensity range.[49] The different biological role of low and high abundant plasma proteins shows that we could recover the known biology of the plasma proteome.
Figure 2

Deep plasma discovery proteomics of five solid cancer types. (a) Description of cohort comprising five solid cancers: breast (infiltrating ductal carcinoma), colon (adenocarcinoma), pancreas (adenocarcinoma), prostate (adenocarcinoma), and lung (non-small-cell lung cancer, squamous cell) cancers. Fifteen subjects for early and late stages were selected for each cancer type, along with 15 matching healthy individuals (a total of 30, given the need to balance ethnicity and sex for prostate and breast cancers). (b) Z-score of all quantified proteins (n = 2732) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. (c) The protein rank vs protein average intensity (n = 180). Proteins were categorized according to Human Protein Atlas, and the average rank was calculated (dotted, vertical lines). The green box depicts the proteome region that is typically below the sensitivity of the neat plasma profiling by mass spectrometry. (d) The coefficient of variation (CV) of the quality control measurements across the processing steps was plotted. The LC-MS variance was controlled by reinjection of the same digested sample (injection). Digestion and depletion were done repeatedly of the same sample (digest, depletion) and the batch stemming from sample preparation 96-well plates (batch). Thick lines indicate medians, boxes indicate 25 and 75% quartiles, and whiskers extend between the median and ±(1.58 × interquartile range).

Deep plasma discovery proteomics of five solid cancer types. (a) Description of cohort comprising five solid cancers: breast (infiltrating ductal carcinoma), colon (adenocarcinoma), pancreas (adenocarcinoma), prostate (adenocarcinoma), and lung (non-small-cell lung cancer, squamous cell) cancers. Fifteen subjects for early and late stages were selected for each cancer type, along with 15 matching healthy individuals (a total of 30, given the need to balance ethnicity and sex for prostate and breast cancers). (b) Z-score of all quantified proteins (n = 2732) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. (c) The protein rank vs protein average intensity (n = 180). Proteins were categorized according to Human Protein Atlas, and the average rank was calculated (dotted, vertical lines). The green box depicts the proteome region that is typically below the sensitivity of the neat plasma profiling by mass spectrometry. (d) The coefficient of variation (CV) of the quality control measurements across the processing steps was plotted. The LC-MS variance was controlled by reinjection of the same digested sample (injection). Digestion and depletion were done repeatedly of the same sample (digest, depletion) and the batch stemming from sample preparation 96-well plates (batch). Thick lines indicate medians, boxes indicate 25 and 75% quartiles, and whiskers extend between the median and ±(1.58 × interquartile range). Furthermore, based on quality control samples, we could characterize variance introduced on each level: injection (median coefficient of variation (CV = 16%), digestion (CV = 19%), depletion (CV = 25%), and column (CV = 26%)), all of which are much lower than the healthy interindividual variability (CV = 56%; Figure D and Supporting Information Figure 2D). As a further quality control, we focused on known protein levels’ interpatient variability (measured by CV; Supporting Information Figure 2E). On one hand, coagulation and complement cascade proteins (KEGG complement and coagulation cascades) were significantly enriched among the proteins with the least interpatient variability (median CV = 32% and p-value = 2.8e – 12), such as complement factor I (CF1, CV = 23%) and complement component C6 (CV = 27%), demonstrating tight regulation.[18] On the other hand, keratins (likely contaminants, Go biological process keratinization) were significantly enriched among the proteins with the most interpatient variability (CV = 339% and p-value = 4.46e – 8), with HLA molecules (CV = 90%) also showing high variability across patients.[50] Additionally, lipoprotein A (LPA) showcases a large interpatient variability (CV = 113%), likely due to the known genetic variants affecting its secretion into plasma.[51,52] Overall, the quantitative data set generated recapitulates known biological features of intrapatient heterogeneity while providing a deep unbiased view of the plasma proteome.

Considerable Heterogeneity across Cancer Types

The cohort was designed to enable five independent within-cancer analyses, each comprising a healthy-, early-, and late-stage group (each n = 15; Supporting Information Table 2 and Figure A). Overall, we included 30 control samples, but only a subset of 15 per cancer were matched (see methods). Hence, a combined analysis of all samples together was not the primary goal of this study. Aware of these limitations, we explored the entire data set for markers that would agnostically predict the cancer stage. The analysis pipeline applied to the whole data set, and the cancer-specific analyses were the same and aimed at providing actionable insights about specific disease development. Given a large amount of data (2732 proteins combined), we performed a two-step approach (Figure A). First, we filtered for differentially abundant proteins between healthy-, early- and late-stage cancers using univariate analysis. In the case of the pan-cancer model, we found 468 proteins dysregulated (Figure B, Supporting Information Figure 3A, and Supporting Information Table 3). Second, using the selected proteins, we trained a model based on sparse partial least-squares discriminant analysis (sPLSDA) on 80% of the data set. This modeling step further reduced the number of proteins to 94 (Figure B). The model partially differentiated healthy from disease but not late to early stage (Supporting Information Figure 3B and Supporting Information Table 4). Interestingly, the majority of the differentiating proteins would have been below the detection level in a neat plasma preparation (65%; Figure C). Furthermore, the unsupervised clustering of the differentiating proteins generated enriched patterns (Figure D). For example, proteins enriched for immunoglobulin production and complement activation tend to be higher in healthy samples (Figure E). A subset of cancer samples have a strong upregulation of proteins linked to metabolic processes and cellular oxidant detoxification (Figure D,E). Immunoglobulin kappa variable 6–21 (KV621) was among the proteins higher in healthy samples, was the third most important discriminant protein in the model (0.56 importance), and showed a more pronounced bimodal distribution in healthy individuals and a decrease in diseased individuals (Figure F and Supporting Information Figure 3C). In addition, the model identified the known inflammation marker complement C5 (CO5, importance 1[53]) increased in the early and late stages and spondin-1 (SPON1, importance 0.58) increased in the late stage (Figure F and Supporting Information Figure 3C), as the first and second most important contributors, respectively. Finally, the predictive power of the model was validated using the remaining 20% of the samples. The predictive power was low at 55.6% (Supporting Information Figure 3D), likely due to the cohort imbalance, the sample heterogeneity, and the small sample set, as each cancer type is known to have a particular protein signature.[54] Nonetheless, unsupervised clustering using the final protein panel (enrichment p-value = 1.4e – 9) allowed for a more efficient separation of samples between healthy and diseased states compared to the entire proteome (p-value = 0.09; Figures B and 3D). Altogether, the global data analysis underlined the importance and necessity of precision medicine and a much larger sample set would be needed to find a potential “one-fits-all” solution.
Figure 3

Machine learning-based candidate biomarker discovery. (a) Schematic detailing the steps of the postprocessing, including univariate testing for filtering, machine learning (sPLSDA) on 80% of the data, and classification performance accuracy on the 20% hold-out validation data. (b) Overview of the number of biomarker candidates selected by univariate analysis (gray) and machine learning (blue) for healthy, early, and late stages across all cancers and individual cancers. (c) Average protein intensity plotted vs protein abundance rank. The machine learning-selected biomarker candidates for the pan-cancer model are colored blue (the average is plotted as a blue line), and important contributors are highlighted. The green box depicts the proteome region that is typically below the sensitivity of neat plasma profiling by mass spectrometry. (d) Z-score of all machine learning-selected candidate biomarkers for the pan-cancer model (n = 94) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in blue and gray are reported in panels (e) and (f), respectively. (e) Boxplot visualization of the average z-transformed protein intensity for all proteins (n = 288) in the cluster highlighted in blue in panel (d) divided by stage (n = 180). Thick lines indicate medians, boxes indicate 25 and 75% quartiles, and whiskers extend between the median and ±(1.58 × interquartile range). (f) Boxplot visualization (as in panel (e)) of the log-transformed protein quantities of the three most differentiating proteins based on the machine learning model (SPON1, KV621, and CO5). Each data point represents a sample (n = 180).

Machine learning-based candidate biomarker discovery. (a) Schematic detailing the steps of the postprocessing, including univariate testing for filtering, machine learning (sPLSDA) on 80% of the data, and classification performance accuracy on the 20% hold-out validation data. (b) Overview of the number of biomarker candidates selected by univariate analysis (gray) and machine learning (blue) for healthy, early, and late stages across all cancers and individual cancers. (c) Average protein intensity plotted vs protein abundance rank. The machine learning-selected biomarker candidates for the pan-cancer model are colored blue (the average is plotted as a blue line), and important contributors are highlighted. The green box depicts the proteome region that is typically below the sensitivity of neat plasma profiling by mass spectrometry. (d) Z-score of all machine learning-selected candidate biomarkers for the pan-cancer model (n = 94) across all measured samples (n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in blue and gray are reported in panels (e) and (f), respectively. (e) Boxplot visualization of the average z-transformed protein intensity for all proteins (n = 288) in the cluster highlighted in blue in panel (d) divided by stage (n = 180). Thick lines indicate medians, boxes indicate 25 and 75% quartiles, and whiskers extend between the median and ±(1.58 × interquartile range). (f) Boxplot visualization (as in panel (e)) of the log-transformed protein quantities of the three most differentiating proteins based on the machine learning model (SPON1, KV621, and CO5). Each data point represents a sample (n = 180).

Overall Changes within and across Cancer Types

Next, we applied the same analysis strategy using the matched healthy controls to each of the five solid tumor types. In the first step, we identified on average 325 significantly altered proteins between healthy, late, and early stages (Figures B and 4A and Supporting Information Table 3). With 436 significantly altered proteins (83% reduction in features), prostate cancer had the highest number of differentially abundant proteins, while breast cancer had the fewest with 229 (92% reduction). Interestingly, only a few proteins were shared among cancers (Supporting Information Figure 4A). Pancreatic and prostate had the most with 190 overlapping proteins, while breast and pancreas had the least at 37 (Supporting Information Figure 4A). Seven candidate proteins were consistently selected as differentially abundant across all cancers: the complement activation protein C4b-binding protein β chain (C4BPB), the immunoglobulin component immunoglobulin heavy variable 4-4 (HV404), the T-cell apoptosis inducer galectin-1 (LEG1), the degrader of the inflammation-promoting bradykinin peptide Xaa-Pro aminopeptidase 1 (XPP1), the solute carrier family 2 facilitated glucose transporter member 1 (GTR1), the glycan metabolism β-mannosidase enzyme (MANBA), and the suggested growth inducer of epithelial tumors tenascin-X (TENX; Figure B and Supporting Information Figure 4A,B). These candidates have rather decreasing (HV404, XPP1, MANBA, TENX) or increasing (LEG1, C4BPB) trends in a cancer agnostic manner, with the exception of GTR1, which strongly increases in the late-stage breast cancer while decreasing in the other types (Figure C). Interestingly, this small set of proteins separated healthy- from the cancer-stage samples quite well (p-value = 1.9e – 8; Figure B). Fitting an sPLSDA model with 80% of the data overall decreased the number of candidates to less than 5% of the total measured proteins. It led to an average of 129 candidates, making biological interpretation and follow-up more feasible (Figures B and 4A and Supporting Information Table 4). The relative decrease in the input data was highly cancer-dependent, from an almost 76% reduction in pancreatic cancer to only a 15% reduction in lung cancer. The number of overlapping proteins across models was minimal, likely due to the reductionist approach of sPLSDA and cancer-type-specific mechanisms, with no proteins being selected for all models (Supporting Information Figure 4C). Still, TAGL and MANBA were selected in all but breast cancer models, and GTR1 and LEG10 in all but the pan-cancer and breast cancer models (Figure C and Supporting Information Figure 4B).
Figure 4

Classification accuracy of the five cancer types. (a) Overview of the data analysis per cancer and combined (pan-cancer) as a normalized score. Percentage reduction upon univariate filtering and sPLSDA on 80% of the data set along with percentage accuracy as measured on the 20% hold-out samples as a three-way (healthy-, early-, and late-stage) and two-way (cancer and healthy) classification and p-value of enrichment based on the heatmap clustering (Manhattan distance, Ward clustering). (b) Z-score of the seven candidate proteins consistently selected across all cancers (by univariate analysis, n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. (c) Boxplot visualization of log-transformed GTR1 quantities across the stage and cancer type. The healthy samples were matched to the respective cancer samples. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ±(1.58 × interquartile range), and each data point represents a sample (n = 180). The dashed blue line connects the median values across stages.

Classification accuracy of the five cancer types. (a) Overview of the data analysis per cancer and combined (pan-cancer) as a normalized score. Percentage reduction upon univariate filtering and sPLSDA on 80% of the data set along with percentage accuracy as measured on the 20% hold-out samples as a three-way (healthy-, early-, and late-stage) and two-way (cancer and healthy) classification and p-value of enrichment based on the heatmap clustering (Manhattan distance, Ward clustering). (b) Z-score of the seven candidate proteins consistently selected across all cancers (by univariate analysis, n = 180). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. (c) Boxplot visualization of log-transformed GTR1 quantities across the stage and cancer type. The healthy samples were matched to the respective cancer samples. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ±(1.58 × interquartile range), and each data point represents a sample (n = 180). The dashed blue line connects the median values across stages. In summary, the model classification performance measured on the 20% validation set ranged between 33.3% in lung and prostate cancers and 77.8% in colorectal cancer when all three groups were considered and between 86.1% for the pan-cancer model and 100% for lung and colorectal cancers when healthy and overall disease status were considered (Figure A and Supporting Information Table 2). While for the early/late-stage differentiation two of the six models were close to random performance, the disease status was easier to predict, especially if the cancer type is known, as the pan-cancer model performed the worst with an 86% accuracy. Interestingly, high model performance was not always associated with high separation efficiency using PCA or distance analysis and vice versa (Figure A). This is especially apparent in the case of pancreatic and colorectal cancers. While colorectal performs the best on the validation set, especially in the differentiation of healthy/disease, pancreatic cancer leads to the best separation by hierarchical clustering on all three groups (p-value = 3.1e – 16). In a nutshell, in contrast to the “one-fits-all” approach, the cancer-specific models performed better. In some cases, the classification accuracy of the derived models was good, demonstrating the benefit of deep profiling of the plasma proteome.

Diseased State Separation in Colorectal Cancer

In colorectal cancer (CRC), we identified 307 proteins significantly altered between healthy, early, and late stages (Supporting Information Figure 5A). The sPLSDA model further reduced these candidate proteins to 90, and both hierarchical clustering and PCA analysis led to the efficient separation of healthy subjects from patients regardless of tumor staging (p-value = 2.1e – 8; Figure A and Supporting Information Figure 5B). Multiple biological GO enrichments in the candidates could be dissected, for example, response to leptin and regulation of proteolysis increased in cancer (including STAT3 and transgelin (TAGL)). In contrast, the negative regulation of cell–cell adhesion, leukocyte homeostasis, and response to hydrogen peroxide decreased (including CD47; Figure A,B). TAGL (importance = 1.00), STAT3 (importance = 0.65), and CD47 (importance = 0.57) were the three most predictive proteins from the sPLSDA model and showed interesting patterns (Figure B and Supporting Information Figure 5C). While CD47 and STAT3 showed strong heterogeneity in late-stage colorectal cancer, TAGL was highly expressed in the early- and late-stage colorectal cancers (Figure B). The selected 90 proteins were distributed across the entire intensity range of measured proteins, with more than 80% of the selected proteins (including the three most important) being beyond the 500 protein mark representing the usual range of proteins detected in neat plasma (Supporting Information Figure 5D). Furthermore, at 78%, the model had the best overall classification accuracy among all tested malignancies on the validation set (Figure C). As no misclassification for healthy subjects was observed, the panel of identified candidate proteins could be helpful for early CRC diagnosis. In summary, despite the small sample set, deep profiling of the human plasma enabled the partial classification of diseased patients based on a panel of 90 proteins that span a large dynamic range while providing an unbiased glimpse into the biological processes at the base of colorectal cancer.
Figure 5

Colorectal cancer biomarker candidates predict diseased status. (a) Z-score of all machine learning-selected candidate biomarkers for the colorectal cancer model (n = 90) across the matched colorectal sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in gray are reported in panel (b). (b) Boxplot visualization of log-transformed CD47, STAT3, and TAGL quantities divided by the stage for the colorectal cancer set. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ±(1.58 × interquartile range), and each data point represents a sample (n = 45). (c) Overview of the classification accuracy of the machine learning models for the colorectal cancer validation set (n = 9). Correct classifications are represented in the highlighted boxes.

Colorectal cancer biomarker candidates predict diseased status. (a) Z-score of all machine learning-selected candidate biomarkers for the colorectal cancer model (n = 90) across the matched colorectal sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in gray are reported in panel (b). (b) Boxplot visualization of log-transformed CD47, STAT3, and TAGL quantities divided by the stage for the colorectal cancer set. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ±(1.58 × interquartile range), and each data point represents a sample (n = 45). (c) Overview of the classification accuracy of the machine learning models for the colorectal cancer validation set (n = 9). Correct classifications are represented in the highlighted boxes.

Stage Separation in Pancreatic Cancer

In the pancreatic cancer set, 436 proteins were significantly altered between healthy, early, and late stages (Supporting Information Figure 6A). The sPLSDA modeling selected 106 proteins, which efficiently separated the three classes in both hierarchical clustering and PCA analyses (p-value = 3.1e – 16; Figure A,B). The separation was driven primarily by CD9 (importance = 0.37), TENX (importance = 0.32), and di-N-acetylchitobiase (DIAC, importance = 0.28), with both TENX and DIAC showing a downregulation with disease progression and CD9 showing a stronger upregulation in early- than late-stage pancreatic cancer (Figure C and Supporting Information Figure 6B). CD9 levels correlated most strongly with endocytosis-related protein dynamin-1 (DYN1), heat shock protein β-1 (HSPB1), platelet glycoprotein 4 (CD36), and a profibrotic matricellular protein CCN family member 2 (CCN2). The unsupervised clustering of the candidate proteins resulted in interesting patterns (Figure A). In the early-stage pancreatic cancer, proteins involved in the regulation of peptide secretion, cell communication, and chemokine production are overall downregulated including LEG10, which is essential for the suppressive function of CD25 positive regulatory T-cells[55,56] (Supporting Information Figure 6C), while proteins involved in the negative regulation of apoptotic process and receptor internalization (including proto-oncogene tyrosine-protein kinase Src (SRC) and CD9; Figure C and Supporting Information Figure 6C) are upregulated. In late-stage pancreatic cancer, cellular oxidant detoxification and oxygen transport, including hemoglobin subunit γ-1 (HBG1), are upregulated (Supporting Information Figure 6C). Of the 125 biomarker candidates selected, 65% were in the low abundance range (Supporting Information Figure 6D). In the validation set, the model had an accuracy of 66.7%, with two out of nine observations incorrectly assigned to the healthy group instead of early-stage cancer (Figure D). On the whole, deep profiling of human plasma enabled the clustering of diseased patients based on the disease stage and feature reduction makes biological patterns related to disease progression emerge.
Figure 6

Pancreatic cancer biomarker candidates predict diseased stage. (a) Z-score of all machine learning-selected candidate biomarkers for the pancreatic cancer model (n = 106) across the matched pancreatic cancer sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in gray are reported in panel (c) and the Supporting Information Figure 6. (b) Representation of the first two dimensions from the PCA analysis based on candidates identified in the sPLSDA model for pancreatic cancer. Small points represent samples, and large points represent the average across the stage. While the first dimension separates healthy from diseased samples and explains 18% of the variance in the data, the second dimension separates early- and late-stage samples and represents 13% of the variability. The corresponding ellipses represent sample concentration around the mean. (c) Boxplot visualization of log-transformed CD9, DIAC, and TNXB quantities divided by the stage for the pancreatic cancer set. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ± (1.58 × interquartile range), and each data point represents a sample (n = 45). (d) Overview of the classification accuracy of the machine learning models for the pancreatic cancer validation set (n = 9). Correct classifications are represented in the highlighted boxes.

Pancreatic cancer biomarker candidates predict diseased stage. (a) Z-score of all machine learning-selected candidate biomarkers for the pancreatic cancer model (n = 106) across the matched pancreatic cancer sample set (n = 45). Stage calling is overlaid. Both the proteins and the samples were hierarchically clustered. Selected, significantly enriched gene ontology pathways are reported on the right with the p-value in parentheses. Proteins highlighted in gray are reported in panel (c) and the Supporting Information Figure 6. (b) Representation of the first two dimensions from the PCA analysis based on candidates identified in the sPLSDA model for pancreatic cancer. Small points represent samples, and large points represent the average across the stage. While the first dimension separates healthy from diseased samples and explains 18% of the variance in the data, the second dimension separates early- and late-stage samples and represents 13% of the variability. The corresponding ellipses represent sample concentration around the mean. (c) Boxplot visualization of log-transformed CD9, DIAC, and TNXB quantities divided by the stage for the pancreatic cancer set. Thick lines indicate medians, boxes indicate 25 and 75% quartiles, whiskers extend between the median and ± (1.58 × interquartile range), and each data point represents a sample (n = 45). (d) Overview of the classification accuracy of the machine learning models for the pancreatic cancer validation set (n = 9). Correct classifications are represented in the highlighted boxes.

Discussion

We have developed an automated, robust, and parallelizable workflow for deep, large-scale plasma proteome profiling by depletion and sample preparation and by generating deep coverage ion mobility DIA methods. First, we demonstrated substantial improvements upon depletion for identification and quantification using a controlled quantitative plasma experiment. Furthermore, through multistage quality control, we assessed the variance introduced at each step of processing. In summary, the novel plasma discovery workflow enables the deep profiling of 10 samples per day per analytical platform to a depth of approximately 2700 proteins per study for 2 h gradients, reaching deep into tissue leakage and signaling molecules while maintaining quantitative accuracy. To evaluate the potential of deeper proteome coverage of the analytical pipeline, we measured a subset of the cancer plasma study with 3.5 h gradient FAIMS-DIA acquisitions. This resulted in a substantial increase in protein identifications to 3372 cumulatively (Supporting Information Figure 7). Next, we applied the novel plasma discovery workflow to a cohort containing samples coming from five solid tumors. Data analysis, including machine learning, revealed biomarker candidates and resulted in predictive models. The biomarkers mainly contain proteins from low abundance regions that would have likely been missed by neat plasma profiling, as previously speculated by Geyer et al.[9] Given the limited sample size and sample selection limitations (e.g., the healthy samples are self-declared healthy), the presented biomarker candidates require additional validation in an independent cohort. While the separation of healthy from cancer plasma samples was quite accurate for the cancer-specific models (average accuracy 93%), early- to late-stage differentiation was much more challenging, showing weaker separation (average accuracy 56%). The pan-cancer model performed worse than the cancer-specific models, indicating that “one-fits-all” biomarkers are generally harder to discover. This is likely because of the considerable heterogeneity across cancer types and could be solved by a larger cohort, more advanced stratification strategy and would likely lead to a larger biomarker panel. Seven candidate proteins were consistently differentially abundant across all cancers, of which one followed a cancer-type-specific behavior. Notably, the previously reported pan-cancer biomarker candidate TENX was reproduced, showing a reduction in the disease progression irrespective of the cancer type.[57] Overall, our approach showed that deep exploration of the proteome of cancer plasma samples can be realized for biomarker discovery. Larger cohorts and a longitudinal study design, where the same subjects are monitored ideally before disease onset, would likely lead to more robust biomarkers. When focusing on colorectal cancer, 307 proteins were altered between healthy, early, and late stages. These include three with a documented role in colorectal cancer development: STAT3,[58] TAGL,[59] and CD47.[60] In addition, gene ontology enrichments based on identified candidates showed a response to leptin and the regulation of proteolysis increased in cancer. At the same time, there was a negative regulation of cell–cell adhesion, leukocyte homeostasis, and response to hydrogen peroxide. Based on the machine learning-assisted biomarker discovery approach, a prediction model based on 90 proteins had the highest predictive classification power with a 78% accuracy on the hold-out set. In pancreatic cancer, 436 proteins were altered between healthy, early, and late stages. Of these, seven (GTR1, APOA4, IBP2, CD9, CAB45, OLFM4, BGH3) have previously been suggested as possible pancreatic cancer biomarkers.[61−65] Machine learning-based modeling selected 106 proteins, which led to an efficient separation using distance measures of healthy-, early-, and late-stage samples. The selected proteins showed an average overall prediction accuracy of 67%, with two observations incorrectly assigned to the healthy group instead of early-stage cancer. This separation was primarily driven by the three cancer-related proteins CD9,[66] TENX,[57] and DIAC.[62,67] Further proving the quality of the candidates, the separation was also driven by the recently proposed therapeutic target CNN2[68] and the prognostic marker GTR1.[69] A study by Jayaraman et al. demonstrated that the exposure of pancreatic cancer cells to zinc leads to increased protein ubiquitination and enhanced cell death, implicating zinc as a potential therapy in treating pancreatic cancer.[70] We found the sequestration of zinc ions as an enriched biological process in pancreatic cancer, specifically downregulated in cancer samples (especially early stage). Clinical analysis of blood is the most widespread diagnostic procedure in medicine, and blood biomarkers are used to diagnose diseases, categorize patients, and support treatment decisions. The presented approach is well suited for deep, epidemiological biomarker studies in plasma as it reaches deep into the tissue leakage area, where information on the health state of distal tissues can be discovered. Furthermore, biomarker sets derived from the machine learning biomarker discovery analysis are not optimally suited for a direct transition into a “classical” clinical biomarker, as new multiplexed approaches for clinical assays would be required. Such challenges could potentially be facilitated by DIA or multiple PRM-based assays, which are fully compatible with the presented workflow and could ultimately result in streamlined discovery-to-target-driven personalized medicine utilizing only one technology platform.[71,72] Hence, we envision that the profiling of large cohorts at high proteome depth will strongly support the development of novel biomarkers previously not accessible to large-scale discovery approaches and will lead to the development of biomarker panels that will finally deliver on the promise of noninvasive, preventive cancer screening.
  69 in total

Review 1.  The human plasma proteome: history, character, and diagnostic prospects.

Authors:  N Leigh Anderson; Norman G Anderson
Journal:  Mol Cell Proteomics       Date:  2002-11       Impact factor: 5.911

2.  Identification of novel accessible proteins bearing diagnostic and therapeutic potential in human pancreatic ductal adenocarcinoma.

Authors:  Andrei Turtoi; Davide Musmeci; Yinghong Wang; Bruno Dumont; Joan Somja; Generoso Bevilacqua; Edwin De Pauw; Philippe Delvenne; Vincent Castronovo
Journal:  J Proteome Res       Date:  2011-07-29       Impact factor: 4.466

3.  Three biomarkers identified from serum proteomic analysis for the detection of early stage ovarian cancer.

Authors:  Zhen Zhang; Robert C Bast; Yinhua Yu; Jinong Li; Lori J Sokoll; Alex J Rai; Jason M Rosenzweig; Bonnie Cameron; Young Y Wang; Xiao-Ying Meng; Andrew Berchuck; Carolien Van Haaften-Day; Neville F Hacker; Henk W A de Bruijn; Ate G J van der Zee; Ian J Jacobs; Eric T Fung; Daniel W Chan
Journal:  Cancer Res       Date:  2004-08-15       Impact factor: 12.701

4.  Novel prognostic protein markers of resectable pancreatic cancer identified by coupled shotgun and targeted proteomics using formalin-fixed paraffin-embedded tissues.

Authors:  Tatsuyuki Takadate; Tohru Onogawa; Tetsuya Fukuda; Fuyuhiko Motoi; Takashi Suzuki; Kiyonaga Fujii; Makoto Kihara; Sayaka Mikami; Yasuhiko Bando; Shimpei Maeda; Kazuyuki Ishida; Takashi Minowa; Nobutaka Hanagata; Hideo Ohtsuka; Yu Katayose; Shinichi Egawa; Toshihide Nishimura; Michiaki Unno
Journal:  Int J Cancer       Date:  2012-09-14       Impact factor: 7.396

5.  Genetics of the quantitative Lp(a) lipoprotein trait. III. Contribution of Lp(a) glycoprotein phenotypes to normal lipid variation.

Authors:  E Boerwinkle; H J Menzel; H G Kraft; G Utermann
Journal:  Hum Genet       Date:  1989-04       Impact factor: 4.132

Review 6.  Biology and significance of alpha-fetoprotein in hepatocellular carcinoma.

Authors:  Peter R Galle; Friedrich Foerster; Masatoshi Kudo; Stephen L Chan; Josep M Llovet; Shukui Qin; William R Schelman; Sudhakar Chintharlapalli; Paolo B Abada; Morris Sherman; Andrew X Zhu
Journal:  Liver Int       Date:  2019-09-11       Impact factor: 5.828

7.  Prognostic value of GLUT-1 expression in pancreatic cancer: results from 538 patients.

Authors:  Gaowa Sharen; Yaojun Peng; Haidong Cheng; Yang Liu; Yonghong Shi; Jian Zhao
Journal:  Oncotarget       Date:  2017-03-21

8.  MS-based lipidomics of human blood plasma: a community-initiated position paper to develop accepted guidelines.

Authors:  Bo Burla; Makoto Arita; Masanori Arita; Anne K Bendt; Amaury Cazenave-Gassiot; Edward A Dennis; Kim Ekroos; Xianlin Han; Kazutaka Ikeda; Gerhard Liebisch; Michelle K Lin; Tze Ping Loh; Peter J Meikle; Matej Orešič; Oswald Quehenberger; Andrej Shevchenko; Federico Torta; Michael J O Wakelam; Craig E Wheelock; Markus R Wenk
Journal:  J Lipid Res       Date:  2018-08-16       Impact factor: 5.922

9.  CCN-Based Therapeutic Peptides Modify Pancreatic Ductal Adenocarcinoma Microenvironment and Decrease Tumor Growth in Combination with Chemotherapy.

Authors:  Andrea Resovi; Patrizia Borsotti; Tommaso Ceruti; Alice Passoni; Massimo Zucchetti; Alexander Berndt; Bruce L Riser; Giulia Taraboletti; Dorina Belotti
Journal:  Cells       Date:  2020-04-13       Impact factor: 6.600

10.  Evaluation of Spin Columns for Human Plasma Depletion to Facilitate MS-Based Proteomics Analysis of Plasma.

Authors:  Xiaofang Cao; AnnSofi Sandberg; José Eduardo Araújo; Filip Cvetkovski; Erik Berglund; Lars E Eriksson; Maria Pernemalm
Journal:  J Proteome Res       Date:  2021-07-28       Impact factor: 4.466

View more
  2 in total

Review 1.  Advances and Utility of the Human Plasma Proteome.

Authors:  Eric W Deutsch; Gilbert S Omenn; Zhi Sun; Michal Maes; Maria Pernemalm; Krishnan K Palaniappan; Natasha Letunica; Yves Vandenbrouck; Virginie Brun; Sheng-Ce Tao; Xiaobo Yu; Philipp E Geyer; Vera Ignjatovic; Robert L Moritz; Jochen M Schwenk
Journal:  J Proteome Res       Date:  2021-10-21       Impact factor: 5.370

2.  Quantitative Plasma Proteomics to Identify Candidate Biomarkers of Relapse in Pediatric/Adolescent Hodgkin Lymphoma.

Authors:  Ombretta Repetto; Laura Caggiari; Mariangela De Zorzi; Caterina Elia; Lara Mussolin; Salvatore Buffardi; Marta Pillon; Paola Muggeo; Tommaso Casini; Agostino Steffan; Christine Mauz-Körholz; Maurizio Mascarin; Valli De Re
Journal:  Int J Mol Sci       Date:  2022-08-31       Impact factor: 6.208

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.