Stefani N Thomas1, Betty Friedrich2, Michael Schnaubelt1, Daniel W Chan1, Hui Zhang3, Ruedi Aebersold4. 1. Department of Pathology, Clinical Chemistry Division, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA. 2. Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Otto-Stern-Weg 3, 8093 Zürich, Switzerland. 3. Department of Pathology, Clinical Chemistry Division, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA. Electronic address: huizhang@jhu.edu. 4. Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Otto-Stern-Weg 3, 8093 Zürich, Switzerland; Faculty of Science, University of Zürich, Zürich, Switzerland. Electronic address: aebersold@imsb.biol.ethz.ch.
Abstract
The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) established a harmonized method for large-scale clinical proteomic studies. SWATH-MS, an instance of data-independent acquisition (DIA) proteomic methods, is an alternate proteomic approach. In this study, we used SWATH-MS to analyze remnant peptides from the original retrospective TCGA samples generated for the CPTAC ovarian cancer proteogenomic study. The SWATH-MS results recapitulated the confident identification of differentially expressed proteins in enriched pathways associated with the robust Mesenchymal high-grade serous ovarian cancer subtype and the homologous recombination deficient tumors. Hence, SWATH/DIA-MS presents a promising complementary or orthogonal alternative to the CPTAC proteomic workflow, with the advantages of simpler and faster workflows and lower sample consumption, albeit with shallower proteome coverage. In summary, both analytical methods are suitable to characterize clinical samples, providing proteomic workflow alternatives for cancer researchers depending on the context-specific goals of the studies.
The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) established a harmonized method for large-scale clinical proteomic studies. SWATH-MS, an instance of data-independent acquisition (DIA) proteomic methods, is an alternate proteomic approach. In this study, we used SWATH-MS to analyze remnant peptides from the original retrospective TCGA samples generated for the CPTAC ovarian cancer proteogenomic study. The SWATH-MS results recapitulated the confident identification of differentially expressed proteins in enriched pathways associated with the robust Mesenchymal high-grade serous ovarian cancer subtype and the homologous recombination deficient tumors. Hence, SWATH/DIA-MS presents a promising complementary or orthogonal alternative to the CPTAC proteomic workflow, with the advantages of simpler and faster workflows and lower sample consumption, albeit with shallower proteome coverage. In summary, both analytical methods are suitable to characterize clinical samples, providing proteomic workflow alternatives for cancer researchers depending on the context-specific goals of the studies.
Advances in sample preparation workflows, mass spectrometry instrumentation, and data processing software have positioned proteomics to provide comprehensive insights into complex biological processes at a level close to the underlying biochemical mechanisms. Indeed, it is currently possible to routinely quantify >10,000 proteins in human cell proteomes (Beck et al., 2011, Nagaraj et al., 2011) and human tissue proteomes (Kim et al., 2014, Mertins et al., 2018, Wilhelm et al., 2014) using mass spectrometry-based platforms. The workflows for many of these large-scale proteomic studies entail extensive offline fractionation of the peptides generated from enzymatically digested proteins, followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS). Consequently, proteomic analysis of large cohorts of clinical specimens (>100) requires several months for data acquisition using the aforementioned workflows. Furthermore, because each fraction analyzed typically requires 1–5 μg of total peptides, the required quantity of the original tissue sample is in the milligram level. Although more rapid proteomic workflows have been developed (Anagnostopoulos et al., 2017, Hebert et al., 2014, Kulak et al., 2014, Richards et al., 2015), they have not yet been deployed for large-scale clinical proteomic studies.Examples of large-scale clinical proteomic studies using the 2DLC-MS/MS workflow described above include the National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) studies. CPTAC was formed to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteogenomic analyses. Several hundred tumor tissue specimens from breast, ovarian, and colorectal cancer tissues previously analyzed by NCI's The Cancer Genome Atlas (TCGA) have also been characterized using proteomics, informed by genomics, resulting in the identification and quantification of proteins and phosphoproteins in cancer-associated cell signaling pathways and networks (Clark et al., 2020, Dou et al., 2020, Mertins et al., 2016, Zhang et al., 2014a, Zhang et al., 2016a). These studies employed data-dependent acquisition (DDA) mass spectrometry, a mode of MS/MS data collection wherein a fixed number of precursor ions whose m/z values were recorded in a survey scan are selected for fragmentation using a pre-determined set of rules (Mann et al., 2001). DDA-based proteomic workflows have undergone considerable optimization to improve the reliability and reproducibility of the generated data in an effort to minimize the limitations due to the stochastic nature of precursor ion selection and low sampling efficiency resulting in missing values across datasets (Mertins et al., 2018, Revesz et al., 2018, Tabb et al., 2016, Zhou et al., 2017).A relatively newer method termed data-independent acquisition (DIA) mass spectrometry has been gaining traction in large-scale proteomic studies (Vidova and Spacil, 2017). DIA mass spectrometry is an alternative to DDA that allows all ions within a selected mass range to be concurrently fragmented and analyzed by tandem mass spectrometry. Sequential Window Acquisition of All Theoretical Mass Spectra (SWATH-MS) is an example of a DIA acquisition method whose use in proteomic studies has increased considerably within the past 5 years. The SWATH-MS method acquires a complete and permanent digital fragment ion record for all detectable precursor ions of a sample (Collins et al., 2013, Gillet et al., 2012, Liu et al., 2013) and provides an iterative, targeted search strategy that determines the presence and quantity of tens of thousands of query peptides using reference fragment ion spectra for the query peptides as prior information (Rost et al., 2014). To support SWATH/DIA data analysis, several software tools have been developed and benchmarked (Navarro et al., 2016). As with any analytical methodology that has potential widespread use, several studies have been conducted to optimize and evaluate the performance of SWATH-MS (Li et al., 2017b, Rardin et al., 2015). In a multi-laboratory study including 11 sites worldwide, SWATH-MS was shown to have a linear dynamic range exceeding four orders of magnitude with an inter-laboratory coefficient of variation (CV) of 22.0 ± 17.4% (Collins et al., 2017), demonstrating that SWATH-MS is a reproducible method for large-scale protein quantification.To assess the potential of SWATH-MS in addressing some of the common limitations of DDA proteomic workflows, comparative analyses of SWATH-MS and DDA using isobaric tags for relative and absolute quantitation (iTRAQ) have been conducted (Basak et al., 2015, Bourassa et al., 2015, Zhang et al., 2014b). Basak et al. concluded that SWATH-MS and iTRAQ DDA are complementary techniques with a 60% overlap of the high-confidence quantifiable proteins identified by both methods using Saccharomyces cerevisiae as a model system when incorporating offline peptide fractionation into the LC-MS workflow (Basak et al., 2015). In their study, Basak et al. incorporated first dimension separation using strong cation exchange chromatography wherein the peptides were fractionated into six fractions followed by second dimension separation using reversed-phase chromatography prior to LC-MS analysis.In the current study, we utilized SWATH-MS to analyze peptides from 103 HGSOC tumors that were previously analyzed by iTRAQ DDA as part of the NCI CPTAC study (Zhang et al., 2016a). The iTRAQ DDA proteomic workflow resulted in the identification of 8,597 proteins from these tumors using a 24-fraction peptide separation method, whereas 2,914 proteins were quantified by SWATH-MS without peptide fractionation. We compared the two proteomic workflows on the basis of cost, robustness, complexity, ability to detect differential protein expression, and the elucidated biological information. Our analysis demonstrated that despite the greater than 2-fold difference in the analytical depth of iTRAQ DDA compared with SWATH-MS common differentially expressed proteins in enriched pathways associated with the HGSOC Mesenchymal subtype were identified by both workflows with 96% of the proteins quantified by SWATH-MS also quantified by iTRAQ DDA. We also showed that tumor subtype classification stability is sensitive to the number of samples that are analyzed. Lastly, our results indicated a conservation of the homologous recombination deficiency (HRD)-associated enriched DNA repair and chromosome organization pathways in the iTRAQ DDA and SWATH-MS datasets, thus indicating that some biological information for HGSOC could be consistently extracted from either dataset.Taken together, compared with the DDA analysis of HGSOC using iTRAQ labeling and 2D fractionation, SWATH-MS is a robust proteomic method that can be used to re-capitulate common differentially expressed proteins in enriched pathways associated with the HGSOC Mesenchymal subtype (Zhang et al., 2016a). The SWATH-MS proteomic workflow is simpler, cheaper, and consumes less sample, but it results in shallower proteome coverage. The significantly lower number of proteins detected by SWATH-MS compared with the iTRAQ DDA workflow is mitigated by the streamlined and less complex workflow, the increased sample throughput (requiring ~80% less time), ~10-fold reduced sample requirements, and lower technical variability and attenuated signal compression. The SWATH-MS workflow therefore presents novel opportunities to enhance the efficiency of clinical proteomic studies with continuous method improvements.Ludwig et al. provide an in-depth overview of the advantages and limitations of SWATH-MS in comparison with DDA (Ludwig et al., 2018). There are analytical challenges common to both workflows that should be taken into account when deciding on an optimal approach for a given biological or clinical problem. These challenges include sample complexity, dynamic range of the measured proteins, accuracy of the protein measurements, use of incomplete databases, and the selection of peptides for protein quantification. The strengths and weaknesses of SWATH-MS and iTRAQ DDA must be considered in the context of the afore-mentioned analytical challenges.
Results
Study Design and Evaluation of Technical Performance of Analytical Platforms
Proteomic measurements of clinically annotated HGSOC previously characterized by TCGA (Cancer Genome Atlas Research, 2011) were conducted using an iTRAQ DDA workflow entailing stable isotope labeling, offline fractionation, and LC-MS/MS analysis of each fraction (Zhang et al., 2016a). A total of 103 of these tumors were used for the current SWATH-MS analysis. An overview of the experimental design of our study is shown in Figure 1.
Figure 1
Experimental Design of Proteomic Analysis of High-Grade Serous Ovarian Tumors Using DIA Mass Spectrometry Instrument Platforms
A total of 103 clinically annotated ovarian high-grade serous carcinomas previously characterized by The Cancer Genome Atlas (TCGA) were processed for proteomic analysis using an iTRAQ DDA method wherein the samples were subjected to fractionation prior to data acquisition and also using a SWATH-MS method without fractionation prior to analysis. The data processing and bioinformatics pipeline enabled differential expression analysis and tumor subtype classification comparison.
Experimental Design of Proteomic Analysis of High-Grade Serous Ovarian Tumors Using DIA Mass Spectrometry Instrument PlatformsA total of 103 clinically annotated ovarian high-grade serous carcinomas previously characterized by The Cancer Genome Atlas (TCGA) were processed for proteomic analysis using an iTRAQ DDA method wherein the samples were subjected to fractionation prior to data acquisition and also using a SWATH-MS method without fractionation prior to analysis. The data processing and bioinformatics pipeline enabled differential expression analysis and tumor subtype classification comparison.Protein was extracted from each tumor specimen followed by enzymatic digestion with trypsin. For the iTRAQ DDA workflow, the resulting peptides were labeled with 4-plex iTRAQ reagents followed by combination into analysis sets comprising the peptides from three tumors, each labeled with a distinct iTRAQ tag, and an iTRAQ-labeled reference pool comprising the peptides from most of the tumors. Each analysis set was subjected to offline fractionation into 24 concatenated fractions, and the fractions from each analytical set were sequentially analyzed using a DDA method on an LTQ-Orbitrap Velos mass spectrometer. An unfractionated aliquot of each analytical set was also analyzed by DDA-MS without fractionation. In comparison, the SWATH-MS workflow did not require stable isotope labeling. However, for the purpose of generating a spectral library to facilitate the targeted protein identification, each sample from 103 tumors was pooled, fractionated to 48 fractions and subjected to DDA analysis, which is an optional step for this workflow. We note that, although we chose in this study to create a sample-specific reference library, several publicly available reference libraries exist (Ludwig et al., 2018). The implications of using a generic human library on the false discovery rate (FDR) due to multiple hypothesis testing correction have been addressed (Rosenberger et al., 2017a). Another option is to avoid acquiring DDA runs to generate the spectral library by utilizing spectrum-centric scoring of DIA data based on the use of algorithms such as DIA-Umpire (Li et al., 2015, Tsou et al., 2015).The CV of SWATH-MS data was assessed using two QC approaches (Figure S1). For the first method, peptides from HEK293 cells were analyzed in triplicate on three days for a total of nine runs between the DDA data acquisition runs for spectral library generation and the SWATH acquisition of the ovarian tumor data. The median CV for 3,855 quantified proteins was 8%, and the mean total CV (reflecting the intra- and inter-day CV) was 15% for the nine technical repeat analyses (Collins et al., 2013, Collins et al., 2017) (Figure S1A). For the second QC method, peptides from a control ovarian tumor were analyzed using a DDA method on the same 5600+ TripleTOF mass spectrometer that was used to acquire the SWATH data. These QC samples were run in duplicate immediately prior to the SWATH acquisition of the ovarian tumor samples, and again in duplicate 10 days later when half of the ovarian tumor sample data acquisition runs were completed. These measurements resulted in a total CV of 7% for 781 quantified proteins (Figure S1B). For both QC strategies, the duration of the LC gradient was identical to the gradient used for the SWATH analysis of the HGSOC samples. Hence, the results from these QC strategies evaluated the analytical measurement/technical variability of the SWATH-MS platform that was used to acquire the data from the ovarian tumor proteins.We used the 1,599 proteins quantified by iTRAQ DDA and SWATH-MS to compare the analytical differences between the two workflows with respect to relative protein abundances (Figure 2A), correlation of the normalized relative abundance of the quantified proteins (Figure 2B), and variability of the constituent peptides from the quantified proteins (Figure 2C; y axis indicates the peptide variability calculated by the standard deviation of the quantified peptides for each protein divided by the mean peptide intensity per protein). Relative protein abundances were determined based on the normalized log2 intensity ratios compared with the reference iTRAQ channel of each iTRAQ set for the iTRAQ data, and relative protein abundances in the SWATH-MS dataset were determined based on the log2 intensity of each protein as represented by the mean intensity of the constituent peptide ratios (peptide intensity divided by the mean peptide intensity) (Figure 2A). The median relative log2 relative protein abundance of the iTRAQ data was 0.01 compared with −0.23 in the SWATH-MS data. The compressed distribution of the quantified relative protein abundances of the iTRAQ DDA data reflects the well-documented phenomenon of ratio compression in iTRAQ-based relative quantification (Ow et al., 2011, Savitski et al., 2013) (Figure 2A).
Figure 2
Comparison of iTRAQ DDA and SWATH-MS Proteomic Data Based on Completeness and Variability
The 1,599 proteins quantified by iTRAQ DDA and SWATH-MS were used for the following comparisons: (A) Distribution of relative protein abundance in the iTRAQ DDA and SWATH-MS datasets. (B) Spearman's rank correlation of proteins quantified by iTRAQ DDA and SWATH-MS. (C) Peptide variability calculated by the standard deviation of peptides per protein divided by mean peptide intensity per protein.
Comparison of iTRAQ DDA and SWATH-MS Proteomic Data Based on Completeness and VariabilityThe 1,599 proteins quantified by iTRAQ DDA and SWATH-MS were used for the following comparisons: (A) Distribution of relative protein abundance in the iTRAQ DDA and SWATH-MS datasets. (B) Spearman's rank correlation of proteins quantified by iTRAQ DDA and SWATH-MS. (C) Peptide variability calculated by the standard deviation of peptides per protein divided by mean peptide intensity per protein.Spearman's rank correlation was used to assess the strength of the association of the proteins quantified by iTRAQ DDA and SWATH-MS. The median ρ of 0.61 indicates a moderately positive correlation (Figure 2B). Among the factors that likely preclude the median ρ from being higher are the fundamental differences in protein quantification (log2 ratio of reporter ion intensities in the iTRAQ DDA data compared with the summed fragment ion intensities and normalized to the respective mean peptide intensities for the SWATH-MS data) and the differences in protein quantification wherein multiple peptides from the same protein were used to quantify each protein in the dataset. Additionally, for the majority of the quantified proteins, different constituent peptides and different numbers of unique peptides were identified in the iTRAQ DDA workflow compared with SWATH-MS. A direct comparison of the CV of each method was not possible because replicate analyses of the samples were not conducted; therefore, a CV-like score was calculated, which represents the biological variability of the sibling peptides from the quantified proteins. Because aliquots of identical samples were used for this analysis, the displayed variation reflects technical differences in the measurements and data handling of both methods. The median variability of the SWATH-MS data was higher than the iTRAQ DDA data (0.25 versus 0.18, respectively). As shown in the violin plots, the variability of the iTRAQ DDA data has a wider distribution than that of the SWATH-MS data (Figure 2C).Both assessed proteomic methods yielded different numbers of quantified proteins: a cumulative total of 8,597 quantified proteins resulting from the iTRAQ DDA analysis and a cumulative total of 2,914 quantified proteins resulting from the SWATH-MS analysis. A total of 1 μg of each of the 24 fractionated peptide samples was used for the iTRAQ DDA analysis, and 1 μg of peptide samples from each tumor was injected for the SWATH-MS analysis. In iTRAQ DDA, we used the original quantified proteomic data from Zhang et al. (2016a), and we only used values obtained from unshared peptides with other proteins (proteotypic peptides). The SWATH-MS data were filtered on a peptide-query level FDR of 1% and a protein-level global FDR cutoff of 1% based on the method described by Rosenberger et al. for the statistical control of peptide and protein error rates in large-scale DIA analyses (Rosenberger et al., 2017a). In addition, only proteotypic peptides were used for quantification for this dataset.For any proteomic dataset consisting of multiple analyses of biologically distinct samples, not every protein will be detected in every sample, and the resulting data matrix will have missing values. This can either be due to technical reasons, i.e., not every protein present in a sample will be identified owing to the stochastic nature of the sampling of ions by the mass spectrometer or reasons related to biological variability. We therefore assessed the occurrence and distribution of missing values in the two datasets. Figure S2 shows the distribution of the proteins with missing values in the iTRAQ and SWATH data. Here, we show the number of proteins with various percentages of missing values in the iTRAQ DDA (Figure S2A) and SWATH-MS datasets (Figure S2B), starting with proteins without any missing values across all 103 samples up to proteins with more than 91% missing values. In the iTRAQ dataset, approximately 50% of the proteins did not have any missing values, whereas in the SWATH-MS dataset, 11% of the proteins did not have any missing values and 50% of the proteins had ~30% missing values. The iTRAQ DDA dataset was directly filtered for the proteins without missing values (4,363). The same approach was not applicable in SWATH-MS owing to the sparsity of the matrix. Only working with complete measurements in SWATH-MS would also result in neglecting potential biological effects, where proteins might not be detected owing to downregulation or absence in a subset of the samples. Owing to the instrument duty cycle and the stochastic nature of data acquisition, more missing values are expected to occur in low-abundance proteins than in high-abundance ones (Wang et al., 2007). We assessed this as well in the current dataset and could indeed observe this inverse correlation (Figure S2C). Hence, to distinguish technical and biological reasons for “missingness” we applied a filtering strategy following this assumption by allowing more missing values in high-abundance proteins and less missing values among the low-abundance proteins, yielding 1,659 proteins, instead of taking the 4,363 proteins without missing value from our previous iTRAQ based DDA approach. To enable the use of a complete matrix, an imputation approach was adopted based on that used in the Perseus software program (Tyanova et al., 2016). Imputation has been demonstrated to be an effective approach to address the challenge of missing values in SWATH-MS and related DIA approaches (Collins et al., 2017, Karpievitch et al., 2012, Rost et al., 2016).The robust filtering approach used for the SWATH-MS data resulted in 1,659 proteins that were quantified with high confidence across all 103 tumors. Among these proteins, 1,599 were also quantified using iTRAQ-DDA, and these are the proteins that were used for all subsequent analyses, including the elucidation of the molecular subtype classification of the tumors based on their proteomic signatures.
High-Grade Serous Ovarian Carcinoma Subtype Classification Based on Proteomic Signatures
The rationale for the molecular subtyping of HGSOC is related to efforts to develop more specific and effective therapeutic strategies given the heterogeneity of this type of ovarian cancer. To assess the ability to classify the iTRAQ-DDA and SWATH-MS data into subtypes based on proteomic signatures, we used the same approach as reported by Zhang et al. wherein an unbiased molecular taxonomy of HGSOC was established using relative protein abundance data to identify subtypes that exhibit biological differences (Zhang et al., 2016b). The 1,599 proteins that were quantified in the iTRAQ-DDA and SWATH-MS datasets were used to classify the two datasets separately using mclust (Fraley and Raftery, 2002) based on the z-score-transformed relative protein abundances, and the emergent protein modules were characterized using weighted gene-correlation network analysis (WGCNA) (Langfelder and Horvath, 2008) and Reactome pathway enrichment. The heatmap resulting from the clustering analysis is shown in Figure S3A with the proteins listed in horizontal rows and the tumors listed in vertical columns. The colored vertical bars represent the transcriptome-based HGSOC subtypes (Verhaak et al., 2013) (“Original TCGA”), the proposed proteomic iTRAQ DDA subtypes (“Original CPTAC” (Zhang et al., 2016a); “iTRAQ DDA”), and the proposed proteomic SWATH-MS subtypes (“SWATH-MS”).The correlation of the WGCNA-derived protein modules with the proteomic subtypes derived from the iTRAQ DDA and SWATH-MS datasets is shown in Figure S4. The enriched Reactome pathways in the WGCNA-derived modules in the SWATH-MS data include ECM organization, immune response, metabolism, complement cascade and fibrin clot formation, and gene expression and translation. In the SWATH-MS dataset, the tumors with proteins exhibiting a significantly positive correlation (Pearson correlation coefficient ρ = 0.52; p value = 2 × 10−8) with the ECM organization Reactome pathway were assigned to the Mesenchymal subtype. A significantly positive correlation (Pearson correlation coefficient ρ = 0.54; p = 5 × 10−9) was also observed among the proteins assigned to the gene expression and translation ontology.Based on the observation highlighted by the dashed box in Figure 3 indicating a large degree of similarity among the genomic and proteomic profiles of the Mesenchymal subtype tumors, we conducted a down-sampling analysis in an effort to evaluate the stability of the Mesenchymal subtype. In this analysis, the number of tumors used for the cluster analysis was systematically decreased to determine the sample number cutoff below which the enrichment of the Reactome pathways was no longer significant. The Mesenchymal cluster was the only cluster for which the enrichment of specific Reactome pathways remained significant (p < 0.01, Fisher's exact test) when the sample number was reduced to 60% (Figure S3B). These results suggest that the SWATH-MS and iTRAQ DDA proteomic signatures of the Mesenchymal subtype tumors are consistent and robust.
Figure 3
Protein (iTRAQ DDA, SWATH-MS) and mRNA-Based High-Grade Serous Ovarian Cancer Subtype Classification
(A) Tumor subtype classification comparison based on gene expression (TCGA study) and relative protein abundance (original CPTAC study [Zhang et al., 2016a], and the iTRAQ DDA and SWATH-MS data from the current analysis). Subtype classification: blue, Mesenchymal; red, Differentiated; purple, Proliferative; green, Immunoreactive; yellow, Stromal.
(B) Subtype classification agreement among the genomic and proteomic data assessed by the Adjusted Rand Index (ARI).
Protein (iTRAQ DDA, SWATH-MS) and mRNA-Based High-Grade Serous Ovarian Cancer Subtype Classification(A) Tumor subtype classification comparison based on gene expression (TCGA study) and relative protein abundance (original CPTAC study [Zhang et al., 2016a], and the iTRAQ DDA and SWATH-MS data from the current analysis). Subtype classification: blue, Mesenchymal; red, Differentiated; purple, Proliferative; green, Immunoreactive; yellow, Stromal.(B) Subtype classification agreement among the genomic and proteomic data assessed by the Adjusted Rand Index (ARI).Because the protein and mRNA components of the same ovarian tumors were analyzed, we were able to compare the robustness of the Mesenchymal subtype across different analytical methods. Figure 3A compares the subtyping of the tumors based on the five protein-based subtypes (Differentiated “D,” Immunoreactive “I,” Proliferative “P,” Mesenchymal “M,” and Stromal “S”) that emerged from the original iTRAQ DDA-based analysis of these tumors (Zhang et al., 2016a), the four mRNA-based subtypes (Differentiated, Proliferative, Mesenchymal, and Immunoreactive) that emerged from the genomic analysis (Cancer Genome Atlas Research, 2011), and the three protein-based subtypes that resulted from the clustering analysis of the iTRAQ DDA and SWATH-MS data using the 1,599 proteins that were quantified using these two analytical approaches. Based on visual analysis, the Mesenchymal subtype tumors (represented by shades of blue) exhibit the highest degree of classification agreement among the four analytical approaches. Although the accurate characterization of the molecular subtypes of HGSOC is challenging, it is widely accepted that the Mesenchymal subtype is defined by the increased expression of extracellular matrix proteins and desmoplasia (Chen et al., 2018). Compared with the other data types, the SWATH-MS data facilitates the discrete partitioning of one of the five proteomic subtypes, the Mesenchymal subtype tumors, from our previous five proteomic subtypes (Figure 3A).
Influence of Sample Size and Analytical Depth on Tumor Molecular Subtype Classification Stability
We used the Adjusted Rand Index (ARI) to quantitatively assess the agreement among the classification of the tumors based on the proteomic analyses (iTRAQ DDA and SWATH-MS) (Figure 3B). ARI values range from 0 to 1, with 1 indicating clustering results that are identical and 0 indicating clusters that are devoid of similarity. Although we did not anticipate an ARI of 1 when comparing the iTRAQ DDA and SWATH-MS datasets using the commonly quantified 1,599 proteins owing to the inherent noise in biological data, we did not expect the relatively low ARI of 0.21. To further explore the similarity among the various proteomic datasets mentioned in Figure 3A, we calculated the ARI values that resulted from changing the numbers of proteins included in the analysis. The highest ARI was 0.58, which was obtained by comparing the complete iTRAQ DDA dataset without any missing values (4,363) with the 1,599 proteins in the iTRAQ DDA dataset that were also quantified by SWATH-MS. This ARI was unexpectedly low given the expected level of similarity when comparing a subset to the full set of exactly the same data. Conversely, the lowest ARI, 0.14, was obtained when comparing the full iTRAQ DDA dataset of 8,597 proteins with the 1,599 proteins that were quantified in common using SWATH-MS. This low ARI value of 0.14 is likely reflective of the fundamental differences in protein quantification between iTRAQ DDA and SWATH-MS.We further examined the unexpected lack of similarity when comparing the subset of the iTRAQ data to the full dataset by adopting a systematic bootstrapping approach. Bootstrapping was used to randomly select a fraction of either the 103 samples analyzed by iTRAQ DDA and SWATH-MS (Figures 4A and 4B) or the proteins that were quantified by iTRAQ DDA (4,363) and SWATH-MS (1,659) (Figures 4C and 4D). ARI values based on mclust classification were used to determine the robustness of the clusters. This bootstrapping was repeated 100 times for each subset of the proteins or samples. The resulting classifications were compared with the result from the complete initial dataset, and the distribution of ARI values is shown in Figure 4 for the iTRAQ DDA (left column, A, C) and SWATH-MS datasets (right column, B, D). A lack of statistical significance among the ARI values (n.s.) indicates cluster stability, whereas cluster instability is indicated by statistically significant changes in ARI. When we bootstrapped along the dimension of sample number, the average ARI decreased dramatically with the reduced number of samples, even when comparing 90% versus 80% of the samples in the iTRAQ DDA and SWATH-MS datasets (Figures 4A and 4B, respectively). However, the corresponding median ARI values of the incrementally reduced sample sizes in the SWATH-MS dataset systematically exceed the median ARI values of the iTRAQ DDA dataset (Figure 4B). These results indicate that the fidelity of molecular clusters is rather sensitive to the number of samples included in the analysis and that the clusters identified from the SWATH-MS data generally have slightly higher confidence. In comparison, when we bootstrapped along the dimension of the number of quantified proteins (Figures 4C and 4D) the ARI was refractory to the number of proteins included in the clustering. The difference among the ARI values did not reach statistical significance at a significance level of 0.05 (i.e., the clusters were stable) when the number of quantified proteins was down-sampled to 30% and 50% of the initial group of proteins in the iTRAQ DDA and SWATH-MS datasets, respectively. Hence, the resulting clusters are largely insensitive to the number of included proteins.
Figure 4
Proteomic Cluster Stability as a Function of Sample Number and Protein Number in the iTRAQ DDA and SWATH-MS Datasets
Cluster stability measured by the Adjusted Rand Index (ARI) as a function of the percentage of tumors analyzed in the iTRAQ DDA (A) and SWATH-MS (B) datasets. (C and D) Cluster stability assessed by ARI as a function of the percentage of quantified proteins in the iTRAQ DDA (C) and SWATH-MS (D) datasets. ∗p < 0.05, ∗∗p < 0.01, ∗∗∗∗p < 0.0001. n.s. non-significant
Proteomic Cluster Stability as a Function of Sample Number and Protein Number in the iTRAQ DDA and SWATH-MS DatasetsCluster stability measured by the Adjusted Rand Index (ARI) as a function of the percentage of tumors analyzed in the iTRAQ DDA (A) and SWATH-MS (B) datasets. (C and D) Cluster stability assessed by ARI as a function of the percentage of quantified proteins in the iTRAQ DDA (C) and SWATH-MS (D) datasets. ∗p < 0.05, ∗∗p < 0.01, ∗∗∗∗p < 0.0001. n.s. non-significantThe sensitivity of molecular subtype clustering to the number of samples included in the analysis has been shown previously (Levine and Domany, 2001, Monti et al., 2003); however, this phenomenon has not been demonstrated in a systematic manner for proteomics data. Increasing the number of samples included in subtype classification analyses renders the classifications more robust, whereas increasing the numbers of proteins by employing proteomics methods such as 2DLC-MS/MS iTRAQ DDA does not provide an added benefit in terms of cluster stability. Our bootstrapping approach clearly demonstrated how easily classification results can be perturbed by experimental variables such as sample size, with a negative impact on the ability to extract robust biological content.Nevertheless, in our SWATH-MS analysis, we were able to extract one robust HGSOC subtype, which was characterized by the increased relative abundance of proteins with extracellular matrix functions (Figure S3A). The robustness of this presumptive Mesenchymal subtype as a function of sample number was demonstrated using different subsets of samples in the SWATH-MS dataset using a Fisher's exact test (Figure S3B). A fundamental premise of this stability analysis is that only the most robust information can be reproducibly extracted from the data. The emergence of the Mesenchymal phenotype as a robust subtype in the iTRAQ DDA and SWATH-MS data indicates that biologically relevant content can be robustly extracted from these orthogonal/complementary analytical methods. The proteins characterizing the Mesenchymal cluster in the SWATH data (86 proteins) and the iTRAQ DDA data (283 proteins) were enriched in extracellular matrix organization function, show an overlap of 84 proteins (Figure 5A), and have a significantly higher correlation of the relative abundance of constituent proteins compared with the proteins in the entire dataset of 1,599 proteins (Figure 5B, median ρ = 0.79 compared with Figure 2B, median ρ = 0.61). The relatively high correlation of the relative abundance of proteins comprising the Mesenchymal cluster is also evidenced by the overlap among the Mesenchymal subtype tumors (Figure 3A). These results suggest that the robustness of the clusters increases with increasing quantification accuracy rather than by increasing the number of quantified proteins.
Figure 5
Comparison of Proteins Comprising the Mesenchymal Subtype Tumors in the iTRAQ DDA and SWATH-MS Data
(A) High overlap of proteins characterizing the Mesenchymal subtype in the SWATH MS data and the iTRAQ DDA data.
(B) Strong positive correlation of 84 proteins commonly comprising the Mesenchymal subtype in the iTRAQ DDA and SWATH-MS data (median ρ = 0.79).
Comparison of Proteins Comprising the Mesenchymal Subtype Tumors in the iTRAQ DDA and SWATH-MS Data(A) High overlap of proteins characterizing the Mesenchymal subtype in the SWATH MS data and the iTRAQ DDA data.(B) Strong positive correlation of 84 proteins commonly comprising the Mesenchymal subtype in the iTRAQ DDA and SWATH-MS data (median ρ = 0.79).To determine whether the same proteins from the iTRAQ DDA and SWATH-MS datasets would be identified in a group comparison of Mesenchymal subtype samples versus the other samples, we used the sample subtype annotations from the Zhang et al. iTRAQ DDA study (Zhang et al., 2016a). The log2-fold changes of these proteins in the iTRAQ DDA and SWATH-MS datasets were determined as well as the associated p value (Figure 6). Proteins with significantly increased or decreased relative abundance (fold-change cut-off: 1.3; p < 0.05) are indicated in red or blue, respectively. The significantly upregulated proteins (Figures 6A and 6B, red dots) in these comparisons showed a high overlap with the 84 common proteins from the ECM protein module (Figure 6C) as defined by the Zhang et al. iTRAQ DDA study (Zhang et al., 2016a). Well-established cancer markers such as Fibronectin 1 and Thrombospondins that are known for contributing to the metastatic progression of tumors (Hu et al., 2017, Incardona et al., 1993, Kenny et al., 2014, Mitra et al., 2011, Ricciardelli et al., 2016) were among those proteins, confirming the characterization of these samples as Mesenchymal subtype tumors. Although there is a high overlap of these proteins between the iTRAQ DDA and SWATH datasets, 33 of the up-regulated proteins were not identified as such in the iTRAQ DDA dataset. Nevertheless, most of those 33 proteins are part of the extracellular region and exhibit enrichment in extracellular matrix organization as well. Also, of note, the quantitative dimension of iTRAQ DDA is narrower than that of the SWATH-MS data, which influences the extent of similarity among the proteins quantified by each method.
Figure 6
Group Comparison of Proteins in Mesenchymal Subtype versus the Non-mesenchymal Subtypes
Up-regulated proteins among the Mesenchymal subtype tumors are indicated in red and the down-regulated proteins are indicated in blue. The fold-change cutoff was 1.3, and the significance cutoff was p < 0.05.
(A) iTRAQ DDA data.
(B) SWATH-MS data.
(C) Comparison of all proteins assigned to the ECM module based on the original proteomic classification (Zhang et al., 2016a) versus proteins assigned to the ECM module in the iTRAQ DDA data and the SWATH-MS data.
Group Comparison of Proteins in Mesenchymal Subtype versus the Non-mesenchymal SubtypesUp-regulated proteins among the Mesenchymal subtype tumors are indicated in red and the down-regulated proteins are indicated in blue. The fold-change cutoff was 1.3, and the significance cutoff was p < 0.05.(A) iTRAQ DDA data.(B) SWATH-MS data.(C) Comparison of all proteins assigned to the ECM module based on the original proteomic classification (Zhang et al., 2016a) versus proteins assigned to the ECM module in the iTRAQ DDA data and the SWATH-MS data.Since the proteomic subtypes other than the Mesenchymal subtype identified in the CPTAC and TCGA studies could not be clearly distinguished by the classification analyses of the datasets used in this study, we performed a similar differential expression analysis on those groups using the 1,599 proteins that were quantified in the iTRAQ DDA and SWATH-MS datasets (Figure S5). As expected, the p values of those comparisons were considerably less significant than those resulting from the analyses of the Mesenchymal subtype. The Differentiated and Stromal subtypes did not lead to any conclusive results in these analyses based on the functional enrichment of the proteins with increased or decreased relative abundances, whereas the Proliferative and Immunoreactive subtypes were characterized by several differentially expressed proteins with rather high p values (low significance). The WGCNA analyses of the respective differentially abundant proteins identified protein modules related to immune response and gene expression and translation (Figure S4). However, those modules did not show high associations with any of the patient groups.
Homologous Recombination Deficiency-Related Proteomic Signature of HGSOC Identified in SWATH-MS Data
Previous studies have used gene expression and mutation profiles to characterize molecular subtypes of high-grade serous ovarian cancer to identify patients who respond well to poly-ADP ribose polymerase (PARP) inhibitor treatment (Lheureux et al., 2017, Tothill et al., 2008). Homologous recombination deficiency (HRD) is associated with a higher sensitivity toward PARP inhibitor treatment and therefore with a better prognosis for the respective patients. The initial iTRAQ DDA analysis of the HGSOC tumors resulted in the identification of a well-defined network of proteins with roles in histone acetylation that differentiated HRD from non-HRD tumors (Zhang et al., 2016a). Thus, we conducted an analysis to determine whether similar HRD-related features could be identified from the SWATH-MS data in the current study.A group comparison of HRD versus non-HRDpatients using mapDIA (Teo et al., 2015) revealed several differentially expressed proteins (Figure 7A). Because none of these proteins could be directly linked to DNA repair mechanism-related functions, we used a network propagation approach (Hofree et al., 2013) to extend the comparatively limited analytical depth of the SWATH measurements. As an input we used a network obtained from STRING (Szklarczyk et al., 2015) filtered for highly confident experimental evidence (physical interactions with a score higher than 800) with 4,424 nodes. Log2-fold changes obtained from the group comparison approach were mapped on this network, and these signals were propagated over the network to identify signals accumulating within a subnetwork. The top 5% (221 proteins) of the resulting positive and negative scores were used for further investigation and functional enrichment.
Figure 7
Group Comparison of Proteins in HRD versus Non-HRD Tumors Analyzed by SWATH-MS
(A) Relative abundance of up- (red) and down- (blue) regulated proteins in the HRD compared with the non-HRD tumors. The log2-fold change cutoff was 1.3 and the significance cutoff was p < 0.05.
(B) Functional enrichment analysis using STRING. Blue nodes: Chromosome organization proteins. Red nodes: DNA repair proteins. Blue and red nodes: Proteins with roles in chromosome organization and DNA repair.
Group Comparison of Proteins in HRD versus Non-HRD Tumors Analyzed by SWATH-MS(A) Relative abundance of up- (red) and down- (blue) regulated proteins in the HRD compared with the non-HRD tumors. The log2-fold change cutoff was 1.3 and the significance cutoff was p < 0.05.(B) Functional enrichment analysis using STRING. Blue nodes: Chromosome organization proteins. Red nodes: DNA repair proteins. Blue and red nodes: Proteins with roles in chromosome organization and DNA repair.Further verifying that proteins with roles in histone acetylation differentiate HRD from non-HRD tumors (Zhang et al., 2016a), DNA repair and chromosome organization were among the GO terms that were identified using the previously described network propagation approach based on functional enrichment analysis. The respective sub-network contained proteins previously identified as belonging to the HRD-associated protein network, including histone deacetylase 1 (HDAC1; Figure 7B) and a histone-binding protein RBBP7 (Zhang et al., 2016a) (Figure 7B). HDAC1 and RBBP7 are among the blue nodes representing chromosome organization proteins. Another protein identified in this sub-network was a DNA mismatch repair protein that is a well-known marker for ovarian cancer, MSH2 (Maresca et al., 2015, Stewart et al., 2017, Xiao et al., 2014, Zhao et al., 2018). Additionally, subunits of the tumor suppressor complex SWI/SNF, including ARID1A, SMARCC1, and SMARCA2, were among the proteins in the enriched chromosome organization network. The interaction of SWI/SNF with PARP and BRCA1 has been demonstrated previously by in vitro studies using yeast two-hybrid screening, affinity purification followed by western blotting, and co-immunoprecipitation purification followed by LC-MS/MS (Bochar et al., 2000, Harte et al., 2010, Hill et al., 2004). The identification of SWI/SNF complex components in the enriched chromosome organization network in our study and the results from in vitro studies indicating the interaction of these proteins with PARP and BRCA1 support the well-known role of these proteins in the control of homologous recombination during DNA repair.In summary, the analysis of the SWATH data was able to provide not only confirmation of some previously identified signatures from the iTRAQ DDA data in a larger untargeted network approach, but also important mechanistic insights into HRD-related pathways as discovered by iTRAQ labeling and LC-MS/MS using DDA (Zhang et al., 2016a).
Discussion
HGSOC is among the cancer types that have been proteogenomically characterized by CPTAC. The large-scale proteome analytical workflow employed by CPTAC uses isobaric tagging, offline peptide fractionation, and LC-MS/MS. This workflow is considered a reference method for the comparison of tissue protein relative abundance across large sample cohorts. However, recently, DIA proteomic methods exemplified by SWATH-MS have been developed, which are simpler, cheaper, and consume less sample than the reference method but provide shallower proteome coverage.Although HGSOC is the most common histological subtype of ovarian cancer, there is a considerable amount of tumor heterogeneity (Arend et al., 2018), thus underscoring the need for comprehensive characterization of the molecular subtypes of this lethal disease. Previous studies have shown that somatic mutations (Wiegand et al., 2010), genetic (Gates et al., 2010) and environmental risk factors (Goode et al., 2010), and the clinical response rates to platinum- or taxane-based therapy (Sugiyama et al., 2000) vary considerably among the HGSOC molecular subtypes. As such, elucidating the molecular characteristics of HGSOC could facilitate the development of more targeted and effective therapies (Leong et al., 2015, Vaughan et al., 2011).DDA mass spectrometry workflows have been employed to comprehensively characterize HGSOC (Coscia et al., 2016, Li et al., 2017a, Xie et al., 2017, Zhang et al., 2016a) with varying numbers of molecular subtypes emerging from these analyses. Functional genomic studies have identified various numbers of distinct HGSOC molecular subtypes with clinical relevance and pathways that are responsible for growth control in epithelial ovarian cancer (Cancer Genome Atlas Research, 2011, Tan et al., 2013). The discordance among these subtypes results from the varied sample sizes and analytical criteria used to conduct these studies (Helland et al., 2011, Tothill et al., 2008, Verhaak et al., 2013). As we show in this study, the discordance also results from the low robustness of observed patterns as a function of sample size and the numbers of protein used for the analysis. Our current study provides orthogonal evidence of the proteomic signatures of the HGSOC Mesenchymal subtype and the fidelity of the HGSOC Mesenchymal subtype, which is more sensitive to sample number compared with the number of quantified proteins. This has implications for the design of future large-scale clinical proteomic studies of cancer types where molecular subtyping is a predominant goal.In this study, we compared the results obtained by the reference large-scale DDA proteomic method versus SWATH-MS using aliquots of peptide samples generated for the CPTAC HGSOC study (Zhang et al., 2016a). The results indicate that iTRAQ DDA and SWATH-MS confidently identified differentially expressed proteins in enriched pathways associated with the Mesenchymal subtype of HGSOC tumors as evidenced by the (1) high degree of overlap of up-regulated extracellular matrix-related proteins (Figure 5A), (2) strongly positive median correlation (ρ = 0.79) among the proteins comprising this subtype from both proteomic workflows (Figure 5B), and (3) statistically significant stability of this subtype in the context of the number of tumors included in the clustering analysis (Figure S3B). The robustness of the Mesenchymal subtype with respect to molecular subtype cluster stability could be a signature of the decreased survival of patients whose tumors express this molecular signature compared with an Immunoreactive signature. A study conducted using a cohort of 174 HGSOC patients from the Mayo Clinic with long-term clinical follow-up observed statistically significantly worse survival of patients whose tumor samples expressed a Mesenchymal-like signature upon the analysis of a set of 1,850 genes (Konecny et al., 2014). A similar trend of worse survival for patients with Mesenchymal subtype tumors compared with those with Immunoreactive subtype tumors was observed upon the analysis of a separate cohort of 185 HGSOC patients (Bonome et al., 2008).In addition to its association with worse survival and low rates of resection, the Mesenchymal subtype of HGSOC is associated with a significant contribution from the cancer-associated stroma (Tothill et al., 2008, Zhang et al., 2019). Cancer-associated stroma comprises extracellular matrix proteins, tissue remodeling proteases, and other proteins that comprise a scaffold for tissue organization and integrity (Davidson et al., 2014). These proteins are highly abundant, which could explain why they are readily detectable across distinct analytical workflows resulting in the consistent classification of the Mesenchymal subtype.The notion of discrete HGSOC subtypes that are mutually exclusive is not universally accepted. Verhaak et al. suggested that an individual tumor could be represented by multiple signatures based on different levels of pathway activation (Verhaak et al., 2013). This concept has been supported by Konecny et al. who proposed a multi-dimensional approach to subtyping where molecular subtypes lie on a spectrum with partly overlapping causes (Konecny et al., 2014). Additional large-scale clinical proteomic studies of HGSOC tumors that are designed to address issues related to tumor heterogeneity would be beneficial in enhancing the resolution of molecular subtyping.Our SWATH-MS analysis confirmed the proteomic signature of HRD established by iTRAQ DDA analysis wherein a sub-network of BRCA1- or BRCA2-related proteins displayed co-expression patterns differentiating HRD from non-HRD patients (Zhang et al., 2016a). Many of the proteins in these identified modules have roles in histone acetylation or deacetylation. Inhibitors of PARP and histone deacetylase inhibitors have emerged as novel classes of anti-cancer drugs to treat HR-related ovarian cancer associated with BRCA1/2 mutations (Bryant et al., 2005, Farmer et al., 2005, Yano et al., 2018, Yuan et al., 2017). Thus, the proteomic characterization of HGSOC tumor tissue biopsies from patients who receive PARP inhibitor treatment could be beneficial in further elucidating the molecular mechanisms that are implicated in HRD.One of the strengths of the iTRAQ DDA workflow is the ability to achieve deep proteome coverage that exceeds that of the SWATH workflow by almost 3-fold. A total of 8,597 proteins were quantified by iTRAQ DDA compared with 2,914 proteins by SWATH-MS. These numbers represent the aggregate numbers of quantified proteins. However, after employing filtering strategies to restrict the data to only the proteins that were quantified across all 103 tumors for each workflow, these numbers decreased to 4,363 and 1,659, respectively. The group of 1,599 proteins that were quantified by both proteomic workflows was used to compare the performance of iTRAQ DDA and SWATH-MS.In addition to their analytical performance, there are also considerable differences in the resource characteristics of iTRAQ DDA and SWATH-MS, including sample requirement, sample throughput, and cost. Hence, it is clear that iTRAQ DDA for comprehensive proteome profiling is a substantially more resource-intense workflow compared with SWATH-MS.Based on the concordance between the iTRAQ DDA and SWATH-MS results that we have shown in this study, SWATH-MS, which is considerably less resource intense than iTRAQ DDA, can be reliably deployed in the proteomic analysis of clinical specimens. The clinical utility of future large-scale translational proteomic studies, regardless of the employed analytical methodology, can be strengthened by the use of large sample cohorts that have undergone comprehensive pathology review.Although our study was focused on the performance of SWATH-MS, other DIA methods are currently available. Our SWATH-MS analysis was conducted using a 5600+ TripleTOF mass spectrometer; however, a newer generation of this mass spectrometer platform exists with an increased linear dynamic range and enhanced detection system, which could result in improved instrument performance with respect to the number of quantifiable proteins per analytical run. A recent DIA study conducted by Bruderer et al. using a quadrupole ultra-high field Orbitrap mass spectrometer resulted in the identification of more than 6,000 proteins in human cell lines and more than 7,000 proteins in mouse tissues (Bruderer et al., 2017). Of note, the use of isobaric chemical labeling strategies, including 10- and 11-plex tandem mass tags, has greatly facilitated the multiplexing capabilities of large-scale clinical proteomic studies resulting in the reproducible quantification of >10,000 proteins (Krieger et al., 2019, Mertins et al., 2018).In contrast to the original CPTAC HGSOC study, we did not assess protein phosphorylation in the current study. However, it should be noted that SWATH/DIA-MS has been shown to be compatible with the analysis of protein phosphorylation patterns and specific software tools supporting such analyses have been developed (Rosenberger et al., 2017b).
Limitations of the Study
As expected, the development of new analytical instruments and methods often enables an expanded breadth and/or depth of analytical measurement. After the performance of these instruments and methods has been optimized and validated, the short-comings of previously existing methods become evident. It is as yet unknown whether SWATH-MS and other DIA proteomic methods will have an increased prevalence in clinical proteomic analyses. However, our current study provides compelling orthogonal evidence that SWATH-MS elucidates some of the common biological signatures of the Mesenchymal subtype of HGSOC.Novel biological insights beyond those that were initially gleaned from the iTRAQ DDA data (Zhang et al., 2016a) and confirmed by the SWATH-MS data could be obtained via additional bioinformatics and statistical analyses. However, the goal of this study, as reflected in the experimental design, was not to reveal novel biological insights, but rather to demonstrate that the SWATH-MS workflow presents novel opportunities to enhance the efficiency of clinical proteomic studies and to disseminate large-scale cancer proteomic studies beyond the realm of large research consortia, thus potentially accelerating the use of this powerful proteomic approach to cancer biology.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.