| Literature DB >> 36232544 |
Thilo Bracht1,2, Daniel Kleefisch2,3, Karin Schork2,3, Kathrin E Witzke2,3, Weiqiang Chen2, Malte Bayer1,2, Jan Hovanec4, Georg Johnen4, Swetlana Meier4, Yon-Dschun Ko5, Thomas Behrens4, Thomas Brüning4, Jana Fassunke6, Reinhard Buettner6, Julian Uszkoreit2,3, Michael Adamzik1, Martin Eisenacher2,3, Barbara Sitek1,2.
Abstract
Chronic obstructive pulmonary disease (COPD) is a major risk factor for the development of lung adenocarcinoma (AC). AC often develops on underlying COPD; thus, the differentiation of both entities by biomarker is challenging. Although survival of AC patients strongly depends on early diagnosis, a biomarker panel for AC detection and differentiation from COPD is still missing. Plasma samples from 176 patients with AC with or without underlying COPD, COPD patients, and hospital controls were analyzed using mass-spectrometry-based proteomics. We performed univariate statistics and additionally evaluated machine learning algorithms regarding the differentiation of AC vs. COPD and AC with COPD vs. COPD. Univariate statistics revealed significantly regulated proteins that were significantly regulated between the patient groups. Furthermore, random forest classification yielded the best performance for differentiation of AC vs. COPD (area under the curve (AUC) 0.935) and AC with COPD vs. COPD (AUC 0.916). The most influential proteins were identified by permutation feature importance and compared to those identified by univariate testing. We demonstrate the great potential of machine learning for differentiation of highly similar disease entities and present a panel of biomarker candidates that should be considered for the development of a future biomarker panel.Entities:
Keywords: Ig kappa light chain; SAA1; SERPINA3; artificial intelligence; lung cancer; machine learning; plasma proteomics; random forest
Mesh:
Substances:
Year: 2022 PMID: 36232544 PMCID: PMC9569607 DOI: 10.3390/ijms231911242
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1Schematic representation of the project workflow. Plasma samples from n = 176 patients were analyzed in two batches using LC–MS/MS. Proteins were quantified label-free and the resulting intensities were normalized to account for batch effects. Normalized intensities were analyzed by univariate statistics and machine learning approaches. Five different machine learning algorithms were compared using the same modeling pipeline.
Figure 2Normalization and statistical analysis of proteomics data. (A) Principal component analysis (PCA) plots of label-free LC–MS/MS data before and after batch normalization. Each data point corresponds to a sample measured in either batch 1, batch 2, or both batches (colors representing patient groups). (B) Volcano plots representing the results of statistical analysis using Welch-ANOVA. Significant proteins highlighted in red (ANOVA pFDR-value ≤ 0.05 (corrected according to Benjamini–Hochberg); post hoc pFDR-value ≤ 0.05 (corrected according to Bonferroni–Holm); absolute ratio of means ≥ 1.5) and labeled with gene names (except Igκ Chain).
Significantly differentially abundant protein groups.
| Comparison | Protein Groups Considered for Statistical Testing 1 | Significantly Differentially Abundant Protein Groups 2 | Higher Abundance in Condition A | Higher Abundance in Condition B |
|---|---|---|---|---|
| AC with COPD vs. COPD | 325 |
| 3 | 8 |
| AC w/o COPD vs. COPD | 349 |
| 6 | 5 |
| AC w/o COPD vs. | 324 |
| 3 | 0 |
| AC with COPD vs. Control | 271 |
| 14 | 25 |
| AC w/o COPD vs. Control | 278 |
| 9 | 17 |
| COPD vs. Control | 283 |
| 14 | 18 |
1 Protein groups were filtered for a minimum of five valid quantifications per patient group. 2 Significance filter criteria: ANOVA pFDR and post hoc pFDR-values ≤ 0.05; absolute RoM ≥ 1.5.
Figure 3Hierarchical cluster analysis and two-group comparison of proteomics data. Heatmap illustrating hierarchical cluster analysis (distance based on Pearson’s correlation, complete linkage) considering 42 proteins, which passed a pFDR-value threshold ≤0.05 calculated using Welch-ANOVA for comparisons between either AC with or without COPD vs. COPD.
Figure 4Results of machine learning approaches. (A) Five machine learning approaches were compared for classification of AC vs. COPD (left panel) and AC with COPD vs. COPD (right panel). Different p-value thresholds were assessed for feature selection and plotted against the respective ten-times-repeated 10-fold-cross-validated AUCs. (B) Receiver operating characteristic (ROC) curves for the best-performing random forest classifiers (AC vs. COPD: p-value threshold = 0.2; AC with COPD vs. COPD: no p-value threshold). (C) Feature importance plots illustrating the relative influence of individual proteins on multivariate classification models. Proteins represented by gene names (except Igκ Chain).
Characteristics of random forest classifiers with the highest AUCs *.
|
|
|
| 0.916 |
|
| 0.928 |
| 0.882 |
|
| 0.865 |
| 0.873 |
|
| 0.848 |
| 0.570 |
|
| 0.879 |
| 0.965 |
* Ten-times-repeated 10-fold cross-validated. 1 For feature selection, a p-value threshold of 0.2 was used. 2 No p-value threshold was applied for feature selection. 3 Sensitivity corresponds to true classification of AC. 4 Specificity corresponds to true classification of COPD. 5 Sensitivity corresponds to true classification of AC with COPD.
Metrics for train-test-split validation for the comparison AC vs. COPD.
| Minimum 1 | Mean 1 | Maximum 1 | ||
|---|---|---|---|---|
|
|
| 0.85 | 0.901 | 0.965 |
|
| 0.667 | 0.823 | 0.936 | |
|
|
| 0.763 | 0.864 | 0.968 |
|
| 0.554 | 0.766 | 0.931 | |
|
|
| 0.76 | 0.831 | 0.91 |
|
| 0.65 | 0.753 | 0.85 | |
|
|
| 0.726 | 0.815 | 0.905 |
|
| 0.5 | 0.763 | 1 | |
|
|
| 0.759 | 0.844 | 0.941 |
|
| 0.455 | 0.745 | 1 |
1 Model metrics represent 50 repetitions of random train-test-splits. 2 Model built on train set with ten-times-repeated 10-fold cross-validation. 3 Model validated on test set representing 1/3 of the whole dataset.
Figure 5Results of intra-set validation for the comparison AC vs. COPD. The dataset was randomly split into train and test sets for 50 repetitions. The random forest model was developed with a ten-times-repeated 10-fold-cross-validation on the train set and validated on the test set. (A) Top 10 list of the most frequently selected features (i.e., proteins, represented by gene names (except Igκ Chain)). (B) Characteristics of the random forest classifier for cross-validation on the train sets (black) and validation on the test sets (red), respectively, were plotted against the number of repetitions.
Composition of the analyzed patient cohorts.
| Group | Description | Mean Age (Years) | Sex | Smoking Behavior | |
|---|---|---|---|---|---|
|
| AC w/o COPD | AC-patients without diagnosed COPD | 67.17 ± 9.43, | 25 female, | 20 smokers, |
| AC with COPD | AC-patients with diagnosed COPD | 64.48 ± 8.89, | 12 female, | 11 smokers, | |
| COPD §
| COPD-patients without AC | 68.61 ± 10.43, | 36 female, | 36 smokers, | |
| HC | Hospital controls | 65.34 ± 12.40 | 16 female, | 14 smokers, | |
* Lung adenocarcinoma. § Chronic obstructive pulmonary disease.