| Literature DB >> 32548300 |
Lin Lu1, Shawn H Sun1, Hao Yang1, Linning E2, Pingzhen Guo1, Lawrence H Schwartz1, Binsheng Zhao1.
Abstract
We investigated the performance of multiple radiomics feature extractors/software on predicting epidermal growth factor receptor mutation status in 228 patients with non-small cell lung cancer from publicly available data sets in The Cancer Imaging Archive. The imaging and clinical data were split into training (n = 105) and validation cohorts (n = 123). Two of the most cited open-source feature extractors, IBEX (1563 features) and Pyradiomics (1319 features), and our in-house software, Columbia Image Feature Extractor (CIFE) (1160 features), were used to extract radiomics features. Univariate and multivariate analyses were performed sequentially to predict EGFR mutation status using each individual feature extractor. Our univariate analysis integrated an unsupervised clustering method to identify nonredundant and informative candidate features for the creation of prediction models by multivariate analyses. In training, unsupervised clustering-based univariate analysis identified 5, 6, and 4 features from IBEX, Pyradiomics, and CIFE as candidate features, respectively. Multivariate prediction models using these features from IBEX, Pyradiomics, and CIFE yielded similar areas under the receiver operating characteristic curve of 0.68, 0.67, and 0.69. However, in validation, areas under the receiver operating characteristic curve of multivariate prediction models from IBEX, Pyradiomics, and CIFE decreased to 0.54, 0.56 and 0.64, respectively. Different feature extractors select different radiomics features, which leads to prediction models with varying performance. However, correlation between those selected features from different extractors may indicate these features measure similar imaging phenotypes associated with similar biological characteristics. Overall, attention should be paid to the generalizability of individual radiomics features and radiomics prediction models.Entities:
Keywords: EGFR; IBEX; NSCLC; Pyradiomics; Radiomics; TCIA
Mesh:
Substances:
Year: 2020 PMID: 32548300 PMCID: PMC7289249 DOI: 10.18383/j.tom.2020.00017
Source DB: PubMed Journal: Tomography ISSN: 2379-1381
Figure 1.Study design diagram. The design consists of 4 modules. First, projects NSCLC Radiogenomics and The Cancer Genome Atlas-Lung Adenocarcinoma (TGCA-LUSC)/TGCA-Lung Squamous Cell Carcinoma (TCGA/LUAD) were obtained from The Cancer Imaging Archive (TCIA) and split into a homogenous training cohort and a heterogeneous validation cohort. Second, features were extracted from all imaging cases using 3 different feature extractors: IBEX, Pyradiomics, and CIFE. Third, univariate and multivariate analyses are sequentially conducted on features from each extractor to create prediction models for epidermal growth factor receptor (EGFR) mutation status. “x3” means the univariate and multivariate analyses were performed identically 3 times by using the features from IBEX, Pyradiomics, and CIFE. Finally, the best classifier models and optimal features are compared between the 3 individual extractors.
1NSCLC Radiogenomics was produced by Bakr et al. with 211 patients with NSCLC from Stanford University School of Medicine and the Palo Alto Veteran Affairs Healthcare System.
2TCGA-LUSC and -LUAD are projects of TCGA, consisting of lung squamous cell carcinoma and lung adenocarcinoma cases. Imaging is available from 5 centers in the United States (Washington University, University of Pittsburgh, UNC, Roswell Park, and Lahey Health Home).
Patient Characteristics in Training and Validation Cohorts
| Training Cohort | Validation Cohort | ||
|---|---|---|---|
| Subjects (N) | 105 | 123 | |
| Age (Years) | 67.96 ± 8.9 | 67.92 ± 10.77 | .98 |
| Sex | .74 | ||
| Female | 40 (38%) | 45 (37%) | |
| Male | 65 (62%) | 55 (45%) | |
| Unknown | 0 (0%) | 23 (19%) | |
| Histology | <.001 | ||
| Adenocarcinoma | 92 (88%) | 84 (68%) | |
| Squamous Cell Carcinoma | 11 (10%) | 38 (31%) | |
| NOS (Not Otherwise Specified) | 2 (2%) | 1 (1%) | |
| Stage | .39 | ||
| Unknown | 22 (21%) | 34 (28%) | |
| 0 | 1 (1%) | 2 (2%) | |
| I | 49 (46%) | 40 (32%) | |
| II | 18 (17%) | 24 (20%) | |
| III | 13 (12%) | 18 (14%) | |
| IV | 2 (2%) | 4 (3%) | |
| EGFR Mutation | .054 | ||
| EGFR-Mutant | 27 (26%) | 18 (15%) | |
| EGFR-Wildtype | 78 (74%) | 105 (85%) |
P-value: chi-square test for categorical data and t test for continuous data.
Nonredundant and Informative Features from Each Feature Extractor
| Feature Name | Univariate Analysis (AUC) |
|---|---|
| IBEX | |
| 135-1Correlation | 0.74 |
| LocalRangeStd | 0.72 |
| 1GaussAmplitude | 0.66 |
| VoxelSize | 0.62 |
| -333-4ClusterShade | 0.62 |
| Pyradiomics | |
| log-σ-2-0-mm-3D_firstorder_Minimum | 0.72 |
| log- σ2-0-mm-3D_glszm_SizeZoneNonUniformityNormalized | 0.70 |
| log-σ2-0-mm-3D_glcm_InverseVariance | 0.68 |
| wavelet-LHL_firstorder_Skewness | 0.67 |
| wavelet-LHH_firstorder_Skewness | 0.65 |
| wavelet-HHH_glszm_SmallAreaEmphasis | 0.65 |
| CIFE | |
| DWF_Z_H | 0.72 |
| Intensity_Minimum | 0.71 |
| Gabor_Max_Z | 0.68 |
| Intensity_Skewness | 0.65 |
Optimal features are listed for each individual extractor. These features are then used to build prediction models in the multivariate analysis. Each feature has a correlation coefficient <0.2 and an AUC > 0.6.
Performance of Multivariate Models from Each Feature Set in the Training and Validation Cohorts
| IBEX | PY | CIFE | ||||
|---|---|---|---|---|---|---|
| ClassificationAlgorithm | T | V | T | V | T | V |
| KNN | 0.620 | 0.54 | 0.66 | 0.54 | 0.67 | 0.60 |
| SVM | 0.59 | 0.48 | 0.57 | 0.52 | 0.60 | 0.51 |
| Random Forests | 0.68 | 0.54[ | 0.67 | 0.56[ | 0.68 | 0.64[ |
| Bagging | 0.66 | 0.53 | 0.67 | 0.53 | 0.69 | 0.63 |
aOptimal model based on validation performance from each feature set, which was random forest for all extractors. T and V columns represent AUC scores for the indicated model from the training and validation cohorts, respectively. Performance values are presented as AUC values.