Literature DB >> 34244594

Impact of rescanning and repositioning on radiomic features employing a multi-object phantom in magnetic resonance imaging.

Simon Bernatz1,2,3, Yauheniya Zhdanovich4, Jörg Ackermann4, Ina Koch4, Peter J Wild5,6, Daniel Pinto Dos Santos7, Thomas J Vogl8, Benjamin Kaltenbach8, Nicolas Rosbach8.   

Abstract

Our purpose was to analyze the robustness and reproducibility of magnetic resonance imaging (MRI) radiomic features. We constructed a multi-object fruit phantom to perform MRI acquisition as scan-rescan using a 3 Tesla MRI scanner. We applied T2-weighted (T2w) half-Fourier acquisition single-shot turbo spin-echo (HASTE), T2w turbo spin-echo (TSE), T2w fluid-attenuated inversion recovery (FLAIR), T2 map and T1-weighted (T1w) TSE. Images were resampled to isotropic voxels. Fruits were segmented. The workflow was repeated by a second reader and the first reader after a pause of one month. We applied PyRadiomics to extract 107 radiomic features per fruit and sequence from seven feature classes. We calculated concordance correlation coefficients (CCC) and dynamic range (DR) to obtain measurements of feature robustness. Intraclass correlation coefficient (ICC) was calculated to assess intra- and inter-observer reproducibility. We calculated Gini scores to test the pairwise discriminative power specific for the features and MRI sequences. We depict Bland Altmann plots of features with top discriminative power (Mann-Whitney U test). Shape features were the most robust feature class. T2 map was the most robust imaging technique (robust features (rf), n = 84). HASTE sequence led to the least amount of rf (n = 20). Intra-observer ICC was excellent (≥ 0.75) for nearly all features (max-min; 99.1-97.2%). Deterioration of ICC values was seen in the inter-observer analyses (max-min; 88.7-81.1%). Complete robustness across all sequences was found for 8 features. Shape features and T2 map yielded the highest pairwise discriminative performance. Radiomics validity depends on the MRI sequence and feature class. T2 map seems to be the most promising imaging technique with the highest feature robustness, high intra-/inter-observer reproducibility and most promising discriminative power.
© 2021. The Author(s).

Entities:  

Year:  2021        PMID: 34244594      PMCID: PMC8271025          DOI: 10.1038/s41598-021-93756-x

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Diagnostic radiology is based on visual-semantic reporting[1]. Radiomics describes quantitative computational data analysis by transforming images into mineable data[1]. It is hypothesized that imaging data exists beyond visual perception which can be extracted to build imaging phenotypes, leading the way to non-invasive precision medicine[1-3]. A radiomics pipeline consists of specific steps[1,2,4]: (I) image acquisition and reconstruction, (II) preprocessing and segmentation of volumes of interest (VOI), (III) radiomic features extraction, (IV) statistical analysis with clinical and biological data, and (V) model development applying machine learning algorithms. Each step is prone to bias[1,5-7]. Currently there are increasing concerns about the robustness, validity and interpretability of radiomics research[5,6,8-10]. Multicenter studies deal with multiple imaging scanners and vendors with various protocols of acquisition and reconstruction[4,7,10]. There is no uniform recommendation for image pre-processing[8,11]. Image segmentation is prone to inter-observer variance[12]. Feature extraction and definition can be highly variable as research groups may use house-build software, making reproducibility and comparability of data nearly impossible[5,6,10]. Therefore, application of open-source implementations like PyRadiomics is highly recommended[5,6,8,13]. Following features extraction, numerous ways exist to reduce feature dimensionality and to build predictive models[13,14]. The image biomarker standardization initiative (IBSI) works towards standardizing the methodology[11]. Furthermore, radiomic features may not provide unique and independent information but are prone to redundancy[15]. An increasing number of studies addresses potential weaknesses of radiomics research[5,6,8,9,14]. Welch et al.[6] have demonstrated that the signature features studied in a groundbreaking work of Aerts et al.[3] might have been surrogates of tumor volume. Schwier et al. have emphasized the need of highly transparent reporting of methodology[8]. Schwier et al. have shown that the methods of image preprocessing and feature extraction highly influence the repeatability of radiomic features[8]. They have urged caution in the interpretation of radiomics studies[8]. There is ongoing debate concerning the repeatability and robustness of radiomic features[5,9,16-18]. Baeßler et al. have constructed a multi-object phantom to acquire test–retest data using three sequences and two matrix sizes to investigate the repeatability and robustness of MRI radiomic features[9]. Matrix size has not impacted repeatability and fluid-attenuated inversion recovery (FLAIR) provided the highest amount of robust features[9]. In total, 45 features have been extracted with one third having been robust across all sequences[9]. Those features have been proposed to be reliably used in future clinical studies[9]. The aim of our study was to replicate parts of the study design of Baeßler et al.[9] with novelty given by a different selection of sequences, inclusion of T2 mapping, extraction of more radiomic features and we performed discriminatory analyses of phantom-components. We aimed to tackle the analyzes of robustness and reproducibility of radiomic MRI features of Baeßler et al.[9] in another institute and with a different MR scanner to obtain temporal and geographical external validation. We applied the supposed reference software package PyRadiomics[19] to extract the quantitative imaging features.

Results

Robustness of features depends on the feature class

Figure 1 shows the fractions of robust features (CCC & DR ≥ 0.90, red) specific for the classes of features for the combined MRI sequences. Among the seven classes, shape has the highest fraction of 86.15% robust features. A fraction of 32.22% first-order features is robust. The least fraction of 28% robust features has class ngtdm. For all feature classes, the fraction of robust features rapidly decreases for increasing levels of robustness from relaxed (CCC & DR ≥ 0.85) to strict (CCC & DR ≥ 0.95).
Figure 1

Feature class impacts the amount of robust features. Concordance correlation coefficient (CCC) and dynamic range (DR) values were computed for each feature. Results depict the combined mean values of dedicated CCC and DR analysis for each acquired MRI sequence plotted for each feature class. Excellent robustness was defined as CCC & DR ≥ 0.90 (red).

Feature class impacts the amount of robust features. Concordance correlation coefficient (CCC) and dynamic range (DR) values were computed for each feature. Results depict the combined mean values of dedicated CCC and DR analysis for each acquired MRI sequence plotted for each feature class. Excellent robustness was defined as CCC & DR ≥ 0.90 (red).

T2 map yields the highest fraction of robust features

We stratified the fraction of robust features per MRI sequence and feature class, see Fig. 2A, B. T2 map yielded the highest fraction of 82.03% robust features (CCC & DR ≥ 0.90, red, Fig. 2A). Features of T2 map (dark violet bars, Fig. 2B) were robust in 100%, 87.50%; and 83.33% of the cases for glcm glszm, and first order, respectively. For the other classes, the fraction of robust features of T2 map ranged from 72.43% (gldm) to 80% (ngtdm). Shape feature class was the only feature class where T2 map did not yield the top result among the sequences with 77.00%. FLAIR was the sequence with the second highest fraction of 50.98% robust features (CCC & DR ≥ 0.90, red, Fig. 2A). FLAIR (green bars, Fig. 2B) obtained its highest fraction of robust features (75%) for glcm and glrlm, its lowest fraction (20%) for ngtdm. Compared to the other sequences, FLAIR had the least fraction (53.85%) of robust shape features. T2w TSE, T1w TSE, and HASTE (dark blue, red, light blue, Fig. 2B) yielded 100% robust shape features. Figure 2A shows the rapid decline of the fractions of robust features per sequence for increasing levels of robustness from relaxed (CCC & DR ≥ 0.85) to strict (CCC & DR ≥ 0.95). The results emphasize that, T2 map has advantages for all classes beside shape.
Figure 2

Impact of MRI sequences on the amount of robust features. Concordance correlation coefficient (CCC) and dynamic range (DR) values were computed for each feature and depicted for each MRI sequence (A) and feature class (B). The fraction of features decreases reciprocally to higher levels of robustness (CCC & DR ≥ 0.85; ≥ 0.90; ≥ 0.95) with T2 map revealing highest stability (A). We depict the distribution of excellently robust (CCC & DR ≥ 0.90) features in B. T2 map yields the highest fraction of robust features (A, B).

Impact of MRI sequences on the amount of robust features. Concordance correlation coefficient (CCC) and dynamic range (DR) values were computed for each feature and depicted for each MRI sequence (A) and feature class (B). The fraction of features decreases reciprocally to higher levels of robustness (CCC & DR ≥ 0.85; ≥ 0.90; ≥ 0.95) with T2 map revealing highest stability (A). We depict the distribution of excellently robust (CCC & DR ≥ 0.90) features in B. T2 map yields the highest fraction of robust features (A, B).

Observer performance has excellent reproducibility

The left part of Fig. 3A shows box-and-whisker plots of intra-observer ICCs of the features specific for the MRI sequences. The median values are excellent for all sequences, with a minimum median value of 0.94 and a maximum median value of 0.98 for HASTE and T2w TSE, respectively. Outlier values drop down to the minimum of 0.66 for T2w TSE. The right part of Fig. 3A shows box-and-whisker plots of intra-observer ICCs specific for the feature class. The median values of intra-observer ICC are excellent (≥ 0.95) for each feature class. Feature class shape shows preferable high median with small interquartile range. Left part of Fig. 3B shows corresponding box-and-whisker plots of inter-observer ICCs. As for intra-observer ICCs, the median values are excellent for all sequences, with a minimum median value of 0.83 and a maximum median value of 0.90 for HASTE and T2w TSE, respectively. Outlier values, however, drop down to values even below zero. Right part of Fig. 3B shows corresponding box-and-whisker plots of inter-observer ICCs specific for the feature class. The median values of inter-observer ICCs are excellent (≥ 0.95) for each feature class.
Figure 3

Inter-observer variance highly influences shape features. Box-Whisker plots for intraclass correlation coefficients are depicted (5–95 percentile) to visualize intra- (A) and inter-observer (B) reproducibility. To comprehensively visualize the effect of each feature, we performed single feature analysis with regard to the MRI sequence (left part) and feature class (right part) with outliers being depicted as dots (A, B).

Inter-observer variance highly influences shape features. Box-Whisker plots for intraclass correlation coefficients are depicted (5–95 percentile) to visualize intra- (A) and inter-observer (B) reproducibility. To comprehensively visualize the effect of each feature, we performed single feature analysis with regard to the MRI sequence (left part) and feature class (right part) with outliers being depicted as dots (A, B).

T2 map inherits the highest robustness and reproducibility of features

We stratified feature subcohorts (CCC ≥ 0.90 & intra-/inter-ICC ≥ 0.75) to propose feature sets for each MRI imaging technique with excellent levels of robustness and reproducibility (Supplementary Table 1, Table A.1). T2 map yielded the highest number of robust and reproducible features (rrf, n = 84, Table 1). FLAIR was the second highest ranked MRI sequence (rrf, n = 59). The further MRI sequences revealed rrf of 20, 26, 29 for HASTE, T2w TSE and T1w TSE, respectively (Supplementary Table 1, Table A.1). A set of eight features was found to be robust and reproducible across all MRI sequences (Table 2). Seven out of these eight features were part of the shape feature class and the further remaining feature was Imc1 of glcm feature class (Table 2). Intra-observer ICC, inter-observer ICC, CCC, and DR for each feature are visualized in supplementary Fig. 1 (Fig. B.1). Supplementary Fig. 2 (Fig. C.1) shows the Bland Altmann plots for the set of the eight features that were robust and reproducible across all MRI sequences.
Table 1

T2 map—robust and reproducible features.

FeaturesCCCDRIntra-observer ICCInter-observer ICC
shape_Maximum3DDiameter0.992052860.962516650.997166640.9867156
shape_MajorAxisLength0.997934190.979813610.9961710.99749429
shape_Elongation0.91692080.926854640.994149140.99282215
shape_Maximum2DDiameterSlice0.993834950.978221570.997616890.9936394
shape_SurfaceArea0.997563680.933739060.991859980.87503941
shape_MinorAxisLength0.992074220.966967350.996168270.99825783
shape_Maximum2DDiameterColumn0.988371190.957574630.991945080.98736261
shape_Maximum2DDiameterRow0.992098730.953956610.993869680.97679991
gldm_GrayLevelVariance0.996049330.970967070.995804580.98995297
gldm_HighGrayLevelEmphasis0.99871850.990018130.998679830.99895219
gldm_DependenceEntropy0.97686040.930692930.979188530.95287381
gldm_GrayLevelNonUniformity0.996722130.968440260.994040550.90588485
gldm_SmallDependenceEmphasis0.996184730.976846370.999419630.99714743
gldm_SmallDependenceHighGrayLevelEmphasis0.997855480.990267150.998499640.99848907
gldm_DependenceNonUniformityNormalized0.989913520.971105510.999331870.99372596
gldm_LargeDependenceEmphasis0.996044570.977686670.996633320.9956135
gldm_DependenceVariance0.980741930.958320530.994684590.99131544
gldm_LargeDependenceHighGrayLevelEmphasis0.910639230.955692320.997217490.9746251
glcm_JointAverage0.998684860.987459310.998010850.99716702
glcm_SumAverage0.998684860.987459310.998010850.99716702
glcm_JointEntropy0.995863160.97196370.998658820.99477919
glcm_ClusterShade0.98397360.960631880.995108420.99205699
glcm_MaximumProbability0.973167180.950347130.983187070.97765663
glcm_Idmn0.977084820.958085480.998439890.98992028
glcm_JointEnergy0.994721580.970352820.989919290.9924416
glcm_Contrast0.998786650.984148080.998992230.99747783
glcm_DifferenceEntropy0.998881680.983045730.999462220.99795149
glcm_InverseVariance0.997838290.981436220.999450290.99754584
glcm_DifferenceVariance0.996598930.977149620.998462320.99813253
glcm_Idn0.9703450.95411690.998156280.98941716
glcm_Idm0.997278340.981555740.999273820.99750192
glcm_Correlation0.930008150.919425470.981389670.98861346
glcm_Autocorrelation0.998828750.990711590.998695950.999001
glcm_SumEntropy0.994155340.966100840.996712680.9937416
glcm_MCC0.933029380.915618080.929707850.94261298
glcm_SumSquares0.997168540.972139910.997249180.99249056
glcm_ClusterProminence0.989318930.964917620.990778450.98964681
glcm_Imc20.978156590.922122290.975920090.9396985
glcm_Imc10.997771370.970860930.994266330.99035433
glcm_DifferenceAverage0.998487750.982961530.999371150.99702514
glcm_Id0.997266820.981398370.999302410.99748562
glcm_ClusterTendency0.996423250.96990740.996712960.99142976
firstorder_InterquartileRange0.995219940.974860540.998415540.99669589
firstorder_Uniformity0.993362790.964268640.992803670.99113126
firstorder_Median0.998248730.98490030.99991250.99978288
firstorder_Energy0.992855350.959324580.999010880.92817079
firstorder_RobustMeanAbsoluteDeviation0.997793420.976494930.998443990.99691045
firstorder_MeanAbsoluteDeviation0.998443340.976946240.998217310.99604297
firstorder_TotalEnergy0.992855350.959324580.999010880.92817079
firstorder_RootMeanSquared0.998840740.986512960.999895060.9997983
firstorder_90Percentile0.999357750.990139330.999955520.99994976
firstorder_Minimum0.96983540.921855150.926780630.8563204
firstorder_Entropy0.994247810.966417270.997003280.99412676
firstorder_Variance0.996064240.970997040.995799730.98995292
firstorder_10Percentile0.997789370.978654870.998999330.99627425
firstorder_Kurtosis0.938475160.909842930.954088770.97436669
firstorder_Mean0.998805780.986226510.999896690.99974019
glrlm_GrayLevelVariance0.99587990.9704340.995490480.98993049
glrlm_GrayLevelNonUniformityNormalized0.995430040.966479320.994135250.99112422
glrlm_RunVariance0.994732240.97340170.993806360.99548573
glrlm_GrayLevelNonUniformity0.999498040.96749280.996147490.90009412
glrlm_LongRunEmphasis0.997145090.977283210.994900070.99758722
glrlm_ShortRunHighGrayLevelEmphasis0.998878220.98961690.998630310.99910028
glrlm_ShortRunEmphasis0.998504780.98221780.998461490.99790335
glrlm_LongRunHighGrayLevelEmphasis0.992486840.97847320.998300250.99622993
glrlm_RunPercentage0.996635840.979470320.998538880.99734065
glrlm_RunEntropy0.991441070.954502540.995136930.98727089
glrlm_HighGrayLevelRunEmphasis0.998755610.989674270.998629580.99894877
glrlm_RunLengthNonUniformityNormalized0.997659110.981305630.998937790.99778324
glszm_GrayLevelVariance0.96641230.934874990.974264990.97948862
glszm_ZoneVariance0.993768130.978893860.985678310.91246949
glszm_GrayLevelNonUniformityNormalized0.974874580.939981640.993212270.98660887
glszm_SizeZoneNonUniformityNormalized0.966000870.934462050.994531470.97741752
glszm_SizeZoneNonUniformity0.979208160.923531730.997471040.77391648
glszm_LargeAreaEmphasis0.993760080.978902410.985693350.91286785
glszm_SmallAreaHighGrayLevelEmphasis0.998506070.986192180.99813230.99889261
glszm_ZonePercentage0.996015240.976232380.999389940.99740069
glszm_LargeAreaLowGrayLevelEmphasis0.957170080.969661150.943449360.94615198
glszm_HighGrayLevelZoneEmphasis0.998269720.984918030.998156220.99876604
glszm_SmallAreaEmphasis0.961652280.931477040.994013270.9749973
glszm_ZoneEntropy0.955035240.927762220.961649330.96298698
ngtdm_Complexity0.975121480.956078620.978768070.98862786
ngtdm_Contrast0.993323620.97254990.998536080.99306747
ngtdm_Busyness0.995824320.937027810.965715880.9623215

T2 map acquisition robust and reproducible features as defined by CCC & DR ≥ 0.9 and inter-/intra-ICC ≥ 0.75. CCC, concordance correlation coefficient; DR, dynamic range; firstorder, first-order features; GLCM, gray level co-occurrence matrix; GLDM, gray level difference matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; ICC, intraclass correlation coefficient; NGTDM, neighboring gray tone difference matrix. https://pyradiomics.readthedocs.io[19].

Table 2

Robust and reproducible features across all sequences.

FeaturesShape maximum 3D diameterShape major axis lengthShape elongationShape maximum 2D diameter sliceShape minoraxis lengthShape maximum 2D diameter columnShape maximum 2D diameter rowglcm Imc1
T2w TSECCC1.000.990.961.000.991.001.000.99
DR0.980.980.950.990.960.980.980.97
Intra-observer ICC1.000.991.001.000.991.001.001.00
Inter-observer ICC0.991.001.001.001.000.990.990.99
T1w TSECCC1.000.980.991.000.981.000.990.93
DR0.970.950.950.980.960.970.970.91
Intra-observer ICC1.001.001.001.001.001.001.001.00
Inter-observer ICC0.990.990.991.000.990.980.980.97
FLAIRCCC1.001.000.981.000.991.000.990.97
DR0.970.960.960.990.960.970.950.94
Intra-observer ICC1.000.990.991.000.991.000.990.99
Inter-observer ICC0.980.980.991.000.980.980.970.96
T2 mapCCC0.991.000.920.990.990.990.991.00
DR0.960.980.930.980.970.960.950.97
Intra-observer ICC1.001.000.991.001.000.990.990.99
Inter-observer ICC0.991.000.990.991.000.990.980.99
HASTECCC1.000.990.931.000.990.990.990.98
DR0.980.970.930.990.960.960.970.96
Intra-observer ICC1.000.990.991.000.990.990.990.99
Inter-observer ICC0.990.990.991.000.990.990.990.99

Across all sequences, eight features proofed to be robust and reproducible. Except of Imc1 from the GLCM feature class, all other features were shape features. CCC concordance correlation coefficient, DR dynamic range, FLAIR fluid-attenuated inversion recovery, GLCM gray level co-occurrence matrix, HASTE half-Fourier acquisition single-shot turbo spin-echo, ICC intraclass correlation coefficient, T1w T1-weighted, T2w T2-weighted, TSE turbo spin-echo.

T2 map—robust and reproducible features. T2 map acquisition robust and reproducible features as defined by CCC & DR ≥ 0.9 and inter-/intra-ICC ≥ 0.75. CCC, concordance correlation coefficient; DR, dynamic range; firstorder, first-order features; GLCM, gray level co-occurrence matrix; GLDM, gray level difference matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; ICC, intraclass correlation coefficient; NGTDM, neighboring gray tone difference matrix. https://pyradiomics.readthedocs.io[19]. Robust and reproducible features across all sequences. Across all sequences, eight features proofed to be robust and reproducible. Except of Imc1 from the GLCM feature class, all other features were shape features. CCC concordance correlation coefficient, DR dynamic range, FLAIR fluid-attenuated inversion recovery, GLCM gray level co-occurrence matrix, HASTE half-Fourier acquisition single-shot turbo spin-echo, ICC intraclass correlation coefficient, T1w T1-weighted, T2w T2-weighted, TSE turbo spin-echo.

T2 map has superior discriminative power for non-shape features

The statistical significance of the perfect results of Gini score one was pGini = 6! 6! / 12! = 1/924 ≈ 1.08E−3. In total, we computed 63,600 Gini scores for the differentiation of 120 pairs of fruits by 106 features of five sequences. We computed the Gini score for the differentiation of individual fruits. All five sequences yielded a maximal score of 120 successes. Note that, with 106 tested features the false discovery rate (FDR) of a single success, FDR = 1-(1 − pGini)106 ≈ 10.8%, was rather high. The significance of 120 successes, however, was significantly high, p120 < 1e−58, and demonstrated the predictive power of each of the five sequences. To study the predictive power of the classes, we counted their number of successes. Beside class ngtdm all classes yielded a maximal score of 120 successes. Class ngtdm failed only for the pair Kiwi3/Kiwi4. The high success score demonstrated the predictive power for each class on its own. Figure 4 shows the success rate of the sequences within individual classes. Within shape, three sequences, T1 TSE, FLAIR, and HASTE, yielded a maximal score of 120 successes (100%). T2 map and T2 TSE failed for pair Kiwi2/Kiwi3 and pair Kiwi1/Kiwi2, respectively. For three classes, gldm, glcm, and first order, T2 map yielded the maximal score of 120 successes (100%). T2 map revealed also the top result of 117 successes (97.5%) within glszm and ngtdm. The majority of failed tests occurred for pairs of identical fruit types. Within classes shape, glrlm, and gldm, all sequences obtained a 100% success rate for pairs of different fruit types. For pairs of different fruit types, success rates below 100% were exceptions (n = 4 out of 35), and the minimal success rate was 95.8% for T2 TSE in class glszm. The differentiation of fruits of identical type was more difficult. Success rates below 100% were the rule (n = 26 out of 35), and the minimal success rate was 33.3% for HASTE in class glszm.
Figure 4

Success rates of MRI sequences within individual feature classes. Maximum score of 120 successes to differentiate a total of 120 pairs of fruits equals a success rate of 100%. firstorder, first-order features; GLCM, gray level co-occurrence matrix; GLDM, gray level difference matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; NGTDM, neighboring gray tone difference matrix.

Success rates of MRI sequences within individual feature classes. Maximum score of 120 successes to differentiate a total of 120 pairs of fruits equals a success rate of 100%. firstorder, first-order features; GLCM, gray level co-occurrence matrix; GLDM, gray level difference matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; NGTDM, neighboring gray tone difference matrix.

Shape feature class and T2 map imaging technique for non-shape features yield highest discriminative performance

Some fruits were easily distinguishable by differences in size, shape, and textures. Discriminations between apples and limes had high accuracy, e.g., 478 out of 5 × 106 = 530 features were able to distinguish between apple3 and lime2. Discriminations between two fruits of identical type were more challenging, e.g., only 64 out of 5 × 106 = 530 features were able to distinguish between kiwi1 and kiwi3. For fruits of identical type, the mean number of successful features was 155 ± 50 out of 530 to be compared with the mean number 386 ± 54 out of 530 for fruits of different type. A feature with perfect sensitivity and robustness would provide optimal predictive power for each of the 120 pairwise differentiations of the sixteen fruits. Table 3 shows the top-ranked features (17 features for a cut-off value of 109 successes, see supplementary Table 2 (Table D.1) for all features and supplementary Fig. 3 (Fig. E.1) for the respective Bland Altmann plots of the top ranked features). We observe an optimal score of 120 correct discriminations only for one feature, Maximum2DDiameterSlice (class shape) of HASTE. Also, for other sequences, Maximum2DDiameterSlice achieved high ranks, two (T2 TSE, T2 map), 6 (FLAIR), and eleven (T1 TSE), among the 5 × 106 = 530 features. Among the 17 best-ranked features, eleven features are of class shape. Of the six non-shape top features five were of T2 map. Features of class shape were enriched in the set of top-ranked features, and T2 map imaging technique was enriched in the top ranked non-shape features.
Table 3

Top ranked features by number of pairwise discriminative successes based on Gini score analysis.

RankFeatureSequenceClassNo. successesRate (%)
1Maximum2DDiameterSliceHASTEShape120100.0
2Maximum2DDiameterSliceT2w TSEShape11495.0
2Maximum2DDiameterSliceT2 mapShape11495.0
4MajorAxisLengthHASTEShape11293.3
5DependenceNonUniformityNormalizedT2 mapGldm11192.5
6Maximum3DDiameterT2w TSEShape11091.7
6Maximum2DDiameterRowT2w TSEShape11091.7
6Maximum2DDiameterSliceFLAIRShape11091.7
6Maximum2DDiameterRowFLAIRShape11091.7
6MedianT2 mapFirstorder11091.7
11MajorAxisLengthT2 mapShape10990.8
11RunPercentageT2 mapglrlm10990.8
11Maximum2DDiameterSliceT1w TSEShape10990.8
11IdmT2 mapglcm10990.8
11RunPercentageFLAIRglrlm10990.8
11IdT2 mapglcm10990.8
11Maximum2DDiameterColumnT2w TSEShape10990.8

The 17 top ranked features up to a cut-off value of 109 successes are depicted to pairwise discriminate variant fruits. See supplementary Table 2 (Table D.1) for all features.

Top ranked features by number of pairwise discriminative successes based on Gini score analysis. The 17 top ranked features up to a cut-off value of 109 successes are depicted to pairwise discriminate variant fruits. See supplementary Table 2 (Table D.1) for all features.

Discussion

Radiomics is increasingly applied to perform data mining and augment image data for model building[1]. Nevertheless, data on the robustness and reproducibility of radiomic features, especially for MRI radiomics, are scarce and remain controversial[5,9,16-18,20,21]. Monocenter as well as multicenter studies dealing with the robustness and reproducibility of radiomic features obtained controversial results[5,9,16-18]. Baeßler et al. have demonstrated high vulnerability of the majority of radiomic features[9]. We applied a phantom model as proposed by Baeßler et al.[9] to acquire standard clinical routine (HASTE, T2w TSE, FLAIR, T1w TSE) and further experimental (T2 map) imaging techniques and extracted 106 radiomic features per sequence. In accordance with Baeßler et al.[9], we analyzed intra- and inter-observer reproducibility as well as robustness of radiomics features. We could reveal superiority of T2 map yielding the highest performance. FLAIR was the second best imaging sequence. We could demonstrate robustness and reproducibility of 84 features applying T2 map, 59 features applying FLAIR, and only a subset of eight features was robust and reproducible across all sequences. The highest discriminative performance was found for feature class shape and for the imaging technique T2 map for non-shape features. Baeßler et al. have proposed a subset of 15 features as reliable candidates for radiomic signatures within clinical studies[9]. They claim that all other features should be favored to be dismissed during the feature selection process to improve validity of model building[9]. We examined approximately twice as much features (106 vs 45) and could reveal approximately half as much stable features (8 vs 15)[9]. In line with Baeßler et al., our subset of eight robust features across all sequences included shape features and a feature of the GLCM feature class[9]. Nevertheless, we could not corroborate the feature set of Baeßler et al.[9]. All of our top robust features were variant to the proposed 15 features[9]. We could demonstrate that subsets of the proposed 15 robust features were transferrable to specific MRI sequences in our data set[9]. Our analyzes emphasize that one may not overstate generalizability of single center datasets[9,22]. Mapping imaging techniques enable acquisition of quantitative imaging data in contrast to the standard qualitative MR images[23]. Though mapping parameters depend on the applied field strength, they inherit the potential to serve as quantitative biomarkers[23]. We could demonstrate that T2 map has the highest potential for robust and reproducible feature extraction. In line with Baeßler et al., intra-observer variance revealed a high stability[9]. We applied a semi-automatic segmentation process which is known to reduce inter-observer variance[12]. Nevertheless, inter-observer variance remained a dominant factor reducing the amount of robust and reproducible imaging features. Multidimensional feature classes are routinely applied in radiomic research to mine data and build specific models[1]. Our study urges caution in the interpretation of radiomics study results, especially when the possibility of rapid translation into clinical routine is proclaimed[8]. We elucidate the potential of shape features to represent the most promising features. This may be interpreted in line with a recent proof of validity study of Welch et al.[6]. Welch et al. have been able to demonstrate that radiomic signature features may be surrogates of shape features only and may not yield additional pertinent for prognostication[6]. High-dimensional features may be redundant, and predictive power may be based on shape features[6]. The study has been performed employing computed tomography (CT) data[6]. Qualitative MRI data may inherit an even higher vulnerability. Our study suffers from limitations that warrant discussion. We applied a fruit phantom and a standardized phantom of defined multi-material compositions might have led to higher levels of reproducibility. To stay in line with Baeßler et al. we favored application of a multifruit model[9]. Stationary macro-object phantoms have limited comparability to human tissues and direct translation to in-vivo radiomic studies would overestimate the findings and is beyond the scope of our study. In an in-vivo setting, MRI sequences are prone to motion artefacts which might have altered the results. We acquired our scan and rescan data directly one after the other on one 3 T scanner. We cannot rule out that recalibration of the MR scanner, temporal or geographical variation might have altered the results[22]. We did design our study to acquire two measurement sets in form of test and retest data, and more repetitions could have stabilized the results. Contrary to Baeßler et al., we applied PyRadiomics to perform the feature extraction[9,19,24,25]. PyRadiomics promotes transparent multicenter research with open source codes being available[8,19]. In an in-vivo setting or a standard of care clinical scenario, the limitations described above would lead to increased variation in the VOI-definition of repeat or follow-up scans, which in turn increases the variation of all radiomic features. One would expect to see a decrease in the number of robust features across all sequences. This highlights the importance of stratifying specific robust and reproducible sequences and corresponding feature subsets to path the way for clinical translation of radiomic data augmentation in the future. In conclusion, we provide further evidence that the robustness of MRI radiomics features depends on the particular MRI sequence used. We revealed superiority of T2 map to lead to the highest amount of robust and reproducible quantitative imaging features as well having the highest discriminative performance. FLAIR was the second best sequence. Only eight out of 106 features were stable across all MR sequences, and seven out of the respective eight features were part of the shape feature class. We could not corroborate the subset of robust features of Baeßler et al. and therefore urge caution in interpreting radiomic research[8,9]. We propose the inclusion of mapping imaging techniques in the clinical routine setting to enable acquisition of robust imaging data pathing the way for multicenter multivendor research. Multicenter multivendor validation studies employing phantoms and in-vivo experiments are needed prior to translation of radiomic findings and respective models into clinical routine.

Methods

Study design

We constructed a multi-fruit phantom as proposed by Baeßler et al.[9] consisting of four onions, four limes, four kiwifruits and four apples. Image acquisition was performed as scan-rescan: repositioning in the same direction with replanning of all sequences, two measurements.

MR imaging acquisition and examination

All examinations were performed on a single 3 T scanner with a standard body-array coil (Magnetom PrismaFIT, Siemens Healthcare, Erlangen, Germany) and built-in spine phased-array coil. The sequences were adapted from the standard clinical liver sequences, including an experimental quantitative mapping imaging technique leading to a total of five variant imaging techniques: (I) T2-weighted (T2w) half-Fourier acquisition single-shot turbo spin-echo (HASTE), (II) T2w turbo spin-echo (TSE), (III) T2w fluid-attenuated inversion recovery (FLAIR), (IV) T2 map and (V) T1-weighted (T1w) TSE (Fig. 5). Details of imaging parameters of acquisition are shown in Table 4.
Figure 5

Representative images of the acquired magnetic resonance imaging sequences. FLAIR, fluid-attenuated inversion recovery; HASTE, half-Fourier acquisition single-shot turbo spin-echo; T1w, T1-weighted; T2w, T2-weighted; TSE, turbo-spin-echo.

Table 4

Magnetic resonance imaging sequence parameters.

SequenceT2w HASTET2w TSET2w FLAIRT2 MapT1w TSE
OrientationAxialAxialAxialAxialAxial
TR (ms)1000750090004000600
TE (ms)87968934; 8020
Averages12112
Flip angle115160150180161
FOV (mm2)382 × 350400 × 400400 × 400299 × 399400 × 400
Matrix (px2)280 × 256256 × 320256 × 256173 × 384240 × 320
Bandwidth (Hz)700200220220185
Slice thickness (mm)63544
Original protocolLiverLiverLiverLiverLiver

Acquisition parameters of the modified clinical routine protocols are shown. FLAIR fluid-attenuated inversion recovery, FOV field of view, HASTE half-Fourier acquisition single-shot turbo spin-echo, T1w T1-weighted, T2w T2-weighted, TE echo time, TR repetition time, TSE turbo-spin-echo.

Representative images of the acquired magnetic resonance imaging sequences. FLAIR, fluid-attenuated inversion recovery; HASTE, half-Fourier acquisition single-shot turbo spin-echo; T1w, T1-weighted; T2w, T2-weighted; TSE, turbo-spin-echo. Magnetic resonance imaging sequence parameters. Acquisition parameters of the modified clinical routine protocols are shown. FLAIR fluid-attenuated inversion recovery, FOV field of view, HASTE half-Fourier acquisition single-shot turbo spin-echo, T1w T1-weighted, T2w T2-weighted, TE echo time, TR repetition time, TSE turbo-spin-echo.

Image preprocessing and segmentation

MR images were extracted in Digital Imaging and Communications in Medicine (DICOM) format and imported into the open-source 3D Slicer software platform (http://slicer.org, version 4.9.0)[24,25]. Images were resampled to a spacing of 1 mm × 1 mm × 1 mm employing B-spline interpolation (https://www.slicer.org/wiki/Registration:Resampling, supplementary methods 2 of Griethuysen et al.[19])[25]. For the segmentation, a three-dimensional volume of interest (VOI) was defined in each fruit employing the paint tool of the segment editor[25]. Augmentation of the VOI to match the boundaries of the fruit was performed using the semi-automatic grow from seeds algorithm, known to reduce inter-observer variability[12,25,26]. By limiting manual VOI placement to the middle proportion of each fruit with a 1.5 cm diameter VOI, we did limit the consecutive growing algorithm to segment the middle portion of each fruit, thus reducing partial volume artefacts of the upper and lower boarder zones (Fig. 6). Fruits were positioned in close proximity paralleling the real world scenario of VOIs being surrounded by variant tissues. Consequently, segmentation errors were observed. Respective foci of error were manually corrected employing the brush-erase tool[9]. The segmentation workflow is shown in Fig. 6. To analyze the inter- and intra-observer variance, the segmentation workflow as well as feature extraction was additionally conducted by a second reader after initial training and by the first reader after a pause of one month, respectively.
Figure 6

Phantom design and workflow of semi-automatic segmentation. The phantom (A) and the workflow of semi-automatic segmentation are shown exemplarily for T2-weighted turbo-spin-echo acquisition (B–E). On the original image (B), we manually defined preliminary volumes of interest (C, diameter 1.5 cm). The growth from seeds algorithm was used to augment the 3D volumes (D) with subsequent manual correction of erroneous border segment sections (E). In F a representative 3D volume rendering is shown.

Phantom design and workflow of semi-automatic segmentation. The phantom (A) and the workflow of semi-automatic segmentation are shown exemplarily for T2-weighted turbo-spin-echo acquisition (B–E). On the original image (B), we manually defined preliminary volumes of interest (C, diameter 1.5 cm). The growth from seeds algorithm was used to augment the 3D volumes (D) with subsequent manual correction of erroneous border segment sections (E). In F a representative 3D volume rendering is shown.

Features extraction

We applied the open-source package PyRadiomics[19] as extension within the 3D Slicer software platform[24,25] to extract the radiomic features. Feature definitions of PyRadiomics are broadly implemented according to the IBSI definition consensus[8,11,19]. From seven feature classes we extracted all original standard features: Shape-based, First Order Statistics, Gray Level Co-occurrence Matrix (GLCM), Gray Level Run Length Matrix (GLRLM), Gray Level Size Zone Matrix (GLSZM), Gray Level Dependence Matrix (GLDM), Neighboring Gray Tone Difference Matrix (NGTDM) leading to 107 features per VOI and sequence (http://pyradiomics.readthedocs.io[19]). Default settings of PyRadiomics were used for feature extraction, i.e. original without filtering, no wavelet-based features, bin width 25, and enforced symmetrical GLCM, http://pyradiomics.readthedocs.io[3,8,19]. As we restricted the segmentation to the middle proportion of the fruits, least axis parameter was systematically biased and we excluded this feature from the analyses, leading to a total of 106 “true” features per VOI and sequence (further referred to as the total amount of features).

Evaluation of robustness and reproducibility

To ensure highest methodological transparency, we used open-source software with source codes being available online. We performed statistical calculations and analysis with Python 3.7.6[27], within Jupyter Notebook[28] with the respective package scipy (version 1.4.1)[29]. We computed concordance correlation coefficient (CCC) and dynamic range (DR) values on paired samples, x and y[30-32]. The samples x and y contained a first set and a second disjunct set of values of a feature, respectively. CCC values range from -1 to 1, where 1 refers to the perfect agreement between the two samples xi = yi, i = 1, …, n. The value CCC = -1 refers to perfectly anticorrelated pairs of samples xi = -yi, i = 1, …, n. The value DR = 0 refers to the lowest possible variability in the sets x and y, i.e., x1 = x2 =  = xn ≠ y1 = y2 =  = yn. The value DR = 1 refers to optimal reproducibility xi = yi, i = 1, …, n combined with a nonzero data range. Recent studies have defined high correlation for CCC and DR using a cut-off value of 0.9[9,32]. The choice of 0.9 as the cut-off has been based on the study of Segal et al. applying Pearson correlation measurement[33]. CCC is known to outperform Pearson correlation coefficient[30] and no consensus exists, therefore, we defined our cut-off value at 0.9, as proposed by Baeßler et al.[9]. Also, in line with Baeßler et al., we further added analyses of a relaxed and a more strict cut-off value of 0.85 and 0.95, respectively[9]. Further, we tested intra- and inter-observer reproducibility by means of intraclass correlation coefficients (ICCs)[34,35]. ICC assesses the reproducibility of measurements performed by different observers measuring the same quantity[34,35]. ICC range from − 1 to 1, where 1 refers to perfect correlation and − 1 refers to perfect anticorrelation. In accordance with Baeßler et al.[9], we defined excellent, ≥ 0.75; good, 0.60–0.74; moderate, 0.40–0.59; and poor, ≤ 0.39, reproducibility[36,37]. To correct CCC values for subtle intrareader variances, we applied the bias correction as done by Baeßler et al.: CCCcorr = CCC + (1 − intra-observer ICC)[9].

Gini scores

We applied the Mann–Whitney U test[38] to measure the predictive power of a feature to distinguish the sixteen individual objects of the phantom. The Mann–Whitney U test computes a U-parameter from the numeric ranks of the values in the union of two groups. The statistic of the U parameter describes the null hypothesis of identical distributions of both populations. We rescaled the U parameter to the area under the receiver operating characteristic curve (ROC, AUC)[39] where n1 and n2 denote the number of feature values of fruit A and B, respectively. We calculated the Gini score to measure the predictive power. Note that, a Gini Score of Gini = 100% enables a correct decision based on a single value of a feature. For features with a Gini score of Gini = 0%, any prediction would be random and the feature would give no valuable information for a decision. For an individual fruit, we considered six replicate values, one value of each of three segmentations of two scans. The Mann–Whitney U test compared the six values of a fruit with the six values of another fruit and computes a value of Gini score between zero (no predictive power) and one (perfect predictive power). We named the fruits apple1-4, lime1-4, onion1-4, and kiwi1-4. We denoted a group of features to be successful to distinguish a pair of fruits, if at last one feature in the group yielded a perfect Gini score of one. For example, a sequence yielded a maximal score of 120 successes only if for each of the 120 pairs of fruits, at least one of its 106 features was able to distinguish the two individual fruits.

General statistical analysis

For statistical analysis, the values of significance are depicted in the graphs as followed: *p < 0.05; **p < 0.01; ***p < 0.001. Further graphical illustrations and statistics were performed employing JMP 14 (SAS), Prism 6.0 (GraphPad software), Microsoft Excel (Microsoft Corporation) and Affinity Designer 1.8.5.703 (Serif (Europe) Ltd). Supplementary Figure 1. Supplementary Figure 2. Supplementary Figure 3. Supplementary Table 1. Supplementary Table 2. Supplementary Table Legends.
  33 in total

1.  Multicentre magnetic resonance texture analysis trial using reticulated foam test objects.

Authors:  R A Lerski; L R Schad; R Luypaert; A Amorison; R N Muller; L Mascaro; P Ring; A Spisni; X Zhu; A Bruno
Journal:  Magn Reson Imaging       Date:  1999-09       Impact factor: 2.546

2.  3D Slicer as an image computing platform for the Quantitative Imaging Network.

Authors:  Andriy Fedorov; Reinhard Beichel; Jayashree Kalpathy-Cramer; Julien Finet; Jean-Christophe Fillion-Robin; Sonia Pujol; Christian Bauer; Dominique Jennings; Fiona Fennessy; Milan Sonka; John Buatti; Stephen Aylward; James V Miller; Steve Pieper; Ron Kikinis
Journal:  Magn Reson Imaging       Date:  2012-07-06       Impact factor: 2.546

3.  Robustness and Reproducibility of Radiomics in Magnetic Resonance Imaging: A Phantom Study.

Authors:  Bettina Baeßler; Kilian Weiss; Daniel Pinto Dos Santos
Journal:  Invest Radiol       Date:  2019-04       Impact factor: 6.016

4.  Radiomics of CT Features May Be Nonreproducible and Redundant: Influence of CT Acquisition Parameters.

Authors:  Roberto Berenguer; María Del Rosario Pastor-Juan; Jesús Canales-Vázquez; Miguel Castro-García; María Victoria Villas; Francisco Mansilla Legorburo; Sebastià Sabater
Journal:  Radiology       Date:  2018-04-24       Impact factor: 11.105

Review 5.  Radiomics as a Quantitative Imaging Biomarker: Practical Considerations and the Current Standpoint in Neuro-oncologic Studies.

Authors:  Ji Eun Park; Ho Sung Kim
Journal:  Nucl Med Mol Imaging       Date:  2018-02-01

6.  The intraclass correlation coefficient as a measure of reliability.

Authors:  J J Bartko
Journal:  Psychol Rep       Date:  1966-08

7.  Test-retest reproducibility analysis of lung CT image features.

Authors:  Yoganand Balagurunathan; Virendra Kumar; Yuhua Gu; Jongphil Kim; Hua Wang; Ying Liu; Dmitry B Goldgof; Lawrence O Hall; Rene Korn; Binsheng Zhao; Lawrence H Schwartz; Satrajit Basu; Steven Eschrich; Robert A Gatenby; Robert J Gillies
Journal:  J Digit Imaging       Date:  2014-12       Impact factor: 4.056

8.  Decoding global gene expression programs in liver cancer by noninvasive imaging.

Authors:  Eran Segal; Claude B Sirlin; Clara Ooi; Adam S Adler; Jeremy Gollub; Xin Chen; Bryan K Chan; George R Matcuk; Christopher T Barry; Howard Y Chang; Michael D Kuo
Journal:  Nat Biotechnol       Date:  2007-05-21       Impact factor: 54.908

9.  Computational Radiomics System to Decode the Radiographic Phenotype.

Authors:  Joost J M van Griethuysen; Andriy Fedorov; Chintan Parmar; Ahmed Hosny; Nicole Aucoin; Vivek Narayan; Regina G H Beets-Tan; Jean-Christophe Fillion-Robin; Steve Pieper; Hugo J W L Aerts
Journal:  Cancer Res       Date:  2017-11-01       Impact factor: 12.701

10.  Repeatability of Multiparametric Prostate MRI Radiomics Features.

Authors:  Michael Schwier; Joost van Griethuysen; Mark G Vangel; Steve Pieper; Sharon Peled; Clare Tempany; Hugo J W L Aerts; Ron Kikinis; Fiona M Fennessy; Andriy Fedorov
Journal:  Sci Rep       Date:  2019-07-01       Impact factor: 4.379

View more
  4 in total

1.  CT-Based Radiomics Analysis for Noninvasive Prediction of Perineural Invasion of Perihilar Cholangiocarcinoma.

Authors:  Peng-Chao Zhan; Pei-Jie Lyu; Zhen Li; Xing Liu; Hui-Xia Wang; Na-Na Liu; Yuyuan Zhang; Wenpeng Huang; Yan Chen; Jian-Bo Gao
Journal:  Front Oncol       Date:  2022-06-20       Impact factor: 5.738

2.  Evaluation of the dependence of radiomic features on the machine learning model.

Authors:  Aydin Demircioğlu
Journal:  Insights Imaging       Date:  2022-02-24

Review 3.  Radiomics in prostate cancer: an up-to-date review.

Authors:  Matteo Ferro; Ottavio de Cobelli; Gennaro Musi; Francesco Del Giudice; Giuseppe Carrieri; Gian Maria Busetto; Ugo Giovanni Falagario; Alessandro Sciarra; Martina Maggi; Felice Crocetto; Biagio Barone; Vincenzo Francesco Caputo; Michele Marchioni; Giuseppe Lucarelli; Ciro Imbimbo; Francesco Alessandro Mistretta; Stefano Luzzago; Mihai Dorin Vartolomei; Luigi Cormio; Riccardo Autorino; Octavian Sabin Tătaru
Journal:  Ther Adv Urol       Date:  2022-07-04

4.  A Pipeline for the Implementation and Visualization of Explainable Machine Learning for Medical Imaging Using Radiomics Features.

Authors:  Cameron Severn; Krithika Suresh; Carsten Görg; Yoon Seong Choi; Rajan Jain; Debashis Ghosh
Journal:  Sensors (Basel)       Date:  2022-07-12       Impact factor: 3.847

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.