| Literature DB >> 35260666 |
Michał Kruczkowski1, Anna Drabik-Kruczkowska2, Anna Marciniak3,4, Martyna Tarczewska3, Monika Kosowska5, Małgorzata Szczerska6.
Abstract
Cervical cancer is one of the most commonly appearing cancers, which early diagnosis is of greatest importance. Unfortunately, many diagnoses are based on subjective opinions of doctors-to date, there is no general measurement method with a calibrated standard. The problem can be solved with the measurement system being a fusion of an optoelectronic sensor and machine learning algorithm to provide reliable assistance for doctors in the early diagnosis stage of cervical cancer. We demonstrate the preliminary research on cervical cancer assessment utilizing an optical sensor and a prediction algorithm. Since each matter is characterized by refractive index, measuring its value and detecting changes give information about the state of the tissue. The optical measurements provided datasets for training and validating the analyzing software. We present data preprocessing, machine learning results utilizing four algorithms (Random Forest, eXtreme Gradient Boosting, Naïve Bayes, Convolutional Neural Networks) and assessment of their performance for classification of tissue as healthy or sick. Our solution allows for rapid sample measurement and automatic classification of the results constituting a potential support tool for doctors.Entities:
Mesh:
Year: 2022 PMID: 35260666 PMCID: PMC8904553 DOI: 10.1038/s41598-022-07723-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Associated refractive index values of cervical cells at different neoplastic progression.
| Cell type | Basal | Midzone | Superficial |
|---|---|---|---|
| Normal | 1.387 | 1.372 | 1.414 |
| Cancer | 1.426 | 1.404 | 1.431 |
Figure 1Methodology workflow.
Figure 2Sample interferogram.
Selected features description.
| Symbol | Feature | Description |
|---|---|---|
| F1 | Number of local maxima | Extraction of a list of local maxima in considered interferogram |
| F2 | Global maxima | Maximum value from the local maxima list |
| F3 | Threshold | A variable used to filter amplitude to smooth the signal (e.g. 5% of global maxima) |
| F4 | Amplitude normalization factor | Used to rescale experimental plot compared to the simulation one, due to different ranges of amplitude; factor was calculated as shown in Eq. |
| F5 | Local maxima distance–average | Average wavelength distance between the local maxima |
| F6 | Local maxima distance–maximum | Maximum wavelength distance between the local maxima |
| F7 | Local maxima distance–minimum | Minimum wavelength distance between the local maxima |
| F8 | Local maxima distance–median | Median wavelength distance between the local maxima |
| F9 | Dissimilarity measure | Dissimilarity between simulated and experimental interferogram (integral calculated with the use of Simpson rule as shown in Eq. |
| F10 | Chart axial shift | Global maxima shift between simulated and experimental interferogram |
| F11 | Roots mean squared error (RMSE) | Difference between the simulation plot and the experimental data |
| F12 | Cavity length | Value read from the configuration of the measuring set (Fabry–Perot cavity) |
| F13 | Minimum wavelength | Minimum value for wavelength parameter |
| F14 | Maximum wavelength | Maximum value for wavelength parameter |
| F15 | Amplitude | Difference between maximum and minimum y value, where y is representative of amplitude column from input data |
| F16 | λ0 | Wavelength for maximum amplitude |
| F17 | λ0 for theoretical signal | Wavelength for maximum amplitude (for base signal) |
| F18 | Target variable | 1—cancer, 0—healthy |
Figure 3Flowchart of data preprocessing.
Figure 4Graphical representation of Stratified 3-Fold Cross Validation on a prepared dataset.
Dataset statistics (coef—the coefficient of the independent variables and the constant term in the equation).
| Symbol | coef | std error | test statistic t | P >|t| | [0.025 | 0.975] |
|---|---|---|---|---|---|---|
| F1 | − 5.8e + 04 | 2.22e + 05 | − 0.262 | 0.794 | − 4.96e + 05 | 3.8e + 05 |
| F2 | − 0.0012 | 0.001 | − 2.161 | 0.032 | − 0.002 | − 9.98e− 05 |
| F3 | − 0.0159 | 0.035 | − 0.448 | 0.654 | − 0.086 | 0.054 |
| F4 | − 0.0646 | 0.033 | − 1.965 | 0.051 | − 0.129 | 0.000 |
| F5 | 6.203e + 04 | 2.33e + 05 | 0.266 | 0.791 | − 3.99e + 05 | 5.23e + 05 |
| F6 | − 0.1583 | 0.654 | − 0.242 | 0.809 | − 1.450 | 1.133 |
| F7 | − 0.0221 | 0.064 | − 0.345 | 0.731 | − 0.148 | 0.104 |
| F8 | − 0.0091 | 0.035 | − 0.264 | 0.792 | − 0.077 | 0.059 |
| F9 | 0.5103 | 1.621 | 0.315 | 0.753 | − 2.689 | 3.710 |
| F10 | − 58.0557 | 362.914 | − 0.160 | 0.873 | − 774.423 | 658.312 |
| F11 | 0.0962 | 0.051 | 1.872 | 0.063 | − 0.005 | 0.198 |
| F12 | 0.0167 | 0.093 | 0.180 | 0.858 | − 0.167 | 0.200 |
| F13 | − 0.0057 | 0.022 | − 0.264 | 0.792 | − 0.049 | 0.037 |
| F14 | 2.6056 | 0.805 | 3.238 | 0.001 | 1.017 | 4.194 |
| F15 | 0.0323 | 0.094 | 0.346 | 0.730 | − 0.152 | 0.217 |
| F16 | 0.0004 | 0.000 | 0.775 | 0.439 | − 0.001 | 0.001 |
| F17 | 7.937e− 07 | 3.01e− 06 | 0.264 | 0.792 | − 5.15e− 06 | 6.73e− 06 |
| F18 | − 5.8e + 04 | 2.22e + 05 | − 0.262 | 0.794 | − 4.96e + 05 | 3.8e + 05 |
Figure 5A graphical representation of evaluation measures: True Positives, False Positives, False Negatives, True Negatives.
Classification results.
| Classifier | Fold | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| Random Forest | 1 | 0.97 | 0.97 | 0.98 | 0.97 |
| 2 | 1.00 | 1.00 | 1.00 | 1.00 | |
| 3 | 1.00 | 1.00 | 1.00 | 1.00 | |
| Validation: | 0.91 | 0.91 | 0.92 | 0.92 | |
| XGBoost | 1 | 1.00 | 1.00 | 1.00 | 1.00 |
| 2 | 1.00 | 1.00 | 1.00 | 1.00 | |
| 3 | 1.00 | 1.00 | 1.00 | 1.00 | |
| Validation: | 0.89 | 0.90 | 0.90 | 0.89 | |
| Naïve Bayes | 1 | 0.96 | 0.96 | 0.96 | 0.96 |
| 2 | 0.95 | 0.95 | 0.95 | 0.95 | |
| 3 | 0.97 | 0.97 | 0.97 | 0.97 | |
| Validation: | 0.92 | 0.93 | 0.93 | 0.92 | |
| CNN | 1 | 0.78 | 1.00 | 0.61 | 0.75 |
| 2 | 0.83 | 1.00 | 0.69 | 0.82 | |
| 3 | 0.81 | 1.00 | 0.67 | 0.80 | |
| Validation: | 0.75 | 1.00 | 0.58 | 0.73 | |
Figure 6Confusion matrices for selected algorithms: A1: Random Forest test dataset fold 1, A2: Random Forest validation dataset, B1: XGBoost test dataset fold 1t, B2: XGBoost validation dataset, C1: Naive Bayes test dataset fold 1, C2: Naïve Bayes validation dataset, D1: CNN test dataset fold 1 D2: CNN validation dataset.
Average time for training and prediction for chosen algorithms.
| Algorithm | Training | Prediction |
|---|---|---|
| Random Forest | 212 ms | 15.5 ms |
| XGBoost | 21 ms | 2.08 ms |
| Naïve Bayes | 7.54 ms | 1.81 ms |
| CNN | 5320 ms | 5 ms |