| Literature DB >> 33924045 |
Danuta Liberda1, Ewa Pięta1, Katarzyna Pogoda1,2, Natalia Piergies1, Maciej Roman1, Paulina Koziol1, Tomasz P Wrobel1, Czeslawa Paluszkiewicz1, Wojciech M Kwiatek1.
Abstract
Fourier transform infrared spectroscopy (FT-IR) is widely used in the analysis of the chemical composition of biological materials and has the potential to reveal new aspects of the molecular basis of diseases, including different types of cancer. The potential of FT-IR in cancer research lies in its capability of monitoring the biochemical status of cells, which undergo malignant transformation and further examination of spectral features that differentiate normal and cancerous ones using proper mathematical approaches. Such examination can be performed with the use of chemometric tools, such as partial least squares discriminant analysis (PLS-DA) classification and partial least squares regression (PLSR), and proper application of preprocessing methods and their correct sequence is crucial for success. Here, we performed a comparison of several state-of-the-art methods commonly used in infrared biospectroscopy (denoising, baseline correction, and normalization) with the addition of methods not previously used in infrared biospectroscopy classification problems: Mie extinction extended multiplicative signal correction, Eiler's smoothing, and probabilistic quotient normalization. We compared all of these approaches and their effect on the data structure, classification, and regression capability on experimental FT-IR spectra collected from five different prostate normal and cancerous cell lines. Additionally, we tested the influence of added spectral noise. Overall, we concluded that in the case of the data analyzed here, the biggest impact on data structure and performance of PLS-DA and PLSR was caused by the baseline correction; therefore, much attention should be given, especially to this step of data preprocessing.Entities:
Keywords: EMSC; FT-IR spectroscopy; PLS-DA; PLSR; preprocessing; prostate cancer cells
Year: 2021 PMID: 33924045 PMCID: PMC8073124 DOI: 10.3390/cells10040953
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1A scheme of the preprocessing steps—five prostate cell lines were imaged with FT-IR, then white noise was added to the original spectra. Raw data and data with added noise were preprocessed in the following order: denoising → baseline correction → normalization. Individual methods coming from one preprocessing step were combined with each method from the remaining two preprocessing steps. Taking into account the above and the number of parameters adjusted for each method, the number of combinations (data sets) was equal to 2835. All of these combinations were then used to create a classifier discriminating cell lines and a regression model of class assignments giving more detail about the relative importance of different preprocessing factors and parameters.
Figure 2Principal component analysis exploration of the original data structure after application of three preprocessing steps. Each point corresponds to the individual spectrum coming from the original dataset on which unique combinations of the three steps were used. Subsections a, b, and c present the same PC projection but are colored according to a single preprocessing type: (a) denoising, (b) baseline correction, and (c) normalization. For better understanding, a set of spectra on which combinations of DER baseline correction method with other preprocessing steps are marked with a circle.
Figure 3Results of PLS-DA classification: Values of accuracy for each combination of the methods (marked with dots, with yellow corresponding to a high number of models while blue to low number) calculated for up to 30 LVs of: (a) original data and (b) data with added noise. The red circle indicates the best combination of methods which gave very high internal accuracy with the smallest reasonable LVs number. Comparison of internal and external validation accuracy values (for the best LVs chosen based on internal validation) for (c) original data and (d) data with added noise. The green circle indicates the worst combination of methods which gave very high internal validation and low external validation accuracy values. (e) Number of combinations giving accuracy higher than 0.8 for internal and external validation—marked with a red frame on right figure panel: for original and noise added data divided into baseline/normalization categories.
The best combination of methods gave very high internal accuracy with the smallest reasonable LVs (marked with the red circle in the left panel in Figure 3). Methods giving the best external validation values for original and raw data were marked with green color.
| Denoising | Adjusted Parameter | Baseline | Adjusted Parameter | Normalization | Internal | External | ||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 27 | CONSTANT | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 29 | CONSTANT | 0.94 | 0.79 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 27 | CONSTANT | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 29 | CONSTANT | 0.94 | 0.79 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 23 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 25 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 27 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 2, 29 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 23 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 25 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 27 | TSN | 0.94 | 0.86 |
| Fourier | frame | 100 | Second derivative | Poly, frame | 3, 29 | TSN | 0.94 | 0.86 |
|
| ||||||||
| Fourier | frame | 140 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |
| Fourier | frame | 220 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| Eilers | λ | 6 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |
| SavitzkyG | Poly, frame | 2, 15 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 2, 17 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |
| SavitzkyG | Poly, frame | 2, 19 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 2, 21 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 2, 23 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 3, 15 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 3, 17 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |
| SavitzkyG | Poly, frame | 3, 19 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 3, 21 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
| SavitzkyG | Poly, frame | 3, 23 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |
Figure 4Internal and external accuracy values comparison for the best LVs for: (a) original data and (b) noise added data, for baseline correction methods: DER, RMie-EMSC, and ME-EMSC.
Figure 5Comparison of RMSECV and RMSEP values for: (a) original data and (b) noise added data. Each dot on the plot presents a value for one combination of preprocessing methods. (c) Histogram of 10% of all combinations giving the lowest RMSECV and RSEMP for the original and noise added data.
Figure 6Comparison of PLSR mean RMSECV and RMSEP errors calculated for all methods on each preprocessing step (model with optimal LVs allowed by CV was chosen) for (a) original data and (b) data with added noise. The standard deviation of all models that used a given method (from the current preprocessing step) in combination with other methods (from other preprocessing steps) was marked with error bars.