| Literature DB >> 35945515 |
Nadav Bar1, Bahareh Nikparvar2, Naresh Doni Jayavelu3, Fabienne Krystin Roessler2.
Abstract
BACKGROUND: Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results.Entities:
Keywords: Fourier transform; Gene expression data; Network component analysis; Noise; Time-Series data; k-means Clustering
Mesh:
Year: 2022 PMID: 35945515 PMCID: PMC9364503 DOI: 10.1186/s12859-022-04839-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Constrained Fourier approximation fit the gene expression data accurately. A Two examples of true signals (dotted curve), noisy data (’*’), Fourier approximation (solid) and the spline approximation (red dashed) for frequencies of (left) and (right). Spline approximations follow the noise. B The root mean squared error (RMSE) is significantly (two-samples t-test, , ) lower for the Fourier approximation than the spline. Furthermore, C 85% of the trials were accurately approximated (lowest RMSE) by Fourier with first and second harmonics. D Frequency analysis of the Fourier approximations: The error is low for frequencies , but increases with frequency. The spline approximation (red) is higher, with its mean (mean RMSE of all frequencies) significantly () higher than the Fourier. A sustained stimulus, an impulse and a wave-like response with frequencies , and , respectively, are depicted above. E Deterioration of the noise reduction methods (expressed by the normalized sum of SSE) as the noise variance of the gene expression measurements increases. Fourier algorithm performs better than its counterpart for all variances tested
Comparison of mean correlation coefficients between noisy gene profiles and de-noised gene profiles using ImulseDE or our constrained Fourier approximation
| All time points | “High noise” time points | |||||
|---|---|---|---|---|---|---|
| Noisy data | ImpulseDE | Constrained Fourier | Noisy data | ImpulseDE | Constrained Fourier | |
| Cluster 1 | 0.97 | 0.99 | 0.99 | 0.68 | 0.82 | 0.91 |
| Cluster 2 | 0.94 | 0.98 | 0.99 | 0.73 | 0.90 | 0.97 |
| Cluster 3 | 0.97 | 0.99 | 1.00 | 0.80 | 0.95 | 0.99 |
| Cluster 4 | 0.97 | 0.65 | 0.96 | 0.79 | 0.32 | 0.90 |
| Cluster 5 | 0.93 | 0.98 | 0.99 | 0.58 | 0.82 | 0.91 |
| Cluster 6 | 0.98 | 0.99 | 0.95 | 0.92 | 0.96 | 0.98 |
Standard deviation is shown in brackets.
Fig. 2Results of k-means clustering of raw (gray) and de-noised (red) synthetic expression data. A, B Six synthetic clusters, from each we generated 1000 signals with random additive noise of variance (A) and (B). Fourier approximation of de-noised data that was clustered (red dashed) and Fourier approximation of raw data that was clustered (gray dashed). C, D Monte Carlo of 1000 k-means simulations (see “Methods” section) on the de-noised and raw signals. The histograms describe the distribution of the SSEs for the raw (grey) and the de-noised (red) data. The mean error SSE of Fourier treated genes () was significantly lower (t-test: ) than the mean SSE of the untreated genes ().The difference in low noise signals (here shown ) was also statistically significant (t-test: )
Fig. 3Analysis of k-means clustering of raw (gray) and de-noised (red) synthetic expression data. A, B Total size (of all six clusters) of correlation and SSE between the raw signals to the true signals (gray) and de-noised signals to the true signals (red). C–F analyzes the performance of the clustering as a function of the number of data samples: C, D Total mean correlation and error (SSE) of expression signals from clusters of raw (gray) and de-noised (red) data, as a function of sampling frequency (linearly distributed). The difference is not statistically significant for over 7 sample points (two-sample t-test). E, F Mean correlation and SSE of clustering of raw (gray) and de-noised (red) data, as function of sampling frequency with a logarithmic time scale (see “Methods” section). The improvement in the clustering performance was significantly better over 5–7 sample points. And most importantly, above 7–8 samples the improvement is not statistically significant
Fig. 4Post-processing with NCA performs better when data was treated with NR. A When reconstructed 10 transcription factor (TF) signals from 3 replicates of data, the correlation between the replicates was always higher when the data was first treated with our constrained Fourier estimation. Here we show noise variance . Other variances and the GNCA-r are shown is Additional file 1: Fig. S5. Numbers besides column are the correlation of 3 replicates from treated data. B, C GNCA-r reconstructed the 3 replicates from the pre-treated (solid) data significantly better () than the noisy data (dashed). Here we show temporal reconstruction of two arbitrary TFs
Fig. 5Noise reduction of Listeria monocytogenes RNA-sequencing differentially expressed data. A The variation in the mRNA counts between five replicates of the important early-active regulator genes lexA and recA was significantly reduced for the first 1 h after exposure to stimuli, reflected by the low variance B of the same genes at these early samples. Black triangles and red stars represent mean values for untreated and de-noised data, respectively. Shaded areas around mean values represent standard deviation
Fig. 6Post-analysis of mouse T cell expression data. A The algorithm estimated the data with 2-harmonics Fourier approximation. The mean variance of the estimated frequency between 3-replicates of each gene (log scale). 98% of the distribution had variance less than 0.01, indicating similar estimated frequencies between experiment replicates. B Selected TF activity predictions (using NCA) of noisy data (dashed) and Fourier de-noised data (solid). Replicates of Fourier estimated data are closely correlated (data on min and max cross correlation is given in Table 1). C Over 90% (29/32) of the TF activities had closer correlation (percent) with Fourier de-noised data than with noisy data. D Noisy data had exclusively higher mean angle between the replicates than the de-noised data, indicating that replicates of NCA predictions with de-noised data are more linearly dependent, and are closely related
Cross correlation between three replicates of predictions of four TFs activities
| TF | Treated data | Noisy data | ||
|---|---|---|---|---|
| Max | Min | Max | Min | |
| MYB | 0.86 | 0.48 | 0.68 | 0.29 |
| YBX1 | 0.92 | 0.88 | 0.78 | 0.52 |
| TP53 | 0.91 | 0.61 | 0.66 | 0.52 |
| FOXA1 | 0.80 | 0.38 | 0.78 | 0.15 |
Fig. 7Post-analysis (k-means clustering) of mouse T cell expression data. A k-means clustering accuracy (Silhouette, see “Methods” section) of de-noised data (red) and raw data (gray) as a function of number of clusters tested from the real data. Difference was statistically significant (). B % improvement of k-mean clustering the de-noised data and the raw data. C The distance to centroids within clusters (calculated by within-cluster sums of point-to-centroid distances, see “Methods” section) of the de-noised data (red) and the raw data (gray) as a function the number of clusters. The de-noised data produced more centered clusters (results significantly different . D % improvement of the distance to centroids by using de-noised data