Literature DB >> 34278158

Real-Time Detection of State Transitions in Stochastic Signals from Biological Systems.

Max H Bergkamp^1,2, Leo J van IJzendoorn^3,2, Menno W J Prins^1,3,2,4.

Abstract

Robust analysis of signals from stochastic biomolecular processes is critical for understanding the dynamics of biological systems. Measured signals typically show multiple states with heterogeneities and a wide range of state lifetimes. Here, we present an algorithm for robust detection of state transitions in experimental time traces where the properties of the underlying states are a priori unknown. The method implements a maximum-likelihood approach to fit models in neighboring windows of data points. Multiple windows are combined to achieve a high sensitivity for state transitions with a wide range of lifetimes. The proposed maximum-likelihood multiple-windows change point detection (MM-CPD) algorithm is computationally extremely efficient and enables real-time signal analysis. By analyzing both simulated and experimental data, we demonstrate that the algorithm provides accurate change point detection in time traces with multiple heterogeneous states that are a priori unknown. A high sensitivity for a wide range of state lifetimes is achieved.

Entities: Chemical Disease

Year: 2021 PMID： 34278158 PMCID： PMC8280633 DOI： 10.1021/acsomega.1c02498

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Time-dependencies observed in biological systems are at the most basic level controlled by the dynamics of biomolecular processes. Examples are the binding and unbinding of ligands, conformational switching of proteins, addition and removal of chemical groups, modulation of enzyme activity, etc. At the level of individual molecules, transitions between states are stochastic and described by transition probabilities. The stochastic nature of molecules and their interactions transpires into stochastic properties at higher biological levels. For example, intracellular transport by the molecular motor protein kinesin is dependent on ATP binding and conformational changes of kinesin.[1] Another example is bacterial chemotaxis where run-and-tumble motion leads to a biased random walk in a preferred chemical gradient direction. The bacterial motion, controlled by flagellar motors, depends on single-molecule processes including stochastic ligand binding to receptors.[2] Dedicated experimental techniques have been developed to study the dynamics of stochastic biological systems, such as single-particle tracking,[3] optical tweezers,[4] magnetic tweezers,[4] fluorescence resonance energy transfer,[5] super-resolution microscopy,[6] and nanopores.[7] A critical aspect of the experiments is the data analysis. Here, discrete change points and corresponding state transitions need to be accurately and reliably detected in noisy time-dependent data. Furthermore, the experiments typically exhibit multiple heterogeneous states of which the properties are not a priori known, caused by the complexity of biological systems and time-dependencies of the constituent biomolecules. Furthermore, change points are intrinsically difficult to recognize when states have time-correlated properties that interfere with state change detection. Several algorithms have been developed for change point detection (CPD), e.g., thresholding,[8] hidden Markov models,[9] and maximum-likelihood-based methods.[10−12] However, the algorithms typically require input parameters that sensitively depend on the experimental conditions, and the implementations of the algorithms are generally computationally demanding. Thus, there is a need to develop CPD algorithms that require minimal input parameters for generalizability and robustness and that are computationally efficient for enabling real-time data analysis. In this work, we present a CPD algorithm that can accurately detect state transitions in experimental time traces from multistate biological systems with a priori unknown state properties. The algorithm is generalizable and computationally efficient. Change points are detected by calculating the change in log-likelihood of time-shifted neighboring windows, where multiple window sizes are combined to achieve sensitivity for a wide range of state lifetimes. The performance of the maximum-likelihood multiple-windows change point detection (MM-CPD) is demonstrated using simulated and experimental time traces from a biomolecular sensing technology with single-molecule resolution. Finally, we demonstrate the improved sensitivity and speed of MM-CPD compared to alternative methods reported in the scientific literature.

Change Point Detection Methods

Data Analysis Challenge

Figure illustrates the data analysis challenge for a biological system with multiple heterogeneous states. A simulated time trace is shown of a system with three states, where the states have overlapping probability distributions, different autocorrelations, and different lifetime distributions. The overlap complicates state identification, the time-correlation influences the time that is required to recognize a state, and the presence of a range of lifetimes complicates the detection of short-lived states, making the data analysis prone to misidentifications of states and state transitions.

Figure 1

Change point detection (CPD) in a biological system with multiple heterogeneous states. Illustration of the scientific problem using simulated states and time traces of a model system with three mutually exclusive states. (a) The states have overlapping probability distributions of coordinate x. (b) The traces x(t) of each state have different autocorrelation properties (0 to 10 time lag data points). (c) The states have different lifetime distributions, represented as single-exponential distributions with different characteristic lifetimes. (d) Simulated time trace x(t), where the color indicates the state of the system. The sampling rate of the time trace is 100 Hz. (e) State transitions detected by a CPD algorithm (example). The black vertical lines indicate the detected state transitions. Not all state transitions are true positives (TP), i.e., correctly detected CPs, since both false positives (FP) and false negatives (FN) are observed.

CPD Algorithm Development

Figure shows a flow chart that illustrates how a CPD algorithm is developed and validated using both simulated and experimental data. Simulated data gives the opportunity to quantitatively evaluate a CPD algorithm by comparing detected change points with the known change points in the simulation. In order to generate simulated data that is a good representation of an experimental system, a simulation model needs to be developed, based on knowledge of the system, biophysical equations, and parameters derived from experimental data. By performing the quantitative evaluation, algorithm parameters can be tuned to optimize the CPD algorithm. In parallel, the algorithm is tested on experimental data. Here, the locations of true change points are not known. Therefore, quantitative evaluations do not focus on individual change points but rather on scaling relationships between experimental input parameters and extracted physical output parameters, such as lifetime distributions.

Figure 2

Flow chart for CPD algorithm development using simulated data and experimental data. The arrows indicate the information flow. Analysis of simulated data allows a quantitative evaluation of the algorithm by comparing detected change points with the true change points in the simulation. The CPD algorithm is evaluated on experimental data by comparing scaling relationships between experimental input parameters and physical output parameters. The dashed arrows indicate how the simulation parameters are adapted based on experimental parameters and results.

Quantitative Evaluation of CPD with Simulated Data

Figure e shows the detected state transitions after CPD. Detected CPs can be either true positives (TP) i.e., detected change points that correspond to a true change point or can be false positives (FP). Missed state transitions are categorized as false negatives (FN). The CPD performance can be evaluated by calculating the F1-score,[13] which is equal to the maximum value of 1 if no false positives and no false negatives are present.

Maximum-Likelihood Multiple-Windows Change Point Detection (MM-CPD)

The main principle of MM-CPD is to calculate the probability of a change in distribution as a function of time. First, a distribution is assumed, and a maximum-likelihood approach is applied to calculate the distribution parameters. A distribution with a mean μ, standard deviation σ, and nearest-neighbor coupling parameter ϵ is assumed. This model is an approximation of an Ornstein–Uhlenbeck process.[12] For convenience, we introduce θ, which describes the model parameters: The probability density function of a variable x is given by The log-likelihood of a time trace of N data points is given by Maximization of eq with respect to the model parameters provides expressions for the maximum-likelihood estimators (MLE), μ̂, ϵ̂, and σ̂. Supporting Information Section 1 provides a detailed derivation of the MLEs. The null hypothesis assumes that the data points in two neighboring windows A and B of equal window size w originate from a single distribution: The alternative hypothesis assumes that the data points in windows A and B come from different distributions: For both hypotheses the log-likelihood can be calculated with eq and the MLEs corresponding to the hypothesis. Calculating the log-likelihood ratio between the two hypotheses provides an expression for the likelihood of a change point:which can be expressed in terms of individual log-likelihood functions: We refer to R as the response that relates to the probability of a change in distribution. In multidimensional systems, responses from different traces can be added. For example, the response of a two-dimensional system with traces of variables x and y is given by Figure shows the time trace of the response for simulated two-dimensional time traces with multiple states. Panels 3a–c show the state and x and y time traces of the simulated data, respectively. The response is plotted as a function of time for window sizes of 20 and 80 data points (Figure d,e). The response time trace is a measure for the probability of a change point as a function of time.

Figure 3

Response time traces in a simulated two-dimensional system with multiple states. (a) Simulated state as a function of time. (b, c) Simulated x and y time traces (a sampling rate of 100 Hz). The green and red boxes indicate two neighboring windows A and B (eq ) with a window size of 80 data points. (d, e) Response time traces for window sizes of 20 and 80 data points, respectively. The red vertical lines indicate the detected change points, which are identified by applying a threshold (green line). The red vertical lines in Figure indicate the change points that are identified from the response signals by applying a threshold. Change points that are detected with a small window size typically correspond to large changes in the distribution. Applying a larger window size allows detecting smaller changes in distribution, since a larger number of data points give a more precise estimation of the model parameters. However, states with a short lifetime, typically shorter than the window size, will be missed with the larger window size. The threshold is an important parameter that can be tuned to minimize the number of FP and FN. Typically, a higher threshold results in an increased number of FN, whereas a lower threshold gives more FP. The MM-CPD algorithm combines the detected change points from multiple windows. The minimum window size w and number of windows N are input parameters that define the list of window sizes. The sizes of the respective window are chosen as follows: The sizes of the windows from 3 to N are given by The increment by a factor of two is chosen for the reason of computational efficiency, since the log-likelihood of the null hypothesis (eq ) for a window size w can be used for calculating the log-likelihood of the alternative hypothesis (eq ) for a window size of 2 · w. Combining the change points (CPs) from multiple windows is a sequential procedure in which the CPs of w1 are all accepted. Change points from the next window size w are accepted only if the distance to already accepted CPs is larger than w. The process of combining change points from multiple window sizes is performed sequentially in an ascending order, i.e., from the smallest window size to the largest window size. The final step of the MM-CPD algorithm is to perform a test on accepted CPs with a minimum distance of w to neighboring CPs. These CPs are rejected if R is lower than the threshold. This step is important to reduce the number of FP in systems with long state lifetimes.

Results

The MM-CPD algorithm discussed in the previous paragraph is generally applicable for CPD in multistate biological systems. Here, we will apply the algorithm to analyze simulated and experimental time traces of a continuous biomolecular sensing technology with single-molecule resolution, called biosensing by particle mobility (BPM).[14−16] The biophysical sensing methodology relies on detecting changes in the motion of tethered particles induced by reversible affinity-based interactions between a biofunctionalized particle and surface. Different molecular sensor designs have been demonstrated, all exhibiting motion pattern changes due to state switching between bound and unbound states. In this paper, we will analyze experimental data of a BPM competition assay, which is schematically drawn in Figure a. Here, the surface is provided with antibodies and the particles with analyte analogues. When no analyte is present in solution, then particles switch frequently between unbound and bound states due to the affinity between antibodies on the surface and analyte analogues on the particle. When analyte molecules are present in solution, then these bind to the antibodies, thereby inhibiting binding of the particle to the surface and increasing the mean unbound state lifetime of the particles. BPM experiments provide time traces with the heterogeneities presented in Figure b–d, namely, multiple states, states with different autocorrelations, and a wide range of state lifetimes. The Supporting Information Section 2 provides a more detailed explanation of the BPM system and the BPM time trace simulations.

Figure 4

MM-CPD algorithm performance studied on simulated BPM data. (a) BPM competition assay system with a dsDNA tether, analogue molecules on the particle, and detection molecules on the surface.[15] Switching between bound and unbound states is induced by reversible interactions between the analogue molecules and detection molecules. Analyte molecules can also bind to the detecting molecules. Therefore, the number of switches between bound and unbound states is reduced as a function of analyte concentration. Observed motion patterns in the unbound state are circular in shape. In contrast, motion patterns in the bound state can have different shapes and sizes. Images are obtained from Yan et al.(15) (b) Simulated state and x and y time traces of a single BPM particle. State 1 refers to the unbound state in (a) and states 2, 3, and 4 are bound states with different distributions. (c) Performance of MM-CPD with a single window. The F1-score is shown as a function of the threshold and window size wmin. (d) MM-CPD with multiple windows (N = 9) gives an improved F1-score compared to the single-window approach. (e, f) F1-score as a function of the mean bound state lifetime τB and mean unbound state lifetime τU for the MM-CPD and IB-CPD.

Application to Simulated BPM Data

BPM time traces with single-exponential distributed state lifetimes were simulated and analyzed with the MM-CPD algorithm. Figure b shows the x, y, and state time traces for a typical BPM particle with a single unbound state and multiple bound states. Figure c shows the F1-score as a function of the algorithm parameters, for the case of a single window size (N = 1). The F1-score clearly depends on the threshold and on the minimum window size w. Given a certain w, an optimal threshold can be found, where a lower threshold would lead to more FP and a higher threshold to more FN. The same data set was analyzed with a multiple-windows approach (N = 9), showing that a higher F1-score is achieved by combining the change points from multiple windows (Figure d). To test the performance of MM-CPD for a wide range of state lifetimes, time traces with mean bound state lifetime τ and mean unbound state lifetime τ ranging from 1 to 100 s were simulated. In order to be effective for a wide range of state lifetimes, the following algorithm settings were chosen: w = 10, N = 9, and threshold = 20. The simulations were used to compare the performance of the MM-CPD algorithm to the information-based CPD (IB-CPD) algorithm developed by Wiggins.[12] The latter is a CPD algorithm for biophysical systems with multiple heterogeneous states. Figure e,f shows the F1-score as a function of τ and τ for both algorithms. Both algorithms show a high F1-score for bound states with a long lifetime. For traces with short bound state lifetimes and long unbound state lifetimes, the MM-CPD algorithm gives a significantly higher F1-score compared to the IB-CPD algorithm.

Application to Experimental BPM Data

In order to demonstrate the performance of the algorithm on experimental data, we analyzed BPM data from Yan et al.[15] This study describes a competition assay for continuous biosensing of single-stranded DNA (ssDNA) and small molecules. In short, addition of analyte molecules decreases the probability for a particle to transition from an unbound to a bound state. The average number of switching events between bound and unbound states per particle per unit of time is referred to as the activity. The activity is obtained by fitting a Poisson distribution on the number of detected change points per particle. Figure a shows the activity as a function of time in a BPM assay with varying ssDNA concentrations. The system behaves as expected, since an increase in analyte concentration leads to a decrease in activity.

Figure 5

CPD algorithms applied to experimental BPM data for continuous monitoring of single-stranded DNA. Comparison between MM-CPD, IB-CPD, and SMD-CPD. (a) Top panel shows the concentration of the ssDNA analyte as a function of time. The bottom panel shows the detected switching activity as a function of time with MM-CPD, IB-CPD, and SMD-CPD. The error bars indicate the 99% confidence intervals of the mean of the fitted Poisson distribution. (b) Dose–response curves of ssDNA determined with the different CPD algorithms. The curves were fitted with a Hill equation with a Hill coefficient 1. (c) Survival curves of the bound state lifetimes determined from the detected change points with the different CPD algorithms. (d) Average CPU time required to analyze an experimental time trace of 180 s (a sampling rate of 33.7 Hz) for the different CPD algorithms implemented in MATLAB. The activity was calculated with MM-CPD, IB-CPD, and with the second-moment divergence CPD (SMD-CPD) algorithm that was applied by Yan et al. The latter SMD-CPD algorithm was developed specifically for BPM.[14] Clearly, more change points are detected with MM-CPD and IB-CPD compared to SMD-CPD. Figure b shows the dose–response curves for ssDNA determined with the three CPD algorithms. A further comparison of the CPD algorithms was performed by extracting the distributions of state lifetimes from the measured data. In a BPM assay, the bound state lifetime is determined by the dissociation properties of the molecular interactions between the particle and the surface. In the regime of single-molecular interactions, the dissociation behavior should obey a first-order rate equation with a well-defined dissociation rate constant (k) and single-exponential distributed bound state lifetimes. Here, we studied to what extent the algorithms reproduce single-exponential distributed bound state lifetimes. We used experimental BPM traces, applied the three algorithms, and classified states between consecutive change points. States were classified as a bound state, unbound state, or an undefined state by analysis of the standard deviation in x and y (see Supporting Information Section 3). Figure c shows cumulative distribution functions, also called survival plots, of the bound state lifetimes determined with the three different CPD algorithms. A straight line in a survival plot with a linear x-axis and a logarithmic y-axis indicates a single-exponential distribution. The figure shows that MM-CPD and IB-CPD give straight lines, while the data from SMD-CPD deviates from a straight line. This indicates that SMD-CPD detects short and long state lifetimes with nonuniform sensitivity, resulting is an under-representation of short state lifetimes in the distribution. A comparative analysis of the bound and unbound state lifetime distributions for different concentrations can be found in the Supporting Information Section 5.1. The computational efficiencies of the three CPD algorithms are reported in Figure d. The data shows the average CPU time required for the analysis of experimental time traces, for CPD algorithm implementations in MATLAB. The results show that the MM-CPD is between one and two orders of magnitude faster than the IB-CPD and SMD-CPD algorithms. The increased CPU time of IB-CPD can be attributed to the segmentation algorithm that is applied in IB-CPD, which is a reiterative procedure. This leads to an analysis time that is dependent on the number of detected change points; data in the Supporting Information Section 4.2 shows that the CPU time in IB-CPD depends on the state lifetimes.

Discussion

From a statistical perspective, a minimum number of data points is required to detect a specific state transition with a certain reliability. The required number of data points is dependent on the significance of the change in distribution but is also influenced by time-correlation properties of the states. Detection of short-lived states is therefore theoretically limited. High sensitivity for short-lived states is desirable and can be achieved by applying a window size close to this theoretical limit. However, in time traces with multiple heterogeneous states, transitions are unique and for each transition, a different window size might be optimal. By combining detected change points from multiple windows, a high sensitivity is achieved for a wide range of state lifetimes. In particular, small window sizes are sensitive for significant state transitions between short-lived states. On the other hand, larger window sizes are sensitive for less significant state transitions between states with longer lifetimes. This explains why the multiple-windows CPD approach performs significantly better compared to the single-window approach (Figure c,d). Supporting Information Section 6 gives an overview of frequently applied methods for CPD in biological time traces. These include half-amplitude thresholding,[8] maximum-likelihood-based methods,[11] and hidden Markov models.[9] Most methods rely on a defined number of states as input and therefore are not suitable for systems with a priori undefined states as presented in this paper. The IB-CPD algorithm of Wiggins[12] can be applied without a priori knowledge of the states. In comparison to the IB-CPD algorithm, the MM-CPD algorithm is more likely to detect short-lived bound states, especially when neighboring unbound states are long. This effect was clearly visible in the F1-scores of the simulated data (Figure e,f). In addition, the lifetime analysis in experimental BPM data showed shorter bound state lifetimes for the MM-CPD approach compared to the IB-CPD approach (Figure c). The difference is most likely due to the binary segmentation algorithm that is applied in IB-CPD. The algorithm starts with determining the most likely change point in a time trace. If the candidate change point is significant, it is accepted, and the trace is split into two segments. The process is repeated until no new candidate change points are accepted. This method might have difficulties to detect a short-lived bound state between two long identical unbound states, since the distributions on both sides of a candidate change point are dominated by the unbound state. In contrast, the MM-CPD approach compares two windows of data points. Thus, a short-lived bound state will result in two peaks in the response trace of a window size close to the state lifetime. In general, short-lived states that are hidden in the distribution of neighboring identical states are more likely to be detected with MM-CPD compared to IB-CPD. The MM-CPD algorithm is designed to be robust for traces with time-correlated data points, such as BPM. Similar to the IB-CPD, the MM-CPD assumes a Gaussian distribution with a nearest-neighbor coupling parameter. Including the nearest-neighbor coupling parameter in the distribution model significantly increases the robustness of CPD in traces with time-correlated data points (Supporting Information Section 4.1). The MM-CPD has only three algorithm parameters. The following settings were found to give good results for BPM data with two input time traces: w = 10, N = 9, and threshold = 20. In some cases, it might be beneficial to further tune the algorithm parameters. For example, the minimum window size can be increased in systems with only long state lifetimes. Also, the number of windows could be decreased if higher computational efficiency is desired. The threshold level might also be tuned, especially if the number of input time traces is changed, since responses are added linearly (eq ). Furthermore, it should be considered that the choice of the threshold can have a large influence on the extracted physical parameters (Supporting Information Section 5.2). Therefore, performing a quantitative evaluation with both simulated and experimental data as shown in Figure is important for validation when applying the MM-CPD in other biological applications. The performance of the MM-CPD was discussed for BPM experiments. In BPM, the detection of short-lived states is theoretically limited by the autocorrelation times of the states. Further increasing the sampling rate does not significantly improve the CPD performance, but it does decrease the computational efficiency. In other biological processes with shorter autocorrelation times, the sampling rate can be increased to improve the sensitivity for short-lived states. When autocorrelation times are long in comparison to the intersampling time, it might be beneficial to increase the minimum window size. One of the main advantages of the MM-CPD is that the algorithm is computationally very efficient. In contrast to alternative methods, the analysis time is independent of the state lifetimes (Supporting Information Section 4.2). This leads to a predictable total analysis time, which is crucial in real-time applications. We showed that the time to analyze a two-dimensional BPM time trace of 180 s (∼6000 data points) is only ∼0.02 s in MATLAB (Figure d). This speed enables real-time CPD of ∼104 BPM time traces in parallel. Even faster biological processes, having shorter autocorrelation times measured, e.g., at a sampling rate of 1 kHz, can be analyzed in real time with MM-CPD for hundreds of time traces in parallel. We expect that the MM-CPD algorithm will allow real-time analysis with a time lag that is of the order of magnitude of the largest window size.

Conclusions

We developed a CPD algorithm for rapid and reliable detection of state transitions in experimental time traces from multistate biological systems with a priori unknown state properties and a wide range of state lifetimes. The maximum-likelihood multiple-windows change point detection (MM-CPD) algorithm is controlled by three input parameters with a clear significance. The algorithm was validated using both simulated and experimental data of a biosensing method with single-molecule resolution. The MM-CPD shows an increased sensitivity for short-lived states that are hidden in the distribution of neighboring identical states, and the algorithm is between one and two orders of magnitude faster compared to alternative methods. The computational efficiency allows CPD in thousands of time traces in parallel, for a real-time readout and statistical analysis of stochastic signals from dynamic biological systems.

16 in total

Real-Time Detection of State Transitions in Stochastic Signals from Biological Systems.

Introduction

Change Point Detection Methods

Data Analysis Challenge

CPD Algorithm Development

Quantitative Evaluation of CPD with Simulated Data

Maximum-Likelihood Multiple-Windows Change Point Detection (MM-CPD)

Results

Application to Simulated BPM Data

Application to Experimental BPM Data

Discussion

Conclusions

Review 1. Responding to chemical gradients: bacterial chemotaxis.

Review 2. A review of progress in single particle tracking: from methods to biophysical insights.

3. Detection of intensity change points in time-resolved single-molecule measurements.

Review 4. Protein folding studied by single-molecule FRET.

Review 5. Single-molecule force spectroscopy: optical tweezers, magnetic tweezers and atomic force microscopy.

6. Diffusive hidden Markov model characterization of DNA looping dynamics in tethered particle experiments.

7. An information-based approach to change-point analysis with applications to biophysics and cell biology.

8. Quantitative analysis of DNA-looping kinetics from tethered particle motion experiments.

9. Continuous biomarker monitoring by particle mobility sensing with single molecule resolution.

10. Multiplexed Continuous Biosensing by Single-Molecule Encoded Nanoswitches.

1. Sensing Methodology for the Rapid Monitoring of Biomolecules at Low Concentrations over Long Time Spans.