| Literature DB >> 36090462 |
Yuzi He1,2, Keith A Burghardt1, Kristina Lerman1.
Abstract
Change point detection has many practical applications, from anomaly detection in data to scene changes in robotics; however, finding changes in high dimensional data is an ongoing challenge. We describe a self-training model-agnostic framework to detect changes in arbitrarily complex data. The method consists of two steps. First, it labels data as before or after a candidate change point and trains a classifier to predict these labels. The accuracy of this classifier varies for different candidate change points. By modeling the accuracy change we can infer the true change point and fraction of data affected by the change (a proxy for detection confidence). We demonstrate how our framework can achieve low bias over a wide range of conditions and detect changes in high dimensional, noisy data more accurately than alternative methods. We use the framework to identify changes in real-world data and measure their effects using regression discontinuity designs, thereby uncovering potential natural experiments, such as the effect of pandemic lockdowns on air pollution and the effect of policy changes on performance and persistence in a learning platform. Our method opens new avenues for data-driven discovery due to its flexibility, accuracy and robustness in identifying changes in data.Entities:
Keywords: Causal effect; Change point detection; High-dimensional data; Regression discontinuity design
Year: 2022 PMID: 36090462 PMCID: PMC9440658 DOI: 10.1140/epjds/s13688-022-00361-7
Source DB: PubMed Journal: EPJ Data Sci ISSN: 2193-1127 Impact factor: 3.630
Figure 1Illustrations of synthetic data (a), where observations have two features and . In (b) and (c), blue dots represent data points which satisfy and orange dots are for . (b) and (c) are for and , respectively. For fixed data size N, as increases, the number of data points in each square decreases
A comprehensive comparison of the performance of the proposed method against two types of state-of-the-art methods: optimal segmentation and Bayesian change point detection on synthetic data. MtChD(RF) is our method with a random forest classifier; MtChD(MLP) is our method with a MLP classifier. DP + Normal (GLR eq.) is DP segmentation method used with normal loss function, which is equivalent to GLR test that assumes a multivariate normal distribution. Six combinations of optimal segmentation methods are listed. DP is dynamic programming segmentation algorithm, BinSeg is binary segmentation, Window is window-based change point detection, and BottomUp is Bottom-up segmentation. The cost functions used are RBF (RBF kernel), L1 ( loss function), and L2 ( loss function). The last four rows are for Bayesian change point detection with a uniform prior or Geo (geometric) prior. Gassusian stands for Gaussian likelihood function, IFM is the individual feature model [30], and FullCov is the full covariance model [30]. and are the mean value and standard deviation of inferred change point and and are the mean value and standard deviation of inferred α. Bold values indicate change points that are closest to the correct value
| 2 | 4 | 6 | 8 | 10 | 6 | 6 | 6 | 6 | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.2 | 0.4 | 0.6 | 0.8 | ||
| MtChD (RF) | ||||||||||
| MtChD (MLP) | 0.5262 | 0.5084 | 0.5772 | 0.5649 | 0.4095 | 0.5962 | 0.5372 | |||
| 0.0173 | 0.0962 | 0.0569 | 0.0450 | 0.0258 | 0.0668 | 0.1315 | ||||
| 0.6249 | 0.0048 | 0.0086 | 0.0045 | 0.4906 | 0.3950 | 0.0171 | ||||
| 0.0710 | 0.0068 | 0.0080 | 0.0035 | 0.0534 | 0.1112 | 0.0202 | ||||
| Naive Confusion (RF) | 0.4965 | 0.5017 | 0.4974 | 0.4975 | 0.4973 | 0.2271 | 0.4255 | 0.5235 | 0.5436 | |
| 0.0018 | 0.0019 | 0.0004 | 0.0001 | 0.0001 | 0.0382 | 0.0312 | 0.0229 | 0.0900 | ||
| DP + Normal | 0.5212 | 0.7238 | 0.5971 | 0.2441 | 0.4578 | 0.5885 | 0.8108 | |||
| (GLR eq.) | 0.0204 | 0.2762 | 0.3374 | 0.0377 | 0.0447 | 0.0266 | 0.0288 | |||
| DP + RBF | 0.5673 | 0.9495 | 0.3071 | 0.3740 | 0.4234 | 0.5827 | 0.8355 | |||
| 0.0684 | 0.0679 | 0.2392 | 0.2840 | 0.1893 | 0.0246 | 0.0654 | ||||
| DP + L2 | 0.9510 | 0.9875 | 0.3515 | 0.8584 | 0.5143 | 0.4451 | 0.3183 | 0.3104 | 0.2917 | |
| 0.0099 | 0.0062 | 0.2399 | 0.2734 | 0.4006 | 0.3481 | 0.4417 | 0.4252 | 0.3778 | ||
| DP + L1 | 0.9569 | 0.5313 | 0.5809 | 0.6053 | 0.4015 | 0.5526 | 0.1277 | 0.4916 | 0.2114 | |
| 0.0070 | 0.2660 | 0.1677 | 0.4027 | 0.3308 | 0.4467 | 0.1873 | 0.3832 | 0.3312 | ||
| BinSeg + RBF | 0.5701 | 0.7663 | 0.5635 | 0.3133 | 0.3850 | 0.6049 | 0.7258 | |||
| 0.0502 | 0.3205 | 0.2190 | 0.3285 | 0.3702 | 0.1506 | 0.2715 | ||||
| Window + RBF | 0.4391 | 0.5653 | 0.2960 | 0.5699 | 0.2444 | 0.4746 | 0.5654 | 0.7964 | 0.3987 | |
| 0.1364 | 0.2210 | 0.2139 | 0.1738 | 0.1012 | 0.2436 | 0.2459 | 0.2223 | 0.3159 | ||
| BottomUp + RBF | 0.4581 | 0.4500 | 0.6821 | 0.4947 | 0.4271 | 0.5213 | 0.4602 | 0.5861 | ||
|
| 0.1477 | 0.3655 | 0.2879 | 0.3144 | 0.3059 | 0.2149 | 0.2885 | 0.2953 | ||
| Uniform + Gaussian | 0.5474 | 0.5429 | 0.3915 | 0.4717 | 0.5429 | 0.6171 | 0.7546 | 0.5210 | 0.5196 | |
| 0.2299 | 0.3010 | 0.1567 | 0.2265 | 0.2159 | 0.2842 | 0.2203 | 0.1549 | 0.3386 | ||
| Uniform + IFM | 0.9969 | 0.9942 | 0.9973 | 0.9975 | 0.9975 | 0.9986 | 0.9958 | 0.9973 | 0.9985 | |
| 0.0031 | 0.0030 | 0.0020 | 0.0015 | 0.0030 | 0.0015 | 0.0049 | 0.0026 | 0.0012 | ||
| Uniform + FullCov | 0.4985 | 0.5089 | 0.9986 | 0.9976 | 0.9989 | 0.9930 | 0.9280 | 0.9982 | 0.9974 | |
| 0.0002 | 0.0163 | 0.0006 | 0.0010 | 0.0009 | 0.0098 | 0.1593 | 0.0020 | 0.0038 | ||
| Geo + Gaussian | 0.0282 | 0.0271 | 0.0286 | 0.0323 | 0.0278 | 0.0326 | 0.0340 | 0.0312 | 0.0254 | |
| 0.0044 | 0.0018 | 0.0044 | 0.0054 | 0.0037 | 0.0063 | 0.0034 | 0.0051 | 0.0037 |
Figure 2Example time series of synthetic images that change at when a solid circle changes to a hollow circle. From top to bottom, each row shows images with an increasing noise level and 1.0
Change points inferred for noisy synthetic images. The true value of change point is where solid circles change into hollow circles with different levels of noise
| Noise | 0.20 | 0.40 | 0.60 | 0.80 | 1.00 |
|---|---|---|---|---|---|
| 0.5048 | 0.5087 | 0.5253 | 0.5155 | 0.5380 | |
| 0.0047 | 0.0086 | 0.0237 | 0.0191 | 0.0398 | |
| 0.0028 | 0.0043 | 0.0027 | 0.0111 | 0.0246 | |
| 0.9612 | 0.9787 | 0.9298 | 0.9609 | 0.8781 | |
| 0.0278 | 0.0139 | 0.0083 | 0.0361 | 0.0717 |
A comprehensive comparison of our method with previous methods on real world datasets, COVID-19 Air and Khan Academy. We use the same abbreviations as in Table 1. For COVID-19, the measure of is number of days since 01/01/2020. For Khan Academy, the measure of is Unix timestamp, namely, number of seconds since midnight 01/01/1970. Correct values are roughly 80 days for COVID-19 air data, and seconds for Khan Academy data. Bold values indicate change points that are closest to the correct value
| COVID Air | Khan | ||
|---|---|---|---|
| Time (day) | Time (sec) | ||
| MtChD(RF) | |||
| MtChD(MLP) | 99.5820 | ||
| 99.5820 | |||
| 0.4843 | |||
| 0.3264 | |||
| DP + Normal | 71.8333 | 1.3577e+09 | |
| (Normal GLR eq.) | 0.3727 | 2.2059e+07 | |
| DP + RBF | 37.1667 | 1.3763e+09 | |
| 25.5761 | 9.4481e+06 | ||
| DP + L2 | 70.1667 | 1.3679e+09 | |
| 53.8911 | 1.0014e+07 | ||
| BinSeg + RBF | 1.0000 | 1.3741e+09 | |
| 0.0000 | 8.9074e+06 | ||
| Window + RBF | 55.0000 | 1.3587e+09 | |
| 0.0000 | 1.2031e+07 | ||
| BottomUp + RBF | 54.0000 | 1.3528e+09 | |
| 0.8165 | 1.2960e+06 | ||
| Uniform + Gaussian | 96.9167 | 1.3439e+09 | |
| 37.5859 | 4.2047e+06 | ||
| Uniform + IFM | −0.5833 | 1.3564e+09 | |
| 0.8858 | 1.5300e+07 | ||
| Uniform + FullCov | 0.0000 | 1.3591e+09 | |
| 0.6455 | 1.6176e+07 | ||
| Geo + Gaussian | 8.1667 | 1.3396e+09 | |
| 8.9334 | 2.9504e+05 |
Figure 3Accuracy deviation curve for COVID-19 Air data. (a) Using random forest classifier; (b) Using a MLP classifier. The scatter points are accuracy deviation measured on testing set and the solid lines are fitted using the proposed accuracy deviation model
Figure 4Accuracy deviation curve for Khan Academy data. (a) Using random forest classifier; (b) Using a MLP classifier
Figure 5Performance in Khan Academy. (a) Performance vs. time in Khan Academy data for problems binned with duration and number of attempts. (b) Change of performance for binned data
Figure 6Persistence in Khan Academy. (a) Persistence rate vs. time in Khan Academy data for long (≥ 100 sec) and short (< 100 sec) duration questions. (b) Change of persistence rate for long and short duration problems
Figure 7Averaged and change of nitrogen dioxide levels before and after Mar. 21, 2020. (a) For Manhattan, (b) For San Francisco