| Literature DB >> 31304396 |
Joao Palotti1, Raghvendra Mall1, Michael Aupetit1, Michael Rueschman2,3, Meghna Singh4, Aarti Sathyanarayana3,5, Shahrad Taheri6, Luis Fernandez-Luque1.
Abstract
Accurately measuring sleep and its quality with polysomnography (PSG) is an expensive task. Actigraphy, an alternative, has been proven cheap and relatively accurate. However, the largest experiments conducted to date, have had only hundreds of participants. In this work, we processed the data of the recently published Multi-Ethnic Study of Atherosclerosis (MESA) Sleep study to have both PSG and actigraphy data synchronized. We propose the adoption of this publicly available large dataset, which is at least one order of magnitude larger than any other dataset, to systematically compare existing methods for the detection of sleep-wake stages, thus fostering the creation of new algorithms. We also implemented and compared state-of-the-art methods to score sleep-wake stages, which range from the widely used traditional algorithms to recent machine learning approaches. We identified among the traditional algorithms, two approaches that perform better than the algorithm implemented by the actigraphy device used in the MESA Sleep experiments. The performance, in regards to accuracy and F 1 score of the machine learning algorithms, was also superior to the device's native algorithm and comparable to human annotation. Future research in developing new sleep-wake scoring algorithms, in particular, machine learning approaches, will be highly facilitated by the cohort used here. We exemplify this potential by showing that two particular deep-learning architectures, CNN and LSTM, among the many recently created, can achieve accuracy scores significantly higher than other methods for the same tasks.Entities:
Keywords: Epidemiology; Fatigue; Health care; Machine learning; Quality of life
Year: 2019 PMID: 31304396 PMCID: PMC6555808 DOI: 10.1038/s41746-019-0126-9
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Results (Mean ± 95% confidence interval) for Task Night
| Method | Algorithm evaluation metrics | Sleep quality metrics | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Specificity | Precision | Sensitivity | F1 | WASO (min) | MAE WASO | Sleep Eff. (%) | MAE sleep Eff. | |
| Ground truth | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 102.1 ± 7.3 | 0.0 | 58.4 ± 1.4 | 0.0 |
| Baselines | |||||||||
| Manual annotations | 56.5 ± 2.3 | 94.8 ± 1.5 | 45.8 ± 8.6 | 74.7 |
| ||||
| Device algorithm | 76.2 ± 1.0 | 50.1 ± 1.8 | 72.6 ± 1.3 | 94.3 ± 0.6 | 81.3 ± 1.0 |
| 75.7 ± 1.0 | 17.7 | |
| Always sleep | 58.4 ± 1.4 | 0.0 ± 0.0 | 58.4 ± 1.4 | 72.8 ± 1.1 | 0.0 ± 0.0 | 102.1 | 100.0 ± 0.0 | 41.6 | |
| Always wake | 41.6 ± 1.4 | 0.0 ± 0.0 | 0.0 ± 0.0 | 0.0 ± 0.0 | 459.2 ± 9.0 | 357.0 | 0.0 ± 0.0 | 58.4 | |
| Traditional algorithms | |||||||||
Oakley | 63.0 ± 1.7 | 76.8 ± 1.3 | 87.2 ± 0.9 | 81.0 ± 1.0 |
| 66.0 ± 1.1 | 10.1 | ||
| Scripps Clinic[ | 76.6 ± 1.1 | 48.8 ± 1.9 | 72.5 ± 1.4 | 95.9 ± 0.5 | 46.3 ± 4.2 | 58.5 | 77.1 ± 1.0 | 18.9 | |
Oakley | 75.9 ± 1.0 | 49.3 ± 1.8 | 72.2 ± 1.3 | 94.4 ± 0.5 | 81.2 ± 1.0 | 53.1 ± 4.1 | 52.9 | 76.0 ± 1.0 | 17.9 |
| Cole-Kripke[ | 75.4 ± 1.1 | 45.0 ± 1.8 | 71.1 ± 1.4 | 96.7 ± 0.4 | 81.2 ± 1.0 | 40.2 ± 3.7 | 63.5 | 79.2 ± 1.0 | 21.0 |
| Sazonov[ | 75.2 ± 1.0 | 75.5 ± 1.4 | 76.7 ± 1.2 | 149.2 ± 7.7 | 58.7 |
| |||
Oakley | 73.9 ± 1.1 | 41.2 ± 1.7 | 69.7 ± 1.4 | 96.9 ± 0.4 | 80.3 ± 1.0 | 35.9 ± 3.2 | 67.4 | 80.9 ± 0.9 | 22.7 |
| Sadeh[ | 73.4 ± 1.2 | 38.3 ± 1.8 | 69.1 ± 1.4 | 80.3 ± 1.1 | 26.3 ± 3.1 | 76.5 | 83.0 ± 0.9 | 24.7 | |
| Webster[ | 73.3 ± 1.2 | 38.2 ± 1.8 | 69.0 ± 1.4 | 98.2 ± 0.3 | 80.3 ± 1.1 | 27.5 ± 3.0 | 75.3 | 83.0 ± 0.9 | 24.7 |
| Group average | 75.1 ± 1.3 | 49.6 ± 10.4 | 72.5 ± 3.3 | 92.9 ± 6.6 | 80.4 ± 1.3 | 59.2 ± 35.4 | 61.3 ± 10.6 | 75.0 ± 8.2 | 18.6 ± 5.1 |
| Rescoring rules applied to traditional algorithms | |||||||||
Resc. Oakley | 68.3 ± 1.9 | 79.9 ± 1.3 | 88.1 ± 0.9 | 83.1 ± 1.0 | 93.2 ± 6.6 |
| 64.4 ± 1.2 |
| |
| Resc. Cole-Kripke | 80.2 ± 1.0 | 65.7 ± 2.0 | 78.9 ± 1.3 | 89.9 ±0.8 | 83.3 ± 1.0 | 83.5 ± 6.3 | 40.0 | 66.6 ± 1.2 | 10.2 |
| Resc. Scripps Clinic | 80.1 ± 1.0 | 70.4 ± 1.9 | 80.7 ± 1.3 | 86.3 ± 1.1 | 82.6 ± 1.0 | 41.8 | 9.1 | ||
Resc. Oakley | 79.3 ± 1.0 | 59.8 ± 2.0 | 76.6 ± 1.4 | 92.8 ± 0.6 | 65.0 ± 5.4 | 46.4 | 70.7 ± 1.1 | 13.1 | |
| Resc. Sadeh | 79.1 ± 1.0 | 59.4 ± 2.0 | 76.5 ± 1.4 | 83.1 ± 1.0 | 64.1 ± 5.7 | 49.2 | 70.9 ± 1.2 | 13.5 | |
| Resc. Webster | 79.0 ± 1.0 | 58.9 ± 2.0 | 76.2 ± 1.4 | 93.1 ± 0.7 | 83.0 ± 1.0 | 63.2 ± 5.5 | 48.9 | 71.3 ± 1.2 | 13.8 |
Resc. Oakley | 77.8 ± 1.0 | 81.6 ± 1.6 | 85.5 ± 1.3 | 73.8 ± 1.6 | 78.0 ± 1.3 | 163.9 ± 9.1 | 68.7 | 50.7 ± 1.5 | 10.8 |
| Resc. Sazonov | 68.1 ± 1.3 | 51.2 ± 2.1 | 62.3 ±2.0 | 258.4 ± 11.0 | 156.7 | 34.0 ± 1.6 | 24.7 | ||
| Group average | 78.0 ± 3.4 | 69.3 ± 9.5 | 80.3 ± 3.6 | 83.5 ± 12.1 | 79.8 ± 6.1 | 111.8 ± 56.8 | 61.2 ± 33.3 | 61.4 ± 10.9 | 13.0 ± 4.3 |
| Machine learning algorithms | |||||||||
| Extra trees | 68.1 ± 1.9 | 90.4 ± 1.2 | 85.4 ± 7.4 |
| 65.8 ± 1.4 | 10.3 | |||
| Logistic regression | 81.5 ± 1.0 | 67.2 ± 2.0 | 79.9 ± 1.3 | 84.1 ± 1.1 | 83.2 ± 7.5 | 45.6 | 66.3 ± 1.4 | 11.1 | |
| Linear SVM | 81.4 ± 1.1 | 68.0 ± 2.0 | 80.2 ± 1.3 | 89.9 ± 1.3 | 83.8 ± 1.1 | 87.2 ± 7.8 | 45.8 | 65.5 ± 1.5 | 10.8 |
| Perceptron | 78.4 ± 1.0 | 79.4 ± 1.3 | 83.9 ± 1.4 | 80.7 ± 1.2 | 44.0 |
| |||
| Group average | 80.8 ± 2.5 | 68.1 ± 1.2 | 80.0 ± 0.6 | 88.8 ± 5.1 | 83.2 ± 2.7 | 91.5 ± 20.1 | 44.6 ± 2.3 | 64.8 ± 3.4 | 10.4 ± 1.2 |
| Rescoring rules applied to machine learning algorithms | |||||||||
| Resc. Log. Regression | 80.7 ± 1.8 | 85.6 ± 1.2 |
|
| |||||
| Resc. extra trees | 78.5 ± 1.2 | 82.0 ± 1.7 | 74.2 ± 1.9 | 78.2 ± 1.5 | 160.4 ± 10.3 | 68.6 | 50.8 ± 1.7 | 11.0 | |
| Resc. linear SVM | 78.3 ± 1.2 | 81.4 ± 1.7 | 85.8 ± 1.2 | 74.4 ± 2.0 | 77.9 ± 1.6 | 159.4 ± 10.6 | 69.6 | 51.1 ± 1.7 | 11.2 |
| Resc. perceptron | 73.4 ± 1.3 | 85.7 ± 1.4 | 63.8 ± 2.2 | 70.8 ± 1.9 | 202.2 ± 11.3 | 104.4 | 43.7 ± 1.8 | 16.2 | |
| Group average | 77.3 ± 4.1 | 82.1 ± 2.6 | 86.0 ± 0.3 | 72.1 ± 8.9 | 76.4 ± 6.0 | 168.7 ± 35.9 | 76.8 ± 29.5 | 49.5 ± 6.2 | 12.2 ± 4.2 |
| Deep-learning algorithms | |||||||||
| LSTM 100 | 69.9 ± 2.0 | 91.4 ± 1.1 | 79.2 ± 7.6 | 43.9 | 65.6 ± 1.4 | 10.0 | |||
| CNN 100 | 82.9 ± 1.0 | 68.8 ± 2.1 | 81.3 ± 1.3 | 91.7 ± 1.2 | 85.3 ± 1.1 | 78.3 ± 7.9 | 46.7 | 66.2 ± 1.5 | 10.8 |
| LSTM 50 | 82.7 ± 1.0 | 81.5 ± 1.3 | 90.5 ± 1.1 | 85.0 ± 1.0 |
|
| |||
| CNN 50 | 82.5 ± 1.0 | 67.6 ± 2.0 | 80.5 ± 1.3 | 85.1 ± 1.1 | 75.9 ± 7.4 | 46.6 | 66.9 ± 1.4 | 11.0 | |
| CNN 20 | 81.4 ± 1.0 | 66.5 ± 1.9 | 79.6 ± 1.3 | 90.9 ± 1.1 | 84.1 ± 1.1 | 81.9 ± 7.1 | 43.2 | 66.7 ± 1.4 | 10.8 |
| LSTM 20 | 81.3 ± 1.0 | 65.0 ± 1.9 | 79.0 ± 1.3 | 92.0 ± 1.0 | 84.3 ± 1.0 | 75.3 ± 6.7 | 44.5 | 68.0 ± 1.3 | 11.4 |
| Group average | 82.3 ± 0.8 | 68.0 ± 2.1 | 80.6 ± 1.2 | 91.4 ± 0.7 | 84.9 ± 0.6 | 79.4 ± 4.1 | 44.4 ± 2.2 | 66.4 ± 1.1 | 10.6 ± 0.7 |
| Rescoring rules applied to deep-learning algorithms | |||||||||
| Resc. LSTM 100 | 77.8 ± 1.8 | 84.8 ± 1.2 |
|
| |||||
| Resc. CNN 100 | 80.9 ± 1.0 | 78.3 ± 1.9 | 85.1 ± 1.2 | 81.1 ± 1.7 | 81.7 ± 1.3 | 128.1 ± 9.9 | 50.8 | 56.4 ± 1.7 | 9.3 |
| Resc. CNN 50 | 80.6 ± 1.1 | 78.2 ± 1.8 | 84.8 ± 1.3 | 80.6 ± 1.7 | 81.4 ±1.3 | 130.0 ± 9.7 | 51.4 | 56.1 ± 1.6 | 9.3 |
| Resc. LSTM 50 | 79.9 ± 1.0 | 80.1 ± 1.7 | 85.6 ± 1.2 | 78.0 ± 1.6 | 80.4 ± 1.3 | 142.9 ± 9.8 | 55.6 | 53.8 ± 1.6 | 9.5 |
| Resc. LSTM 20 | 79.5 ± 1.1 | 79.9 ± 1.7 | 85.2 ± 1.2 | 77.5 ± 1.7 | 79.9 ± 1.4 | 145.2 ± 9.8 | 56.9 | 53.6 ± 1.6 | 9.6 |
| Resc. CNN 20 | 78.4 ± 1.1 | 74.5 ± 1.8 | 78.2 ± 1.5 | 158.5 ± 10.2 | 66.8 | 51.3 ± 1.7 | 10.8 | ||
| Group average | 80.1 ± 1.1 | 79.3 ± 1.5 | 85.2 ± 0.4 | 79.0 ± 3.0 | 80.7 ± 1.6 | 138.0 ± 13.8 | 54.8 ± 7.2 | 54.7 ± 2.3 | 9.6 ± 0.7 |
Methods within each group are sorted by their mean accuracy score. The best results for each category are marked in bold. Note that for WASO and Sleep Efficiency, the best results are the closest to the ground truth
Fig. 1Pearson’s r correlation coefficients between the results of different metrics for Task Night (shown in Table 1) (n = 41)
Results (Mean ± 95% confidence interval) for Task Night&Day
| Method | Accuracy | Specificity | Precision | Sensitivity | F1 |
|---|---|---|---|---|---|
| Baselines | |||||
| Manual annotations | 81.4 ± 2.9 | 98.6 ± 1.0 | |||
| Device algorithm | 76.6 ± 2.8 | 68.9 ± 3.7 | 58.6 ± 3.9 | 94.0 ± 1.7 | 71.2 ± 3.2 |
| Always wake | 69.2 ± 1.8 | 0.0 ± 0.0 | 0.0 ± 0.0 | 0.0 ± 0.0 | |
| Always sleep | 30.8 ± 1.8 | 0.0 ± 0.0 | 30.8 ± 1.8 | 46.7 ± 2.2 | |
| Traditional algorithms | |||||
| Sazonov[ | 75.5 ± 4.2 | 71.7 ± 3.9 | |||
Oakley | 81.4 ± 2.3 | 79.1 ± 3.1 | 65.9 ± 4.0 | 86.8 ± 2.9 | |
| Scripps Clinic[ | 77.4 ± 2.9 | 69.2 ± 3.9 | 59.5 ± 4.0 | 95.9 ± 1.5 | 72.4 ± 3.2 |
Oakley | 76.2 ± 2.7 | 68.4 ± 3.7 | 58.1 ± 3.8 | 94.1 ± 1.6 | 70.9 ± 3.1 |
| Cole-Kripke[ | 74.6 ± 2.9 | 64.9 ± 4.0 | 56.4 ± 3.8 | 96.6 ± 1.3 | 70.2 ± 3.2 |
Oakley | 71.6 ± 2.9 | 60.5 ± 3.9 | 53.3 ± 3.6 | 96.7 ± 1.1 | 67.8 ± 3.1 |
| Sadeh[ | 70.8 ± 3.2 | 58.7 ± 4.3 | 52.7 ± 3.7 | 67.6 ± 3.3 | |
| Webster[ | 70.3 ± 3.2 | 58.0 ± 4.3 | 52.2 ± 3.6 | 98.1 ± 1.1 | 67.2 ± 3.2 |
| Group average | 75.6 ± 3.9 | 68.1 ± 8.4 | 58.6 ± 5.6 | 92.7 ± 6.6 | 70.2 ± 2.0 |
| Rescoring rules applied to traditional algorithms | |||||
Resc. Oakley | 90.8 ± 2.3 | 79.5 ± 4.4 | 75.8 ± 4.0 | 76.0 ± 3.7 | |
| Resc. Scripps Clinic | 85.8 ± 2.3 | 84.9 ± 3.1 | 73.7 ±4.5 | 87.6 ± 2.6 | |
Resc. Oakley | 85.3 ± 2.3 | 83.6 ± 3.1 | 72.0 ± 4.3 | 89.0± 2.2 | 78.6 ± 3.2 |
| Resc. Cole-Kripke | 84.8 ± 2.4 | 82.0 ± 3.3 | 70.8 ± 4.4 | 91.0 ± 2.1 | 78.6 ± 3.2 |
Resc. Oakley | 82.9 ± 2.6 | 78.3 ± 3.5 | 67.2 ± 4.3 | 93.2 ± 1.7 | 77.0 ± 3.2 |
| Resc. Sadeh | 82.6 ± 2.8 | 77.7 ± 3.8 | 67.1 ± 4.5 | 93.4 ± 1.9 | 76.9 ± 3.4 |
| Resc. Sazonov | 82.5 ± 1.9 | 53.5 ± 5.6 | 62.4 ± 5 | ||
| Resc. Webster | 82.2 ± 2.8 | 77.1 ± 3.8 | 66.4 ± 4.4 | 76.5 ± 3.3 | |
| Group average | 84.0 ± 1.4 | 83.7 ± 5.4 | 72.1 ± 4.5 | 84.7 ± 11.6 | 75.6 ± 4.5 |
| Machine learning algorithms | |||||
| Extra trees | 76.0 ± 4.8 | 82.3 ± 5.3 | 77.3 ± 4.8 | ||
| Logistic regression | 83.7 ± 2.6 | 79.1 ± 3.4 | 94.3 ± 3.1 | ||
| Linear SVM | 82.3 ± 2.7 | 76.7 ± 3.6 | 65.9 ± 4.2 | 76.6 ± 3.5 | |
| Perceptron | 79.7 ± 2.5 | 75.1 ± 3.3 | 62.7 ± 4.0 | 90.2 ± 2.6 | 72.9 ± 3.4 |
| Group average | 83.1 ± 4.6 | 79.8 ± 9.4 | 68.1 ± 9.0 | 90.5 ± 9.5 | 76.1 ± 3.5 |
| Rescoring rules applied to machine learning algorithms | |||||
| Resc. logistic regression | 87.4 ± 2.6 | 76.3 ± 4.4 | 87.2 ± 4.4 | 79.7 ± 4.2 | |
| Resc. linear SVM | 86.9 ± 2.2 | 86.1 ± 2.8 | 75.0 ± 4.5 | ||
| Resc. perceptron | 86.0 ± 2.1 | 87.4 ± 2.7 | 75.5 ± 4.5 | 82.8 ± 4.0 | 77.4 ± 3.8 |
| Resc. extra trees | 85.4 ± 2.0 | 64.3 ± 6.5 | 69.3 ± 6.1 | ||
| Group average | 86.4 ± 1.5 | 88.7 ± 5.54 | 76.9 ± 4.3 | 80.71 ± 17.81 | 76.5 ± 7.9 |
| Deep-learning algorithms | |||||
| LSTM 100 | 86.4 ± 3.9 | 80.8 ± 3.6 | |||
| CNN 100 | 87.7 ± 2.3 | 86.6 ± 2.9 | 76.2 ± 4.4 | ||
| CNN 50 | 87.6 ± 2.2 | 87.7 ± 2.7 | 76.7 ± 4.4 | 87.4 ± 4.2 | 80.1 ± 4.0 |
| LSTM 50 | 86.4 ± 2.2 | 86.2 ± 2.9 | 74.9 ± 4.4 | 86.8 ± 3.2 | 79.1 ± 3.4 |
| CNN 20 | 85.9 ± 2.2 | 86.1 ± 2.8 | 74.0 ± 4.4 | 85.5 ± 4.1 | 77.8 ± 3.9 |
| LSTM 20 | 85.8 ± 2.2 | 87.2 ± 2.8 | 75.2 ± 4.5 | 82.5 ± 4.0 | 77.2 ± 3.8 |
| Group average | 86.9 ± 1.1 | 87.1 ± 1.1 | 75.9 ± 1.7 | 86.5 ± 2.6 | 79.3 ± 1.6 |
| Rescoring rules applied to deep-learning algorithms | |||||
| Resc. CNN 100 | 90.9 ± 2.2 | 80.4 ± 4.4 | |||
| Resc. LSTM 100 | 87.5 ± 1.9 | 92.2 ± 2.0 | 76.3 ± 5.0 | 77.0 ± 4.4 | |
| Resc. CNN 50 | 87.2 ± 1.9 | 92.1 ± 2.1 | 81.4 ± 4.3 | 75.9 ± 5.3 | 76.4 ± 4.8 |
| Resc. LSTM 50 | 86.9 ± 1.9 | 91.5 ± 2.2 | 80.7 ± 4.4 | 75.8 ± 4.6 | 76.4 ± 4.0 |
| Resc. CNN 20 | 86.0 ± 2.0 | 92.5 ± 2.0 | 81.1 ± 4.6 | 70.6 ± 5.4 | 73.3 ± 5.0 |
| Resc. LSTM 20 | 84.7 ± 1.9 | 82.1 ± 4.7 | 63.6 ± 5.5 | 69.3 ± 5.1 | |
| Group average | 86.7 ± 1.2 | 92.1 ± 1.0 | 81.3 ± 0.7 | 73.7 ± 6.1 | 75.1 ± 3.4 |
Methods within each group are sorted by their mean accuracy score. Highest results for each category are marked in bold (n = 363)
Summary statistics of the MESA Sleep dataset
| Dataset | Total | Female | Male | White | Chinese | Black | Hispanic | Age (mean ± Std.) | Min. age | Max age |
|---|---|---|---|---|---|---|---|---|---|---|
| Training | 1454 | 799 (55%) | 539 (37%) | 157 (11%) | 404 (28%) | 354 (24%) | 655 (45%) | 69.36 ± 9.18 | 55 | 94 |
| Test | 363 | 186 (51%) | 177 (49%) | 126 (35%) | 44 (12%) | 102 (28%) | 91 (25%) | 69.24 ± 8.79 | 54 | 92 |
Fig. 2Activity counts by time for MesaID 345. Each point corresponds to the activity count measured by the actigraphy device for an interval of 30 s. The yellow lines mark the borders of the data used for Task Night—the start and end of PSG period (in this case, from 9:15 p.m. to 09:24 a.m.). The extended period before and after the use of PSG (from 9:00 p.m. to 6:00 p.m. in the next day) is the data used for Task Night&Day