| Literature DB >> 30523329 |
Jens B Stephansen1,2, Alexander N Olesen1,2,3, Mads Olsen1,2,3, Aditya Ambati1, Eileen B Leary1, Hyatt E Moore1, Oscar Carrillo1, Ling Lin1, Fang Han4, Han Yan4, Yun L Sun4, Yves Dauvilliers5,6, Sabine Scholz5,6, Lucie Barateau5,6, Birgit Hogl7, Ambra Stefani7, Seung Chul Hong8, Tae Won Kim8, Fabio Pizza9,10, Giuseppe Plazzi9,10, Stefano Vandi9,10, Elena Antelmi9,10, Dimitri Perrin11, Samuel T Kuna12, Paula K Schweitzer13, Clete Kushida1, Paul E Peppard14, Helge B D Sorensen2, Poul Jennum3, Emmanuel Mignot15.
Abstract
Analysis of sleep for the diagnosis of sleep disorders such as Type-1 Narcolepsy (T1N) currently requires visual inspection of polysomnography records by trained scoring technicians. Here, we used neural networks in approximately 3,000 normal and abnormal sleep recordings to automate sleep stage scoring, producing a hypnodensity graph-a probability distribution conveying more information than classical hypnograms. Accuracy of sleep stage scoring was validated in 70 subjects assessed by six scorers. The best model performed better than any individual scorer (87% versus consensus). It also reliably scores sleep down to 5 s instead of 30 s scoring epochs. A T1N marker based on unusual sleep stage overlaps achieved a specificity of 96% and a sensitivity of 91%, validated in independent datasets. Addition of HLA-DQB1*06:02 typing increased specificity to 99%. Our method can reduce time spent in sleep clinics and automates T1N diagnosis. It also opens the possibility of diagnosing T1N using home sleep studies.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30523329 PMCID: PMC6283836 DOI: 10.1038/s41467-018-07229-3
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Individual and overall scorer performance, expressed as accuracy and Cohen’s kappa
| Overall | Scorer 1 | Scorer 2 | Scorer 3 | Scorer 4 | Scorer 5 | Scorer 6 | |
|---|---|---|---|---|---|---|---|
| Accuracy (%), biased | 81.3 ± 3.0 | 82.4 ± 6.1 | 84.6 ± 5.5 | 74.1 ± 7.9 | 85.4 ± 5.7 | 83.1 ± 9.4 | 78.3 ± 8.9 |
| Accuracy (%), unbiased | 76.0 ± 3.2 | 77.3 ± 6.3 | 79.1 ± 6.3 | 69.0 ± 8.0 | 79.7 ± 6.5 | 77.8 ± 9.6 | 72.9 ± 9.2 |
| Model accuracy (%) on concensus | — | 85.1 ± 4.9 | 83.8 ± 5.0 | 86.5 ± 4.3 | 84.3 ± 4.7 | 85.6 ± 4.7 | 87.0 ± 4.5 |
| — | 9.5 (3.8 × 10−14) | 6.6 (7.5 × 10−9) | 18.3 (6.0 × 10−28) | 6.7 (4.7 × 10−9) | 6.4 (1.7 × 10−8) | 12.2 (7.5 × 10−19) | |
| Cohen’s kappa, biased | 61.0 ± 6.8 | 63.6 ± 12.2 | 68.4 ± 10.5 | 45.6 ± 19.7 | 69.6 ± 13.2 | 64.5 ± 20.9 | 54.5 ± 19.8 |
| Cohen's kappa, unbiased | 57.7 ± 6.1 | 61.3 ± 11.2 | 64.6 ± 10.3 | 43.5 ± 19.2 | 64.6 ± 13.1 | 60.9 ± 16.9 | 51.6 ± 16.7 |
| Model kappa on concensus | — | 74.3 ± 12.3 | 72.4 ± 12.1 | 76.0 ± 11.8 | 72.7 ± 12.0 | 74.7 ± 12.1 | 76.6 ± 12.2 |
| — | 9.5 (4.6 × 10−14) | 7.1 (7.9 × 10−10) | 15.4 (7.0 × 10−24) | 6.6 (6.4 × 10−9) | 7.1 (9.2 × 10−10) | 13.2 (2.0 × 10−20) |
Both accuracy and Cohen’s kappa are presented as both with (biased) and without (unbiased) the assessed scorer included in the consensus standard in a leave-one-out fashion. Accuracy is expressed in percent, and Cohen’s kappa is a ratio, and therefore unitless. T-statistics and p values correspond to the paired t-test between the unbiased predictions for each scorer against the model predictions on the same consensus
Performance of best models, as they are described by Supplementary Table 8, on various datasets compared to the six-scorer consensus
| Test data | Best single model | Mean performance (%) | Best ensemble | Mean performance (%) |
|---|---|---|---|---|
| WSC | CC/SH/LS/LSTM/2 | 86.0 ± 5.0 | All CC | 86.4 ± 5.2 |
| SSC+KHC, no narcolepsy | CC/LH/SS/LSTM | 76.9 ± 11.1 | All CC | 77.0 ± 11.9 |
| SSC+KHC, narcolepsy | CC/LH/SS/LSTM | 68.8 ± 11.0 | All CC | 68.4 ± 12.2 |
| IS-RC | CC/LH/LS/LSTM/2 | 84.6 ± 4.6 | All models | 86.8 ± 4.3 |
All comparisons are on a by-epoch basis
Fig. 1Accuracy per scorer and by time resolution. a The effect on scoring accuracy as golden standard is improved. Every combination of N scorers is evaluated in an unweighted manner and the mean is calculated. Accuracy is shown with mean (solid black line) and a 95% confidence interval (gray area). b Predictive performance of best model at different resolutions. Performance is shown as mean accuracy (solid black line) with a 95% confidence interval (gray area)
Fig. 2Hypnodensity example evaluated by multiple scorers and different predictive models. a The figure displays the hypnodensity graph. Displayed models are, in order: multiple scorer assessment (1); ensembles as described in Supplementary Table 8: All models, those with memory (LSTM) and those without memory (FF) (2–4); single models, as described in Supplementary Table 8 (5–7). OCT is octave encoding, Color codes: white, wake; red, N1; light blue, N2; dark blue, N3; black, REM. b The 150 epochs of a recording from the AASM ISR program are analyzed by 16 models with randomly varying parameters, using the CC/SH/LS/LSTM model as a template. These data were also evaluated by 5234 ± 14 different scorers. The distribution of these is shown on top, the average model predictions are shown in the middle, and the model variance is shown at the bottom
Confusion matrix displaying the relation between different targets and the ensemble estimate
| Target | ||||||
|---|---|---|---|---|---|---|
| Model Predictions | Wake | N1 | N2 | N3 | REM | Precision |
| Wake | 14.08% | 0.35% | 0.88% | 0.007% | 0.08% | 0.91 |
| 16.68% | 0.15% | 0.44% | 0.003% | 0.02% | 0.96 | |
| N1 | 1.13% | 1.78% | 3.00% | 0.002% | 0.36% | 0.28 |
| 0.47% | 0.88% | 1.15% | 0% | 0.12% | 0.34 | |
| N2 | 0.29% | 0.59% | 52.58% | 1.27% | 0.66% | 0.95 |
| 0.12% | 0.25% | 56.30% | 0.34% | 0.32% | 0.98 | |
| N3 | 0.002% | 0% | 2.13% | 4.87% | 0% | 0.70 |
| 0% | 0% | 1.09% | 4.23% | 0% | 0.91 | |
| REM | 0.54% | 1.17% | 0.78% | 0% | 13.45% | 0.84 |
| 0.40% | 0.73% | 0.41% | 0% | 15.86% | 0.91 | |
| Sensitivity | 0.88 | 0.46 | 0.89 | 0.79 | 0.92 | 0.87 |
| 0.94 | 0.44 | 0.95 | 0.92 | 0.97 | 0.94 | |
The targets are: top row: unweighted consensus; bottom row: weighted by the scorer agreement at each epoch. The number of analyzed epochs were 53,009 (unweighted) and 36,032 (weighted)
Fig. 3Examples of hypnodensity graph in subjects with and without narcolepsy. Hypnodensity, i.e., probability distribution per stage of sleep for a subject without narcolepsy (top) and a subject with narcolepsy (Bottom). Color codes: white, wake; red, N1; light blue, N2; dark blue, N3; black, REM
Descriptions of the 8 most frequently selected features
| Number | Relative selection frequency | Description |
|---|---|---|
| 1 | 1 | The time taken before 5% of the sum of the product between W, N2 and REM, calculated at every epoch, has accumulated, weighed by the total amount of this sum. |
| 2 | 0.91 | The number of nightly SOREMPS appearing throughout the recording. |
| 3 | 0.82 | The time taken before 50% of the wakefulness in a recording has accumulated, weighed by the total amount of wakefulness. |
| 4 | 0.82 | REM 6 |
| 5 | 0.68 | The maximum probability of wakefulness obtained in a recording. |
| 6 | 0.68 | The maximum value obtained of the product between the N2 and REM probability in a recording. |
| 7 | 0.68 | The time taken before 30% of the sum of the product between W and N2, calculated at every epoch, has accumulated, weighed by the total amount of this sum. |
| 8 | 0.64 | The time taken before 10% of the sum of the product between W and N1, calculated at every epoch, has accumulated, weighed by the total amount of this sum. |
Fig. 4Diagnostic receiver operating characteristics curves. Diagnostic receiver operating characteristics (ROC) curves, displaying the trade-offs between sensitivity and specificity for our narcolepsy biomarker for a training sample, b testing sample, c replication sample and e high pretest sample. d–f Adding HLA to model vastly increases specificity. Cut-off thresholds are presented for models with (red dot) and without HLA (green dot)
Fig. 5Overall design of the study. a Pre-processing steps taken to achieve the format of data as it is used in the neural networks. One of the 5 channels is first high-pass filtered with a cut-off at 0.2 Hz, then low-pass filtered with a cut-off at 49 Hz followed by a re-sampling to 100 Hz to ensure data homogeneity. In the case of EEG signals, a channel selection is employed to choose the channel with the least noise. The data are then encoded using either the CC or the octave encoding. b Steps taken to produce and test the automatic scoring algorithm. A part of the SSC[10, 32] and WSC[32, 33] is randomly selected, as described in Supplementary Table 1. These data are then segmented in 5 min segments and scrambled with segments from other subjects to increase batch similarity during training. A neural network is then trained until convergence (evaluated using a separate validation sample). Once trained, the networks are tested on a separate part of the SSC and WSC along with data from the IS-RC[31] and KHC[10, 34]. c Steps taken to produce and test the narcolepsy detector. Hypnodensities are extracted from data, as described in Supplementary Table 1. These data are separated into a training (60%) and a testing (40%) split. From the training split, 481 potentially relevant features, as described in Supplementary Table 9, are extracted from each hypnodensity. The prominent features are maintained using a recursive selection algorithm, and from these features a GP classifier is created. From the testing split, the same relevant features are extracted, and the GP classifier is evaluated
Fig. 6Neural network strategy. a An example of the octave and the CC encoding on 10 s of EEG, EOG and EMG data. These processed data are fed into the neural networks in one of the two formats. The data in the octave encoding are offset for visualization purposes. Color scale is unitless. b Simplified network configuration, displaying how data are fed and processed through the networks. A more detailed description can be found in Supplementary Figure 3