Gi-Ren Liu1, Ting-Yu Lin2, Hau-Tieng Wu3, Yuan-Chung Sheu4,5, Ching-Lung Liu6, Wen-Te Liu7, Mei-Chen Yang8, Yung-Lun Ni9, Kun-Ta Chou10, Chao-Hsien Chen6, Dean Wu7, Chou-Chin Lan8, Kuo-Liang Chiu9,11, Hwa-Yen Chiu10, Yu-Lun Lo2. 1. Department of Mathematics, National Chen-Kung University, Tainan, Taiwan. 2. Department of Thoracic Medicine, Chang Gung Memorial Hospital, Chang Gung University, College of Medicine, Taoyuan, Taiwan. 3. Department of Mathematics and Department of Statistical Science, Duke University, Durham, North Carolina. 4. Mathematics Division, National Center for Theoretical Sciences, Taipei, Taiwan. 5. Department of Applied Mathematics, National Chiao Tung University, Hsinchu, Taiwan. 6. Division of Chest, Department of Internal Medicine, MacKay Memorial Hospital, Taipei, Taiwan. 7. Sleep Center, Shuang Ho Hospital, Taipei Medical University, New Taipei City, Taiwan. 8. Division of Pulmonary Medicine, Department of Internal Medicine, Taipei Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, New Taipei, Taiwan. 9. Department of Pulmonary Medicine, Taichung Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, Taichung, Taiwan. 10. Center of Sleep Medicine, Taipei Veterans General Hospital, Taipei, Taiwan. 11. School of Post-Baccalaureate Chinese Medicine, Tzu Chi University, Hualien, Taiwan.
Abstract
STUDY OBJECTIVES: Polysomnography is the gold standard in identifying sleep stages; however, there are discrepancies in how technicians use the standards. Because organizing meetings to evaluate this discrepancy and/or reach a consensus among multiple sleep centers is time-consuming, we developed an artificial intelligence system to efficiently evaluate the reliability and consistency of sleep scoring and hence the sleep center quality. METHODS: An interpretable machine learning algorithm was used to evaluate the interrater reliability (IRR) of sleep stage annotation among sleep centers. The artificial intelligence system was trained to learn raters from 1 hospital and was applied to patients from the same or other hospitals. The results were compared with the experts' annotation to determine IRR. Intracenter and intercenter assessments were conducted on 679 patients without sleep apnea from 6 sleep centers in Taiwan. Centers with potential quality issues were identified by the estimated IRR. RESULTS: In the intracenter assessment, the median accuracy ranged from 80.3%-83.3%, with the exception of 1 hospital, which had an accuracy of 72.3%. In the intercenter assessment, the median accuracy ranged from 75.7%-83.3% when the 1 hospital was excluded from testing and training. The performance of the proposed method was higher for the N2, awake, and REM sleep stages than for the N1 and N3 stages. The significant IRR discrepancy of the 1 hospital suggested a quality issue. This quality issue was confirmed by the physicians in charge of the 1 hospital. CONCLUSIONS: The proposed artificial intelligence system proved effective in assessing IRR and hence the sleep center quality.
STUDY OBJECTIVES: Polysomnography is the gold standard in identifying sleep stages; however, there are discrepancies in how technicians use the standards. Because organizing meetings to evaluate this discrepancy and/or reach a consensus among multiple sleep centers is time-consuming, we developed an artificial intelligence system to efficiently evaluate the reliability and consistency of sleep scoring and hence the sleep center quality. METHODS: An interpretable machine learning algorithm was used to evaluate the interrater reliability (IRR) of sleep stage annotation among sleep centers. The artificial intelligence system was trained to learn raters from 1 hospital and was applied to patients from the same or other hospitals. The results were compared with the experts' annotation to determine IRR. Intracenter and intercenter assessments were conducted on 679 patients without sleep apnea from 6 sleep centers in Taiwan. Centers with potential quality issues were identified by the estimated IRR. RESULTS: In the intracenter assessment, the median accuracy ranged from 80.3%-83.3%, with the exception of 1 hospital, which had an accuracy of 72.3%. In the intercenter assessment, the median accuracy ranged from 75.7%-83.3% when the 1 hospital was excluded from testing and training. The performance of the proposed method was higher for the N2, awake, and REM sleep stages than for the N1 and N3 stages. The significant IRR discrepancy of the 1 hospital suggested a quality issue. This quality issue was confirmed by the physicians in charge of the 1 hospital. CONCLUSIONS: The proposed artificial intelligence system proved effective in assessing IRR and hence the sleep center quality.
Authors: Peter Anderer; Arnaud Moreau; Michael Woertz; Marco Ross; Georg Gruber; Silvia Parapatics; Erna Loretz; Esther Heller; Andrea Schmidt; Marion Boeck; Doris Moser; Gerhard Kloesch; Bernd Saletu; Gerda M Saletu-Zyhlarz; Heidi Danker-Hopfe; Josef Zeitlhofer; Georg Dorffner Journal: Neuropsychobiology Date: 2010-09-09 Impact factor: 2.328
Authors: Xiaozhe Zhang; Xiaosong Dong; Jan W Kantelhardt; Jing Li; Long Zhao; Carmen Garcia; Martin Glos; Thomas Penzel; Fang Han Journal: Sleep Breath Date: 2014-05-07 Impact factor: 2.816
Authors: Heidi Danker-Hopfe; Peter Anderer; Josef Zeitlhofer; Marion Boeck; Hans Dorn; Georg Gruber; Esther Heller; Erna Loretz; Doris Moser; Silvia Parapatics; Bernd Saletu; Andrea Schmidt; Georg Dorffner Journal: J Sleep Res Date: 2009-03 Impact factor: 3.981