Ching-Heng Lin1, Kai-Cheng Hsu2, Kory R Johnson3, Marie Luby4, Yang C Fann5. 1. Center for Information Technology, National Institutes of Health, Bethesda, MD, United States. 2. Bioinformatics Section, National Institute of Neurological Disorder and Stroke, National Institutes of Health, Bethesda, MD, United States; Department of Neurology, National Taiwan University Hospital, Taipei, Taiwan. 3. Bioinformatics Section, National Institute of Neurological Disorder and Stroke, National Institutes of Health, Bethesda, MD, United States. 4. Stroke Branch, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, United States. 5. Bioinformatics Section, National Institute of Neurological Disorder and Stroke, National Institutes of Health, Bethesda, MD, United States. Electronic address: fann@ninds.nih.gov.
Abstract
INTRODUCTION: Clinicians commonly use the modified Rankin Scale (mRS) and the Barthel Index (BI) to measure clinical outcome after stroke. These are potential targets in machine learning models for stroke outcome prediction. Therefore, the quality of the measurements is crucial for training and validation of these models. The objective of this study was to apply and evaluate density-based outlier detection methods for identifying potentially incorrect measurements in multiple large stroke datasets to assess the measurement quality. METHOD: We applied three density-based outlier detection methods including density-based spatial clustering of applications (DBSCAN), hierarchical DBSCAN (HDBSCAN) and local outlier factor (LOF) based on a large dataset obtained from a nationwide prospective stroke registry in Taiwan. The testing of each method was done by using four different NINDS funded stroke datasets. RESULT: The DBSCAN achieved a high performance across all mRS values where the highest average accuracy was 99.2 ± 0.7 at mRS of 4 and the lowest average accuracy was 92.0 ± 4.6 at mRS of 3. The LOF also achieved similar performance, however, the HDBSCAN with default parameters setting required further tuning improvement. CONCLUSION: The density-based outlier detection methods were proven to be promising for validation of stroke outcome measures. The outlier detection algorithm developed from a large prospective registry dataset was effectively applied in four different NINDS stroke datasets with high performance results. The tool developed from this detection algorithm can be further applied to real world datasets to increase the data quality in stroke outcome measures. Published by Elsevier B.V.
INTRODUCTION: Clinicians commonly use the modified Rankin Scale (mRS) and the Barthel Index (BI) to measure clinical outcome after stroke. These are potential targets in machine learning models for stroke outcome prediction. Therefore, the quality of the measurements is crucial for training and validation of these models. The objective of this study was to apply and evaluate density-based outlier detection methods for identifying potentially incorrect measurements in multiple large stroke datasets to assess the measurement quality. METHOD: We applied three density-based outlier detection methods including density-based spatial clustering of applications (DBSCAN), hierarchical DBSCAN (HDBSCAN) and local outlier factor (LOF) based on a large dataset obtained from a nationwide prospective stroke registry in Taiwan. The testing of each method was done by using four different NINDS funded stroke datasets. RESULT: The DBSCAN achieved a high performance across all mRS values where the highest average accuracy was 99.2 ± 0.7 at mRS of 4 and the lowest average accuracy was 92.0 ± 4.6 at mRS of 3. The LOF also achieved similar performance, however, the HDBSCAN with default parameters setting required further tuning improvement. CONCLUSION: The density-based outlier detection methods were proven to be promising for validation of stroke outcome measures. The outlier detection algorithm developed from a large prospective registry dataset was effectively applied in four different NINDS stroke datasets with high performance results. The tool developed from this detection algorithm can be further applied to real world datasets to increase the data quality in stroke outcome measures. Published by Elsevier B.V.
Authors: E Clarke Haley; John L P Thompson; James C Grotta; Patrick D Lyden; Thomas G Hemmen; Devin L Brown; Christopher Fanale; Richard Libman; Thomas G Kwiatkowski; Rafael H Llinas; Steven R Levine; Karen C Johnston; Richard Buchsbaum; Gilberto Levy; Bruce Levin Journal: Stroke Date: 2010-02-25 Impact factor: 7.914
Authors: Tina M Burton; Marie Luby; Zurab Nadareishvili; Richard T Benson; John K Lynch; Lawrence L Latour; Amie W Hsia Journal: Neurology Date: 2017-06-28 Impact factor: 9.910
Authors: Jeffrey L Saver; Sidney Starkman; Marc Eckstein; Samuel J Stratton; Franklin D Pratt; Scott Hamilton; Robin Conwit; David S Liebeskind; Gene Sung; Ian Kramer; Gary Moreau; Robert Goldweber; Nerses Sanossian Journal: N Engl J Med Date: 2015-02-05 Impact factor: 91.245
Authors: Karen C Johnston; Douglas P Wagner; Xin-Qun Wang; George C Newman; Vincent Thijs; Souvik Sen; Steven Warach Journal: Stroke Date: 2007-04-19 Impact factor: 7.914
Authors: Kelly M Sunderland; Derek Beaton; Julia Fraser; Donna Kwan; Paula M McLaughlin; Manuel Montero-Odasso; Alicia J Peltsch; Frederico Pieruccini-Faria; Demetrios J Sahlas; Richard H Swartz; Stephen C Strother; Malcolm A Binns Journal: BMC Med Res Methodol Date: 2019-05-15 Impact factor: 4.615