Fredrik A Dahl1,2, Taraka Rama3, Petter Hurlen4, Pål H Brekke5, Haldor Husby6, Tore Gundersen7, Øystein Nytrø8, Lilja Øvrelid9. 1. Health Services Research Unit, Akershus University Hospital, Lørenskog, Norway. Fredrik.dahl@ahus.no. 2. Institute for Clinical Medicine, Campus Ahus, University of Oslo, Oslo, Norway. Fredrik.dahl@ahus.no. 3. Department of Linguistics, University of North Texas, Denton, TX, USA. 4. Division of Diagnostics and Technology, Akershus University Hospital, Lørenskog, Norway. 5. Department of Cardiology, Oslo University Hospital Rikshospitalet, Oslo, Norway. 6. Institute for Clinical Medicine, Campus Ahus, University of Oslo, Oslo, Norway. 7. Data and Analytics, Akershus University Hospital, Lørenskog, Norway. 8. Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway. 9. Department of Informatics, University of Oslo, Oslo, Norway.
Abstract
BACKGROUND: With a motivation of quality assurance, machine learning techniques were trained to classify Norwegian radiology reports of paediatric CT examinations according to their description of abnormal findings. METHODS: 13.506 reports from CT-scans of children, 1000 reports from CT scan of adults and 1000 reports from X-ray examination of adults were classified as positive or negative by a radiologist, according to the presence of abnormal findings. Inter-rater reliability was evaluated by comparison with a clinician's classifications of 500 reports. Test-retest reliability of the radiologist was performed on the same 500 reports. A convolutional neural network model (CNN), a bidirectional recurrent neural network model (bi-LSTM) and a support vector machine model (SVM) were trained on a random selection of the children's data set. Models were evaluated on the remaining CT-children reports and the adult data sets. RESULTS: Test-retest reliability: Cohen's Kappa = 0.86 and F1 = 0.919. Inter-rater reliability: Kappa = 0.80 and F1 = 0.885. Model performances on the Children-CT data were as follows. CNN: (AUC = 0.981, F1 = 0.930), bi-LSTM: (AUC = 0.978, F1 = 0.927), SVM: (AUC = 0.975, F1 = 0.912). On the adult data sets, the models had AUC around 0.95 and F1 around 0.91. CONCLUSIONS: The models performed close to perfectly on its defined domain, and also performed convincingly on reports pertaining to a different patient group and a different modality. The models were deemed suitable for classifying radiology reports for future quality assurance purposes, where the fraction of the examinations with abnormal findings for different sub-groups of patients is a parameter of interest.
BACKGROUND: With a motivation of quality assurance, machine learning techniques were trained to classify Norwegian radiology reports of paediatric CT examinations according to their description of abnormal findings. METHODS: 13.506 reports from CT-scans of children, 1000 reports from CT scan of adults and 1000 reports from X-ray examination of adults were classified as positive or negative by a radiologist, according to the presence of abnormal findings. Inter-rater reliability was evaluated by comparison with a clinician's classifications of 500 reports. Test-retest reliability of the radiologist was performed on the same 500 reports. A convolutional neural network model (CNN), a bidirectional recurrent neural network model (bi-LSTM) and a support vector machine model (SVM) were trained on a random selection of the children's data set. Models were evaluated on the remaining CT-children reports and the adult data sets. RESULTS: Test-retest reliability: Cohen's Kappa = 0.86 and F1 = 0.919. Inter-rater reliability: Kappa = 0.80 and F1 = 0.885. Model performances on the Children-CT data were as follows. CNN: (AUC = 0.981, F1 = 0.930), bi-LSTM: (AUC = 0.978, F1 = 0.927), SVM: (AUC = 0.975, F1 = 0.912). On the adult data sets, the models had AUC around 0.95 and F1 around 0.91. CONCLUSIONS: The models performed close to perfectly on its defined domain, and also performed convincingly on reports pertaining to a different patient group and a different modality. The models were deemed suitable for classifying radiology reports for future quality assurance purposes, where the fraction of the examinations with abnormal findings for different sub-groups of patients is a parameter of interest.
Entities:
Keywords:
Machine learning; Natural language processing; Reproducibility of results; Tomography; X-ray computed
Authors: Pragya A Dang; Mannudeep K Kalra; Michael A Blake; Thomas J Schultz; Elkan F Halpern; Keith J Dreyer Journal: AJR Am J Roentgenol Date: 2008-08 Impact factor: 3.959
Authors: Matthew C Chen; Robyn L Ball; Lingyao Yang; Nathaniel Moradzadeh; Brian E Chapman; David B Larson; Curtis P Langlotz; Timothy J Amrhein; Matthew P Lungren Journal: Radiology Date: 2017-11-13 Impact factor: 11.105
Authors: Johanna M Meulepas; Cécile M Ronckers; Anne M J B Smets; Rutger A J Nievelstein; Patrycja Gradowska; Choonsik Lee; Andreas Jahnen; Marcel van Straten; Marie-Claire Y de Wit; Bernard Zonnenberg; Willemijn M Klein; Johannes H Merks; Otto Visser; Flora E van Leeuwen; Michael Hauptmann Journal: J Natl Cancer Inst Date: 2019-03-01 Impact factor: 13.506
Authors: John D Mathews; Anna V Forsythe; Zoe Brady; Martin W Butler; Stacy K Goergen; Graham B Byrnes; Graham G Giles; Anthony B Wallace; Philip R Anderson; Tenniel A Guiver; Paul McGale; Timothy M Cain; James G Dowty; Adrian C Bickerstaffe; Sarah C Darby Journal: BMJ Date: 2013-05-21