Corrado Lanera1, Clara Minto1, Abhinav Sharma2, Dario Gregori1, Paola Berchialla3, Ileana Baldi4. 1. Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic and Vascular Sciences, University of Padova, Via Loredan 18, Padova 35131, Italy. 2. Department of Biological Sciences and Bioengineering (BSBE), IIT, Kanpur, India. 3. Department of Clinical and Biological Sciences, University of Torino, Via Santena 5bis, Torino 10126, Italy. 4. Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic and Vascular Sciences, University of Padova, Via Loredan 18, Padova 35131, Italy. Electronic address: ileana.baldi@unipd.it.
Abstract
OBJECTIVES: Despite their essential role in collecting and organizing published medical literature, indexed search engines are unable to cover all relevant knowledge. Hence, current literature recommends the inclusion of clinical trial registries in systematic reviews (SRs). This study aims to provide an automated approach to extend a search on PubMed to the ClinicalTrials.gov database, relying on text mining and machine learning techniques. STUDY DESIGN AND SETTING: The procedure starts from a literature search on PubMed. Next, it considers the training of a classifier that can identify documents with a comparable word characterization in the ClinicalTrials.gov clinical trial repository. Fourteen SRs, covering a broad range of health conditions, are used as case studies for external validation. A cross-validated support-vector machine (SVM) model was used as the classifier. RESULTS: The sensitivity was 100% in all SRs except one (87.5%), and the specificity ranged from 97.2% to 99.9%. The ability of the instrument to distinguish on-topic from off-topic articles ranged from an area under the receiver operator characteristic curve of 93.4% to 99.9%. CONCLUSION: The proposed machine learning instrument has the potential to help researchers identify relevant studies in the SR process by reducing workload, without losing sensitivity and at a small price in terms of specificity.
OBJECTIVES: Despite their essential role in collecting and organizing published medical literature, indexed search engines are unable to cover all relevant knowledge. Hence, current literature recommends the inclusion of clinical trial registries in systematic reviews (SRs). This study aims to provide an automated approach to extend a search on PubMed to the ClinicalTrials.gov database, relying on text mining and machine learning techniques. STUDY DESIGN AND SETTING: The procedure starts from a literature search on PubMed. Next, it considers the training of a classifier that can identify documents with a comparable word characterization in the ClinicalTrials.gov clinical trial repository. Fourteen SRs, covering a broad range of health conditions, are used as case studies for external validation. A cross-validated support-vector machine (SVM) model was used as the classifier. RESULTS: The sensitivity was 100% in all SRs except one (87.5%), and the specificity ranged from 97.2% to 99.9%. The ability of the instrument to distinguish on-topic from off-topic articles ranged from an area under the receiver operator characteristic curve of 93.4% to 99.9%. CONCLUSION: The proposed machine learning instrument has the potential to help researchers identify relevant studies in the SR process by reducing workload, without losing sensitivity and at a small price in terms of specificity.