Sejin Nam1, Senator Jeong2, Sang-Kyun Kim3, Hong-Gee Kim4, Victoria Ngo5, Nansu Zong6. 1. National Center of Excellence in Software, Chungnam National University, South Korea. 2. National Center for Medical Information & Knowledge, Korea National Institute of Health, South Korea. 3. Mibyeong Research Center, Korea Institute of Oriental Medicine, Daejeon, South Korea. 4. Biomedical Knowledge Engineering Laboratory, School of Dentistry, Seoul National University, South Korea. 5. Betty Irene Moore School of Nursing, University of California, Davis, USA. 6. Department of Biomedical Informatics, School of Medicine, University of California, San Diego, USA. Electronic address: nazong@ucsd.edu.
Abstract
OBJECTIVE: Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. METHODS: We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. RESULTS: Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). CONCLUSION: Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.
OBJECTIVE: Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. METHODS: We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. RESULTS: Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). CONCLUSION: Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.
Authors: Honghan Wu; Anika Oellrich; Christine Girges; Bernard de Bono; Tim J P Hubbard; Richard J B Dobson Journal: Database (Oxford) Date: 2017-01-01 Impact factor: 3.451