| Literature DB >> 28365743 |
Honghan Wu1,2, Anika Oellrich1, Christine Girges3, Bernard de Bono3, Tim J P Hubbard4, Richard J B Dobson1,3.
Abstract
Neurodegenerative disorders such as Parkinson's and Alzheimer's disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F 1 -measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process. Database URL: https://github.com/KHP-Informatics/NapEasy.Entities:
Mesh:
Year: 2017 PMID: 28365743 PMCID: PMC5467557 DOI: 10.1093/database/bax027
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Illustration of the individual steps of the developed pipeline.
Figure 2.The spatial distribution of goal sentences extracted from the papers in the development data set.
Figure 3.Sentence illustration of scoring and final result.
Performance results of automated PDF highlights as obtained by described methodology
| Micro-average | Macro-average | ||||||
|---|---|---|---|---|---|---|---|
| Dataset | Corr | Precision | Recall | F1-measure | Precision | Recall | F1-measure |
| Development | no | 0.50 | 0.53 | 0.52 | 0.51 | 0.54 | 0.53 |
| Test | no | 0.25 | 0.38 | 0.30 | 0.28 | 0.39 | 0.32 |
| test_corrected | yes | 0.50 | 0.47 | 0.49 | 0.53 | 0.49 | 0.51 |
As articles originate from different journals and differ in length and highlights, we report both micro- and macro-averaged performance measures for all the assessed data sets (development, test and test_corrected).
Signifies whether data set has been manually corrected before calculating the measures.
Performance results of bag-of-words baseline approaches: perceptron, passive aggressive classifier, kNN and random forest are four well-known binary classification algorithms; ‘test’ is the test data set containing 58 papers; ‘test_corrected’ is a set of 22 papers, of which PDF processing errors were corrected; best performances are highlighted in the table as bold values
| Algorithm | Test | test_corrected | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1-measure | Precision | Recall | F1-measure | |
| Perceptron | 0.23 | 0.20 | 0.26 | 0.19 | ||
| Passive aggressive classifier | 0.27 | 0.18 | 0.28 | 0.16 | 0.21 | |
| kNN | 0.80 | 0.01 | 0.02 | 0.80 | 0.01 | 0.02 |
| Random forest | 0.63 | 0.04 | 0.08 | 0.69 | 0.04 | 0.08 |
Figure 4.The assessment results of a user based evaluation in a scenario of supporting knowledge curation.