| Literature DB >> 33258966 |
Diana Sousa1, Andre Lamurias1, Francisco M Couto1.
Abstract
Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype-gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.Entities:
Year: 2020 PMID: 33258966 PMCID: PMC7706181 DOI: 10.1093/database/baaa104
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The pipeline to incorporate the PGR dataset into the Amazon MTurk platform, including the design, configuration and evaluation stages.
Figure 2.Examples of the two types of relations (‘Known’ and ‘Unknown’) in the PGR dataset (partial figure from (28)). The sentence of abstract PMID:23 669 344 was simplified to capture more clearly the ‘Known’ relation.
The number of abstracts, phenotype and gene annotations, and known, unknown and total relations for the second release (11 March 2019) of the PGR dataset (partial table from (4))
| Annotations | Relations | ||||
|---|---|---|---|---|---|
| Abstracts | Phenotype | Gene | Known | Unknown | Total |
| 2657 | 9553 | 23 786 | 2480 | 5483 | 7963 |
Figure 3.An example of a HIT presented to the workers and the available options.
Summary of the crowdsourcing task criteria and associated costs
| Setting | Task 1 | Task 2 |
|---|---|---|
| Reward per assignment (USD) | 0.02 | 0.02 |
| MTurk fee (USD) | 0.01 | 0.01 |
| Number of assignments per task | 1 | 7 |
| Minimum time per assignment | 3s | 3s |
| Require that workers be masters to do your tasks (high-performing workers according to MTurk) | Yes | Yes |
| Number of tasks | 5574 | 2389 |
| Total cost (USD) | 167.22 | 501.69 |
Figure 4.Flowchart illustrating how to reach majority consensus, according to the answers provided by the workers plus our extra rater on-site.
Precision, recall, F-measure and accuracy of the application of the PGR dataset (original, new and combinations between the two) to the BiOnt and BioBERT systems
| Method | Precision | Recall |
| Accuracy | |
|---|---|---|---|---|---|
| BiOnt | PGR original | 0.8140 | 0.3070 | 0.4459 | 0.4821 |
| Amazon Task 1 (train) + PGR original (test) | 0.7000 | 0.9825 | 0.8175 | 0.7024 | |
| Amazon Task 1 (train) + Amazon/extra-rater consensus Task 2 (test) | 0.6810 | 0.9670 | 0.7992 | 0.6726 | |
| Amazon Task 1 (train) + Expert Task 2 (test) |
| 0.9721 |
|
| |
| Amazon/extra-rater consensus Task 2 (train) + PGR original (test) | 0.6880 | 0.8509 | 0.7608 | 0.6369 | |
| Expert Task 2 (train) + PGR original (test) | 0.6894 |
| 0.8072 | 0.6845 | |
| BioBERT | PGR original |
| 0.3445 | 0.4910 | 0.5143 |
| Amazon Task 1 (train) + PGR original (test) | 0.6744 | 0.9856 | 0.8000 | 0.6775 | |
| Amazon Task 1 (train) + Amazon/extra-rater consensus Task 2 (test) | 0.6700 | 0.9763 | 0.7946 | 0.6680 | |
| Amazon Task 1 (train) + Expert (test) | 0.8103 |
|
|
| |
| Amazon/extra-rater consensus Task 2 (train) + PGR original (test) | 0.7315 | 0.9160 | 0.8134 | 0.7143 | |
| Expert Task 2 (train) + PGR original (test) | 0.7857 | 0.8319 | 0.8082 | 0.7314 | |
The highest scores for each metric are presented in bold.
The inter-rater agreement score, using both Fleiss’ kappa and Krippendorff’s alpha metrics, considering only the Amazon workers, the Amazon workers plus the extra rater (on-site) and the extra rater (on-site) plus the domain expert (Task 2)
| Inter-rater agreement | |||
|---|---|---|---|
| Inter-rater agreement metric | Amazon workers | Amazon workers + extra rater (on-site) | Extra rater (on-site) + expert |
| Fleiss’ kappa | 0.2028 | 0.2050 | 0.6549 |
| Krippendorff’s alpha | 0.2029 | 0.2051 | 0.6550 |
The original and final numbers both in total count and percentage, for Tasks 1 and 2, of true, false, excluded and total relations, considering the majority consensus and the domain expert numbers separately
| Dataset | Relations | ||||
|---|---|---|---|---|---|
| True | False | Excluded | Total | ||
| Task 1 (70%) | Original | 1751 (31.41%) | 3823 (68.59%) | – | 5574 (100%) |
| Amazon workers | 4220 (75.71%) | 283 (5.08%) | 1071 (19.21%) | 4503 (80.79%) | |
| Task 2 (30%) | Original | 729 (30.51%) | 1660 (69.49%) | – | 2389 (100%) |
| Amazon workers + extra rater (on-site) (after reaching consensus) | 1179 (49.35%) | 613 (25.66%) | 597 (24.99%) | 1792 (75.01%) | |
| Expert | 1281 (53.62%) | 343 (14.36%) | 765 (32.02%) | 1624 (67.98%) | |
The number of abstracts, phenotype and gene annotations, and true, false and total relations for the third release of the PGR dataset consisted of the revision of the Amazon workers (Task 1) plus domain expert revision (Task 2)
| Annotations | Relations | ||||
|---|---|---|---|---|---|
| Abstracts | Phenotype | Gene | True | False | Total |
| 1921 | 1943 | 2207 | 5501 | 626 | 6127 |