| Literature DB >> 31566474 |
Qiang Wei, Yaoyun Zhang, Muhammad Amith1, Rebecca Lin2, Jenay Lapeyrolerie3, Cui Tao, Hua Xu1.
Abstract
Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.Entities:
Keywords: biomedical literature; biomedical software; biomedical software index; named entity recognition; natural language processing
Mesh:
Year: 2019 PMID: 31566474 PMCID: PMC7334865 DOI: 10.1177/1460458219869490
Source DB: PubMed Journal: Health Informatics J ISSN: 1460-4582 Impact factor: 2.681
Figure 1.Study design for automated software recognition from biomedical literature.
Figure 2.An example of annotated biomedical literature for software names.
Figure 3.An example of BIO representation of software names.
Example of features for developing machine-learning model.
| Feature type | Feature values |
|---|---|
| StemWord=[wommbat], WordShapel=[AaAAAAA], … | |
| …, TRIGRAM0=[present+wommbat+(], …, BIGRAM-2=[we+present], …, BIGRAM0=[wommbat+(] …, BIGRAM2=[work+memori] … | |
| SentFeaLen=[6+], SEN_STARTWITH_ENUM=[FALSE], … | |
| Prefix1=[W], Prefix2=[Wo], Prefix3=[WoM], …, Suffix1=[T] | |
| Section=[ABSTRACT] | |
| DictFeaUNI-1=[TK], DictFeaUNI-0=[TK], DictFeaUNI+1=[TK],… | |
| RegCAPSMIX=[TRUE], RegEND_PUNCTATION=[FALSE], RegHAS_CAP=[TRUE], RegIS_DASH=[FALSE], … | |
| EB_0=[NEU], EB_1=[NEU], EB_2=[NEU], EB_3=[NEU], EB_4=[POS], EB_5=[NEU], … | |
| DLFeaUNI-1=[642], DLFeaUNI-0=[N], DLFeaUN+1=[382], … |
Summary of rules for post processing.
| Description | |
|---|---|
| Patterns | (a)The string that is at the beginning of a title and followed by a colon, hyphen and so on could be a software name. |
| (b)The string has a pattern of “the * software | package | library | tool | toolkit | bundle | browser” could be a software name. |
Performance of software name recognition from biomedical literature (%).
| Precision | Recall | F-measure | ||
|---|---|---|---|---|
| Baseline system bioNerDS | Exact | 31.52 | 18.98 | 23.69 |
| Inexact | 65.25 | 39.29 | 49.05 | |
| Baseline feature | Exact | 81.12 | 59.38 | 68.57 |
| Inexact | 92.27 | 67.54 | 77.99 | |
| Domain knowledge feature | ||||
| Dictionary feature | Exact | 81.07 | 59.43 | 68.59 |
| Inexact | 92.20 | 67.59 | 78.00 | |
| Orthographic feature | Exact | 80.78 | 59.93 | 68.81 |
| Inexact | 92.15 | 68.37 | 78.50 | |
| Section feature | Exact | 80.40 | 60.32 | 68.93 |
| Inexact | 91.86 | 68.92 | 78.76 | |
| Word representation feature | ||||
| Discrete word embedding feature | Exact | 79.31 | 63.82 | 70.73 |
| Inexact | 91.10 | 73.31 | 81.24 | |
| Clustering of word embedding feature | Exact | 79.84 | 64.59 | 71.41 |
| Inexact | 91.22 | 73.81 | 81.60 | |
| Post-processing: rule (1a) | Exact | 79.28 | 64.76 | 71.29 |
| Inexact | 91.24 | 74.53 | 82.04 | |
| Post-processing: rule (1b) | Exact | 78.78 | 65.32 | 71.42 |
| Inexact | 90.90 | 75.36 | 82.40 | |
| Post-processing: rule (2) | Exact | 69.65 | 72.20 | 70.90 |
| Inexact | 84.69 | 87.79 | 86.21 | |
| Post-processing: rule (3) | Exact | 70.36 | 71.53 | 70.94 |
| Inexact | 85.64 | 87.07 | 86.35 | |
Each type of feature was added into the software recognition system incrementally.
Performance of software name recognition from titles of biomedical literature (%).
| Precision | Recall | F-measure | ||
|---|---|---|---|---|
| Baseline system bioNerDS | Exact | 38.73 | 21.47 | 27.63 |
| Inexact | 76.88 | 42.77 | 54.96 | |
| Baseline feature | Exact | 90.91 | 70.51 | 79.42 |
| Inexact | 97.11 | 75.32 | 84.84 | |
| Domain knowledge feature | ||||
| Dictionary feature | Exact | 90.98 | 71.15 | 79.86 |
| Inexact | 96.72 | 75.64 | 84.89 | |
| Orthographic feature | Exact | 87.27 | 74.68 | 80.48 |
| Inexact | 94.38 | 80.77 | 87.05 | |
| Section feature | Exact | 86.19 | 74.04 | 79.66 |
| Inexact | 93.66 | 80.45 | 86.55 | |
| Word representation feature | ||||
| Discrete word embedding feature | Exact | 88.89 | 76.92 | 82.47 |
| Inexact | 95.19 | 82.37 | 88.32 | |
| Clustering of word embedding feature | Exact | 87.41 | 77.88 | 82.37 |
| Inexact | 94.24 | 83.97 | 88.81 | |
| Post-processing: rule (1a) | Exact | 84.25 | 78.85 | 81.46 |
| Inexact | 94.18 | 88.14 | 91.06 | |
| Post-processing: rule (1b) | Exact | 84.25 | 78.85 | 81.46 |
| Inexact | 94.18 | 88.14 | 91.06 | |
| Post-processing: rule (2) | Exact | 81.23 | 80.45 | 80.84 |
| Inexact | 92.23 | 91.35 | 91.79 | |
| Post-processing: rule (3) | Exact | 81.23 | 80.45 | 80.84 |
| Inexact | 92.23 | 91.35 | 91.79 | |
Each type of feature was added into the software recognition system incrementally.
Reasons and examples of false-positive and false-negative errors in software recognition from biomedical literature.
| Error type | Reasons | Examples |
|---|---|---|
| False positive | Similar orthographic characteristics | (a) Predicting AD conversion: comparison between prodromal AD guidelines and computer-assisted PredictAD tool. |
| (b) Similarly, one of | ||
| Similar context | (c) The Sanger | |
| Complex syntactic structure | (d) One family of algorithms that has proven useful for disease classification is based on relative expression analysis and includes the | |
| False negative | Lack of context Rare pattern | (e) The time consumption was as following: at analysis by |
| (f) The purpose of this work is to introduce the reader to an Addin implementation, | ||
| (g) RESULTS: A Perl script package called | ||
| (h) MSDB also contains other two subprograms: | ||
| (i) A thorough user’s guide is available within |
AD: Alzheimer’s disease; MSDB: Microsatellite Search and Building Database; SWR: search within results; SWP: sliding window plot.
Figure 4.Distribution of types of concepts misrecognized as software names.