| Literature DB >> 26131352 |
Geraint Duck1, Aleksandar Kovacevic2, David L Robertson3, Robert Stevens1, Goran Nenadic4.
Abstract
BACKGROUND: There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.Entities:
Keywords: Bioinformatics; CRF; Computational biology; Dictionary; Resource extraction; Text-mining
Year: 2015 PMID: 26131352 PMCID: PMC4485340 DOI: 10.1186/s13326-015-0026-0
Source DB: PubMed Journal: J Biomed Semantics
Sources from which the database and software name dictionary is comprised
| Type | Entries | Variants | Source |
|---|---|---|---|
| DB | 195 | 298 | databases.biomedcentral.com |
| SW | 263 | 278 |
|
| PK | 799 | 799 |
|
| SW | 2033 | 2087 |
|
| SW | 389 | 391 | evolution.genetics.washington.edu/phylip/software.html |
| DB | 379 | 379 |
|
| DB | 1452 | 1670 |
|
| SW | 135 | 135 |
|
| SW | 36 | 41 |
|
| SW | 1149 | 1183 | en.wikipedia.org/wiki/Wiki/< |
| SW, DB | 171 | 231 | Manually added entries |
| Our dictionary (DB, SW, PK) | 7322 | 6929 |
|
Note that entries and variants are not necessarily unique to a single resource list
DB databases, SW software, PK packages; data correct and accessible as of February 28th, 2012
Token-specific orthographic features extracted by regular expressions
| Name | Description |
|---|---|
| isAcronym | token is an acronym |
| containsAllCaps | all the letters in the token are capitalised |
| isCapitalised | token is capitalised |
| containsCapLetter | token contains at least one capital letter |
| containsDigits | token contains at least one digit |
| isAllDigits | token is made up of digits only |
Statistics describing the manually annotated corpora
| Development | Test | |
|---|---|---|
| Total number of documents | 60 | 25 |
| Total database and software mentions | 2416 | 1479 |
| Total unique resource mentions | 401 | 301 |
| Percentage of database mentions | 36 % | 28 % |
| Percentage of unique database mentions | 27 % | 30 % |
| Average mentions per document | 40.3 | 70.0 |
| Average unique mentions per document | 8.1 | 13.4 |
| Maximum mentions in a single document | 227 | 217 |
| Maximum unique mentions in a single document | 57 | 55 |
| Resources with only a single lexicographic mention | 201 | 147 |
Fig. 1Top token frequencies within the manually compiled dictionary. The figure shows the most common stemmed tokens contained within all the resource names found within our manually compiled dictionary. The top token is database with a count of 474, followed by ontology with 187 instances. Note that the scale is logarithmic (log base 2), and the y-axis crosses at eight rather than zero (for aesthetic reasons). The top terms are labelled
Internal POS structure of database and software names (the development corpus)
| Pattern | Count | Frequency |
|---|---|---|
| NNP | 258 | 63.7 % |
| NNP NNP | 34 | 8.4 % |
| NNP NNP NNP | 26 | 6.4 % |
| NN | 20 | 4.9 % |
| NNP CD | 16 | 4.0 % |
| NNP NNP NNP NNP | 8 | 2.0 % |
| Other Patterns | 43 | 10.6 % |
NNP proper noun, NN singular noun, CD cardinal number
Evaluation results on the development and test corpora
| Development corpus | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| Dictionary | 49 (47) | 38 (37) | 43 (41) |
| CRF with post-processing | 58 (52) | 76 (67) | 65 (58) |
| CRF without post-processing | 54 (49) | 78 (70) | 62 (57) |
| Test Corpus | |||
| Dictionary | 46 (44) | 46 (44) | 46 (44) |
| CRF with post-processing | 60 (54) | 83 (74) | 70 (63) |
| CRF without post-processing | 53 (45) | 71 (65) | 62 (53) |
Strict scores provided in brackets
P Precision, R Recall, F F-score evaluation on the development (5-cross validated) and test corpora
Dictionary matching results on the development corpus
| Fold | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| 1 | 46 (43) | 41 (39) | 43 (41) |
| 2 | 34 (31) | 37 (34) | 36 (32) |
| 3 | 36 (34) | 24 (23) | 29 (27) |
| 4 | 55 (53) | 46 (45) | 50 (49) |
| 5 | 76 (75) | 44 (43) | 56 (55) |
| Min | 34 (31) | 24 (23) | 29 (27) |
| Max | 76 (75) | 46 (45) | 56 (55) |
| Mean | 49 (47) | 38 (37) | 43 (41) |
Note that for Fold 3, a decrease in score (of about 8 % F-score) is observed if the LINNAEUS abbreviation detected is disabled. Strict scores provided in brackets
P Precision, R Recall, F F-score on the development set using dictionary look-up
Machine learning results with post-processing on the development corpus
| Fold | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| 1 | 51 (44) | 71 (60) | 59 (51) |
| 2 | 44 (35) | 88 (71) | 59 (47) |
| 3 | 51 (44) | 76 (66) | 61 (53) |
| 4 | 65 (60) | 73 (67) | 69 (63) |
| 5 | 80 (76) | 74 (70) | 77 (73) |
| Min | 44 (35) | 71 (60) | 59 (47) |
| Max | 80 (76) | 88 (71) | 77 (73) |
| Mean | 58 (52) | 76 (67) | 65 (58) |
| Micro Avg | 56 (50) | 76 (67) | 65 (57) |
Strict scores provided in brackets
P Precision, R Recall, F F-score on the development set using machine learning with post-processing (5-cross fold)
Machine learning results without post-processing on the development set
| Fold | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| 1 | 46 (41) | 78 (69) | 58 (51) |
| 2 | 42 (35) | 89 (75) | 57 (48) |
| 3 | 45 (41) | 75 (70) | 56 (52) |
| 4 | 60 (55) | 71 (66) | 65 (60) |
| 5 | 76 (74) | 74 (72) | 75 (73) |
| Min | 42 (35) | 71 (66) | 56 (52) |
| Max | 76 (74) | 89 (75) | 75 (73) |
| Mean | 54 (49) | 78 (70) | 62 (57) |
| Micro Avg | 52 (47) | 77 (70) | 62 (56) |
P Precision, R Recall, F F-score on the development set using machine learning without post-processing (5-cross fold). Strict scores provided in brackets
Combined dictionary and machine learning results on the development set
| Fold | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| 1 | 56 (49) | 43 (38) | 49 (42) |
| 2 | 50 (41) | 45 (37) | 48 (39) |
| 3 | 57 (52) | 32 (29) | 41 (37) |
| 4 | 68 (64) | 45 (42) | 54 (51) |
| 5 | 87 (84) | 45 (43) | 59 (57) |
| Min | 50 (41) | 32 (29) | 41 (37) |
| Max | 87 (84) | 45 (43) | 59 (57) |
| Mean | 64 (58) | 42 (38) | 50 (45) |
P Precision, R Recall, F F-score on the development set combining the dictionary and machine learning annotations (5-cross fold). Strict scores provided in brackets
Feature impact analysis of the machine learning model without post-processing on the development set
| Feature group | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|
| All features | 54 (49) | 78 (70) | 62 (57) |
| No lexical features | 46 (43) | 68 (62) | 54 (50) |
| No syntactic features | 53 (48) | 77 (69) | 61 (55) |
| No orthographic features | 48 (43) | 70 (62) | 55 (50) |
| No dictionary features | 49 (44) | 70 (62) | 57 (51) |
P Precision, R Recall, F F-score feature contribution results comparison. Strict scores provided in brackets
Types of textual patterns and clues for identification of database and software names
| Type | Contribution to total TPs |
|---|---|
| Machine learning matches | 55.3 % |
| Heads and Hearst Patterns | 9.8 % |
| Title appearances | 0.5 % |
| References and URLs | 1.8 % |
| Version information | 0.9 % |
| Noun/verb associations | 21.4 % |
| Comparisons | 4.0 % |
| Remaining | 6.3 % |
Tables 12, 13, 14, 15, 16 and 17 each provide examples of the above classes
Example clues and phrases appearing with specific heads or in Hearst patterns
| … the stochastic |
| The |
| … |
| … |
Database and software names are in italics, the associated clue is in bold
Example phrases from title appearances
|
|
|
|
|
|
Database and software names are in italics. Notice that in each case, the name is given as the initial part of the paper’s full title (preceding the colon)
Example versioning clues
| … using |
|
|
|
|
Database and software names are in italics, the associated clue is in bold
Example expressions that functionally indicate database and software mentions
| … the |
| … |
|
|
| A typical |
|
|
|
|
|
|
|
|
Database and software names are in italics, the associated clue is in bold
Examples of comparisons between database and software names
| … the numbers of breakpoint sites by |
|
|
|
|
|
|
|
|
Database and software names are in italics, the associated clue is in bold
Example phrases with no clear or discriminative clues
| Additionally, |
| In addition, |
|
|
| The results show that |
| The structure of |
Database and software names are in italics