| Literature DB >> 20500882 |
Hong Cui1.
Abstract
BACKGROUND: Large volumes of morphological descriptions of whole organisms have been created as print or electronic text in a human-readable format. Converting the descriptions into computer- readable formats gives a new life to the valuable knowledge on biodiversity. Research in this area started 20 years ago, yet not sufficient progress has been made to produce an automated system that requires only minimal human intervention but works on descriptions of various plant and animal groups. This paper attempts to examine the hindering factors by identifying the mismatches between existing research and the characteristics of morphological descriptions.Entities:
Mesh:
Year: 2010 PMID: 20500882 PMCID: PMC2887808 DOI: 10.1186/1471-2105-11-278
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An annotated morphological description. "<>" enclosed text is a tag. Bold font represents paragraph level annotation, bold and italic clause level annotation, and italic character level annotation. Annotation produced by an annotation system created for FNA by the author.
Figure 2Two regular expression patterns. The first (Soderland, 1999) is for extracting bedroom number and rent from apartment rental ads. The pattern extracts the digit before "BR" as the number of bedrooms ($1) and the number after a "$" as the rent ($2). The pattern produces the correct result for Input 1 but a wrong result for Input 2, as $600 was the price for one room, not four rooms. The pattern will not match or extract anything from "1 large BR $500" or "1 master BR $500." The second (Tang & Heidorn, 2007) extracts leaf blade dimension by looking for a range between the words "blade" and "base."
Review of the existing annotation techniques.
| Methods | Handmade prerequisites and their reusability | Annotation Level | Results and their reusability | Scope of evaluation | Performance (*) |
|---|---|---|---|---|---|
| Lexicon & grammar rules: | 1. Paragraph | 1. Style clues: Less reusable. | 1. FNA v. 19 | 1. Not reported | |
| Training examples: Not good for another taxon group. | paragraph | Classification models: Less reusable. | 1500+ descriptions from FNA | Recall: 94% Precision: 97% | |
| Dictionaries, | Character | Organ names & character states: | 1. 16 descriptions | 1. Accuracy on 1 sample:76% | |
| Extraction template & training examples: | Character, limit to these character states: leaf shape, size, color; Fruit type. | Extraction patterns: Sensitive to text variations, less reusable. | 1600 FNA species | Recall: 33%-80% | |
| Annotation template & training examples: | Clause | Association rules: Reusable only within the same taxon group | 16,000 descriptions from FNA, FOC, and FNCT | Recall and precision: 80%-95% | |
| No prerequisites | 1. Clause | Organ names & character states: | FNA, FOC, & Treatises Part H | Precision 88-95% |
* Precision is the proportion of the computer's decisions that is correct. Recall is the proportion of all targets correctly discovered by the computer.
Figure 3The counts of new domain concepts in Part V of TIP using different sized common word filters.
Figure 4The counts of new domain concepts in FNA using different sized common word filters.
Figure 5The counts of new domain concepts in FOC using different sized common word filters.
Figure 6Parsing trees produced by the Stanford Parser for descriptive sentences. The first two trees contrast the incorrect parsing of a descriptive sentence in the deviated grammar to the correct parsing of a similar sentence in standard English grammar. The remaining contrasts the incorrect parsing of 3 typical descriptive clauses in the deviated syntax to the correct parsing when the correct Part of Speech (POS) tags were given to the parser. The nodes closest to the words in the parsing trees are the POS tags.
Word repetition in morphological descriptions.
| FNA | 40 | 500 | 614 | 1.228 |
| FNA | 81 | 1048 | 834 | 0.800 |
| FNA | 942 | 12500 | 1959 | 0.157 |
| Treatise Part H | 2038 | 9760 | 2583 | 0.265 |
Figure 7An overall strategy to automated semantic annotation of morphological descriptions of various taxon groups.