| Literature DB >> 24294548 |
Asif Ekbal1, Sriparna Saha, Utpal Kumar Sikdar.
Abstract
BACKGROUND: Named Entity (NE) extraction is one of the most fundamental and important tasks in biomedical information extraction. It involves identification of certain entities from text and their classification into some predefined categories. In the biomedical community, there is yet no general consensus regarding named entity (NE) annotation; thus, it is very difficult to compare the existing systems due to corpus incompatibilities. Due to this problem we can not also exploit the advantages of using different corpora together. In our present work we address the issues of corpus compatibilities, and use a single objective optimization (SOO) based classifier ensemble technique that uses the search capability of genetic algorithm (GA) for NE extraction in biomedicine. We hypothesize that the reliability of predictions of each classifier differs among the various output classes. We use Conditional Random Field (CRF) and Support Vector Machine (SVM) frameworks to build a number of models depending upon the various representations of the set of features and/or feature templates. It is to be noted that we tried to extract the features without using any deep domain knowledge and/or resources.Entities:
Year: 2013 PMID: 24294548 PMCID: PMC3837077 DOI: 10.1186/2193-1801-2-601
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Overall evaluation results (we report percentages) on the original corpus (Saha et al. 2013 )
| Corpus | Model | Recall | Precision | F-measure |
|---|---|---|---|---|
| GENIA | Best individual classifier | 73.10 | 76.78 | 74.90 |
| SOO based ensemble | 74.17 | 77.87 | 75.97 | |
| AIMeD | Best individual classifier | 94.56 | 92.66 | 93.60 |
| SOO based ensemble | 95.65 | 94.23 | 94.93 | |
| GENETAG | Best individual classifier | 95.35 | 95.31 | 95.33 |
| SOO based ensemble | 95.99 | 95.81 | 95.90 |
Comparison with the existing approaches for GENETAG data set
| System | Approach used | Domain knowledge/resources | F-measure |
|---|---|---|---|
| Our system | GA based ensemble | PoS, phrase | 94.70 |
| (CRF and SVM) | |||
| Song et al. (2005) (Song et al. | SVM | - | 66.7 |
| Bickel et al. (2004) (Bickel et al. | SVM | a dictionary | 72.1 |
| Kinoshita et al. (2005) (Kinoshita et al. | TnT (Brants | dictionary based postprocessing | 80.9 |
| HMM-based part-of-speech tagger | |||
| Mitsumoriet al. (2005) (Mitsumori et al. | SVM | gene/protein name dictionary | 78.09 |
| Finkel et al. (2004) (Finkel et al. | ME+ post processing | 82.2 | |
| McDonald and Pereira (2005) (McDonald and Pereira | CRF | 82.4 | |
| GuoDong et al. (2005) (Zhou and Su | HMM, SVM, Ensemble technique | Post processing | 82.58 |
Comparison with the existing approaches for GENIA data set
| System | Used approach | Domain knowledge/resources | FM |
|---|---|---|---|
| Our system | Classifier ensemble | POS, phrase | 76.52 |
| (CRF and SVM) | |||
| Zhou & Su (2004) (GuoDong and Jian | HMM, SVM | Name alias, cascaded NEs dictionary, POS, phrase | 72.55 |
| Zhou & Su (2004) (GuoDong and Jian | HMM, SVM | POS, phrase | 64.1 |
| Kim et al. (2005) (Kim et al. | Two-phase model | POS, phrase, | 71.19 |
| with ME and CRF | rule-based component | ||
| Finkel et. al (2004) (Finkel et al. | CRF | Gazetteers, web-querying, surrounding abstracts, | 70.06 |
| abbreviation handling, BNC corpus, POS | |||
| Settles (2004) (Settles | ME | POS, semantic knowledge sources of 17 lexicons | 70.00 |
| Saha et al. (2009) (Saha et al. | ME | POS, phrase | 67.41 |
| Park et. al (2004) (Park et al. | ME | POS, phrase, domain-salient words using WSJ, | 66.91 |
| morphological patterns, collocations from Medline | |||
| Song et al. (2004) (Song et al. | SVM, CRF | POS, phrase, Virtual sample | 66.28 |
| Song et al. (2004) (Song et al. | SVM | POS, phrase | 63.85 |
| Ponomareva et al. (2007) (Ponomareva et al. | HMM | POS | 65.7 |
Evaluation results of the approach on cross-corpus datasets (we report percentages); Here 'FM’ denotes 'F-measure’
| Approach | Training set | Test set | Recall | Precision | FM |
|---|---|---|---|---|---|
| Best Ind. Classifier | JNLPBA (protein only)+AIMed | AIMed | 83.14 | 83.19 | 83.17 |
| SOO | JNLPBA (protein only)+AIMed | AIMed | 85.10 | 85.01 | 85.05 |
| Best Ind. Classi | JNLPBA (protein + DNA)+AIMed | AIMed | 82.17 | 84.15 | 83.15 |
| SOO | JNLPBA (protein + DNA)+AIMed | 3-fold cross | 84.07 | 86.01 | 85.03 |
| validation on AIMed | |||||
| Best Ind. Classi | JNLPBA (protein only)+GENETAG | GENETAG | 89.44 | 93.07 | 91.22 |
| SOO | JNLPBA (protein only)+GENETAG | GENETAG | 91.19 | 94.98 | 93.05 |
| Best Ind. Classi | JNLPBA (protein + DNA + RNA)+GENTAG | GENTAG | 88.70 | 93.55 | 91.06 |
| SOO | JNLPBA (protein + DNA + RNA)+GENTAG | GENTAG | 90.09 | 95.16 | 92.56 |
Evaluation results of the approach on cross-corpus non-informative sentence-removed datasets (we report percentages)
| Approach | Training set | Test set | r | p | FM |
|---|---|---|---|---|---|
| Best Individual Classifier | JNLPBA (protein only)+AIMed | AIMed | 80.58 | 84.43 | 82.46 |
| SOO Based Ensemble | JNLPBA (protein only)+AIMed | AIMed | 81.98 | 86.01 | 83.95 |
| Best Individual Classifier | JNLPBA (protein + DNA)+AIMed | AIMed | 84.66 | 83.50 | 84.08 |
| SOO Based Ensemble | JNLPBA (protein + DNA)+AIMed | AIMed | 86.07 | 85.01 | 85.54 |
| Best Individual Classifier | JNLPBA (protein only)+GENETAG | GENETAG | 91.79 | 90.61 | 91.20 |
| SOO Based Ensemble | JNLPBA (protein only)+GENETAG | GENETAG | 93.19 | 92.08 | 92.63 |
| Best Individual Classifier | JNLPBA (protein + DNA + RNA)+GENTAG | GENTAG | 93.98 | 90.67 | 92.29 |
| SOO Based Ensemble | JNLPBA (protein + DNA + RNA)+GENTAG | GENTAG | 95.09 | 92.16 | 93.60 |
Here 'r’: recall, 'p’: precision, 'FM’: F-measure.
Orthographic features
| Feature | Example | Feature | Example |
|---|---|---|---|
| InitCap | Src | AllCaps | EBNA, LMP |
| InCap | mAb | CapMixAlpha | NFkappaB, EpoR |
| DigitOnly | 1, 123 | DigitSpecial | 12-3 |
| DigitAlpha | 2× NFkappaB, 2A | AlphaDigitAlpha | IL23R, EIA |
| Hyphen | - | CapLowAlpha | Src, Ras,Epo |
| CapsAndDigits | 32Dc13 | RomanNumeral | I, II |
| StopWord | at, in | ATGCSeq | CCGCCC, ATAGAT |
| AlphaDigit | p50, p65 | DigitCommaDigit | 1,28 |
| GreekLetter | alpha, beta | LowMixAlpha | mRNA, mAb |