| Literature DB >> 15960843 |
Jörg Hakenberg1, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn, Lukas Faulstich, Ulf Leser, Tobias Scheffer.
Abstract
In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.Entities:
Mesh:
Year: 2005 PMID: 15960843 PMCID: PMC1869023 DOI: 10.1186/1471-2105-6-S1-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1System architecture. The overall system architecture, including the recursive feature elimination process.
Feature classes and their impact prediction quality. Table of all feature classes. *: classes used in the BioCreAtIvE submission, ◦: classes implemented afterwards, partly adopted from other participants of the contest. The forth column gives the impact of each single feature class compared to the baseline (only tokens). This figures include post-processing. The fifth column shows which how precision and recall are affected. Letter surface clues (last rows) refer to the following features: {special, allCaps, initCap, capMix, lowMix, Idl, ddd}.
| Feature | Example | Short name | Impact | |
| Token* | Sro7 | Token | = 54% | - |
| Unseen token* | UToken | |||
| n-grams of token* | 1G, 2G, .. | +15% | 1..4-grams, P+, R++ | |
| Previous & next tokens | P/NToken | -5% | [1,1]-window, P+, R- | |
| n-grams of tokens in window | 2PG/2NG/.. | |||
| Prefixes, suffixes | 1P, 2P, 3P, 1S.. | ±0 | ||
| Stop word | the, or | Stop | -5% | 10,000 words, P+, R- |
| POS tag | NN, DT | POS | -50% | P-, R- |
| Initial upper case* | Msp | initCap | +.5% | P=, R+ |
| All chars are upper case* | MMTV | allCaps | +.5% | P-, R+ |
| Upper case letters* | InlC, GUS | Upper | ||
| Upper case (skip first)* | MsPRP2 | Upper2 | ||
| Single capital | A | singleCap | +.5% | P+, R+ |
| Two capitals | RalGDS | twoCaps | +.5% | P+, R+ |
| Capital, then mixed letters ◦ | IgM | capMix | ||
| Lower case, then mixed ◦ | kDa | lowMix | +1% | P-, R+ |
| Special symbols* | ICAM-1 | special | ±0 | P-, R+ |
| Characters and numbers* | p50 | CharNum | ||
| Numbers* | p50, HSF1 | Number | ||
| Letters, digits, letters ◦ | H2kd | Idl | ±0 | |
| Digit, dot, digit ◦ | 5.78 | ddd | -.1% | P-, R- |
| Greek letter ◦ | alpha | greek | +.5% | P+, R- |
| Roman numeral ◦ | II, xii | roman | ±0 | R+, R- |
| Number followed by '%' ◦ | 75.0% | percentage | -.1% | P-, R- |
| DNA, RNA sequences ◦ | ACCGT | DNA, RNA | -.1% | P-, R- |
| Longest consonant chain * | Sro7 → 2 | LCC | -2% | P-, R- |
| Keyword distance* | keyDist | -20% | P+, R- | |
| Gazetteer* | Gaz | -3% | P-, R- | |
| Prev./next token is NEWGENE | PTG, NTG | -18% | prev. only, P+, R- | |
| Tokens + letter surface clues | +2% | P+, R- | ||
| Tokens + 1,2,3-grams + greek + roman + letter surface clues | +14% | P+, R++ | ||
| Tokens + 1,2,3,-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap * | +16% | P+, R++ | ||
| Tokens + 1,2,3,4-grams + keyDist + Gaz + LCC + special + combi + allCaps + initCap* + lowMix ◦ | +18% | P+, R++ | ||
Figure 2Impact of the Recursive Feature Elimination. Impact of removing 10% of the features with the lowest weight vector in each round. After 30 iterations, with only 4.28% of all features remaining, the f-measure has dropped only by 2%. The underlying evaluation method only considers the recognition of single tokens rather than whole phrases. The bottom line (65 iterations) shows the impact of the remaining 0.11% of all features. All values are evaluated without the post-expansion step (see text).
Figure 3Dependence of the f-measure on the number of features. Performances (precision, recall, f-measure) for different numbers of features. Starting from the full feature set, recursive feature elimination removes the features with the lowest weight vector and we measure the performance after each round.
Feature classes remaining after the RFE. Examples for features and feature classes remaining after 64 iterations. In every round, we remove the 10% of all features having the lowest weight. After the 64 iterations, only 0.12% of all features remain. We show the upper, middle, and lower weighted features in this table. High weighted features are more likely to apply to positive samples (NEWGENE), low weighted features to negative samples. Names in bold indicate binary orthographic features and the gazetteer (Gaz), in contrast to single features, like a particular 3-gram. The feature named special in Table 1 actually consists of four parts, two of which are present in the list of top ranking features.
| Feature | Class | Weight | Feature | Class | Weight |
| 1.497386 | AACC | 4-gram | 0.088738 | ||
| insulin | Token | 0.632708 | D2-m | 4-gram | -0.022443 |
| protein | Token | 0.628168 | Stai | 4-gram | -0.082046 |
| kinase | Token | 0.608392 | mig | 3-gram | -0.083135 |
| human | Token | 0.536695 | Reve | 4-gram | -0.096548 |
| proteins | Token | 0.535368 | ing | 3-gram | -0.099499 |
| 0.498111 | GnT | Token | -0.099619 | ||
| 0.489201 | owl | 3-gram | -0.100996 | ||
| serum | Token | 0.480326 | 231 | Token | -0.104751 |
| 0.457806 | ZII | Token | -0.105133 | ||
| 0.438028 | had | Token | -0.106545 | ||
| factor | Token | 0.438028 | we | Token | -0.107104 |
| wild-type | Token | 0.389359 | [..] | ||
| 0.366269 | that | Token | -0.174203 | ||
| mutants | Token | 0.340689 | scre | 4-gram | -0.175351 |
| genes | Token | 0.340352 | OH | Token | -0.179445 |
| promoter | Token | 0.327395 | ims | 3-gram | -0.182513 |
| receptor | Token | 0.323412 | be | Token | -0.186265 |
| polymerase | Token | 0.305972 | . | Token | -0.188904 |
| complex | Token | 0.292019 | To | Token | -0.189576 |
| receptors | Token | 0.292019 | acyc | 4-gram | -0.191766 |
| c-myc | Token | 0.292019 | the | Token | -0.192838 |
| sites | Token | 0.243349 | off | Token | -0.197588 |
| mutant | Token | 0.243349 | rank | Token | -0.198915 |
| domain | Token | 0.231541 | Dar | Token | -0.205479 |
| sequences | Token | 0.216691 | ( | Token | -0.206405 |
| sequence | Token | 0.216683 | omit | 4-gram | -0.220064 |
| domain | Token | 0.215116 | nost | 4-gram | -0.223077 |
| 0.205077 | spit | 4-gram | -0.238335 | ||
| isoforms | Token | 0.194679 | -0.243183 | ||
| 0.179926 | oped | 4-gram | -0.246457 | ||
| 0.179394 | The | Token | -0.246535 | ||
| [..] | aged | Token | -0.253814 | ||
| lare | 4-gram | 0.105354 | are | Token | -0.267228 |
| bicu | 4-gram | 0.103185 | ssif | 4-gram | -0.272211 |
| bea | 3-gram | 0.100539 | encoding | Token | -0.447471 |
| [ | Token | 0.097113 | which | Token | -0.535368 |
| ntei | 4-gram | 0.093310 | activate | Token | -0.535368 |
| GTTA | 4-gram | 0.088738 | contain | Token | -0.640844 |
Rules used for the post-expansion step. The rules switch certain part-of-speech tags to NEWGENE tags. We exclude 372/222 nouns from the expansion, and include only 778 particular adjectives in the expansion of noun phrases. NN*: nouns, proper nouns, plurals; JJ: adjective; CD: cardinal digit; DT: determiner; '/' refers to the token itself.
| Former POS pattern | Expanded pattern | Limitation |
| NEWGENE NN* | NEWGENE NEWGENE | all but 372 particular nouns |
| NN* NEWGENE | NEWGENE NEWGENE | all but 222 particular nouns |
| JJ NEWGENE | NEWGENE NEWGENE | only 778 particular adjectives |
| NEWGENE JJ | NEWGENE NEWGENE | only 778 particular adjectives |
| NEWGENE DT NN* | NEWGENE NEWGENE NEWGENE | |
| NEWGENE CD | NEWGENE NEWGENE | |
| NN* / NEWGENE | NEWGENE NEWGENE NEWGENE | |
| NEWGENE / NN* | NEWGENE NEWGENE NEWGENE |
Figure 4Recall/precision with and without post-expansion. Comparison of recall and precision before and after the post-expansion step. We use the full feature set (marked "100%" in Figure 2) for this evaluation. We obtain the different spots by parallelly shifting the hyperplane.
Figure 5Error analysis. Proportions of different causes for four classes of errors. We distinguish between boundary errors and non-boundary errors (see text).