| Literature DB >> 27843447 |
Mujiono Sadikin1, Mohamad Ivan Fanany2, T Basaruddin2.
Abstract
One essential task in information extraction from the medical corpus is drug name recognition. Compared with text sources come from other domains, the medical text mining poses more challenges, for example, more unstructured text, the fast growing of new terms addition, a wide range of name variation for the same drug, the lack of labeled dataset sources and external knowledge, and the multiple token representations for a single drug name. Although many approaches have been proposed to overwhelm the task, some problems remained with poor F-score performance (less than 0.75). This paper presents a new treatment in data representation techniques to overcome some of those challenges. We propose three data representation techniques based on the characteristics of word distribution and word similarities as a result of word embedding training. The first technique is evaluated with the standard NN model, that is, MLP. The second technique involves two deep network classifiers, that is, DBN and SAE. The third technique represents the sentence as a sequence that is evaluated with a recurrent NN model, that is, LSTM. In extracting the drug name entities, the third technique gives the best F-score performance compared to the state of the art, with its average F-score being 0.8645.Entities:
Mesh:
Year: 2016 PMID: 27843447 PMCID: PMC5098107 DOI: 10.1155/2016/3483528
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Proposed approach framework of the first experiment.
Figure 2Proposed approach framework of the second experiment.
Figure 3Proposed approach framework of the third experiment.
Figure 4Distribution of MedLine train dataset token.
The frequency distribution and drug target token position, MedLine.
| 1/3# | Σ sample | Σ frequency | Σ single token of drug entity |
|---|---|---|---|
| 1 | 28 | 8,661 | — |
| 2 | 410 | 8,510 | 50 |
| 3 | 3,563 | 8,612 | 262 |
Figure 5Distribution of DrugBank train dataset token.
The frequency distribution and drug target token position, DrugBank.
| 1/3# | Σ sample | Σ frequency | Σ single token of drug entity |
|---|---|---|---|
| 1 | 27 | 33,538 | — |
| 2 | 332 | 33,463 | 33 |
| 3 | 5,501 | 33,351 | 920 |
Some of the cosine distance similarities between two kinds of words.
| Word 1 | Word 2 | Similarities (cosine dist) | Remark |
|---|---|---|---|
| dilantin | tegretol | 0.75135758 | drug-drug |
| phenytoin | dilantin | 0.62360351 | drug-drug |
| phenytoin | tegretol | 0.51322415 | drug-drug |
| cholestyramine | dilantin | 0.24557819 | drug-drug |
| cholestyramine | phenytoin | 0.23701277 | drug-drug |
| administration | patients | 0.20459694 | non-drug - non-drug |
| tegretol | may | 0.11605539 | drug - non-drug |
| cholestyramine | patients | 0.08827197 | drug - non-drug |
| evaluated | end | 0.07379115 | non-drug - non-drug |
| within | controlled | 0.06111103 | non-drug - non-drug |
| cholestyramine | evaluated | 0.04024139 | drug - non-drug |
| dilantin | end | 0.02234770 | drug - non-drug |
The average of Euclidean distance and cosine similarities between groups of words.
| Word group | Euclidean dist. avg | Cosine dist. avg |
|---|---|---|
| drug - non-drug | 0.096113798 | 0.194855980 |
| non-drug - non-drug | 0.094824332 | 0.604091044 |
| drug-drug | 0.093840800 | 0.617929002 |
Algorithm 1Dataset labelling.
Sample of DrugBank sentences and their drug name target.
| Sentence | Drug position | Drug name |
|---|---|---|
| modification of surface histidine residues abolishes the cytotoxic activity of clostridium difficile toxin a | 79–107 | clostridium difficile toxin a |
| antimicrobial activity of ganoderma lucidum extract alone and in combination with some antibiotics. | 26–50 | ganoderma lucidum extract |
| on the other hand, surprisingly, green tea gallocatechins, (−)-epigallocatechin-3-o-gallate and theasinensin a, potently enhanced the promoter activity (182 and 247% activity at 1 microm, resp.). | 33–56 | green tea gallocatechins |
A portion of the dataset formulation as the results of DrugBank sample with first technique.
| Dataset number | Token-1 | Token-2 | Token-3 | Token-4 | Token-5 | Label |
|---|---|---|---|---|---|---|
| 1 | modification | of | surface | histidine | residues | 1 |
| 2 | of | surface | histidine | residues | abolishes | 1 |
| 3 | surface | histidine | residues | abolishes | the | 1 |
| 4 | histidine | residues | abolishes | the | cytotoxic | 1 |
| 5 | the | cytotoxic | activity | of | clostridium | 1 |
| 6 |
|
|
|
| antimicrobial | 5 |
| 7 | difficile | toxin | a | antimicrobial | activity | 1 |
First technique of data representation and its label.
| Token-1 | Token-2 | Token-3 | Token-4 | Token-5 | Label |
|---|---|---|---|---|---|
|
| “were” | “performed” | “cytochrome” | “p-450” | 2 |
|
| “concentrations” | “just” | “prior” | “to” | 2 |
|
|
| “and” | “alpha-adrenergic” | “stimulants,” | 3 |
|
|
| “inhibitors,” | “concomitant” | “use” | 3 |
|
|
|
| “should” | “be” | 4 |
|
|
|
| “such” | “as” | 4 |
|
|
|
|
| “—” | 5 |
|
|
|
|
| “and” | 5 |
|
|
|
|
|
| 6 |
| “studies” | “with” | “plenaxis” | “were” | “performed.” | 1 |
| “were” | “performed.” | “cytochrome” | “p-450” | “is” | 1 |
Second technique of data representation and its label.
| Token-1 | Token-2 | Token-3 | Token-4 | Token-5 | Label |
|---|---|---|---|---|---|
| “modification” | “of” | “surface” | “histidine” | “residues” | 1 |
| “of” | “surface” | “histidine” | “residues” | “abolishes” | 1 |
| surface | histidine | residues | abolishes | the | 1 |
| “histidine” | “residues” | “abolishes” | “the” | “cytotoxic” | 1 |
| “the” | “cytotoxic” | “activity” | “of” | “clostridium” | 1 |
|
|
|
| “a” | “ | 5 |
| “difficile” | “toxin” | “a” | “ | “ | 1 |
| “a” | “atoxin” | “ | “ | “ | 1 |
| “toxic” | “ | “ | “ | “ | 1 |
Third technique of data representation and its label.
| Sent.#1 | Class | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Word | “drug” | “interaction” | “studies” | “with” | “plenaxis” | “were” | “performed” | |
| Sent.#2 | Class | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Word | “cytochrome” | “p-450” | “is” | “not” | “known” | “in” | “the” |
Dataset composition.
| Dataset | Train | Test | ||
|---|---|---|---|---|
| All | 2/3 part | Cluster | ||
| MedLine | 26,500 | 10,360 | 6,673 | 5,783 |
| DrugBank | 100,100 | 2,000 | 1,326 | 1,933 |
Figure 6Full batch training error of MedLine dataset.
Figure 7Full batch training error of DrugBank dataset.
The F-score performances of three of scenarios experiments.
| MedLine | Prec | Rec |
| L | Error test |
|---|---|---|---|---|---|
| (1) | 0.3564 | 0.5450 | 0.4310 | L0 | 0.0305 |
| (2) | 0.3806 | 0.5023 | 0.4331 | L1, L2 | 0.0432 |
| (3) | 0.3773 | 0.5266 | 0.4395 | L2 | 0.0418 |
|
| |||||
| DrugBank | Prec | Rec |
| L | Error test |
|
| |||||
| (1) | 0.6312 | 0.5372 | 0.5805 | L0 | 0.07900 |
| (2) | 0.6438 | 0.6398 | 0.6417 | L0, L2 | 0.0802 |
| (3) | 0.6305 | 0.5380 | 0.5806 | L0 | 0.0776 |
The F-score performance as an impact of data representation technique.
| Dataset | (1) One seq. of all sentences | (2) One seq. of each sentence | ||||
|---|---|---|---|---|---|---|
| MedLine | Prec | Rec |
| Prec | Rec |
|
|
| ||||||
| (1) | 0.3564 | 0.5450 | 0.4310 | 0.6515 | 0.6220 | 0.6364 |
| (2) | 0.3806 | 0.5023 | 0.4331 | 0.6119 | 0.7377 |
|
| (3) | 0.3772 | 0.5266 |
| 0.6143 | 0.656873 | 0.6348 |
|
| ||||||
| DrugBank | Prec | Rec |
| Prec | Rec |
|
|
| ||||||
| (1) | 0.6438 | 0.5337 | 0.5836 | 0.7143 | 0.4962 | 0.5856 |
| (2) | 0.6438 | 0.6398 |
| 0.7182 | 0.5804 |
|
| (3) | 0.6306 | 0.5380 | 0.5807 | 0.5974 | 0.5476 | 0.5714 |
The F-score performances as an impact of the Wiki addition of word2vec training data.
| Dataset | (1) One seq. of all sentences | (2) One seq. of each sentence | ||||
|---|---|---|---|---|---|---|
| MedLine | Prec | Rec |
| Prec | Rec |
|
|
| ||||||
| (1) | 0.5661 | 0.4582 | 0.5065 | 0.614 | 0.6495 | 0.6336 |
| (2) | 0.5661 | 0.4946 |
| 0.5972 | 0.7454 | 0.6631 |
| (3) | 0.5714 | 0.4462 | 0.5011 | 0.6193 | 0.6927 |
|
|
| ||||||
| DrugBank | Prec | Rec |
| Prec | Rec |
|
|
| ||||||
| (1) | 0.6778 | 0.5460 | 0.6047 | 0.6973 | 0.6107 | 0.6511 |
| (2) | 0.6776 | 0.6124 |
| 0.6961 | 0.6736 |
|
| (3) | 0.7173 | 0.5574 | 0.6273 | 0.6976 | 0.6193 | 0.6561 |
Experimental results of three NN models.
| Dataset | MLP | DBN | SAE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MedLine | Prec | Rec |
| Prec | Rec |
| Prec | Rec |
|
|
| |||||||||
| (1) | 0.6515 | 0.6220 | 0.6364 | 0.5464 | 0.6866 | 0.6085 | 0.6728 | 0.6214 | 0.6461 |
| (2) | 0.5972 | 0.7454 | 0.6631 | 0.6119 | 0.7377 | 0.6689 | 0.6504 | 0.7261 |
|
| (3) | 0.6193 | 0.6927 | 0.6540 | 0.6139 | 0.6575 | 0.6350 | 0.6738 | 0.6518 | 0.6626 |
|
| |||||||||
| Average | 0.6227 | 0.6867 | 0.6512 | 0.5907 | 0.6939 | 0.6375 | 0.6657 | 0.6665 | 0.6650 |
|
| |||||||||
| DrugBank | Prec | Rec |
| Prec | Rec |
| Prec | Rec |
|
|
| |||||||||
| (1) | 0.6973 | 0.6107 | 0.6512 | 0.6952 | 0.5847 | 0.6352 | 0.6081 | 0.6036 | 0.6059 |
| (2) | 0.6961 | 0.6736 |
| 0.6937 | 0.6479 | 0.6700 | 0.6836 | 0.6768 | 0.6802 |
| (3) | 0.6976 | 0.6193 | 0.6561 | 0.6968 | 0.5929 | 0.6406 | 0.6033 | 0.6050 | 0.6042 |
|
| |||||||||
| Average | 0.6970 | 0.6345 | 0.664 | 0.6952 | 0.6085 | 0.6486 | 0.6317 | 0.6285 | 0.6301 |
Figure 8The global LSTM network.
The F-score performance of third data representation technique with RNN-LSTM.
| Prec | Rec |
| |
|---|---|---|---|
| MedLine | 1 | 0.6474 | 0.7859 |
| DrugBank | 1 | 0.8921 | 0.9430 |
|
| |||
| Average | 0.8645 | ||
The F-score performance compared to the state of the art.
| Approach |
| Remark |
|---|---|---|
| The Best of SemEval 2013 [ | 0.7150 | — |
| [ | 0.5700 | With external knowledge, ChEBI |
| [ | 0.7200 | With external knowledge, DINTO |
| [ | 0.7200 | Additional feature, BIO |
| [ | 0.6000 | Single token only |
| MLP-SentenceSequence + Wiki (average)/Ours | 0.6580 | Without external knowledge |
| DBN-SentenceSequence + Wiki (average)/Ours | 0.6430 | Without external knowledge |
| SAE-SentenceSequence + Wiki (average)/Ours | 0.6480 | Without external knowledge |
| LSTM-AllSentenceSequence + Wiki + | 0.8645 | Without external knowledge |
The best performance of 10 executions on drug label corpus.
| Iteration | Prec | Recall |
|
|---|---|---|---|
| 1 | 0.9170 | 0.9667 | 0.9412 |
| 2 | 0.8849 | 0.9157 | 0.9000 |
| 3 | 0.9134 | 0.9619 | 0.9370 |
| 4 | 0.9298 | 0.9500 | 0.9398 |
| 5 | 0.9640 | 0.9570 | 0.9605 |
| 6 | 0.8857 | 0.9514 | 0.9178 |
| 7 | 0.9489 | 0.9689 | 0.9588 |
| 8 | 0.9622 | 0.9654 | 0.9638 |
| 9 | 0.9507 | 0.9601 | 0.9554 |
| 10 | 0.9516 | 0.9625 | 0.9570 |
|
| |||
| Average | 0.93081 | 0.9560 | 0.9431 |
| Min | 0.8849 | 0.9157 | 0.9000 |
| Max | 0.9640 | 0.9689 | 0.9638 |