| Literature DB >> 25209025 |
Ehsan Emadzadeh1, Azadeh Nikfarjam2, Rachel E Ginn2, Graciela Gonzalez2.
Abstract
Finding gene functions discussed in the literature is an important task of information extraction (IE) from biomedical documents. Automated computational methodologies can significantly reduce the need for manual curation and improve quality of other related IE systems. We propose an open-IE method for the BioCreative IV GO shared task (subtask b), focused on finding gene function terms [Gene Ontology (GO) terms] for different genes in an article. The proposed open-IE approach is based on distributional semantic similarity over the GO terms. The method does not require annotated data for training, which makes it highly generalizable. We achieve an F-measure of 0.26 on the test-set in the official submission for BioCreative-GO shared task, the third highest F-measure among the seven participants in the shared task. DATABASE URL: https://code.google.com/p/rainbow-nlp/Entities:
Mesh:
Year: 2014 PMID: 25209025 PMCID: PMC4160099 DOI: 10.1093/database/bau084
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.This diagram shows the high-level flow of the proposed system. The left column shows the steps to create semantic vectors for each GO term. The right column displays the steps for finding GO terms in a document.
The table summarizes the number of sentences in the training set, which was detected by ‘Sentence Gene Matcher’ as relevant to a gene and also annotated to have a gene function
| Passage type | With gene function | Total | % |
|---|---|---|---|
| front | 26 | 67 | 39 |
| title_2 | 149 | 797 | 19 |
| abstract | 225 | 1253 | 18 |
| paragraph | 1700 | 20 703 | 8 |
| fig_title_caption | 17 | 412 | 4 |
| fig_caption | 99 | 6009 | 2 |
| table_title_caption | 0 | 47 | 0 |
| title_1, title_3, title_4 | 0 | 26 | 0 |
The different passage types are ‘front’ for the title of the article, ‘title_1’ refers to section headings like ‘Introduction’, ‘title_2’ is the section subheadings that sometimes describes the specific topic/finding of the section, ‘title_3’ and ‘title_4’ are more deeper levels of section headings, ‘abstract’ is the abstract content, ‘fig_title_caption’ is the title of a figure caption and ‘fig_caption’ is the caption of the figure, ‘table_title_caption’ is the caption of a table.
This table lists description of different passage types appeared in the corpus along with an example for each type
| Passage type | Description | Example |
|---|---|---|
| Front | The title of the document | Activation of ASK1, downstream MAPKK and MAPK isoforms during cardiac ischaemia |
| Abstract | The content of abstract section of the article | p38 MAPK is activated potently during cardiac ischaemia, although the precise mechanism by which it is activated is unclear. We used the isolated perfused rat heart … |
| Title_1 | Section title | ‘Introduction’, ‘Results’, ‘Discussion’ |
| Title_2 | Subsection title. | |
| Title_3 | Subsubsection title. An inline heading that appears at the beginning of a paragraph. | |
| Title_4 | An inline subheading that appears at the beginning of a paragraph. | |
Title_3 and Title_4 are similar, but we maintain the naming from the corpus to keep it consistent with the data.
Figure 2.This flowchart shows the process of finding GO terms for each gene in a given document by an example. The example sentence category is ‘front_2’ (FAT sections). With the exception of the value for n and m parameters, the process is the same as FAT for sentences in paragraphs.
Figure 3.(a) Top-left diagram depicts precision, recall and F-measure change in respect to mFAT (‘Front’, ‘Abstract’ and ‘Title’) changes when other parameters have constant values (mParagraph = 1, nFAT = 100, nParagraph = 15). (b) Top-right diagram shows the change of performance based on changes of mParagraph when mFAT = 9, nFAT = 100, nParagraph = 15. (c) Bottom-left diagram shows the change of performance when nFAT varies and mFAT = 3, mParagraph = 1, nParagraph = 15. (d) Bottom-right diagram shows the change of performance when nParagraph varies and mFAT = 3, mParagraph = 1, nFAT = 100.
This table shows performance of different settings on dev-set
| Precision | Recall | F-measure | |
|---|---|---|---|
| No intersection/All sections included | 0.082 | 0.141 | |
| No intersection/Paragraph+FAT | 0.091 | 0.498 | 0.155 |
| No intersection/Paragraph | 0.092 | 0.493 | 0.155 |
| No intersection/FAT | 0.281 | 0.272 | 0.276 |
| Intersection/All section | 0.268 | 0.305 | 0.285 |
| Intersection/Paragraph last sentence+FAT | 0.346 | 0.245 | 0.287 |
| Intersection/Paragraph all sentences+FAT | 0.316 | 0.278 | 0.296 |
| Intersection/Paragraph last and first sentences+FAT | 0.348 | 0.261 | |
| Intersection/Paragraph first sentence+FAT | 0.252 | 0.298 |
For intersection approach, the tuning parameter values are mFAT = 9, mParagraph = 2, nParagraph = 15 and nFAT = 75. Random index algorithm random function’s seed was fixed to ‘1234’.
Four settings for creating semantic vectors are compared in this table: (i) using only the GO terms, (ii) using GO term and definition, (iii) using GO term and synonym and (iv) using GO term, definition and synonym. For all experiments in this table, FAT and Paragraph (only first sentence) sections are considered
| Precision | Recall | F-measure | |
|---|---|---|---|
| Create vectors with GO terms only | 0.252 | ||
| Create vectors with GO terms+definitions | 0.247 | 0.229 | 0.238 |
| Create vectors with GO terms+definitions+ synonyms | 0.227 | 0.196 | 0.210 |
| Create vectors with GO terms+synonym | 0.197 | 0.189 | 0.193 |