| Literature DB >> 18426554 |
Julien Gobeill1, Imad Tbahriti, Frédéric Ehrler, Anaïs Mottaz, Anne-Lise Veuthey, Patrick Ruch.
Abstract
BACKGROUND: This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18426554 PMCID: PMC2352866 DOI: 10.1186/1471-2105-9-S3-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example of an ENTREZ-Gene record and the corresponding GeneRiF (italic added)
| Locus - ABCA1: ATP-binding cassette, sub-family A (ABC1), member 1 |
| MEDLINE record - PMID - 12804586 |
| GeneRiF: |
Figure 1GeneRiF distribution in titles and (“ti”) and in abstracts from the 1st to the n sentence
Example of an explicitly structured abstract in MEDLINE. The 4-class argumentation model, supporting our experiments can have minor variations in abstracts as illustrated with the INTRODUCTION marker in this explicitly structured abstracts. Explicitely structured abstracts in MEDLINE account for less than 2% of all abstracts.
The classification results for the abstract shown in Table 2 (explicit argumentative labels are removed before classification). For each row, the attributed class is followed by the score for the class, followed by the extracted text segment. In this example, one of RESULTS sentences (in bold) is misclassified as METHODS, while he INTRODUCTION sentence has been classified as PURPOSE.
| CONCLUSION (00160116) The highly favorable pathologic stage (RI-RII, 58%) and the fact that the majority of patients were alive and disease-free suggested a more favorable prognosis for this type of renal cell carcinoma. |
| METHODS (00160119) Tumors were classified according to well-established histologic criteria to determine stage of disease; the system proposed by Robson was used. |
| PURPOSE (00156456) In this study, we analyzed 250 renal cell carcinomas to a) determine frequency of CCRC at our Hospital and b) analyze clinical and pathologic features of CCRCs. |
| PURPOSE (00167817) Chromophobe renal cell carcinoma (CCRC) comprises 5% of neoplasms of renal tubular epithelium. CCRC may have a slightly better prognosis than clear cell carcinoma, but outcome data are limited. |
| RESULTS (00155338) Robson staging was possible in all cases, and 10 patients were stage 1) 11 stage II; 10 stage III, and five stage IV. |
Confusion matrices for argumentative classification: the first column indeictes the expected category, while the first line provides the measured classification.
| Without sentence positions | ||||
| PURPOSE | METHODS | RESULTS | CONCLUS. | |
| PURPOSE | 80.65 % | 0 % | 3.23 % | 16 % |
| METHODS | 8 % | 78 % | 8 % | 6 % |
| RESULTS | 18.58 % | 5.31 % | 52.21 % | 23.89 % |
| CONCLUS. | 18.18 % | 0 % | 2.27 % | 79.55 % |
| With sentence positions | ||||
| PURPOSE | METHODS | RESULTS | CONCLUS. | |
| PURPOSE | 93.35 % | 0 % | 3.23 % | 3 % |
| METHODS | 3 % | 78 % | 8 % | 6 % |
| RESULTS | 12.43 % | 5.31 % | 52.21 % | 13.01 % |
| CONCLUS. | 2.27 % | 0 % | 2.27 % | 95.45 % |
Class distribution in 1000 GeneRiFs after argumentative classification. Sets A and B are samples of GeneRiFs as in LocusLink, but Set B contains only GeneRiFs originating from the abstract.
| Set A (%) | Set B (%) | |
| PURPOSE | 41 | 22 |
| METHODS | 2 | 4 |
| RESULTS | 2 | 8 |
| CONCLUSION | 55 | 66 |
Sample distribution of the most frequent GO terms in Swiss-Prot.
| GO ID | Proportion (%) | Cumul. (%) | Term |
| 0005634 | 3.41 | 3.41 | nucleus |
| 0007165 | 3.19 | 6.60 | signal transduction |
| 0005737 | 2.75 | 9.36 | cytoplasm |
| 0005887 | 2.58 | 11.9 | integral to plasma membrane |
| 0005886 | 1.65 | 13.6 | plasma membrane |
| 0003700 | 1.48 | 15.0 | transcription factor activity |
| 0016021 | 1.48 | 16.5 | integral to membrane |
| 0005515 | 1.04 | 17.6 | protein binding |
| 0006412 | 0.88 | 18.5 | protein biosynthesis |
| 0006810 | 0.82 | 19.3 | transport |
Two sentences extracted from the abstract in Table 1 are assigned a ranked set of N Gene Ontology descriptors (N = 2). Each {sentence;category} association pair is provided with a categorization status value (CSV), which directly expresses a similarity between the sentence and the Gene Ontology. The final density is computed by simply summing up the top-N CSV (N = 2).
| Sentences | Predicted GO categories | CSV | Density |
| Class 2 transcripts predominated in most other tissues | rna primary transcript binding | 1028 | 2056 |
| 35s primary transcript processing | 1028 | ||
| regulation of ABCA1 mRNA levels exploits the use of alternative transcription start sites | transcription | 9397 | 12819 |
| regulation | 3422 | ||
Performance of each basic strategy and their combination. The top combination (57.29%) achieves 80.7% of the theoretical upper bound (70.96%).
| Origin of the proposed GeneRiF | Dice (%) | ||
| Title | Abstract | ||
| Baseline | 139 | 0 | 50.47 |
| LASt | 125 | 14 | 51.98 |
| LASt & shortening | 125 | 14 | 52.78 |
| GOEx | 103 | 36 | 52.36 |
| Combination | 108 | 31 | 56.14 |
| Combination & shortening | 108 | 31 | 57.29 |