| Literature DB >> 19429696 |
Jean-Fred Fontaine1, Adriano Barbosa-Silva, Martin Schaefer, Matthew R Huska, Enrique M Muro, Miguel A Andrade-Navarro.
Abstract
The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker.Entities:
Mesh:
Year: 2009 PMID: 19429696 PMCID: PMC2703945 DOI: 10.1093/nar/gkp353
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The results page is composed of several sections. The table of significant abstracts (A), here related to microarray and protein aggregation, is sorted by ascending P-values and shows article titles and PubMed identifiers (PMIDs). Discriminative words are highlighted in titles or in abstracts, which are displayed in a popup window hyperlinked from their PMID (B). The performance of the ranking is shown using a table and the corresponding Receiver Operating Characteristic curve plotting the sensitivity versus the false positive rate (C). The last section contains the table of discriminative words (D), which is sorted by decreasing weights (the most important words at the top).
Figure 2.Parameter estimation. The number of abstracts in the background and training sets has an impact on the ROC area for various biomedical topics. The y-axis shows the mean ROC area after leave-one-out cross validations over 10 random background sets using 1000 training set abstracts (left column), or 10 bootstrapped training sets using the rest of Medline as background set (right column).
Number of true positives in manually evaluated sets
| Abstracts selection | Host–pathogen interactions | Phosphorylation-dependent mechanisms |
|---|---|---|
| TOP100 | 99 (99%) | 71 (71%) |
| TOP50 | 49 (98%) | 41 (82%) |
| TOP25 | 25 (100%) | 19 (76%) |
| TOP10 | 10 (100%) | 9 (90%) |
Manual validations on 200 abstracts. The ranking of two topics was manually validated. The first topic, host–pathogen interactions, was used to rank abstracts related to Arabidopsis thaliana. The second topic, phosphorylation-dependent molecular processes, was used to rank abstracts from three PPI databases (HPRD, MINT and DIP). The proportion of true positives was calculated from the manual validation of the best 100 (TOP100), 50 (TOP50), 25 (TOP25) and 10 (TOP10) abstracts.
ROC areas of various topics
| Topic | Positives | Negatives | Medline Ranker | MScanner |
|---|---|---|---|---|
| Virus contamination in Europe | 28 | 24426 | 0.99977 | 0.9075 |
| Microarray and protein aggregation | 71 | 24689 | 0.99795 | 0.8724 |
| Radiology (10) | 53 | 47772 | 0.99748 | 0.9939 |
| Text mining | 312 | 24777 | 0.99601 | 0.9560 |
| Phosphorylation-dependent processes | 136 | 24572 | 0.99421 | 0.9867 |
| Systems biology and pathway | 407 | 24609 | 0.98812 | 0.9671 |
| Microarray and cancer | 8327 | 24592 | 0.97041 | 0.9889 |
| AIDSBio (10) | 4099 | 47746 | 0.94179 | 0.9910 |
| PG07 (10) | 1611 | 47758 | 0.90237 | 0.9754 |
MedlineRanker was compared to MScanner for various topics by the mean ROC area after 10-fold cross-validations (the two columns on the right). The same numbers of abstracts in the training set (positives) and in the background set (negatives) were used by both methods.