| Literature DB >> 23249606 |
Erfan Younesi1, Luca Toldo, Bernd Müller, Christoph M Friedrich, Natalia Novac, Alexander Scheer, Martin Hofmann-Apitius, Juliane Fluck.
Abstract
BACKGROUND: For selection and evaluation of potential biomarkers, inclusion of already published information is of utmost importance. In spite of significant advancements in text- and data-mining techniques, the vast knowledge space of biomarkers in biomedical text has remained unexplored. Existing named entity recognition approaches are not sufficiently selective for the retrieval of biomarker information from the literature. The purpose of this study was to identify textual features that enhance the effectiveness of biomarker information retrieval for different indication areas and diverse end user perspectives.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23249606 PMCID: PMC3541249 DOI: 10.1186/1472-6947-12-148
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1An overview of the biomarker retrieval system design.
Figure 2Example annotation of a biomarker abstract in SCAIView.
Figure 3Incorporation and usage of biomarker terminology in SCAIView. Biomarker terminology is visualized in SCAIView as a navigable tree (left). The user query is formulated either by auto-complete finding of the concepts in Query Builder or navigating the tree. Finally, the type of output (entities or documents) is selected from the drop-down menu.
Biomarker retrieval terminology classes and coverage of the terminology in the annotated corpora
| Clinical Management | Terms indicating clinical investigations on patients | Patient; Cohort study | 96% |
| Diagnostics | Terms representing clinical as well as molecular diagnostics | Immunohistochemistry; Emission Tomography; Microarray | 85% |
| Prognosis | Terms indicating the prediction for a patient /kind of biomarker/ outcome of therapies | Surrogate end point; clinical response; biomarker; predictor | 88% |
| Statistics | Statistical methods indicating the strength of the biomarker relationship | Chi(2) test; mean +/− SD; univariate analysis; Kaplan-Meier Analysis | 48% |
| Evidence | Terms describing genetic/ molecular evidence for activity of a gene | Mutation; gene amplification; polymorphism; expression | 82% |
| Antecedent | Terms expressing exposure to hazardous agents and risk factors | Smoker; susceptibility; exposure | 20% |
Spot check of NSCLC genes/proteins for their relevance to biomarker applicability and evidence
| P53 | 281 | 1 | 1 | 20521348 | Prognosis | Expression |
| VEGFA | 128 | 6 | 4 | 21964530 | Prognosis | Expression |
| BCL2 | 90 | 8 | 6 | 19560836 | Survival | Expression |
| PCNA | 47 | 11 | 10 | 21495034 | Prognosis | Expression |
| VEGFC | 22 | 16 | 21 | 19758816 | Recurrence | Expression |
| PTGS2 | 33 | 25 | 15 | 20592629 | Survival | Expression |
| CDKN1B | 12 | 50 | 52 | 15483027 | Prognosis | Expression |
| DNAJB4 | 2 | 100 | 344 | 16788156 | Survival | Expression |
| CDK2 | 5 | 342 | 112 | 10817513 | Prognosis | No impact |
For the SCAIView search the MeSH disease “Carcinoma, Non-Small-Cell Lung” in combination with five biomarker retrieval classes (omitting the class Antecedent) were selected and the retrieved genes were ranked according the Relative Entropy or frequency. In spot tests the abstracts were analysed for containing biomarker information for the respective genes.
Figure 4Comparative gene enrichment plots for gene/protein entities retrieved in the context of AD. Genes were ranked based on frequency (A) or relative entropy (B) and evaluated against the Alzheimer’s genes/proteins gold standard. The red color represents the retrieval of all abstracts containing human genes/proteins. The other color codings indicate the retrieval rate after additive inclusion of further terminology classes.
Figure 5Comparative gene enrichment plots for gene/protein entities retrieved in the context of MS. Genes were ranked based on frequency (A) or relative entropy (B) and evaluated against the multiple sclerosis genes/proteins gold standard. The red color represents the retrieval of all abstracts containing human genes/proteins. The other color codings indicate the retrieval rate after additive inclusion of further terminology classes.
Figure 6Frequency-based gene enrichment plots after combining additive classes in the context of AD and MS. Genes were ranked based on frequency and evaluated against the Alzheimer’s (A) or multiple sclerosis (B) gold standards. The color codings indicate the retrieval rate after additive inclusion of further terminology classes.
Performance evaluation for Alzheimer’s disease
| Baseline: Genes / Proteins | 60 | 0.10 | 0.92 | 0.17 |
| | 230 | 0.30 | 0.73 | 0.42 |
| | 469 | 0.50 | 0.61 | 0.55 |
| 728 | 0.67 | 0.52 | 0.59 | |
| Baseline + Clinical Management | 61 | 0.10 | 0.90 | 0.17 |
| | 226 | 0.30 | 0.75 | 0.42 |
| | 465 | 0.50 | 0.61 | 0.55 |
| 682 | 0.65 | 0.54 | 0.59 | |
| Baseline + Evidence Marker | 62 | 0.10 | 0.89 | 0.17 |
| | 225 | 0.30 | 0.75 | 0.42 |
| | 464 | 0.50 | 0.61 | 0.55 |
| 654 | 0.62 | 0.54 | 0.57 | |
| Baseline +Prognosis | 63 | 0.10 | 0.87 | 0.17 |
| | 247 | 0.30 | 0.68 | 0.41 |
| | 541 | 0.50 | 0.52 | 0.51 |
| 740 | 0.61 | 0.47 | 0.53 | |
| Baseline + Diagnostics | 64 | 0.10 | 0.86 | 0.17 |
| | 237 | 0.30 | 0.71 | 0.42 |
| | 494 | 0.50 | 0.57 | 0.53 |
| 520 | 0.52 | 0.57 | 0.54 | |
| Baseline + Statistics | 64 | 0.10 | 0.86 | 0.17 |
| | 227 | 0.30 | 0.74 | 0.42 |
| | 678 | 0.50 | 0.42 | 0.45 |
| 377 | 0.41 | 0.62 | 0.49 | |
| Baseline + Clinical Management + Evidence Marker | 60 | 0.10 | 0.92 | 0.17 |
| | 224 | 0.30 | 0.75 | 0.42 |
| | 451 | 0.50 | 0.63 | 0.56 |
| 555 | 0.57 | 0.59 | 0.58 | |
| Baseline + Clinical Management + Evidence Marker + Prognosis | 61 | 0.10 | 0.90 | 0.17 |
| | 230 | 0.30 | 0.73 | 0.42 |
| | 568 | 0.50 | 0.50 | 0.50 |
| 479 | 0.47 | 0.56 | 0.51 |
Genes were ranked based on frequency and evaluated against the Alzheimer’s gold standard. For the different selections recall, precision, F-score and rank has been estimated for 10, 30, and 50% recall. In addition the maximal recall has been estimated.
Figure 7Proportion of retrieved abstracts containing biomarker information for Alzheimer’s disease. The red color represents the retrieval of all abstracts containing human genes/proteins. The other color codings indicate the retrieval rate after additive inclusion of further terminology classes.
Examples of articles accepted to contain biomarker information
| 17387528 | ACAD8 HMGCS2 | SNP | In a European screening sample of 115 sporadic AD patients and 191 healthy control subjects, we analyzed single nucleotide polymorphisms in 28 cholesterol-related genes for association with AD. The genes HMGCS2, FDPS, RAFTLIN, ACAD8, NPC2, and ABCG1 were associated with AD at a significance level of P < or = 0.05 in this sample. |
| 17531353 | SLC17A7 | Protein expression decrease | Loss of VGLUT1 and VGLUT2 in the prefrontal cortex is correlated with cognitive decline in Alzheimer disease…We quantified VGLUT1 and VGLUT2 in the prefrontal dorsolateral cortex (Brodmann area 9) of controls and AD patients using specific antiserums. A dramatic decrease in VGLUT1 and VGLUT2 was observed in AD using Western blot |
| 19863188 | HPX SERPINF1 | Cerebrospinal fluid concentration | Five differentially-expressed proteins with potential roles in amyloid-beta metabolism and vascular and brain physiology [apolipoprotein A-1 (Apo A-1), cathepsin D (CatD), hemopexin (HPX), transthyretin (TTR), and two pigment epithelium-derived factor (PEDF) isoforms] were identified. Apo A-1, CatD and TTR were significantly reduced in the AD pool sample, while HPX and the PEDF isoforms were increased in AD CSF |
Example evidence for genes retrieved via SCAIView for Alzheimer’s and MS diseases but not found in the corresponding gold standards.
Evaluation of genes not found in AD gold standard but retrieved using the biomarker terminology
| 400 | 158 | 241 |