| Literature DB >> 23402646 |
Carlo A Trugenberger1, Christoph Wälti, David Peregrim, Mark E Sharp, Svetlana Bureeva.
Abstract
BACKGROUND: Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23402646 PMCID: PMC3605201 DOI: 10.1186/1471-2105-14-51
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1InfoCodex information map. InfoCodex information map obtained for the approximately 115,000 documents of the PubMed repository used for the present experiment. The size of the dots in the center of each class indicate the number of documents assigned to it.
InfoCodex computed meanings
| Nn1250 | clinical study | insuline glargine |
| Tolterodine | cavity | overactive bladder |
| Ranibizumab | drug | macular edema |
| Nn5401 | clinical study | insulin aspart |
| Duloxetine | antidepressant | personal physician |
| Endocannabinoid | receptor | Enzyme |
| Becaplermin | pathology | Ulcer |
| Candesartan | cardiovascular disease | high blood pressure |
| Srt2104 | medicine | Placebo |
| Olmesartan | cardiovascular medicine | Amlodipine |
| Hctz | diuretic drug | Hydrochlorothiazide |
| Eslicarbazepine | anti nervous | Zebinix |
| Zonisamide | anti nervous | Topiramate Capsules |
| Mk0431 | antidiabetic | Sitagliptin |
| Ziprasidone | tranquilizer | major tranquilizer |
| Psicofarmcolagia | motivation | Incentive |
| Medoxomil | cardiovascular medicine | Amlodipine |
InfoCodex computed meanings of some unknown terms from the experimental PubMed collection.
UMLS benchmark sources, numbers, and examples
| NCI | 58 | C0007595 | FABP4 gene | RO | gene_plays_ role_in_process | C1333527 | Cell Growth |
| MSH | 45 | C0022621 | FTO protein, mouse | RN | mapped_to | C2002654 | Oxo-Acid-Lyases |
| OMIM | 44 | C0064317 | KHK gene | RO | related_to | C1416630 | Ketohexo-kinase |
| MTH | 38 | C0061352 | GCGR gene | RO | | C1415011 | Glucagon Receptor |
| LNC | 20 | C0005767 | MC4R gene mutation analysis:… | RO | has_system | C1715956 | Blood |
Sources, numbers, and examples () of benchmark D&O biomarkers/phenotypes extracted from UMLS (CUI: Concept Unique Identifier, RO: Related Other, RN: Related Narrow).
Figure 2Thomson Reuters obesity algorithm. Obesity example of Thomson Reuters algorithm for scoring matches between InfoCodex output (“All obesity records”) and Thomson Reuters knowledge bases.
Figure 3PubMed results confidence level distribution. Confidence level distribution of candidates discovered by InfoCodex text mining of the experimental PubMed collection.
PubMed results with highest confidence levels
| 1 | glycemic control | BiomarkerFor | Diabetes | 70.3 | 1122 | 20110333, 20128112, 20149122, |
| 2 | Insulin | PhenoTypeOf | Diabetes | 68.3 | 5000 | 19995096, 20017431, 20043582, |
| 3 | Proinsulin | BiomarkerFor | Diabetes | 67.8 | 105 | 16108846, 9405904, 20139232, |
| 4 | TNF alpha inhibitor | PhenoTypeOf | Diabetes | 67.1 | 245 | 9506740, 20025835, 20059414, |
| 5 | anhydroglucitol | BiomarkerFor | Diabetes | 67.1 | 10 | 20424541, 20709052, 21357907, |
| 6 | linoleic acid | BiomarkerFor | Diabetes | 67.1 | 61 | 20861175, 20846914, 15284064, |
| 7 | palmitic acid | BiomarkerFor | Diabetes | 67.1 | 24 | 20861175, 20846914, 21437903, |
| 8 | pentosidine | BiomarkerFor | Diabetes | 67.1 | 13 | 21447665, 21146883, 17898696, |
| 9 | uric acid | BiomarkerFor | Obesity | 66.8 | 433 | 10726195, 19428063, 10904462, |
| 10 | proatrial natriuretic peptide | BiomarkerFor | Obesity | 66.6 | 4 | 14769680, 18931036, 17351376, |
| 11 | ALT values | BiomarkerFor | Diabetes | 66.3 | 2 | 20880180, 19010326 |
| 12 | adrenomedullin | BiomarkerFor | Diabetes | 64.3 | 7 | 21075100, 21408188, 20124980, |
| 13 | fructosamin | BiomarkerFor | Diabetes | 64.2 | 59 | 20424541, 21054539, 18688079, |
| 14 | TNF alpha inhibitor | BiomarkerFor | Diabetes | 62.1 | 245 | 9506740, 20025835, 20059414, |
| 15 | uric acid | BiomarkerFor | Diabetes | 61.8 | 259 | 21431449, 20002472, 20413437, |
| 16 | monoclonal antibody | BiomarkerFor | Obesity | 61.7 | 41 | 14715842, 21136440, 21042773, |
| 17 | Insulin level QTL | PhenoTypeOf | Obesity | 61.2 | 1167 | 16614055, 19393079, 11093286, |
| 18 | stimulant | BiomarkerFor | Obesity | 61.2 | 646 | 18407040, 18772043, 10082070, |
| 19 | IL-10 | BiomarkerFor | Obesity | 60.9 | 120 | 19798061, 19696761, 20190550, |
| 20 | central obesity | PhenoTypeOf | Diabetes | 59.5 | 530 | 16099342, 17141913, 15942464, |
| 21 | lipid | BiomarkerFor | Obesity | 59.5 | 4279 | 11596664, 12059988, 12379160, |
| 22 | urine albumin screening | BiomarkerFor | Diabetes | 59.0 | 95 | 20886205, 19285607, 20299482, |
| 23 | tyrosine kinase inhibitor | BiomarkerFor | Obesity | 58.8 | 83 | 18814184, 9538268, 15235125, |
| 24 | TNF alpha inhibitor | BiomarkerFor | Obesity | 58.0 | 785 | 20143002, 20173393, 10227565, |
| 25 | fas | BiomarkerFor | Obesity | 57.7 | 179 | 12716789, 17925465, 19301503, |
| 26 | leptin | PhenoTypeOf | Diabetes | 57.6 | 870 | 11987032, 17372717, 18414479, |
| 27 | ALT values | BiomarkerFor | Obesity | 57.4 | 8 | 16408483, 19010326, 17255837, |
| 28 | lipase | BiomarkerFor | Obesity | 56.8 | 356 | 16752181, 17609260, 20512427, |
| 29 | insulin resistance | PhenoTypeOf | Obesity | 55.8 | 5000 | 20452774, 20816595, 21114489, |
| 30 | chronic inflammation | PhenoTypeOf | Diabetes | 55.7 | 154 | 15643475, 18673007, 18801863, |
Highest confidence level scoring biomarker/phenotype candidates discovered by InfoCodex text mining of the experimental PubMed collection. The identified candidate terms appear in column A, with their relationship to diabetes or obesity in columns B-C. The confidence level, in column D (the descending sort key), is normalized on a scale in which the maximum of 100% is the score of the manually curated reference biomarkers/phenotypes. In column E are the numbers of documents in which a given candidate term appears. Column F displays the PubMed IDs of the most relevant PubMed documents for purposes of manual SME review. Note that the same term can have multiple entries since it can have different relationships (biomarker for diabetes, phenotype for obesity, etc.).
Precision and recall
| I2E raw | PubMed | PubMed | (exact) | (exact) |
| <1% obesity | 5% obesity | |||
| 3-5% diabetes | 9-11% diabetes | |||
| 3-7% MDOB | 7% MDOB | |||
| I2E normalized | PubMed | PubMed | (exact) | (exact) |
| 3-7% MDOB | 3-7% MDOB | |||
| I2E manual | PubMed | PubMed | 1-5% obesity | 9-33% obesity |
| 3-11% diabetes | 9-31% diabetes | |||
| 3-26% MDOB | 4-15% MDOB | |||
| UMLS + GO + OMIM | UMLS + GO + OMIM | PubMed | 1-4% | 3-22% |
| 1-8% (unary) | 4-35% (unary) | |||
| Thomson Reuters | Thomson Reuters | PubMed | 7-36% obesity | 36% obesity |
| 18% DM2 | ||||
| 9-49% DM2 | 22% DM1 | |||
| 25% DI | ||||
| TGI | TGI | PubMed | 0-5% obesity | (exact) 2.5% |
| 0-4% diabetes | ||||
| 1-14% MDOB | ||||
| I2E manual | PubMed | ClinicalTrials.gov | (preferred terms) 27-59% | (preferred terms) 3-7% |
| UMLS + GO + OMIM | UMLS + GO + OMIM | ClinicalTrials.gov | (preferred terms) 1-2% | (preferred terms) <1% |
| I2E manual | PubMed | Merck internal | (preferred terms) 8-14% | (preferred terms) 1-2% |
| UMLS + GO + OMIM | UMLS + GO + OMIM | Merck internal | (preferred terms) <1% | (preferred terms) <1% |
Precision and recall of InfoCodex candidate biomarkers/phenotypes compared to various benchmarks. “(exact)” and “(preferred terms)” refer to sub-ranges according the 2x2 matching matrix described in the text under “Methods – Precision/recall”. “MDOB” refers to the InfoCodex output subset containing references to the 27 Merck D&O biomarkers. “(unary)” means all InfoCodex candidate biomarkers/phenotypes were lumped together across obesity, diabetes, and MDOB, in contrast to the default binary criterion for matching.
Figure 4PubMed results confidence levels x I2E-manual precision. Correlation between InfoCodex confidence levels (Conf%; purple bars) and precision (light blue bars) against I2E-manual diabetes PubMed benchmark. Pink shading: exact match; yellow shading: partial match. Row 15 (100 Conf%) represents a member of the manually compiled reference set.
SME relevance analysis
| | | |
| Diabetes Type 1 | 1.6 | 3.2 |
| Diabetes Type 2 | 3.6 | 3.7 |
| Obesity | 6.9 | 6.2 |
| | | |
| Diabetes Type 1 | 0.7 | 3.4 |
| Diabetes Type 2 | 0.9 | 3.6 |
| Obesity | 2.6 | 2.8 |
Scale is described in main text.
UMLS mapping
| Exact | ABCC8 gene | ABCC8 gene |
| Left substring | ABCC8 | ABCC8 gene |
| | 9-cis-retinoic acid | 9-cis-retinoic acid biosynthesis |
| | Cara | C ara A |
| | ||
| Between 2 (;; = separator) | acute coronary syndromes | Acute Coronary Syndrome ;; Acute coronary thrombosis… |
| | abnormal laboratory findings | Abnormal Keratinocyte ;; Abnormal Laboratory Result (Biochemistry) |
| | ||
Examples of novel InfoCodex biomarker/phenotype candidates mapped to UMLS by three uncurated match types. The italicized matches are clearly false (unrelated conceptually).
UMLS match type distribution
| Pubmed | 789 (39%) | 591 (29%) | 632 (31%) |
| ClinicalTrials.gov | 409 (52%) | 225 (29%) | 155 (20%) |
| Merck internal | 24 (28%) | 25 (29%) | 38 (44%) |
UMLS match type distribution of novel InfoCodex biomarker/phenotype candidates from the three corpora analyzed.
Figure 5PubMed results confidence levels x UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from PubMed broken down by match type to UMLS terms (100% refers to the manually discovered reference/training set).
Figure 6ClinicalTrials.gov results confidence levels x UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from ClinicalTrials.gov broken down by match type to UMLS terms (100% refers to the reference/training set).
Figure 7Merck P3 results confidence levels × UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from Merck internal research documents broken down by match type to UMLS terms (100% indicates the reference/training set).
Figure 8Novel candidates repository overlap. Overlap between novel InfoCodex biomarker/phenotype candidates from PubMed (PM), ClinicalTrials.gov (CT), and Merck internal research documents (P3). Lavender shading: found in one repository only; dark violet shading: found in all three; others: found in two.
UMLS benchmark sources, numbers, and examples
| wenqing | BiomarkerFor | Obesity | Obesity | 53.5 | 29 |
| proteomic | BiomarkerFor | Obesity | Obesity | 40.8 | 128 |
| gene expression | BiomarkerFor | Obesity | Obesity | 38.9 | 62 |
| Mouse model | BiomarkerFor | Obesity | Obesity | 19.8 | 17 |
| muise | BiomarkerFor | Obesity | Obesity | 17.5 | 20 |
| athero- | BiomarkerFor | Obesity | Obesity | 16.5 | 6 |
| shrna | BiomarkerFor | Obesity | Obesity | 9.6 | 4 |
| inflammation | BiomarkerFor | Obesity | Obesity | 8.2 | 4 |
| TBD | BiomarkerFor | Obesity | Obesity | 7.4 | 3 |
| body weight | PhenoTypeOf | Diabetes | MGAT2 | | 1 |
| cell line | BiomarkerFor | Diabetes | MGAT2 | 1 |
Examples of uninteresting novel InfoCodex biomarker/phenotype candidates from Merck internal research documents.