| Literature DB >> 34322655 |
Leonid Zaslavsky1, Tiejun Cheng1, Asta Gindulyte1, Siqian He1, Sunghwan Kim1, Qingliang Li1, Paul Thiessen1, Bo Yu1, Evan E Bolton1.
Abstract
The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.Entities:
Keywords: PubChem; data mining; information retrieval; knowledge discovery; knowledge graph; knowledge panels; knowledge summarization; natural language processing
Year: 2021 PMID: 34322655 PMCID: PMC8311438 DOI: 10.3389/frma.2021.689059
Source DB: PubMed Journal: Front Res Metr Anal ISSN: 2504-0537
Types of literature co-occurrence panels implemented in PubChem with examples.
| Page type | Query ID | Panel type | Link |
|---|---|---|---|
| Compound | CID: 3672 | Chemical-chemical |
|
| Compound | CID: 3672 | Chemical-gene |
|
| Compound | CID: 3672 | Chemical-disease |
|
| Target | Gene symbol: ptgs2 | Gene-chemical |
|
| Target | Gene symbol: ptgs2 | Gene-gene |
|
| Target | Gene symbol: ptgs2 | Gene-disease |
|
FIGURE 1Chemical-chemical co-occurrence panel for ibuprofen (CID 3672), accessible at: https://pubchem.ncbi.nlm.nih.gov/compound/3672#section=Chemical-Co-Occurrences-in-Literature.
FIGURE 3A chemical-disease co-occurrence panel for ibuprofen (CID 3672), accessible at: https://pubchem.ncbi.nlm.nih.gov/compound/3672#section=Chemical-Disease-Co-Occurrences-in-Literature.
FIGURE 4Information and control options in the chemical co-occurrence panel for ibuprofen.
General statistics (as of February 27, 2021).
| Category | #records | #records with a matched entity in the category | Portion of records that have a matched entity | #unique identifiers |
|---|---|---|---|---|
| Active records | 32.17M | n/a | n/a | n/a |
| Active records with an annotation | 27.34M | 23.24M | 85.0% | 359.75K |
| Active records having chemical annotations | 13.91M | 11.42M | 81.8% | 294.60K |
| #active records having disease annotations | 17.43M | 17.41M | 99.9% | 8.88K |
| #active records having gene, protein or enzyme annotations | 8.73M | 6.54M | 74.9% | 56.28K |
FIGURE 5Number of occurrences of compounds (A), genes/proteins (B), and diseases (C) in PubMed records.
Top five most mentioned chemicals in PubMed records and the number of their chemical, gene/protein, and disease neighbors.
| CID | Chemical name | # PubMed records | # Chemical neighbors (non-redundant) | # Gene/protein neighbors | # Disease neighbors |
|---|---|---|---|---|---|
| 962 | Water | 823,657 | 47,183 | 15,934 | 4,538 |
| 5793 | D-glucose | 452,960 | 23,945 | 16,959 | 4.303 |
| 977 | Oxygen | 376,484 | 27,511 | 12,300 | 4,037 |
| 702 | Ethanol | 342,100 | 29,050 | 11,464 | 4,172 |
| 5460341 | Calcium | 324,490 | 17,402 | 13,408 | 4,258 |
Top five most mentioned diseases in PubMed records and the number of their chemical, gene/protein, and disease neighbors.
| MeSH ID | Name | # PubMed records | # Chemical neighbors (non-redundant) | # gene/protein neighbors | # Disease neighbors |
|---|---|---|---|---|---|
| D009369 | Neoplasms | 2,455,851 | 47.1K | 28.3K | 6.0K |
| D007239 | Infections | 1,050,141 | 20.2K | 20.3K | 5.6K |
| D007249 | Inflammation | 567,355 | 21.0K | 16.2K | 5.3K |
| D064420 | Drug-related side effects and adverse reactions | 554,313 | 48.4K | 16.3K | 4.7K |
| D003920 | Diabetes mellitus | 499,870 | 13.7K | 11.7K | 4.7K |
FIGURE 6Number of non-redundant neighbors of compounds (A), genes/proteins (B), and diseases (C).
Top five most mentioned genes/proteins in PubMed records and the number of their chemical, gene/protein, and disease neighbors.
| Symbol | Name | # PubMed records | # Chemical neighbors (non-redundant) | # gene/protein neighbors | # Disease neighbors |
|---|---|---|---|---|---|
| Ins | Insulin | 329,358 | 13.4K | 12.3K | 3.9K |
| Tnf | Tumor necrosis factor | 212,766 | 14.1K | 11.9K | 3.7K |
| cd4 | CD4 (cluster of differentiation 4) | 155,735 | 6.3K | 8.4K | 3.4K |
| Alb | Albumin | 152,666 | 15.2K | 7.6K | 3.5K |
| il6 | Interleukin 6 | 141,371 | 10.6K | 10.1K | 3.5K |
FIGURE 7Histogram of the relevance score values.
FIGURE 8Annotations of vitamin B2 and cobalt in PMID 33053716 “Relationship between Vitamin B12 and Cobalt Metabolism in Domestic Ruminant: An Update”.
FIGURE 9Values of the correction factor for chemical neighbors.
FIGURE 10Histograms of the relevance score values for D-glucose (A) in PubMed records where D-glucose is co-mentioned with cholesterol, (B) in PubMed records where D-glucose is co-mentioned with any PubChem compound.