| Literature DB >> 32561832 |
QuanQiu Wang1, Rong Xu2.
Abstract
Many diseases are driven by gene-environment interactions. One important environmental factor is the metabolic output of human gut microbiota. A comprehensive catalog of human metabolites originated in microbes is critical for data-driven approaches to understand how microbial metabolism contributes to human health and diseases. Here we present a novel integrated approach to automatically extract and analyze microbial metabolites from 28 million published biomedical records. First, we classified 28,851,232 MEDLINE records into microbial metabolism-related or not. Second, candidate microbial metabolites were extracted from the classified texts. Third, we developed signal prioritization algorithms to further differentiate microbial metabolites from metabolites originated from other resources. Finally, we systematically analyzed the interactions between extracted microbial metabolites and human genes. A total of 11,846 metabolites were extracted from 28 million MEDLINE articles. The combined text classification and signal prioritization significantly enriched true positives among top: manual curation of top 100 metabolites showed a true precision of 0.55, representing a significant 38.3-fold enrichment as compared to the precision of 0.014 for baseline extraction. More importantly, 29% extracted microbial metabolites have not been captured by existing databases. We performed data-driven analysis of the interactions between the extracted microbial metabolite and human genetics. This study represents the first effort towards automatically extracting and prioritizing microbial metabolites from published biomedical literature, which can set a foundation for future tasks of microbial metabolite relationship extraction from literature and facilitate data-driven studies of how microbial metabolism contributes to human diseases.Entities:
Mesh:
Year: 2020 PMID: 32561832 PMCID: PMC7305201 DOI: 10.1038/s41598-020-67075-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Overall performance of microbial metabolite extractions from unclassified MEDLINE records, classified microbial-related MEDLINE records, and classified microbial metabolism related MEDLINE records. 172 known microbial metabolites from HMDB were used as the gold standard. The 2017 HMDB was used.
| Approach | Articles (n) | Extracted Metabolites (n) | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Baseline (Unclassified articles) | 28,851,232 | 11,846 | 0.014 | 0.959 | 0.028 |
| Classified - Microbial | 42,431 | 2,346 | 0.049 | 0.674 | 0.092 |
| Classified - Microbial Metabolism | 16,728 | 2,016 | 0.055 | 0.640 | 0.101 |
Figure 1Precision-recall curves. Precisions and recalls were calculated using the 172 known microbial metabolites from HMDB as evaluation dataset. Extraction was performed on 16,728 classified microbial metabolism-related articles.
Figure 2Estimated precisions (evaluated using 172 known microbial metabolites from HMDB) and true precisions (evaluated using the combined list of 201 microbial metabolites from HMDB and from manual curation) of top ranked metabolites at four ranking cut offs (top 20, 50, 70 and 100 metabolites).
Top 20 ranked metabolites with supporting evidence from HMDB or biomedical literature (the unique identifier number used in PubMed or PMID shown). True positives (13 out of 20) are in bold. Metabolites without supporting evidence are denoted with “NO”.
| Rank | Metabolite | Evidence | Rank | Metabolite | Evidence |
|---|---|---|---|---|---|
| 1 | PMID 28616977 | 11 | HMDB | ||
| 2 | HMDB | 12 | HMDB | ||
| 3 | phenol-formaldehyde, cross-linked, tetraethylenepentamine activated | NO | 13 | PMID 26531326 | |
| 4 | HMDB | 14 | HMDB | ||
| 5 | chymosin preparation, Escherichia coli k-12 | NO | 15 | PMID 30189365 | |
| 6 | HMDB | 16 | HMDB | ||
| 7 | PMID 19716282 | 17 | gonyautoxin v | NO | |
| 8 | PMID 30029499 | 18 | HMDB | ||
| 9 | trans-aconitic acid | NO | 19 | inulobiose | NO |
| 10 | urolithin c | PMID 19716282 | 20 | urolithin d | PMID 19716282 |
Top 20 human genes ranked by the number of associated microbial metabolites.
| Symbol | Name | Score | Symbol | Name | Score |
|---|---|---|---|---|---|
| TMPRSS11D | transmembrane serine protease 11D | 53 | TSPO | translocator protein | 30 |
| CAT | Catalase | 50 | ODC1 | ornithine decarboxylase 1 | 30 |
| ALB | albumin | 47 | ACACA | acetyl-CoA carboxylase alpha | 28 |
| DECR1 | 2,4-dienoyl-CoA reductase 1 | 44 | GSR | glutathione-disul_de reductase 29 | 27 |
| UMPS | Uridine Monophosphate Synthetase) | 32 | ACACB | acetyl-CoA carboxylase beta | 27 |
| TALDO1 | transaldolase 1 | 31 | F2 | coagulation factor II, thrombin | 26 |
| GLUL | glutamate-ammonia ligase | 31 | DPYD | dihydropyrimidine dehydrogenase | 26 |
| GLUL | glutamate-ammonia ligase | 31 | TYR | tyrosinase | 26 |
| PRDM10 | PR/SET domain 10 | 31 | UPP2 | uridine phosphorylase 2 | 26 |
| GUSB | glucuronidase beta | 30 | SORD | sorbitol dehydrogenase | 25 |