| Literature DB >> 35165275 |
Zheni Zeng1, Yuan Yao1, Zhiyuan Liu2, Maosong Sun3.
Abstract
To accelerate biomedical research process, deep-learning systems are developed to automatically acquire knowledge about molecule entities by reading large-scale biomedical data. Inspired by humans that learn deep molecule knowledge from versatile reading on both molecule structure and biomedical text information, we propose a knowledgeable machine reading system that bridges both types of information in a unified deep-learning framework for comprehensive biomedical research assistance. We solve the problem that existing machine reading models can only process different types of data separately, and thus achieve a comprehensive and thorough understanding of molecule entities. By grasping meta-knowledge in an unsupervised fashion within and across different information sources, our system can facilitate various real-world biomedical applications, including molecular property prediction, biomedical relation extraction and so on. Experimental results show that our system even surpasses human professionals in the capability of molecular property comprehension, and also reveal its promising potential in facilitating automatic drug discovery and documentation in the future.Entities:
Mesh:
Year: 2022 PMID: 35165275 PMCID: PMC8844428 DOI: 10.1038/s41467-022-28494-3
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Conceptual diagram of knowledgeable and versatile machine reading.
Here we take salicylic acid as an example. Inspired by humans that versatilely learn meta-knowledge within and across different information, our machine reading system first serializes, a molecule structures via BPE on SMILES strings, then inserts the substrings into c. large-scale corpus and learns b fine-grained mapping between different semantic units by d mask language modeling. In this way, the system can perform e knowledgeable and versatile reading, achieving good performance on both mono-information downstream tasks and versatile reading tasks.
The main experimental results on mono-information tasks and versatile reading tasks.
| Model | Molecule structure tasks | Natural language tasks | Versatile reading tasks | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MoleculeNet | USP-few | ChemProt | BC5CDR | S-T Acc | Rec@20 | T-S Acc | Rec@20 | Score | |
| RXNFP | 65.37 ± 0.63 | 78.97 ± 3.93 | 36.60 ± 0.76 | 9.55 ± 3.05 | 1.58 ± 0.33 | 1.19 ± 0.28 | 2.26 ± 0.14 | 0.81 ± 0.05 | 25.48 ± 1.06 |
| BERTwo | 66.67 ± 0.29 | 33.05 ± 0.60 | 44.10 ± 5.26 | 65.69 ± 0.20 | 17.00 ± 3.06 | 0.91 ± 0.22 | 17.89 ± 2.04 | 0.74 ± 0.11 | 32.23 ± 11.6 |
| SMI-BERT | 68.61 ± 0.63 | 56.79 ± 1.40 | 46.49 ± 2.21 | 74.36 ± 0.41 | 24.68 ± 0.22 | 19.14 ± 1.05 | 22.47 ± 0.80 | 19.82 ± 0.54 | 70.08 ± 1.40 |
| Sci-BERT | 84.50 ± 0.71 | 84.61 ± 0.58 | 50.38 ± 1.39 | 62.11 ± 1.49 | 50.12 ± 1.67 | 68.02 ± 1.87 | 81.59 ± 0.51 | ||
| KV-PLM | 84.59 ± 0.59 | 54.22 ± 0.94 | 71.80 ± 1.56 | ||||||
| KV-PLM* | 68.34 ± 0.52 | 69.13 ± 0.46 | 82.39 ± 0.69 | ||||||
For versatile reading tasks, we present test accuracy and recall for both SMILES-Text retrieval and Text-SMILES retrieval on PCdes. Score stands for accuracy on the CHEMIchoice task. Boldfaced numbers indicate significant advantage over the second-best results in one-sided t-test with p-value <0.05, and underlined numbers denote no significant difference.
Experiment results on 4 MoleculeNet themes. Baseline results are cited from ref. [50].
| Model | BBBP | HIV | SIDER | TOX21 | Average |
|---|---|---|---|---|---|
| D-MPNN | 71.2 ± 3.8 | 75.0 ± 2.1 | 63.2 ± 2.3 | 68.9 ± 1.3 | 69.6 |
| RF | 71.4 ± 0.0 | 78.1 ± 0.6 | 68.4 ± 0.9 | 76.9 ± 1.5 | 73.7 |
| DMP | 78.1 ± 0.5 | 81.0 ± 0.7 | 69.2 ± 0.7 | 78.8 ± 0.5 | 76.5 |
| RXNFP | 68.49 ± 0.71 | 73.46 ± 1.03 | 54.07 ± 1.64 | 65.46 ± 0.47 | 65.37 ± 0.63 |
| BERTwo | 68.37 ± 0.48 | 69.39 ± 1.04 | 60.19 ± 1.19 | 68.75 ± 0.64 | 66.67 ± 0.29 |
| Sci-BERT | 74.94 ± 1.30 | 75.38 ± 1.18 | 60.55 ± 2.36 | 71.72 ± 0.60 | 70.65 ± 0.58 |
| SMI-BERT | 71.12 ± 2.24 | 73.88 ± 1.45 | 59.84 ± 0.88 | 69.61 ± 0.44 | 68.61 ± 0.63 |
| KV-PLM | 74.61 ± 0.92 | 74.00 ± 1.16 | 61.51 ± 1.47 | 72.71 ± 0.59 | 70.71 ± 0.32 |
| KV-PLM* | 71.97 ± 0.85 | 71.84 ± 1.36 | 59.78 ± 1.53 | 69.79 ± 0.45 | 68.34 ± 0.52 |
Fig. 2Schematic diagram for KV-PLM* finishing CHEMIchoice task.
For the given unfamiliar molecule entity, we get a versatile materials including structure and description, from which we know the correct sentence and randomly pick wrong sentences from the pool to form four choices. b Molecule structure and text of choices are fed into KV-PLM* and get their representations, based on which the confidence scores of choices are calculated by cosine similarity. c The tokenizers for structures and biomedical text are different. In this instance, KV-PLM* successfully finds out the correct description sentence for the given substance.
Fig. 3Score comparison of CHEMIchoice task.
Our model successfully surpasses human professionals, showing its promising capability of comprehending molecule structure and biomedical text. Error bars indicates standard deviation over six runs.
Fig. 4Visualizing substring pattern embeddings using t-SNE[62].
Parts of substring pattern fingerprints are randomly chosen and processed for dimensionality reduction. Similar substring patterns are marked in the same colors. The upper one shows fingerprints from pre-trained KV-PLM*, and the lower one is from the model finetuned on PCdes.
Fig. 5Case study for property prediction.
The molecular structures are first serialized in SMILES strings. With more SMILES sub-groups provided (in purple), the model can predict the properties more precisely.
Case study for drug discovery.
| Property query | Substances retrieval result |
|---|---|
| Anti-inflammatory | Effective: Elocalcitola [ |
| Unclear: Eribulin mesylate, U46619, Cholesteryl linoleate, Hallactone B, Leukotriene A4, and Npvvhffknivtprtppps | |
| Antineoplastic | Effective: Rebeccamycina [ |
| Unclear: Trimethoprim, Cyclomontanin C, Hexamidin, Fosinopril, and Dabigatran | |
| Antioxidant | Effective: Purpurina [ |
| Unclear: Capensinidin, 2′-Hydroxygenistein, Hydramacrophyllol A, 23566-96-3, and Olivomycin | |
| Herbicide | Effective: Guanabenza [ |
| Unclear: Diflubenzuron, Fenhexamid, 1-Azakenpaullone, Pteroic acid, and Bromhexine, C5H11ClHgN2O2 | |
| Dye | Effective: Azocarmine G, Acid roseine, Acid green 3, Basic violet 14, Evans blue, Ponceau S, CHEBI:52122a [ |
| Unclear: Acid Green 50 parent | |
| Anti-depressant | Effective: Benzphetamine, Benzylpiperazinea [ |
| Unclear: Dimethylaniline, 2627-86-3, Reduced Pyocyanine, 1672-76-0, 261789-00-8, Dadpm, and D extroamphetamine |
Effective substances are proved with clear documentation.
aRepresents newly-discovered about which we list references for details.
Experiment results on ChemProt and BC5CDR.
| Model | ChemProt RE | BC5CDR NER |
|---|---|---|
| Sci-BERT | 84.61 ± 0.58 | |
| RoBERTa | 81.10 ± 0.95 | 86.93 ± 0.20 |
| BioBERT (+PubMed) | ||
| KV-PLM | 84.59 ± 0.59 | |
| KV-PLM* |
Underlined numbers denote the best scores with no significant difference.