Literature DB >> 33715361

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Xinhao Li1, Denis Fourches1.   

Abstract

Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.

Entities:  

Year:  2021        PMID: 33715361     DOI: 10.1021/acs.jcim.0c01127

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  8 in total

1.  Discovering design principles of collagen molecular stability using a genetic algorithm, deep learning, and experimental validation.

Authors:  Eesha Khare; Chi-Hua Yu; Constancio Gonzalez Obeso; Mario Milazzo; David L Kaplan; Markus J Buehler
Journal:  Proc Natl Acad Sci U S A       Date:  2022-09-26       Impact factor: 12.779

2.  Predicting protein network topology clusters from chemical structure using deep learning.

Authors:  Akshai P Sreenivasan; Philip J Harrison; Wesley Schaal; Damian J Matuszewski; Kim Kultima; Ola Spjuth
Journal:  J Cheminform       Date:  2022-07-15       Impact factor: 8.489

3.  ColGen: An end-to-end deep learning model to predict thermal stability of de novo collagen sequences.

Authors:  Chi-Hua Yu; Eesha Khare; Om Prakash Narayan; Rachael Parker; David L Kaplan; Markus J Buehler
Journal:  J Mech Behav Biomed Mater       Date:  2021-10-31

4.  InflamNat: web-based database and predictor of anti-inflammatory natural products.

Authors:  Ruihan Zhang; Shoupeng Ren; Qi Dai; Tianze Shen; Xiaoli Li; Jin Li; Weilie Xiao
Journal:  J Cheminform       Date:  2022-06-04       Impact factor: 8.489

Review 5.  Representation of molecules for drug response prediction.

Authors:  Xin An; Xi Chen; Daiyao Yi; Hongyang Li; Yuanfang Guan
Journal:  Brief Bioinform       Date:  2022-01-17       Impact factor: 13.994

6.  A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals.

Authors:  Zheni Zeng; Yuan Yao; Zhiyuan Liu; Maosong Sun
Journal:  Nat Commun       Date:  2022-02-14       Impact factor: 14.919

7.  Topology-enhanced molecular graph representation for anti-breast cancer drug selection.

Authors:  Yue Gao; Songling Chen; Junyi Tong; Xiangling Fu
Journal:  BMC Bioinformatics       Date:  2022-09-19       Impact factor: 3.307

8.  Multi-scaled self-attention for drug-target interaction prediction based on multi-granularity representation.

Authors:  Yuni Zeng; Xiangru Chen; Dezhong Peng; Lijun Zhang; Haixiao Huang
Journal:  BMC Bioinformatics       Date:  2022-08-03       Impact factor: 3.307

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.