Literature DB >> 34723543

Extracting Predictive Representations from Hundreds of Millions of Molecules.

Dong Chen1,2, Jiaxin Zheng1, Guo-Wei Wei2,3,4, Feng Pan1.   

Abstract

The construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse data sets. In this work, we develop a self-supervised learning approach to pretrain models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a protocol based on data traits to automatically select the optimal model for a specific task. To validate the proposed method, we consider 10 benchmarks and 38 virtual screening data sets. Extensive validation indicates that the proposed method shows superb performance.

Entities:  

Year:  2021        PMID: 34723543      PMCID: PMC9358546          DOI: 10.1021/acs.jpclett.1c03058

Source DB:  PubMed          Journal:  J Phys Chem Lett        ISSN: 1948-7185            Impact factor:   6.888


  29 in total

1.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

2.  Benchmarking sets for molecular docking.

Authors:  Niu Huang; Brian K Shoichet; John J Irwin
Journal:  J Med Chem       Date:  2006-11-16       Impact factor: 7.446

3.  Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.

Authors:  Sebastian G Rohrer; Knut Baumann
Journal:  J Chem Inf Model       Date:  2009-02       Impact factor: 4.956

4.  Benchmark data set for in silico prediction of Ames mutagenicity.

Authors:  Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller
Journal:  J Chem Inf Model       Date:  2009-09       Impact factor: 4.956

5.  FreeSolv: a database of experimental and calculated hydration free energies, with input files.

Authors:  David L Mobley; J Peter Guthrie
Journal:  J Comput Aided Mol Des       Date:  2014-06-14       Impact factor: 3.686

Review 6.  A review of mathematical representations of biomolecular data.

Authors:  Duc Duy Nguyen; Zixuan Cang; Guo-Wei Wei
Journal:  Phys Chem Chem Phys       Date:  2020-02-26       Impact factor: 3.676

7.  Algebraic graph-assisted bidirectional transformers for molecular property prediction.

Authors:  Dong Chen; Kaifu Gao; Duc Duy Nguyen; Xin Chen; Yi Jiang; Guo-Wei Wei; Feng Pan
Journal:  Nat Commun       Date:  2021-06-10       Impact factor: 14.919

8.  ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties.

Authors:  Guoli Xiong; Zhenxing Wu; Jiacai Yi; Li Fu; Zhijiang Yang; Changyu Hsieh; Mingzhu Yin; Xiangxiang Zeng; Chengkun Wu; Aiping Lu; Xiang Chen; Tingjun Hou; Dongsheng Cao
Journal:  Nucleic Acids Res       Date:  2021-04-24       Impact factor: 16.971

9.  Open Babel: An open chemical toolbox.

Authors:  Noel M O'Boyle; Michael Banck; Craig A James; Chris Morley; Tim Vandermeersch; Geoffrey R Hutchison
Journal:  J Cheminform       Date:  2011-10-07       Impact factor: 5.514

View more
  1 in total

1.  Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks.

Authors:  Sumin Lee; Myeonghun Lee; Ki-Won Gyak; Sung Dug Kim; Mi-Jeong Kim; Kyoungmin Min
Journal:  ACS Omega       Date:  2022-04-04
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.