Literature DB >> 31638804

Regression Clustering for Improved Accuracy and Training Costs with Molecular-Orbital-Based Machine Learning.

Lixue Cheng1, Nikola B Kovachki2, Matthew Welborn1, Thomas F Miller1.   

Abstract

Machine learning (ML) in the representation of molecular-orbital-based (MOB) features has been shown to be an accurate and transferable approach to the prediction of post-Hartree-Fock correlation energies. Previous applications of MOB-ML employed Gaussian Process Regression (GPR), which provides good prediction accuracy with small training sets; however, the cost of GPR training scales cubically with the amount of data and becomes a computational bottleneck for large training sets. In the current work, we address this problem by introducing a clustering/regression/classification implementation of MOB-ML. In the first step, regression clustering (RC) is used to partition the training data to best fit an ensemble of linear regression (LR) models; in the second step, each cluster is regressed independently, using either LR or GPR; and in the third step, a random forest classifier (RFC) is trained for the prediction of cluster assignments based on MOB feature values. Upon inspection, RC is found to recapitulate chemically intuitive groupings of the frontier molecular orbitals, and the combined RC/LR/RFC and RC/GPR/RFC implementations of MOB-ML are found to provide good prediction accuracy with greatly reduced wall-clock training times. For a data set of thermalized (350 K) geometries of 7211 organic molecules of up to seven heavy atoms (QM7b-T), both RC/LR/RFC and RC/GPR/RFC reach chemical accuracy (1 kcal/mol prediction error) with only 300 training molecules, while providing 35000-fold and 4500-fold reductions in the wall-clock training time, respectively, compared to MOB-ML without clustering. The resulting models are also demonstrated to retain transferability for the prediction of large-molecule energies with only small-molecule training data. Finally, it is shown that capping the number of training data points per cluster leads to further improvements in prediction accuracy with negligible increases in wall-clock training time.

Year:  2019        PMID: 31638804     DOI: 10.1021/acs.jctc.9b00884

Source DB:  PubMed          Journal:  J Chem Theory Comput        ISSN: 1549-9618            Impact factor:   6.006


  3 in total

1.  Pure non-local machine-learned density functional theory for electron correlation.

Authors:  Johannes T Margraf; Karsten Reuter
Journal:  Nat Commun       Date:  2021-01-12       Impact factor: 14.919

2.  Clinical Influencing Factors of Acute Myocardial Infarction Based on Improved Machine Learning.

Authors:  Hongwei Du; Linxing Feng; Yan Xu; Enbo Zhan; Wei Xu
Journal:  J Healthc Eng       Date:  2021-03-27       Impact factor: 2.682

3.  Designing a multilayer film via machine learning of scientific literature.

Authors:  Kenta Fukada; Michiko Seyama
Journal:  Sci Rep       Date:  2022-01-18       Impact factor: 4.379

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.