| Literature DB >> 34787422 |
WooSeok Jeong1, Carlo Alberto Gaggioli2, Laura Gagliardi2,3.
Abstract
We present the active learning configuration interaction (ALCI) method for multiconfigurational calculations based on large active spaces. ALCI leverages the use of an active learning procedure to find important electronic configurations among the full configurational space generated within an active space. We tested it for the calculation of singlet-singlet excited states of acenes and pyrene using different machine learning algorithms. The ALCI method yields excitation energies within 0.2-0.3 eV from those obtained by traditional complete active-space configuration interaction (CASCI) calculations (affordable for active spaces up to 16 electrons in 16 orbitals) by including only a small fraction of the CASCI configuration space in the calculations. For larger active spaces (we tested up to 26 electrons in 26 orbitals), not affordable with traditional CI methods, ALCI captures the trends of experimental excitation energies. Overall, ALCI provides satisfactory approximations to large active-space wave functions with up to 10 orders of magnitude fewer determinants for the systems presented here. These ALCI wave functions are promising and affordable starting points for the subsequent second-order perturbation theory or pair-density functional theory calculations.Entities:
Year: 2021 PMID: 34787422 PMCID: PMC8675132 DOI: 10.1021/acs.jctc.1c00769
Source DB: PubMed Journal: J Chem Theory Comput ISSN: 1549-9618 Impact factor: 6.006
Figure 1Molecular structure of polyacenes and pyrene. n is the number of fused benzene rings. (x, y) is the active-space size where x is the number of active electrons (herein, the number of π electrons) and y is the number of active orbitals (i.e., the number of π bonding and π* antibonding orbitals).
Figure 2Active learning scheme for finding important configurations in iterative selected CI calculations.
Figure 3Workflow of the active learning configuration interaction (ALCI) protocol.
Figure 4ALCI protocol convergences in terms of excitation energy depending on the maximum level of excitations for naphthalene, anthracene, and tetracene: single excitations only vs single/double excitations. Three independent protocol calculations (as indicated with different marker types) were performed for a different maximum level of excitations. Iteration zero corresponds to the RASCI (n = 2) calculation.
Figure 5ALCI protocol convergences in terms of excitation energy depending on the use of class probability for query priority sampling for naphthalene, anthracene, and tetracene. Three independent protocol calculations (as indicated with different marker types) were performed for each case. Iteration zero corresponds to the RASCI (n = 2) calculation.
ALCI Protocol Results for Tetracene with Different ML Algorithmsa
| wall time (hh:mm:ss) | |||||||
|---|---|---|---|---|---|---|---|
| ML algorithm | average number of iterations | number of important configurations | excitation energy (eV) | ML training | ML predictions | SCI cal. | total |
| ANN | 14.8 | 1625 | 3.88 | 07:03:12 (83.21%) | 00:00:11 (0.04%) | 01:24:45 (16.66%) | 08:28:34 |
| GP | 15.4 | 1642 | 3.96 | 03:21:24 (61.00%) | 00:04:48 (1.46%) | 02:03:28 (37.39%) | 05:30:10 |
| XGBoost | 15.6 | 1753 | 3.87 | 00:07:24 (4.08%) | 00:00:06 (0.05%) | 02:53:15 (95.61%) | 03:01:12 |
| KRC | 15.6 | 1749 | 3.90 | 00:25:03 (9.24%) | 00:00:34 (0.21%) | 04:05:05 (90.38%) | 04:31:11 |
| RF | 21.2 | 1781 | 3.88 | 00:17:48 (6.33%) | 00:00:06 (0.03%) | 04:22:34 (93.39%) | 04:41:09 |
| KNN | 25.7 | 1724 | 3.91 | 00:07:20 (1.93%) | 00:01:22 (0.36%) | 06:09:34 (97.51%) | 06:19:01 |
Results are average values of 10 independent calculations for each model that are performed to obtain better statistics.
Wall timings measure average elapsed time for both the iteration and termination steps of the ALCI protocol, not including the initialization step (i.e., DFT optimization, HF, and RASCI (n = 2) calculations). To compare the computational cost, the number of CPU cores for the calculations was limited to 5 cores (Intel i9-10980XE 3.00 GHz) if ML model training/predictions can be parallelized (i.e., for KNN, RF, and XGBoost). For ANN, a GPU (NVIDIA Quadro RTX 8000) was used. GP and KRC models were trained and used with one CPU core (Intel i9-10980XE 3.00 GHz) due to the limitation of the software. SCI calculations were performed on a single CPU core due to the limitation of the GENCI program in the GAMESS package.
Wall timing for the ML training step includes the featurization of raw data (i.e., configurations), 10-fold cross-validation for hyperparameter tuning, and retraining of an ML model with the tuned hyperparameters using all of the training data.
Total wall time is slightly larger (25–40 s) than a sum of the ML training, ML predictions, and SCI calculations due to auxiliary processes such as transferring, saving, and loading data, etc.
ALCI Protocol Results with the Optimized Input Parameters for Naphthalene, Anthracene, and Pyrenea
| system | ML algorithm | threshold for CI coeff. | average number of iterations | number of important configurations | number of important SDs | excitation energy (eV) |
|---|---|---|---|---|---|---|
| naphthalene (10e, 10o) | KRC | 0.01 | 9.0 | 369 | 4104 | 4.48 |
| 0.005 | 8.0 | 722 | 8379 | 4.45 | ||
| ANN | 0.01 | 13.3 | 356 | 3828 | 4.50 | |
| 0.005 | 10.3 | 698 | 7942 | 4.46 | ||
| XGBoost | 0.01 | 10.7 | 362 | 4072 | 4.48 | |
| 0.005 | 9.7 | 662 | 7701 | 4.47 | ||
| CASCI (10e, 10o) | 8953 | 63 504 | 4.46 | |||
| anthracene (14e, 14o) | KRC | 0.01 | 11.3 | 1062 | 37 971 | 4.07 |
| 0.005 | 11.7 | 2474 | 100 328 | 3.97 | ||
| ANN | 0.01 | 11.7 | 923 | 23 462 | 4.10 | |
| 0.005 | 10.3 | 2353 | 78 577 | 3.98 | ||
| XGBoost | 0.01 | 9.7 | 1041 | 37 278 | 4.07 | |
| 0.005 | 12.3 | 2328 | 97 536 | 4.01 | ||
| CASCI (14e, 14o) | 616 227 | 11 778 624 | 3.89 | |||
| pyrene (16e, 16o) | KRC | 0.01 | 13.7 | 1444 | 41 961 | 4.13 |
| 0.005 | 16.7 | 3660 | 225 039 | 3.98 | ||
| ANN | 0.01 | 14.7 | 1243 | 45 457 | 4.15 | |
| 0.005 | 15.7 | 3424 | 151 508 | 3.99 | ||
| XGBoost | 0.01 | 12.7 | 1407 | 40 522 | 4.14 | |
| 0.005 | 20.0 | 3505 | 155 884 | 4.00 | ||
| CASCI (16e, 16o) | 5 196 627 | 165 636 900 | 3.79 | |||
Results are average values of three separate calculations for each model that are performed to obtain better statistics.
Total number of configurations or determinants in the active space.
Figure 6ALCI protocol convergence in terms of excitation energy for naphthalene, anthracene, and pyrene. Three independent calculations (as indicated with different marker types) are performed for each model. The CI coefficient threshold for important configuration is 0.01. Iteration zero corresponds to the RASCI (n = 2) calculation.
ALCI Protocol Results with the Optimized Input Parameters for Tetracene, Pentacene, and Hexacenea
| system | ML algorithm | threshold for CI coeff. | average number of iterations | number of important configurations | number of important SDs | excitation energy (eV) |
|---|---|---|---|---|---|---|
| tetracene (18e, 18o) | KRC | 0.01 | 15.0 | 1759 | 53 213 | 3.86 |
| 0.005 | 22.7 | 4788 | 251 320 | 3.74 | ||
| ANN | 0.01 | 10.3 | 1491 | 32 663 | 3.91 | |
| 0.005 | 18.0 | 3944 | 179 952 | 3.73 | ||
| XGBoost | 0.01 | 15.0 | 1741 | 53 125 | 3.88 | |
| 0.005 | 24.3 | 4642 | 234 622 | 3.75 | ||
| CASCI (18e, 18o) | 44 152 809 | 2 363 904 400 | N/A | |||
| pentacene (22e, 22o) | KRC | 0.01 | 12.7 | 1793 | 31 491 | 3.50 |
| 0.005 | 25.0 | 4780 | 216 678 | 3.46 | ||
| ANN | 0.01 | 10.7 | 1713 | 22 622 | 3.44 | |
| 0.005 | 16.0 | 4195 | 175 336 | 3.48 | ||
| XGBoost | 0.01 | 17.0 | 1979 | 47 645 | 3.47 | |
| 0.005 | 25.0 | 4941 | 231 838 | 3.46 | ||
| CASCI (22e, 22o) | 3 241 135 527 | 497 634 306 624 | N/A | |||
| hexacene (26e, 26o) | KRC | 0.01 | 17.3 | 2430 | 27 468 | 2.89 |
| 0.005 | 18.7 | 4061 | 58 237 | 3.02 | ||
| ANN | 0.01 | 12.3 | 2366 | 24 760 | 2.89 | |
| 0.005 | 15.0 | 4574 | 101 015 | 3.03 | ||
| XGBoost | 0.01 | 24.0 | 2670 | 51 247 | 2.93 | |
| 0.005 | 36.3 | 5952 | 245 534 | 3.03 | ||
| CASCI (26e, 26o) | 241 813 226 151 | 108 172 480 360 000 | N/A | |||
Results are average values of three separate calculations for each model that are performed to obtain better statistics.
Total number of configurations or determinants in the active space.
Figure 7ALCI protocol convergence in terms of excitation energy for tetracene, pentacene, and hexacene. Three separate calculations (as indicated with different marker types) are performed for each model. The CI coefficient threshold for important configuration is 0.01. The iteration zero corresponds to the RASCI (n = 2) calculation.
Figure 8Excitation energy for different acene lengths computed with ALCI (using KRC, XGBoost, and ANN as ML algorithms) and CASCI. The experimental results from ref (92) are also reported.
First Singlet Vertical Excited States for Acenes Determined by Different Methods
| naphthalene | anthracene | tetracene | pentacene | hexacene | |
|---|---|---|---|---|---|
| exp. data | Lb | La | La | La | La |
| CASCI | Lb | Lb | N/A | N/A | N/A |
| CASCI + PT2 | Lb | La | N/A | N/A | N/A |
| ALCI | Lb | Lb | Lb | La | La |