| Literature DB >> 34821473 |
Shidang Xu1, Xiaoli Liu1, Pengfei Cai1, Jiali Li1, Xiaonan Wang1, Bin Liu1,2.
Abstract
For practical applications, molecules often exist in an aggregate state. Therefore, it is of great value if one can predict the performance of molecules when forming aggregates, for example, aggregation-induced emission (AIE) or aggregation-caused quenching (ACQ). Herein, a database containing AIE/ACQ molecules reported in the literature is first established. Through training, these machine learning (ML) models can build up the structure-property relationship and thus implement fast prediction of AIE/ACQ properties. To this end, a multi-modal approach is proposed, multiple prediction methods are compared and designed, and thus an ensemble strategy is developed. First, multiple molecular descriptors are considered at the same time, major features are extracted by dimensionality reduction, and multi-modal features are synthesized. Then, several state-of-the-art methods are designed and compared to analyze the advantages of the different methods. Finally, the ensemble strategy combines the advantages of the multiple methods to obtain the final prediction result. The reliability of this approach in an unknown molecular space is further verified by three newly designed molecules. Reasonable consistency between model predictions and experimental outcomes is obtained. The result indicates that ML can be a powerful tool to predict molecular properties in the aggregated state, thus accelerating the development of solid-state optical materials.Entities:
Keywords: aggregation-induced emission; machine learning; molecular design; optical properties; solid-state materials
Year: 2021 PMID: 34821473 PMCID: PMC8760175 DOI: 10.1002/advs.202101074
Source DB: PubMed Journal: Adv Sci (Weinh) ISSN: 2198-3844 Impact factor: 16.806
Figure 1Flowchart of the proposed machine learning (ML)‐assisted prediction of AIE/ACQ properties and experimental validation of newly designed molecules.
Prediction performance results of qualitative and quantitative descriptors of baseline and ensemble methods based on single‐modal strategy
| Methods | Train accuracy | Test accuracy | AUC | F1‐score |
|---|---|---|---|---|
| Logistic regression | ||||
| Morgan | 0.9791 ± 0.0264 | 0.9160 ± 0.0415 | 0.9100 ± 0.0449 | 0.9158 ± 0.0416 |
| Daylight |
|
|
|
|
| Atom‐pair | 0.9778 ± 0.0299 | 0.9103 ± 0.0514 | 0.9090 ± 0.0550 | 0.9107 ± 0.0512 |
| Topological | 0.9332 ± 0.0053 | 0.9244 ± 0.0374 | 0.9181 ± 0.0438 | 0.9242 ± 0.0376 |
| Quantitative descriptors | 0.6236 ± 0.0038 | 0.6237 ± 0.0340 | 0.5000 ± 0.0000 | 0.4797 ± 0.0419 |
| K‐nearest neighbor | ||||
| Morgan | 0.9257 ± 0.0190 | 0.8879 ± 0.0430 | 0.8729 ± 0.0474 | 0.8866 ± 0.0442 |
| Daylight |
| 0.8963 ± 0.0545 | 0.8827 ± 0.0639 | 0.8952 ± 0.0559 |
| Atom‐pair | 0.9151 ± 0.0071 | 0.9048 ± 0.0469 |
| 0.9056 ± 0.0465 |
| Topological | 0.9154 ± 0.0051 |
| 0.9058 ± 0.0502 |
|
| Quantitative descriptors | 0.9017 ± 0.0133 | 0.8740 ± 0.0684 | 0.8716 ± 0.0726 | 0.8748 ± 0.0675 |
| Gradient boost | ||||
| Morgan | 0.8914 ± 0.0156 | 0.8851 ± 0.0577 | 0.8818 ± 0.0571 | 0.8852 ± 0.0575 |
| Daylight |
| 0.9017 ± 0.0461 | 0.8911 ± 0.0538 | 0.9011 ± 0.0468 |
| Atom‐pair | 0.9194 ± 0.1000 | 0.8713 ± 0.0905 | 0.8602 ± 0.1298 | 0.8579 ± 0.1287 |
| Topological | 0.8814 ± 0.0507 | 0.8795 ± 0.0711 | 0.8746 ± 0.0709 | 0.8792 ± 0.0711 |
| Quantitative descriptors | 0.9813 ± 0.0329 |
|
|
|
| Random forest | ||||
| Morgan | 0.9878 ± 0.0047 | 0.9074 ± 0.0281 | 0.9017 ± 0.0323 | 0.9071 ± 0.0284 |
| Daylight | 0.9919 ± 0.0040 | 0.9213 ± 0.0433 | 0.9080 ± 0.0528 | 0.9204 ± 0.0444 |
| Atom‐pair | 0.9913 ± 0.0125 | 0.9271 ± 0.0402 | 0.9315 ± 0.0412 | 0.9276 ± 0.0400 |
| Topological | 0.9953 ± 0.0067 |
| 0.9283 ± 0.0464 |
|
| Quantitative descriptors |
| 0.9326 ± 0.0475 |
| 0.9325 ± 0.0482 |
| MLPClassifier | ||||
| Morgan | 0.9813 ± 0.0226 | 0.9159 ± 0.0414 | 0.9088 ± 0.0443 | 0.9156 ± 0.0418 |
| Daylight |
|
|
|
|
| Atom‐pair | 0.9784 ± 0.0296 | 0.8910 ± 0.0808 | 0.8838 ± 0.0936 | 0.8884 ± 0.0869 |
| Topological | 0.9420 ± 0.0046 | 0.9217 ± 0.0347 | 0.9170 ± 0.0383 | 0.9216 ± 0.0347 |
| Quantitative descriptors | 0.8394 ± 0.1154 | 0.8396 ± 0.0923 | 0.8228 ± 0.1482 | 0.8199 ± 0.1320 |
| Ensemble | — | 0.9274 ± 0.0416 | 0.9226 ± 0.0444 | 0.9273 ± 0.0416 |
Superscript symbol * indicates optimal results for all methods.
Prediction performance results of baseline and ensemble methods based on multi‐modal strategy
| Methods | Train accuracy | Test accuracy | AUC | F1‐score |
|---|---|---|---|---|
| Logistic regression | ||||
| Morgan + Quantitative | 0.9863 ± 0.0071 | 0.9215 ± 0.0323 | 0.9130 ± 0.0354 | 0.9211 ± 0.0323 |
| Daylight + Quantitative |
|
|
|
|
| Atom‐pair + Quantitative | 0.9791 ± 0.0289 | 0.9217 ± 0.0427 | 0.9219 ± 0.0441 | 0.9221 ± 0.0426 |
| Topological + Quantitative | 0.9828 ± 0.0117 | 0.9217 ± 0.0427 | 0.9163 ± 0.0476 | 0.9215 ± 0.0427 |
| K‐nearest neighbor | ||||
| Morgan + Quantitative | 0.9185 ± 0.0200 | 0.8795 ± 0.0545 | 0.8660 ± 0.0661 | 0.8775 ± 0.0573 |
| Daylight + Quantitative |
| 0.8906 ± 0.0441 | 0.8755 ± 0.0535 | 0.8893 ± 0.0455 |
| Atom‐pair + Quantitative | 0.9235 ± 0.0131 |
|
|
|
| Topological + Quantitative | 0.9089 ± 0.0069 | 0.8936 ± 0.0557 | 0.8924 ± 0.0600 | 0.8941 ± 0.0553 |
| Gradient boost | ||||
| Morgan + Quantitative |
|
|
|
|
| Daylight + Quantitative |
| 0.9157 ± 0.0360 | 0.9147 ± 0.0453 | 0.916 ± 0.0369 |
| Atom‐pair + Quantitative |
| 0.9271 ± 0.0359 | 0.9261 ± 0.0392 | 0.9271 ± 0.0360 |
| Topological + Quantitative | 0.9483 ± 0.1551 | 0.8873 ± 0.1108 | 0.8931 ± 0.1011 | 0.8875 ± 0.1109 |
| Random forest | ||||
| Morgan + Quantitative | 0.9981 ± 0.0032 | 0.9410 ± 0.0411 | 0.9418 ± 0.0465 | 0.9411 ± 0.0412 |
| Daylight + Quantitative | 0.9981 ± 0.0032 |
|
|
|
| Atom‐pair + Quantitative |
| 0.9187 ± 0.0462 | 0.9214 ± 0.0509 | 0.9191 ± 0.0462 |
| Topological + Quantitative | 0.9975 ± 0.0027 | 0.9271 ± 0.0422 | 0.9309 ± 0.0467 | 0.9275 ± 0.0422 |
| MLPClassifier | ||||
| Morgan + Quantitative |
| 0.9216 ± 0.0388 | 0.9154 ± 0.0397 | 0.9214 ± 0.0387 |
| Daylight + Quantitative | 0.9966 ± 0.0083 |
|
|
|
| Atom‐pair + Quantitative | 0.9984 ± 0.0038 | 0.9188 ± 0.0420 | 0.9217 ± 0.0452 | 0.9192 ± 0.0420 |
| Topological + Quantitative | 0.9959 ± 0.0063 | 0.9190 ± 0.0502 | 0.9174 ± 0.0477 | 0.9192 ± 0.0494 |
| Ensemble | — | 0.9383 ± 0.0376 | 0.9391 ± 0.0445 | 0.9384 ± 0.0379 |
Figure 2a) Schematic illustration of multi‐modal descriptors. b) Average results of different methods. c) Confusion matrix of ensemble method based on multi‐modal descriptors.
Figure 3a) Structures of compound 1–3. Plots of the maximum photoluminescence (PL) intensity of compound 1–3, b–d) against water fractions (vol%). The inset shows photographs of the compounds in 0 and 99 vol% water under UV light (365 nm) illumination. e) Results of the different state‐of‐the‐art methods compared with ensemble strategy.