| Literature DB >> 35517310 |
Zong-Rong Ye1, I-Shou Huang1,2, Yu-Te Chan1,3, Zhong-Ji Li1, Chen-Cheng Liao1, Hao-Rong Tsai1, Meng-Chi Hsieh1, Chun-Chih Chang1,4, Ming-Kang Tsai1.
Abstract
Organic fluorescent molecules play critical roles in fluorescence inspection, biological probes, and labeling indicators. More than ten thousand organic fluorescent molecules were imported in this study, followed by a machine learning based approach for extracting the intrinsic structural characteristics that were found to correlate with the fluorescence emission. A systematic informatics procedure was introduced, starting from descriptor cleaning, descriptor space reduction, and statistical-meaningful regression to build a broad and valid model for estimating the fluorescence emission wavelength. The least absolute shrinkage and selection operator (Lasso) regression coupling with the random forest model was finally reported as the numerical predictor as well as being fulfilled with the statistical criteria. Such an informatics model appeared to bring comparable predictive ability, being complementary to the conventional time-dependent density functional theory method in emission wavelength prediction, however, with a fractional computational expense. This journal is © The Royal Society of Chemistry.Entities:
Year: 2020 PMID: 35517310 PMCID: PMC9054811 DOI: 10.1039/d0ra05014h
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 4.036
Fig. 1(a) Emission wavelength and (b) molecular weight distribution of the 11 460 fluorescent molecules.
Fig. 2The 2D histograms of the prediction vs. experiment comparison using the VTS series MLR models.
The statistical properties of the VTS series MLR models
| Models | # of descriptors |
| Adj |
|
|---|---|---|---|---|
| VTS-MLR | 6208 | 0.92 | 0.66 | −5143.72 |
| VTSsel-MLR | 4300 | 0.86 | 0.72 | −49790.45 |
| VTSfp-MLR | 5158 | 0.89 | 0.62 | −5.64 × 1015 |
AdjR2 denotes the adjusted R2 in respect to the size of the descriptor ensemble.
Fig. 3The calculated inertial (a, c and e) and Silhouette (b, d and f) scores in respect to k groups in K-means clustering analysis using the PCA-transformed descriptors of the VTS series ensembles.
Fig. 4Three ternary plots of the 15 subgroups using the PCA-transformed descriptors for the VTS series ensembles. The 15 types of colors dots denote the subgroup distributions subject to the three PCA-transformed axes.
Fig. 5(a and b) The inertia and Silhouette scores of Lasso ensemble, respectively; (c and d) the visualization of 7 and 13 groups of PCM-transformed 3-dimentional plot of Lasso ensemble, respectively.
The statistical benchmark of Lasso-LR and Lasso-RF models
| Criterion |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Ideal values | >0.6 | >0.6 | >0.5 | 0.85 ≤ | Close to | <0.1 (<0.1) |
| Lasso-LR | 0.6632 | 0.5685 (44) | 0.5800 | 0.9868 (0.9999) | 0.6627 (0.4984) | 0.0009 (0.2486) |
| Lasso-RF | 0.9227 | 0.7004 (36) | 0.6205 | 0.9919 (1.0025) | 0.8565 (0.7933) | 0.0717 (0.1402) |
Only 80% of 11 460 samples were selected (randomly) as the training set for the Lasso-LR and Lasso-RF models. The rest of 20% samples were used as the testing set with MAE (in nm) shown in the parentheses.
The value of k0 denotes the slope of the predicted over experimental data through the origin (intercept equal to zero), and is the inverse k0. The detailed information is summarized in ESI.
R 0 2 denotes the correlation coefficient of k0, and denotes the case of . See ESI for more details.
Fig. 6The 2D histograms of the regression results of Lasso-LR and Lasso-RF models. The legend shows the linear equation fitting to the predicted values.
Fig. 7The comparison of the predicted wavelength in nm of DFT (red) and Lasso-RF (blue) models. The linear equations of both predicted data are shown in dotted lines. The grey solid line denotes the ideal fitting of slope = 1.
Fig. 8Top 20 important descriptors predicted by the Lasso-RF model.