| Literature DB >> 28090215 |
Lorentz Jäntschi1, Donatella Bálint2, Sorana D Bolboacă3.
Abstract
Multiple linear regression analysis is widely used to link an outcome with predictors for better understanding of the behaviour of the outcome of interest. Usually, under the assumption that the errors follow a normal distribution, the coefficients of the model are estimated by minimizing the sum of squared deviations. A new approach based on maximum likelihood estimation is proposed for finding the coefficients on linear models with two predictors without any constrictive assumptions on the distribution of the errors. The algorithm was developed, implemented, and tested as proof-of-concept using fourteen sets of compounds by investigating the link between activity/property (as outcome) and structural feature information incorporated by molecular descriptors (as predictors). The results on real data demonstrated that in all investigated cases the power of the error is significantly different by the convenient value of two when the Gauss-Laplace distribution was used to relax the constrictive assumption of the normal distribution of the error. Therefore, the Gauss-Laplace distribution of the error could not be rejected while the hypothesis that the power of the error from Gauss-Laplace distribution is normal distributed also failed to be rejected.Entities:
Mesh:
Substances:
Year: 2016 PMID: 28090215 PMCID: PMC5174750 DOI: 10.1155/2016/8578156
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Algorithm 1Calculate “S” at some step “j” from (8).
Algorithm 2Calculate “q” at some step “j” from (7).
Algorithm 3Calculate “(W )1≤” at some step “j” from (9).
Algorithm 4Block solves providing “(a )1≤” at some step “j” with (9).
Algorithm 5Double loop with (9) for (7) and (8).
Algorithm 6Contraction functional for MLR-MLE-GL.
Datasets characteristics.
| Set | Sample size ( | Class | Property/activity | Reference |
|---|---|---|---|---|
| 1 | 132 | Estrogens | Estrogen binding affinity—logRBA | [ |
| 2 | 37 | Carboquinone derivatives | Minimum effective dose (MED)—log(1/MED) | [ |
| 3 | 33 | Organic pollutants | Oxidative degradation—log( | [ |
| 4 | 97 | Benzotriazoles | Fish toxicity—pEC50 | [ |
| 5 | 136 | Thiophene and imidazopyridine derivatives | Inhibition of polo-like kinase 1—pIC50 | [ |
| 6 | 14 | Substituted phenylaminoethanones | Average antimicrobial activity—pMICam | [ |
| 7 | 110 | Acetylcholinesterase inhibitors | Inhibition activity—pIC50 | [ |
| 8 | 107 | Polychlorinated biphenyl ethers | 298 K supercooled liquid vapor pressures—log( | [ |
| 9 | 107 | Polychlorinated biphenyl ethers | Aqueous solubility—log( | [ |
| 10 | 47 | Para-substituted aromatic sulphonamides | Carbonic anhydrase II inhibitors—log( | [ |
Reported bivariate models.
| Set | Model under assumption of normal errors | Determination coefficient ( |
|---|---|---|
| 1 | −4.284 − 0.0263 · TIE + 0.0368 · TIC1 | 0.3976 |
| 2 | 7.780 − 579 · IHDMkMg + 0.049 · IHDDFMg | 0.7700 |
| 3 | −2.703 + 0.00515 · SAG + 9.703 · | 0.6859 |
| 4 | 4.110 − 0.0172 · TPSA(NO) + 0.0097 · Aeigm | 0.7161 |
| 5 | 2.5651 + 0.1899 · RDF035m + 2.9825 · Small-RSI-mol | 0.5101 |
| 6 | 0.780 + 0.0339 · 0
| 0.8357 |
| 7 | 5.446 + 0.716 · nR10 + 1.113 · N-070 | 0.6838 |
| 8 | 1.476 − 0.588 · NCl − 5.029 · 10−2 · | 0.9880 |
| 9 | −4.080 − 0.880 · NCl + 5.996 · | 0.9619 |
| 10 | 4.055 − 0.154 · 0
| 0.7058 |
Differences between values of coefficients obtained by classical linear regression approach compared to the proposed approach.
| Set | diff( | diff( | diff( | diff( | diff(LMLRGL) | diff( |
|---|---|---|---|---|---|---|
| 1 | 0.3400 | −0.00073 | −0.00315 | 0.24400 | −0.30000 | −0.00200 |
| 2 | −0.4150 | −0.00034 | −16.30000 | 0.17400 | −0.10100 | −0.00020 |
| 3 | −0.3830 | −0.28700 | 0.00009 | −0.04000 | −0.06000 | −0.00030 |
| 4 | −0.1680 | 0.00006 | 0.00007 | −0.01400 | −0.05000 | 0.00000 |
| 5 | 0.9420 | 0.34500 | −0.00880 | −0.62400 | −6.10000 | −0.00850 |
| 6 | 0.5000 | 0.00027 | 0.00078 | −0.02140 | −0.09000 | −0.00006 |
| 7 | 0.5210 | −0.10300 | 0.03490 | −0.01800 | −1.10000 | 0.00030 |
| 8 | −0.5690 | 0.00090 | −0.00330 | −0.01100 | −0.42000 | −0.00010 |
| 9 | −0.4370 | −0.27700 | −0.00020 | 0.04000 | −0.30000 | −0.00020 |
| 10 | −0.9370 | 0.01400 | −0.00700 | 0.06000 | −0.70600 | 0.49310 |
diff: difference between value obtained by classical approach and value obtained by the proposed approach.
a 0, a 1, and a 2: coefficients of the independent variables; q: power of the error (Algorithm 6 for the proposed approach).
σ: population standard deviation; LMLRGL: likelihood for multiple linear regressions under assumption of GL distribution.
Figure 1Evolution of the power of the errors (q) by optimization iteration: (a) set 1 (converged at 226); (b) set 6 (converged at 154); (c) set 8 (converged at 83); and (d) set 5 (converged at 784).