| Literature DB >> 29086047 |
Omer Kaspi1,2, Abraham Yosipof3, Hanoch Senderowitz4.
Abstract
An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a "one stop shop" algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For "future" predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions.Entities:
Keywords: Material-informatics; Photovoltaics; QSAR; RANSAC; Solar Cells
Year: 2017 PMID: 29086047 PMCID: PMC5461245 DOI: 10.1186/s13321-017-0224-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1A possible RANSAC output where the desired model is of the first power (i.e., straight line). The algorithm assumes that due to internal noise, samples will not be exactly on the model but within a normal distribution around it. Conceptually, this variance forms a “strip” where all samples that lay within its boundaries are influenced by internal noise only. Samples within a “strip” are defined as model-compatible. Samples outside the “strip” are defined as model-incompatible
Fig. 2Description of the RANSAC algorithm as used for model construction
Fig. 3A schematic representation of the PV solar cells libraries. a library (with Ag and Ag|Cu back contacts), b library
Descriptor ranges for the library (with Ag and Ag|Cu back contacts)
|
| |
|---|---|
|
| 70.0–311.5 |
|
| 249.0–596.0 |
| Ratio | 0.51–0.89 |
| BGP (eV) | 0.21–2.50 |
Descriptor ranges for the library
|
| |
|---|---|
|
| 259.0–355.0 |
|
| 30.7–245.0 |
|
| 38.9–61.8 |
| Ratio | 0.08–0.43 |
| Ratio_AR | 0.38–0.86 |
Fig. 4Boxplots of the three PV activities (J , Voc, and IQE). a–c The three PV activities distribution for the library (with Ag and Ag|Cu back contacts). d–f The three PV activities distribution for the library. The boxplots show the median values (solid horizontal line), 50th percentile values (box outline), the lower and upper quartile (whiskers, vertical lines), and outlier values (open circles)
Activity ranges for the three libraries
|
|
|
| |
|---|---|---|---|
|
| 13.9–387.8 | 17.21–405.86 | 10.35–25.7 |
|
| 0.06–0.35 | 0.01–0.35 | 0.03–0.62 |
|
| 0.11–2.56 | 0.13–2.68 | 0.05–0.31 |
Number of model-compatible samples for the three datasets based on the RANSAC models
| JSC | VOC | IQE | |
|---|---|---|---|
|
| |||
| # Model—compatible training samples | 125/130 (96%) | 120/130 (92%) | 125/130 (96%) |
| # Model—compatible test samples | 28/32 (88%) | 32/32 (100%) | 28/32 (88%) |
|
| |||
| # Model—compatible training samples | 131/134 (98%) | 127/134 (95%) | 129/134 (96%) |
| # Model—compatible test samples | 30/32 (94%) | 30/32 (94%) | 31/32 (97%) |
|
| |||
| # Model—compatible training samples | 118/120 (98%) | 101/120 (84%) | 120/120 (100%) |
| # Model—compatible test samples | 30/30 (100%) | 24/30 (80%) | 30/30 (100%) |
RANSC model performance for the three datasets
| library | Activity |
|
|
|
|
|---|---|---|---|---|---|
|
|
| 0.75 (0.77) | 0.82 (0.84) | 0.69 (0.75) | 0.87 (0.89) |
|
| 0.62 (0.63) | 0.65 (0.66) | 0.80 (0.80) | 0.80 (0.80) | |
|
| 0.71 (0.72) | 0.79 (0.82) | 0.69 (0.76) | 0.83 (0.87) | |
|
|
| 0.74 (0.78) | 0.78 (0.82) | 0.76 (0.81) | 0.84 (0.88) |
|
| 0.57 (0.62) | 0.73 (0.78) | 0.62 (0.68) | 0.73 (0.78) | |
|
| 0.72 (0.74) | 0.78 (0.81) | 0.78 (0.82) | 0.82 (0.86) | |
|
|
| 0.77 (0.78) | 0.78 (0.79) | 0.82 (0.83) | 0.82 (0.83) |
|
| −0.06 (0.03) | 0.25 (0.36) | 0.00 (0.10) | 0.33 (0.31) | |
|
| 0.85 (0.86) | 0.85 (0.86) | 0.79 (0.81) | 0.79 (0.81) |
Fig. 5Predicted versus experimental PV properties for train set samples following the removal of outliers. a–c J , V and IQE for the TiO2|Cu2O library with Ag back contacts, d–f J , V and IQE for the TiO2|Cu2O library with Ag|Cu back contacts, g–i J , V and IQE for the TiO2|Co3O4|MoO3 library
Fig. 6Predicted versus experimental PV properties for test set samples residing with the models applicability domains. a–c J , V and IQE for the TiO2|Cu2O library with Ag back contacts, d–f J , V and IQE for the TiO2|Cu2O library with Ag|Cu back contacts, g–i J , V and IQE for the TiO2|Co3O4|MoO3 library
kNN and GP models performance retrieved from Yosipof et al. [41]
| Library | Activity |
|
|
|
| |
|---|---|---|---|---|---|---|
|
| GP |
|
| GP | ||
|
|
| 0.92 | 0.74 | 0.92 (0.92) | 0.92 (0.92) | 0.76 (0.76) |
|
| 0.78 | 0.65 | 0.89 (0.89) | 0.89 (0.89) | 0.78 (0.77) | |
|
| 0.91 | 0.70 | 0.87 (0.87) | 0.87 (0.87) | 0.72 (0.73) | |
|
|
| 0.92 | 0.76 | 0.89 (0.89) | 0.88 (0.89) | 0.74 (0.76) |
|
| 0.74 | 0.61 | 0.56 (0.55) | 0.55 (0.54) | 0.50 (0.50) | |
|
| 0.9 | 0.72 | 0.91 (0.91) | 0.89 (0.89) | 0.72 (0.73) | |
A comparison of model coverage, based on test set samples, between RANSAC and kNN models
| Library | Activity | RANSAC coverage (%) |
|
|---|---|---|---|
|
|
| 88 | 91 |
|
| 100 | 84 | |
|
| 88 | 91 | |
|
|
| 94 | 79 |
|
| 94 | 79 | |
|
| 97 | 73 |
*The data for kNN were taken from Table 5 in Ref. [41]
RANSAC derived models for different PV properties
| PV Property | Model |
|---|---|
|
|
|
|
| |
|
| |
|
|
|
|
| |
|
| |
|
|
|
|
| |
|
|
Featured selected for the libraries by the various methods
| Library | Activity | RANSAC | GP |
|
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
Fig. 7Virtual cells based on the with Ag back contacts [a J (μA/cm2); b V (V); c IQE (%)] and With Ag|Cu Back Contacts [d J (μA/cm2); e V (V); f IQE (%)] solar cells libraries. The white regions are outside of the models’ applicability domain
Fig. 8Virtual cells based on the library [a J (μA/cm2); b V (V); c IQE (%)]. The white regions are outside of the models’ applicability domain