Literature DB >> 28795106

Simulation data for an estimation of the maximum theoretical value and confidence interval for the correlation coefficient.

Paolo Rocco1, Francesco Cilurzo1, Paola Minghetti1, Giulio Vistoli1, Alessandro Pedretti1.   

Abstract

The data presented in this article are related to the article titled "Molecular Dynamics as a tool for in silico screening of skin permeability" (Rocco et al., 2017) [1]. Knowledge of the confidence interval and maximum theoretical value of the correlation coefficient r can prove useful to estimate the reliability of developed predictive models, in particular when there is great variability in compiled experimental datasets. In this Data in Brief article, data from purposely designed numerical simulations are presented to show how much the maximum r value is worsened by increasing the data uncertainty. The corresponding confidence interval of r is determined by using the Fisher r→Z transform.

Entities:  

Keywords:  Confidence interval around correlation coeffcient; Fisher transform; Maximum theoretical correlation coefficient; Numerical simulation; Skin permeability

Year:  2017        PMID: 28795106      PMCID: PMC5540710          DOI: 10.1016/j.dib.2017.07.045

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the data

When there is great variability in a compiled experimental dataset, considerations on the confidence interval for the correlation coefficient r and on the maximum theoretical value achievable for r can offer hints as to what to expect from a predictive model based on that set. Numerical simulations used to generate a dataset of arbitrary average uncertainty and to estimate a confidence interval around the correlation coefficient r and its maximum theoretical value are easily applicable to all experimental datasets The here proposed data can be easily utilized to derive the range of r that can be pursued when the variability of a given dataset is known Along with well-known statistical parameters (such as r, r2, q2, F, SE, etc), the here proposed confidence interval of r can become a meaningful parameter to better evaluate the reliability of a given model and to understand whether there is still room for statistical improvements.

Data

Data presented here represent maximum theoretical average values and confidence interval for the correlation coefficient r and the determination coefficient r as obtained through numerical simulation (Table 1). The values of r and r correspond to different simulated levels of random error (ε) in the experimental data set.
Table 1

Maximum theoretical average values and 95% confidence interval for r and r from numerical simulation, at different simulated levels of random error (ε) in the experimental data set.

ε →0.100.150.200.250.30
Maximum average r0.970.940.900.860.81
Confidence interval around r0.96–0.980.91–0.960.85–0.940.78–0.910.70–0.89
Maximum average r20.950.880.810.740.66
Confidence interval around r20.92–0.970.83–0.930.72–0.880.60–0.840.50–0.79
Maximum theoretical average values and 95% confidence interval for r and r from numerical simulation, at different simulated levels of random error (ε) in the experimental data set. Original data, on which data in Table 1 are based, are contained in the files Reduced_set.pdf and simulation_data.xlsx. Reduced_set.pdf contains a set of 80 permeability coefficients kp [1] assembled as the intersection of Flynn's set [2] and the Fully Validated data set [3]. The file simulation_data.xlsx contains data from the numerical simulation described below.

Experimental design, materials and methods

Given a set of experimental data, y, we can assume that a perfect estimator ϕ for the set is known (in [1], yi correspond to pkp values). ϕ is a mathematical function, which correlates a set of variables {xij} with the experimental value yi, where xij represents the j-th molecular property of the i-th molecule (Eq. (1)). The correlation, based on a perfect estimator, yields a correlation coefficient r = 1. For every yi, we introduce an error ε·cik·yi, where {cik} is a set of normally distributed pseudo-random numbers with zero average and unitary standard deviation (obtained by applying the Box-Muller transform [4] to a set of a linearly distributed random numbers); ε corresponds to the standard deviation of the errors, normalized by yi. For the k-th simulation, Eq. (1) becomes Eq. (2): Since ϕk, by definition, is a perfect estimator, the values of r obtained for Eq. (2) in the simulation are the maximum theoretical correlation coefficients achievable given the uncertainty introduced (ε). For different values of ε, the numerical simulation is repeated 99 times (l=99) obtaining 99 correlation coefficients rk (simulation_data.xlsx). Table 1 shows how much r and r2 worsen when ε increases and confirms that the formula: maximum r2 ≅ (1−ε) is an approximate but yet reasonable way to estimate the worsening effect of ε. As for the confidence interval around r, it can be estimated, for each value of r, by using Fisher r→Z transform [5]: We apply Fisher r→Z transform to the r values, obtaining 99 Z values. Unlike r, Z tends to a normal distribution as the number of data becomes large. Therefore, the standard deviation S can be calculated by Eq. (4): The 95% confidence interval around Z is then calculated as , and the 95% confidence interval around r is obtained from it, through the reverse transform (Eq. (5)): The confidence intervals around r and r2 for different values of ε are shown in Table 1.
Subject areaChemistry
More specific subject areaChemometrics
Type of dataTable
How data was acquiredNumerical simulation
Data formatRaw, Analysed
Experimental factorsNot applicable
Experimental featuresReduced set (Reduced_ser.pdf) modified by randomly generated errors.
Data source locationNot applicable
Data accessibilityData is contained in this article and files: Reduced_set.pdf, simulation_data.xlsx
  1 in total

1.  Molecular Dynamics as a tool for in silico screening of skin permeability.

Authors:  Paolo Rocco; Francesco Cilurzo; Paola Minghetti; Giulio Vistoli; Alessandro Pedretti
Journal:  Eur J Pharm Sci       Date:  2017-06-13       Impact factor: 4.384

  1 in total
  1 in total

1.  Intra- and Inter-individual Variability of microRNA Levels in Human Cerebrospinal Fluid: Critical Implications for Biomarker Discovery.

Authors:  Hyejin Yoon; Krystal C Belmonte; Tom Kasten; Randall Bateman; Jungsu Kim
Journal:  Sci Rep       Date:  2017-10-05       Impact factor: 4.379

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.