Simone Fiori1. 1. Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Via Brecce Bianche, 60131 Ancona, Italy. fiori@deit.univpm.it
Abstract
Bivariate statistical modeling from incomplete data is a useful statistical tool that allows to discover the model underlying two data sets when the data in the two sets do not correspond in size nor in ordering. Such situation may occur when the sizes of the two data sets do not match (i.e., there are "holes" in the data) or when the data sets have been acquired independently. Also, statistical modeling is useful when the amount of available data is enough to show relevant statistical features of the phenomenon underlying the data. We propose to tackle the problem of statistical modeling via a neural (nonlinear) system that is able to match its input-output statistic to the statistic of the available data sets. A key point of the new implementation proposed here is that it is based on look-up-table (LUT) neural systems, which guarantee a computationally advantageous way of implementing neural systems. A number of numerical experiments, performed on both synthetic and real-world data sets, illustrate the features of the proposed modeling procedure.
Bivariate statistical modeling from incomplete data is a useful statistical tool that allows to discover the model underlying two data sets when the data in the two sets do not correspond in size nor in ordering. Such situation may occur when the sizes of the two data sets do not match (i.e., there are "holes" in the data) or when the data sets have been acquired independently. Also, statistical modeling is useful when the amount of available data is enough to show relevant statistical features of the phenomenon underlying the data. We propose to tackle the problem of statistical modeling via a neural (nonlinear) system that is able to match its input-output statistic to the statistic of the available data sets. A key point of the new implementation proposed here is that it is based on look-up-table (LUT) neural systems, which guarantee a computationally advantageous way of implementing neural systems. A number of numerical experiments, performed on both synthetic and real-world data sets, illustrate the features of the proposed modeling procedure.
Most complex, real-world systems (possibly having
stochastic elements) and phenomena cannot be accurately described by a
mathematical model to be evaluated analytically. In this case, statistical
modeling provides a useful tool to build up a model of the phenomenon under
observation. The classical theory of stochastic modeling bases on concepts such
as Markov chains, renewal theory, and queuing theory [1, 2]. Known applications range from physics [3] to automatic speech
recognition and language translation [4], from broadband communications systems [5] to magnetic hysteresis
modeling [6] and
economics [7].Also, statistical modeling pays off in those
applications where incomplete data only are available. The problem of
incomplete data is a serious one in many fields of research. In general, data
being incomplete means that specific observations are either lost or are not
recorded exactly. Based on the definitions adopted in the insurance context
(see, e.g., [8]) there
are two main ways in which the data can be incomplete.
Censored data
Data are said to be censored when the number of observations that fall within a given set is known, but the specific values of
the observations are unknown.
Truncated data
Data are said to be truncated when observations that fall within a given set are excluded.While the censored-data definition implies that
only the number of observations was recorded, the truncated-data definition
implies that neither the number nor the values of such observations were
recorded (e.g., in the operational risk field the second scenario is the most
common one: the truncated data refer to the recorded observations all of which
fall above a positive threshold of a specific amount, while the missing data
identify the unrecorded observations falling below the known threshold). The
latter are usually called non-randomly missing data, to distinguish them from
the randomly missing data that may instead affect the observations that fall
over the entire range of the data and can be caused, for example, by an
inadequate data collection process.Little and Rubin [9] defined three types of incomplete data (see also
[10]).
Missing completely at random (MCAR)
This is the case when there are missing records in a
data set but there is no chance to guess that such records are actually
missing.
Missing at random (MAR)
Cases with incomplete data differ from cases with
complete data, but the pattern of incomplete data is traceable or predictable
from other variables in the data sets. In other words, the actual variables
where data are missing are not the cause of the incomplete data: instead, the
cause of the missing data is due to some external influences.
Nonignorable
The pattern of incomplete data is not random nor is it
predictable from other variables in the data sets. In contrast to the MAR
situation, where incomplete data can be traced on the basis of other measured
variables, nonignorable missing data are only explainable by the very variables
the values of which are missing.In practice, it is difficult to meet the MCAR
assumption, while the MAR assumption is more reasonable oftentimes. The more
relevant and related predictors one can include in statistical models, the more
likely it is that the MAR assumption will be met.Missing-data imputation and handling is a rapidly
evolving field counting many methods, each applicable in some specific
circumstances. When choosing a missing data handling approach, one of the
desired outcomes is maintaining (or approximating as closely as possible) the
shape of the original data distribution. Some incomplete data handling methods
are superior to others in maintaining distributions' shape. The following list
covers some of the more widely recognized approaches to handling data sets with
incomplete cases [9, 11, 12].Listwise or casewise data deletion: if a record has missing data for any variable used in a
particular analysis, omit that entire record from the analysis.Pairwise data
deletion: for bivariate modeling, compute statistics based upon the available
pairwise data only.Mean
substitution: substitute a variable's mean value computed from available cases
to fill in missing data values on the remaining cases.Hot deck imputation: identify the most similar case to the case with a missing value and
substitute the most similar case's value for the missing case's value.Expectation maximization (EM) approach: an iterative procedure that proceeds in two
discrete steps. In the expectation step, compute the expected value of the
complete data log likelihood. In the maximization step, replace the expected
values for the missing data obtained from the expectation step and then
maximize the likelihood function as if no data were missing to obtain new
parameter estimates. Iterate through these two steps until convergence is
achieved.Maximum likelihood (ML) approach: use the whole set of available data to generate
maximum likelihood-based sufficient statistics. Usually, such statistics
consist of variables covariance matrix and means.Multiple imputation: similar to the maximum likelihood method, except that multiple
imputation generates data values suitable for filling in gaps in an existing
database. Typically, several databases are created in this way and the
investigator than analyzes these datasets using an appropriate statistical
analysis method, treating these databases as if they were based on complete
case data. The results from this analysis are then combined into a single
summary finding.Regression methods: develop a regression equation based on complete case data for a given
variable, treating it as the outcome and using all other relevant variables as
predictors. Then, for cases where a variable is missing, plug the available
data into the regression equation and substitute the equation's predicted value
into the data set for subsequent use.Comparisons of these methods, presented in [9, 11, 12], show that listwise,
pairwise, and mean substitution missing data handling methods are inferior when
compared with maximum likelihood or multiple imputation methods. Regression
methods are somewhat better, but not as good as hot deck imputation or maximum
likelihood approaches. The EM method falls somewhere amidst them, as it is
generally superior to listwise, pairwise, and mean substitution approaches, but
it lacks the uncertainty component contained in the maximum likelihood and
multiple imputation methods.In the present work, we proceed under the MAR assumption
and consider bivariate statistical modeling with a method that differs
from the ones recalled above (while partially resembling the regression
method). Statistical modeling (also known “imputation” in the context of
statistics) with incomplete data is a useful statistical tool that allows to
infer the model underlying two data sets in the case that the data in the two
sets correspond to each other in an unknown way. Such case may occur when the
sizes of the two data sets do not match (i.e., there are “holes” in the data)
or when the data have been acquired independently and a statistical analysis is
to be launched in order to discover their dependency. Also, statistical
modeling is useful when the amount of available data is sufficiently large to
show relevant statistical features of the phenomenon underlying the data.As an example of data sets 𝒟 and 𝒟 of different sizes, let us consider the
following case. On a population of students, the set 𝒟 represents the set of grades attributed by a
teacher at the end of an exam and the set 𝒟 represents the set of evaluation values anonymously attributed by the students to the teacher after class completion. About these
data sets, the following observations can be made.Not all the students might want to evaluate the teacher, so the size of data set 𝒟 might differ from the population size.For some special reason, the teacher might exclude some students from grading, therefore
the size of data set 𝒟 might differ from the population size.Because of anonymity of student's evaluation, the scores in the data sets 𝒟 and 𝒟 come unpaired (namely, without any a priori link to each other).It is tenable
to deduct that the students which will receive the worst grades will also
evaluate the teacher with low scores, therefore the model linking the values of
the data sets 𝒟 and 𝒟 may be expected to exhibit a monotonically
increasing shape.The general problem of finding a monotonic regression model is known as “isotonic
regression” in statistics [13].In the present paper, we consider tackling the problem
of statistical modeling via a neural system that is able to match its own
input-output statistic to the statistic of the available data sets. The neural
system is shown in Figure 1, where x and y denote the variables whose analytic dependency is to be modeled, 𝒟 and 𝒟 denote the data sets that such variables belong to, while p(⋅) and p(⋅) denote the probability density functions that
the variables x and y are drawn from. Neural systems exhibit
nonlinear signal/data processing ability as well as adaptivity. In the present
work, neural systems do not need training, so we retain the “neural” feature
of nonlinear input/output transference only.
Figure 1
The neural system used for modeling along with relevant quantities of input/output data
sets.
Unlike previous contributions such as [14, 15], which required deep
theoretical grounds, in the present contribution, we focus on a conceptually
simple algorithm and on efficient implementation issues. The key points of the
proposed method may be briefly summarized as follows.
(i) Modeling as statistic matching problem
Statistical modeling is interpreted here as an input-output statistic matching problem for
a nonlinear system (or neural network). Therefore, instead of considering the x ∈ 𝒟 and y ∈ 𝒟 variables as paired and to look for a
nonlinear model that fits the best variables values, we consider the cumulative
statistical information brought on by the data sets 𝒟 and 𝒟 in terms of their probability density functions. The neural system nonlinear transference function matches the
probability distributions of the variables rather than the variables' values.
This allows the proposed statistical modeling technique to cope with the
modeling problem when the size of the two data sets does not match (case of
missing data) or when the data sets were collected independently so the pairing
relationship of the values within x ∈ 𝒟 and y ∈ 𝒟 is unknown (or was lost during the
data-collection step).
(ii) Simple and efficient numerical representation
The quantity of interest as well as the learnt model are represented in terms of
paired lists of numbers (namely, in terms of look-up tables, or LUTs). The LUTs
provide an efficient way of representing and handling the variables appearing
in the devised statistical modeling algorithm.
(iii) Computational advantages
As it will be clear
later, advantages of the procedure devised on here are that: (1) it does not
involve any computation except for LUT handling (which may be implemented by
sorting/searching on lists of numbers) and few simple algebraic operations on
numbers; (2) from the viewpoint of implementation on a “number cruncher,” the
operations on the data sets 𝒟 and 𝒟 and may be performed in a parallel way until the
very last step.
(iv) Fundamental limitation
As an inherent restriction
of the method, the developed theory as well as its numerical implementation
requires the modeled relationship to be bijective, namely, it requires the
model to be monotonic (increasing or decreasing), while its shape is otherwise
unrestricted. As the model f(⋅) is free of any shape constraint except for
monotonicity, the related modeling procedure is free of bias effects inherently
tied to other parametric modeling methods.
(v) Interchangeability of pooled data and statistic
As the individual data in the sets 𝒟 and 𝒟 are not used directly but only cumulative
statistical functions obtained by pooling such data take part in the modeling
procedure, if the data sets of the phenomena under investigations are not
available but only their pooled cumulative statistical functions are available,
the modeling process considered here may nevertheless be exploited.The paper is organized as follows: Section 2 briefly
reviews some relevant scientific literature related to the topic of the present
manuscript. Section 3 discusses the modeling problem in details and presents
the related analytic setting and its solution; the section then describes the
required operations with look-up tables and then presents the numerical
procedure of statistical modeling corresponding to the devised problem setting
and analytic solution. Section 4 presents the results of numerical experiments
obtained with synthetic data sets as well as real-world data sets. In
particular, an experiment on synthetic data sets allows us to compare the
modeling result obtained with the method proposed on here with the results
obtainable with an algorithm for random number generation proposed in [16]. The experiments with
real-world data sets allow us to gain some insights into the behavior of the
neural-system-based statistical modeling method when coping with real-world
data. In particular, statistical polypropylene matrix composite reinforced with
natural fiber data results may be compared to the model presented in the
previous work [17].
2. SHORT REVIEW OF STATISTICAL MODELING
AND ITS APPLICATIONS
In this section, we present a short review of
contributions addressing statistical modeling using pairwise interactions that
share the same features of the approach presented in the present manuscript,
namely information theoretical learning (ITL, [18]), as well as selected
contributions that show various applications of statistical modeling. A topic
that is also related to the content of the present paper is nonparametric
kernel regression, although such method looks quite different in terms of
implementation being based on matrix operators. We also refer to interesting
recently published surveys for the sake of readers' convenience.Some selected contributions drawn from the scientific
literature are worthwhile considering to get a painting of current applications
of statistical modeling.
Data filling from incomplete oceanographic data sets [19]
This paper presents a self-consistent method to infer
missing data from oceanographic data series. The method presented in [19] allows to detect the number
of statistically significant empirical orthogonal functions by a
cross-validation procedure for a complete or incomplete data set as well as the
noise level and interpolation error. Orthogonal functions may be gotten by, for
example, singular value decomposition of covariance matrices. Since for the
proposed filling and analysis method there is no need for a priori information
about the error covariance structure, the method is self consistent and
parameter free.
Imputation strategies for blood pressure data [20]
Underlying or untreated blood pressure is often an outcome of interest but is
unobservable when study participants are on anti-hypertensive medications.
Untreated levels are not missing at random but would be higher among subjects
on medication. In such cases, standard methods of analysis may lead to bias.
Blood pressure data obtained at the private physician's office (“out-of-study
blood pressure data”) at the time of prescription of anti-hypertensive
medications may be used to adjust for the potential bias. In particular, in
[20], such data were
used to estimate the conditional expectation and variance of the unobserved
nonmedicated study blood pressure data. For those with no physician data,
imputation from bootstrap samples of out-of-study blood pressure data was used:
an iterative method based on the EM algorithm was employed to estimate the
unknown study parameters in a random-effects model. This model was compared in
[20] to an alternative
model for the out-of-study blood pressure data based on a theoretical truncated
normal distribution. Differences between methods were observed in the decline
in blood pressure over time in the reference group: estimated intervention
effects tended to be slightly larger using the imputation methods.
On the use of missing-data methods in psychosomatic medicine [21]
This paper summarizes recent methodological advances related to
missing data and provides an overview of maximum likelihood estimation and
multiple imputation. The paper carries on an overview of missing data theory:
brief descriptions of traditional missing data techniques are given and maximum
likelihood estimation as well as multiple imputation methods are outlined with
a great deal of details. Special attention is paid within the manuscript to an
analytic strategy that allows incorporating auxiliary variables into the
analytic model. The paper [21] concludes with an illustrative analysis using an
artificial quality-of-life data set.
Estimation of social practices diffusion processes from incomplete data [22]
Event-history analysis of the diffusion of practices in
a social system can show how actors are influenced by each other as well as by
their own characteristics. The presumption that complete data on the entire
population are essential to draw valid inferences about diffusion processes has
been a major limitation in empirical analysis and has precluded diffusion
studies in large populations. The authors examine the impacts of several forms
of incomplete data on estimation of the heterogeneous diffusion model proposed
by Strang and Tuma [23]. Left censoring causes bias, but right censoring
leads to no noteworthy problems. Extensive time aggregation biases estimates of
intrinsic propensities but not cross-case influences. Importantly, random
sampling can yield good results on diffusion processes if there are
supplementary data on influential cases outside the sample. Paper [22] concludes that the
capability of obtaining good estimates from sampled diffusion histories should
help to advance research on diffusion.
Handling missing data in educational research [24]
Recently, missing data analysis techniques have
received considerable attention in the methodological literature, where
multiple imputation and maximum likelihood estimation are recommended. The
article [24] provides
an overview of missing-data theory, maximum likelihood estimation, and multiple
imputation. Also, the paper carries on a methodological review of missing-data
reporting practices drawn from a number of applied research journals. Reference
[24] then provides a
demonstration of multiple imputation and maximum likelihood estimation using
the “longitudinal study of American youth data.”
The results indicate that explicit discussions of missing data increased substantially
between 1999 and 2003, but the use of maximum likelihood estimation or multiple
imputation was rare: the studies relied almost exclusively on listwise and
pairwise deletion.
Analysis of incomplete climate data [25]
The expectation maximization (EM) algorithm, as an
iterative regression method from incomplete datasets and for the imputation of
missing values, is taken in [25] as the starting point for the development of a
regularized EM algorithm. In contrast to the conventional EM algorithm, the
regularized EM algorithm is applicable to sets of climate data, in which the
number of variables typically exceeds the sample size. The regularized EM
algorithm is based on iterated analysis of linear regressions of variables with
missing values on variables with available values, with regression coefficients
estimated by ridge regression. Ridge regression is a regularized regression
method in which a continuous regularization parameter controls the filtering of
the noise in the data. The regularization parameter is determined by
generalized cross-validation, such as to minimize, approximately, the expected
mean-squared error of the imputed values. The regularized EM algorithm can
estimate, and exploit for the imputation of missing values, both synchronic and
diachronic covariance matrices, which may contain information on spatial,
stationary temporal covariability or cyclostationary temporal covariability.
Results of experiments reported in [25] demonstrate that the algorithm is applicable to
typical sets of climate data and that it leads to more accurate estimates of
the missing values than a conventional noniterative imputation technique.A fertile research field involving probability
modeling from incomplete data is the one of classification with labeled,
unlabeled, and labeled with constraints. A recent view of this topic emerge,
for example, from [26-28], which deal with
discriminative learning in belief-net classifiers, classification by pairwise
coupling and clustering by boosted mixture models.Some contributions drawn from the scientific
literature related to pairwise interactions (namely, on the subject of
information theoretical learning) are worthwhile considering.
Learning from examples with information theoretic criteria [29]
This paper discusses a framework for adaptive systems learning based on information theoretic criteria. A novel algorithm based on
Renyi's quadratic entropy is used to train, directly from a data set, adaptive
systems for entropy maximization or minimization. An analogy between the
computation and an information potential measuring the interactions among the
data samples (pairwise interaction) is explained. The newly proposed criterion
is tested in blind source separation (unsupervised learning) and in feature
extraction for classification (supervised learning).
Information potential for adaptive system training [30]
This paper shows that using entropy criterion for
learning provokes the minimization of the average information content of the
error signal rather than merely trying to minimize its energy. The authors of
[30] propose a
generalization of the error entropy criterion that enables the use of any order
of Renyi's entropy and any suitable kernel function in density estimation. The
equivalence between global optimization by convolution smoothing and the
convolution by the kernel in Parzen windowing is also discussed. Simulations
presented in [30]
concern time-series prediction and classification.
Joint statistical models for audio-visual fusion [31]
Audio and visual signals arriving from a common source may be detected using a
signal-level fusion technique. By comparing the mutual information between
different pairs of signals, it is possible to identify which person is speaking
a given utterance and to get rid of errant motion. To this aim, a probabilistic
multimodal generation model may be introduced and used to derive an
information-theoretic measure of cross-modal correspondence. Reference
[31] makes use of
nonparametric statistical density modeling techniques to characterize the mutual
information between signals from different domains.
Search algorithms for information-theoretic learning [32]
Research papers have proposed various ITL criteria based on Renyi's
quadratic entropy with nonparametric kernel-based density estimation as alternative
performance metrics for both supervised and unsupervised adaptive system
training. The drawback of these information-based metrics is the increased
computational complexity. The authors of [32] examine known parameter-search algorithms, like
gradient-descent methods, conjugate gradient, approaches and the
Levenberg-Marquardt algorithm, and propose modifications to allow training of
systems with these ITL criteria.Recent reviews, surveys, and essays drawn from the scientific
literature about statistical modeling are worthwhile considering here.
Statistical modeling in economics and finance [33]
This book provides a broad and rich cross-section of contemporary approaches to
statistical modeling in finance and economics and it is decision-making
oriented. Covered topics range from common tools to solutions of sophisticated
system problems and applications. In particular, a part of [33] is devoted to the
allocation of funds and risk management. Another part explains modeling aspects
of multistage stochastic programming, including a survey of parametric,
multiobjective, and dynamic programming.
Applied nonparametric regression [34]
This book brings together techniques for regression
curve smoothing involving more than one variable. It focuses on the
applications and practical problems of two central aspects of curve smoothing:
the choice of smoothing parameters and the construction of confidence bounds.
The methods covered in [34]
are shown to possess numerous applications in many areas using statistical
analysis. Examples are drawn from economics, medicine, and engineering.
Review of missing data handling [35]
The authors of this manuscript frame the missing-data
problem, review methods, offer advice, and raise issues that remain unresolved.
They strive to clear up common misunderstandings regarding the MAR concept and
summarize the evidence against older procedures. Also, the authors of [35] present two general
approaches that come highly recommended, namely, maximum likelihood and
Bayesian multiple imputation. Recent developments are discussed, including some
new techniques to deal with missing data that are not MAR.
Review of randomized controlled trials in medical journals [36]
Randomized controlled trials in medicine almost always have some individuals with missing outcomes. Inadequate handling
of these missing data in the analysis can cause substantial bias in the estimates of the effects of a treatment. The
paper examines how missing outcome data are handled in randomized controlled trials in major medical journal in order
to assess whether adequate steps have been taken to reduce nonresponse bias and to identify ways to improve procedures
for missing data. The authors of [36] focused on trial designs, how missing outcome data were described, and the statistical methods used to deal with the missing outcome data.A review of kernels theory/usage in adaptive systems may be found in [37].
We conclude the present review section with the list of some relevant
contributions that discuss the general question of parametric/nonparametric
modeling and of kernel-based regression techniques.
Kernel regression under monotonicity constraints [38]
The authors suggest a method for monotonizing general
kernel-type estimators, for example, local linear estimators and
Nadaraya-Watson estimators (see, e.g., [39]). The proposed approach is shown to produce smooth estimates. Implementation involves only an off-the-shelf programming routine.
The method is based on maximizing fidelity to the conventional empirical
approach, subject to monotonicity.
Comparing parametric and nonparametric fits [40]
In general, there are visible differences between a
parametric and a nonparametric regression estimate. It is therefore of interest
to compare these in order to decide whether the parametric model could be
justified. An asymptotic quantification is the distribution of the integrated
squared difference between these curves. Authors of [40] show that the standard way
of bootstrapping this statistic fails. They use and analyze a different form of
bootstrapping for this task, termed “wild bootstrap.”
Hypothesis tests for statistical dependency [41]
Determining the structure of dependencies among a
set of variables is a common task in many signal and image processing
applications, including multitarget tracking and computer vision. In this
paper, authors present an information-theoretic, machine learning approach to
problems of this type. Reference [41] casts this problem as a hypothesis test between
factorizations of variables into mutually independent subsets and shows that
the likelihood ratio can be written as sums of two sets of Kullback-Leibler
divergence terms: the first set captures the structure of the statistical
dependencies within each hypothesis, whereas the second set measures the
details of model differences between hypotheses. The authors consider the case
when the signal prior models are unknown, so that the distributions of interest
must be estimated directly from data, showing that the second set of terms is
(asymptotically) negligible and quantifying the loss in hypothesis separability
when the models are completely unknown. They demonstrate the utility of
nonparametric estimation methods, providing a framework for determining and
distinguishing between dependency structures in highly uncertain environments.
Additionally, the authors develop a machine learning approach for estimating
lower bounds on Kullback-Leibler divergence and mutual information from samples
of high-dimensional random variables for which direct density estimation is
infeasible. The authors of [41] present empirical results in the context of three
prototypical applications: association of signals generated by sources
possessing harmonic behavior, scene correspondence using video imagery, and
detection of coherent behavior among sets of moving objects.
Chance probability functions in shape retrieval and classification [42]
Several example-based systems for shape retrieval and
shape classification directly match input shapes to stored shapes, without
using class membership information to perform the matching. Reference [42] proposes a method for improving the accuracy of this type of systems. First, the system learns a set
of chance probability functions, which represent the probabilities of obtaining
a query shape with particular distances from each training example by chance.
The learnt functions are used at runtime to rapidly estimate the chance
probabilities of the observed distances between the actual query shape and the
database shapes. These estimated probabilities are then used as a dissimilarity
measure for shape retrieval and/or nearest-neighbor classification. The chance
probability functions learning method is parameter free. Experimental
evaluation demonstrates that chance probabilities yield higher accuracy than
Euclidean distances, that the learnt chance probability functions support fast
matching, and that the chance-probability-functions-based system outperforms
prior systems on a standard benchmark test of retrieval accuracy.
3. STATISTICAL MODELING BY INPUT-OUTPUT STATISTIC MATCHING
The present paper completes the research program
initiated with the probability density function estimation method proposed in
[14, 43] and the neural nonlinear
transference learning method for input-output density matching discussed in
[16], which improves a previous algorithm recently published in [15].In particular, the papers on probability density function learning were devoted to the development of methods that are capable
of providing analytic (i.e., smooth) approximations of probability density
functions of data sets on the basis of the available samples. Also, the
nonlinear transference learning methods were developed in order to numerically
estimate nonlinear models that transform a given probability density function
into another known probability density function, for input-output density
function matching, under the assumptions that the involved density functions
are known in analytic form. Therefore, the latter class of methods is mixed
analytic/numerical.In statistical modeling, however, the input/output
probability density functions are unknown and need to be estimated on the basis
of the available input/output data sets. As the model-inference procedure is
intrinsically numerical, it is appropriate to employ a discrete probability density function estimation algorithm.Section 3.1 below reviews the modeling problem in an
analytic fashion and sets the framework for the development of the actual
numerical learning procedures, as discussed in the subsequent sections. The
last section discusses the computational complexity issue.
3.1. Themodeling problem: analytic setting and solution
A well-known effect of nonlinear neural systems is to
warp the statistical distribution of its input variable. In particular, we
suppose that the system under consideration has a nonlinear structure described
by the transference y = f(x), where x ∈ 𝒳 ⊆ ℝ denotes the input, having probability density function p(⋅), and y ∈ 𝓎 ⊆ ℝ denotes the output signal, having probability density function p(⋅).In the hypothesis that the neural system transference is strictly monotonic, namely f′(x) > 0 or f′(x) < 0, for all x ∈ 𝒳, the relationship between the input distribution, the output distribution, and the system transfer function is known to be [44]:
where f
−1(⋅) denotes the inverse of function f(⋅) and the sign “+” arises when the nonlinear function f(⋅) is monotonically increasing while sign “−” arises when the function f(⋅) is monotonically decreasing.Usually, (1) is interpreted as an analysis formula, which enables
us to compute the output distribution when the input distribution and the
system transference function are known. However, the cardinal equation
(1) may also
be interpreted as a formula that allows for designing the nonlinear system when
the distribution p(⋅) is known and it is desired that the system responds according to a desired distribution p(⋅). In fact, (1) may be rewritten as the differential equation
in the unknown f(⋅). In order for the above equation to make sense, we require the function p(⋅) to be nonzero in the domain of interest of the variable y. In general, solving (2) in the unknown f(⋅) implies solving a nonlinear differential equation, provided that a consistent boundary condition is specified.Let us now define the cumulative distribution functions associated to the input-output probability density functions
:
In the case that the cumulative
distribution functions associated to the input-output probability density
functions are assumed to be known, the solution of the cardinal differential
equation may be written in closed form as shown in [15]:
where constant c depends on the boundary condition, symbol P
y
−1(⋅) denotes the inverse of the cumulative density function P
y(⋅), and the sign in
“±” is selected according to the desired sign for the derivative f′(⋅).We consider right now the problem of boundary condition selection in the differential equation (2). The nonlinear function f(⋅) is sought for
such that it maps the data distribution p(⋅) into the distribution p(⋅). Therefore, a sensible choice for the constant c in the model (4) is such that the model maps the center of mass of the x distribution
into the mass center of the y distribution. If we denote by and the mean values of the x and y data sets, respectively, namely
then the mass-center-to-mass-center-map condition above writes . As , the above condition gives rise to the following solutions for the increasing/decreasing model type separately.Monotonically increasing model: the constant c takes on the value 0, therefore the solution (4) to (2) writes f(x) = P
−1(P
x(x)).Monotonically decreasing model: the constant c takes on the
value 1, therefore the solution (4) to (2) writes f(x) = P
−1(1 − P
x(x)).The problem of sign selection in (4) is solvable by reasoning on the nature of the
physical phenomena underlying the involved data sets, as will be shown in the
section devoted to numerical experiments.The conclusion of the above problem setting and
analytic solution is that, in order to develop a fully numerical method for
statistical modeling for incomplete data, the following ingredients are of use:
a suitable numerical estimation method for cumulative density functions and a
suitable numerical format for function representation/handling (with particular
emphasis on numerical function inversion).
3.2. Operations with look-up tables
As already illustrated, for example, in [15, 16], look-up tables provide a completely suitable numerical representation for functions and
probability-distributions-type quantities and related computations. A real-valued look-up table with N entries is basically a pair LUT = (x, y), where x ∈ ℝ and y ∈ ℝ. The entries x of x and y, of y, with k ∈ {1,…, N} are paired and provide a pointwise description of an arbitrarily shaped function. An N-size look-up
table is illustrated in Figure 2. It is worth noting that a look-up table is
inherently undirected, namely, it represents a relationship between the x and y data rather
than a dependency of the kind y = y(x) or x = x(y).
Figure 2
An N-size look-up table.
In order to handle the look-up tables for statistical modeling purpose, the following operations are of use.
Cumulative sum
On the basis of an N-point look-up table (x, y), a new look-up table may be constructed, whose y-entries contain the cumulative sum of the y-entries of the
look-up table (x, y). Such operation is described by (u, v) = cumsum (x, y), where
The “cumsum” operator constructs a look-table (u, v) that has the same size of the argument look-table (x, y).
Histogram computation
In order to numerically approximate the probability distributions of one-dimensional numerical data sets, a histogram operator is of use. Let us denote by 𝒟 such data set (most likely represented by a number vector) and with B ≥ 2 the number of bins for frequency estimation. (Needless to say, data sets here are finite.)
The histogram operation may be described by (x, y) = hist(𝒟, B). The constructed look-up table has B points and is built up as follows:x equals the value of the bth bin center,y equals the number of data points falling within the bth bin,b ∈ {1,…, B}.The width of each bin is . The bin centers are given by min{𝒟} + Δ/2 + (b − 1)Δ. The bth bin is the interval [min{𝒟} + (b − 1)Δ min{𝒟} + bΔ), which is open on the right.
Function inversion
If a function is given a pointwise representation by the help of a look-up table (x, y), then its inverse function may be easily given a pointwise representation by the look-up table (y, x). Therefore, function inversion in terms of look-up table representation does not require any computation at all.
Interpolation
A limitation of LUT-based representation of functions is that the (x, y)-pairs for the represented relationship are known only on some points. Interpolation may be invoked to make computations with LUTs on other points in the domain. Interpolation may be performed in a variety of ways. In the present context, it
is necessary to preserve the monotonicity of a function, therefore we refer here to linear interpolation only. Let us denote by 𝒟 the x-coordinate point set (most likely represented by a number vector) of size D ≥ 2, where the function represented by a look-up table (x, y) needs to be
interpolated. The interpolation operation may be described by ℓ = interp(x, y, 𝒟) and works as follows:
for everyOf course, the built-up set ℓ is of size D and is to be considered an ordered set (which will likely be implemented as a vector, as well).It is worth noting that the above definitions coincide as much as possible to the implementations of the corresponding operators provided by MATLAB, in order to make the computer implementation of the learning procedure as straightforward as possible. With the same spirit, in order to indicate that every entry of a vector is added with the same scalar,
we use the usual vector-to-vector addition symbol (“+”).
3.3. Numerical procedure for statistical modeling
The model estimation procedure is based on a neural system, described by the input-output relationship y = f(x), where x ∈ ℝ denotes the input variable, y ∈ ℝ denotes the output variable, and f(⋅) denotes a neural transfer function.The algorithm proposed in this section retraces the algorithm proposed in [16] for
random-number generation. With respect to the algorithm proposed in [16], there are some important differences to be taken into account, which stem from the different use of the
neural system made for random-number generation and for statistical modeling.In random number generation, the input random variable is supposed to be drawn from a
prototype distribution such as the uniform or Gaussian ones, which are simple
and symmetric. The learning algorithm devised in [16] was written according to
this assumption. In statistical modeling, however, both the input and output
distributions may be arbitrarily involved, depending on the distribution of
data within the 𝒟 and 𝒟 sets, therefore
the learning algorithm should be generalized here accordingly.The random number generation procedure devised in [16] consists of two separate stages: the first stage concerns neural system adaptation via a proper learning strategy; the second
stage consists in using the learnt neural system as a generative model, in
order to produce random numbers drawn from the desired statistic. In neural
modeling, the second stage is apparently unnecessary, as the target of the
method is to infer a model underlying the variables pair.When a neural system is used for random number generation purpose, it is truly “directed,”
in the sense that its input is supposed to be known (as it comes from an
available prototype random number generator such as uniform or Gaussian) and
its output can be computed only by passing the input through the learnt neural
system. In statistical modeling, the variables of interest do not need to
coincide necessarily to input and output variables, as we are seeking for a
relationship among two variables to be “discovered.” This consideration is
helpful as input and output variables may be swapped in order to obtain a model
of the kind x = f
−1(y) as well as a model of the kind y = f(x), depending on which variable is to be considered as
dependent from the other variable.The look-up-table-based algorithm exploited here for numerical statistical modeling implements the closed-form design solution
(4) by the help of the LUT-handling operators defined in Section 3.2First, it is necessary to estimate the probability density functions as well as the cumulative distribution functions of data within 𝒟 and 𝒟 data sets. As already mentioned, in the interpretation of statistical modeling proposed in
the present paper, the size D of the data set 𝒟 and the size D of the data set 𝒟 do not need to
be equal. In order to perform a histogram-based estimation of the probability
density function underlying the data via the “hist” operator, it is necessary
to choose the numbers B and B of bins for the
two data sets. We used the following rule of thumb to select the bins numbers:
where ⌈⋅⌉ rounds its argument to the nearest integer towards infinity. Such a choice endows the histogram-based estimation procedure with enough bins to estimate the probability density
functions of the x and y data, while the number of bins keeps limited. The probability density functions of the 𝒟 and 𝒟 data sets may
thus be numerically estimated by
It should be noted that the vectors p
and p
actually represent probability distributions up to scale factors, as they should be
normalized by the total number of samples in each set and by the bin
widthsThe numerical cumulative distribution functions of the 𝒟 and 𝒟 data sets may now be estimated by numerical integration of the numerical probability density functions,
which may be achieved by the help of the “cumsum” operator applied to look-up
tables (x, p
) and (y, p
). It should be noted, however, that two adjustments
are necessary at this point.
(i) Lifting of 𝒟-set estimated distribution
Some entries of the vector p
may be zero (therefore some entries of its cumulative-sum vector may be equal to others). This situation violates the hypothesis that the probability density function of
the y variable differs from zero in the interval of interest. Such occurrence should be fixed
in a way that does not alter the information content of the vector p
drastically. We
propose, as a remedy, to add to every entry of the vector p
a small
quantity, namely 1/D. Therefore, the normalized numerical probability
density function to be integrated is
The entries of the vector p
do not need any value-shifting, so the corresponding normalized numerical probability density function to be integrated is
It is easy to verify that the entries of vectors and sum up to 1.
(ii) Shifting of cumulative distribution functions and bin centers
The “cumsum” operator provides a cumulative-sum vector whose first entry
coincides to the first entry of the summed-up look-up table, while the
numerical cumulative distribution function's first entry should equal zero.
Also, the “hist” operator provides the bin-centers coordinate, while for
numerical function representation purposes, the boundary values of the bins are
more easily profitable. Because of these reasons, we first define the look-up tables:
Then, the actual look-up tables
representing true numerical cumulative distribution functions are defined as
follows:
where symbol [;] denotes vector concatenation.The last operations make the numerical cumulative probability density function LUTs , of size D + 1, and , of size D + 1, available for the computation of the nonlinear model
according to (4).If the statistical model is monotonically increasing, then the look-up table may be used as is. Otherwise, for monotonically decreasing models, it should be replaced by in what follows. This being understood, we proceed by developing the last procedure
steps for the monotonically-increasing-model case.The nonlinear model is to be evaluated on an ordered
set of x-points denoted here as 𝒳. For consistency reason, it should hold 𝒳 ⊆ [min{𝒟} max{𝒟}]. Here, we propose the ordered set 𝒳 over which the model is to be evaluated to consist of R points equally spaced within the interval [min{𝒟} max{𝒟}], where R depends on the accuracy of the interpolation required for the model f(⋅). The quantity R appears as a finesse of partition for interpolation purpose. (Of course, the points where
the model are evaluated do not need to be equally spaced if the application
that the method is devised for needs a different choice.) The last step in the
procedure is to numerically evaluate the quantity P(⋅) over the set 𝒳, then to evaluate the quantity P
−1(⋅) over the values of function P(⋅), namely,
Note that in the right-hand side of (14), the inverse function P
−1(⋅) appears through the swapped LUT of function P(⋅). The ordered sets 𝒳 and 𝒴, both of size R, give rise to the look-up table representation (𝒳, 𝒴) for the nonlinear model f(⋅) to be designed.Two comments on the developed procedure for statistical modeling are in order.A macroscopic advantage of the procedure devised on here is that it does not involve any
computation except for table handling (which may be efficiently implemented by
procedures of sorting/searching on lists of real-valued numbers) and few simple
multiplies/divisions and additions/subtractions on real-valued numbers.From the viewpoint of computer implementation, the operations on the data sets 𝒟 and 𝒟 may be effected in a parallel way until the very last step represented by operations (14).A possible MATLABs-based implementation of the devised
procedure is reported in Appendix A.
3.4. Computational complexity issues
Complexity is generally an important issue for neural
systems: complex models may have overfitted parameters (they will model random
and noisy components), whereas less-complex models may be underfitted (they
will miss some key information). Neural systems are usually tuned to the data
by controlling their complexity at the same time. A simple example is the
selection of the number of hidden neurons in a multilayer feed-forward neural
network architecture. The complexity, in this case, is expressed by the number
of weights in between formal neural units.In the present case, the complexity of the model (only
one neural unit, but with LUT transfer function) depends upon the dimension of
the LUT, which is determined in (7). In the present paper, thus, the complexity of the
model is not subject to any learning or optimization process but it is provided
a priori. The choice of rules (7) may thus be discussed with a view to computational
complexity.The bin number should be a function of the available samples.It is difficult to envisage any universal a priori relationship between the number of bins and
the number of available data samples. The rule (7) was selected
empirically on the basis of a trial-and-error procedure on various data sets
with different sizes (see Section 4).Depending on
model complexity, it is to be expected that the model performances may degrade.As a thought on
this topic, it might be advisable to optimize the model complexity on the basis
of a measure of “smoothness” of the learnt model, based, for example, on its
first- and second-order derivatives [37].Care should be
taken about model complexity selection. In fact, any data-based model
complexity selection procedure needs extra calculations and will inherently
make model complexity higher.A depiction of model complexity corresponding to rule
(7) versus
the cardinality of a data set is given in Figure 3. As it is readily seen, when
the data set size is low the number of partition bins increases rapidly in
order to follow the increasing complexity of data. Conversely, when the number
of data samples is large, the model complexity increases slowly thus keeping
limited. These observations were used as a rationale for the choice (7).
Figure 3
Model complexity versus the cardinality of a data set
corresponding to rule (9).
We further tested the discussed algorithm on a
simple modeling problem where the closed-form solution is known analytically
(namely, a variant of the problem discussed with details in Section 4.1.2). The
number of bins in the estimate of the required cumulative density functions was
selected according to rule (7). The finesse of partition R was let to vary between 10 and 200 with step 10. The discrepancy between the estimated model
and the actual model was measured as a mean-squared error. Also, the total
number of flops and the CPU time required to run the whole modeling procedure
was recorded over 500 independent trials on randomly generated data sets. In
the experiment, the data sets were of different size, namely, with D = 2354 samples and D = 2544 samples. The obtained results were gathered in Figure 4. The results tell that the quality of approximation increases with the finesse of partition and the computational
burden of the algorithm increases with the finesse of partition R as well. Perhaps, the most interesting result emerging from this experiment is that the number of flops as well as the CPU effort required by the algorithm to run on
the used platform and within the used development environment are quite low
(some 103 flops versus some milliseconds). This makes the discussed procedure attractive from a
computational viewpoint.
Figure 4
Mean-squared error and standard deviation (a)–(b) and
average flops and CPU-time (c)–(d) versus finesse of partition for
interpolation purpose on a test problem. (Averages are computed over 500
independent trials.)
4. NUMERICAL EXPERIMENTS WITH SYNTHETIC AND REAL-WORLD DATA SETS
In this section, we apply the numerical procedure of Section 3.3 to the following.A synthetic data set with density distributions known in analytic form.A synthetic
data set related to digital images for pattern recognition purpose.Sweet cherry
blossom blooming data.Statistical
characterization of the mechanical properties of polypropylene composites
reinforced with natural fibers.Blood
measurements data for trend discovery in laboratory analysis.The algorithm
consists of the set of equations discussed in Section 3.3 to be applied in the
same order as they appear (see also the code on Appendix A).
4.1. Experiments on synthetic data
In the present section, we consider synthetic data
sets whose density distributions are known in analytic form. In this case, the
underlying model can even be computed by the algorithm proposed in [16] whose results may be compared with. We also consider synthetic data sets related to pattern
recognition on black and white images.
4.1.1. Synthetic data sets with known density distributions
Data within sets 𝒟 and 𝒟 were generated
according to the following probability densities:
The details
on the distributions as well as the algorithm used
to generate the sample sets were given in [16]. The selected parameters values were σ = 1, μ = 0.8, λ = 0.5, σ
1 = 0.3, μ
1 = −0.5, σ
2 = 1, and μ
2= 0.8. The generated x-samples range in [−3 4] while the generated y-samples range in [−5/2 4]. The generated data distribute as shown in Figure 5.
The data sets 𝒟 and 𝒟 were generated
independently of each other and the set sizes D = 4771 and D = 4847 differ in this experiment.
Figure 5
Probability density functions of the data sets 𝒟 and 𝒟. (Experiment on synthetic data sets with known density distributions.) Note that in this example D ≠ D.
The cumulative distribution functions required by the modeling procedure are depicted in Figure 6 and the obtained model is depicted in Figure 7 along with the nonlinear function f(⋅) computed with the algorithm presented in [16]. As it can be readily appreciated, the numerical model obtained with the LUT-based algorithm of Section 3.3 is in excellent agreement to the nonlinear function computed on the basis of the closed-form
probability density functions p(⋅) and p(⋅), except that near the “borders” of the variables domains.
Figure 6
Cumulative distribution functions of the data sets 𝒟 and 𝒟 required by the
modeling procedure. (Experiment on synthetic data sets with known density
distributions.)
Figure 7
Model computed by the algorithm of Section 3.3 (solid dotted line) and exact
nonlinearity computed with the algorithm presented in [16] (solid line). (Experiment
on synthetic data sets with known density distributions.)
4.1.2. Synthetic data sets of black and white images
Shape analysis within digital images is a popular
topic in pattern recognition. In order to fit the present statistical-modeling
setting to shape analysis, let us suppose a black and white image is available
and the data sets 𝒟 and 𝒟 are formed by
pooling the (x, y) coordinates of the black dots within the image.Here we can make use of a straightforward deduction from the closed-form expression of the nonlinear model f(⋅) give by (4): if the x-data and y-data distribute approximately in the same way (namely, p(⋅) ≈ p(⋅)) and the model is selected to be monotonically increasing, then the resulting model may only
be of the kind f(x) ≈ x, that is, the model f(⋅) should closely resemble a straight line with unitary slope.In the context of shape analysis, the case that x-data and y-data distribute approximately in the same way might denote the presence of circles or squares within the image, as in the example depicted in Figure 8. The image is corrupted by 20% white Gaussian random noise with variances of the
coordinates equal to (1,4) and with null correlation among the two noise
components. The open circle is centered in (0,0) and has radius equal to 2. The
filled-in square's left-bottom corner locates in (4,5) and its diagonal is of
length . The obtained numerical model is depicted in Figure 9. As it is readily seen, the model exhibits two approximately straight segments with unitary slope in the intervals [−2 2] and [4 7] as expected, interleaved by curly segments that are due to noise.
Figure 8
Rightmost panel: digital black and white image with an open circle and a
filled-in square included. The image is corrupted by white Gaussian random
noise. (Graphics might warp the shape of the image elements.) Cumulative
distribution functions estimates are shown on the leftmost/central panels.
Figure 9
Model
computed by the algorithm of Section 3.3 (Experiment on synthetic data sets of
black and white images.)
4.2. Experiments on “cherry blossom” data
Cherry blossom (sakura) has been celebrated for
many centuries and occupies a quite prominent position in Japanese culture,
being a well-known and ubiquitous symbol of Japan. Sakura are represented on
all manner of consumer goods, including kimono, stationery, and dishware:
cherry blossoms are an enduring metaphor for the ephemeral nature of life and,
as such, are frequently depicted in art.There exist many dozens of different cherry-tree
varieties in Japan, most of which bloom for just a couple of days during
spring. Japan's most beloved variety of cherry blossom is the Somei Yoshino (see Figure 10). Its flowers are nearly pure white, tinged with the palest
pink, especially near the stem. The flowers bloom, and usually fall within a
week, before the leaves come out. Therefore, the trees look nearly white from
top to bottom. Other categories of cherry blossom include Yamazakura, Yaezakura,
and Shidarezakura.
Figure 10
A Somei
Yoshino cherry blossom.
Japanese people are used to celebrate the time of the
year when cherry blossoms bloom with cherry blossom viewing parties (hanami)
under the blooming trees. Most Japanese schools and public buildings have
sakura trees in their outskirts. Since the fiscal and school year both begin in
April, in many parts of Japan, the first day of work or school would coincide
with the cherry blossom season.The available data in the present experiment are the
day when the cherry blossom bloom since January 1st within a cherry tree field
and a conventional temperature (in °F) measured that day. The
recordings span the years 1935–2002. The data are shown in Figure 11. As it
may be readily seen (and intuitively understood), the lower the conventional
temperature is, the later the cherry trees bloom. Therefore, if we identify the 𝒟 data set with
the set of blooming days since January 1st a year and the 𝒟 data set with
the set of conventional temperature measured during the blooming days a year,
the model is expected to be monotonically decreasing.
Figure 11
Cherry blossom recordings. (a)–(b) Blooming day since
January 1st and conventional temperature (in Fahrenheit degrees) (1935–2002).
(c) Data scatter plot.
The cumulative distribution functions required by the
modeling procedure are depicted in Figure 12 and the obtained model is depicted
in Figure 13. The computed model fits the data in a satisfactory way.
Figure 12
Cumulative distribution functions of the data sets 𝒟 and 𝒟 required by the modeling procedure.
(Experiment on cherry blossom data.)
Figure 13
Model
computed by the algorithm of Section 3.3 (Experiment on cherry blossom data.
Dotted solid line: computed model. Circles: data points.)
4.3. Reinforced composites mechanical behavior modeling: motivations and problem formulation
Synthetic fibers have been extensively used over the
last years as reinforcement of plastic (polymeric) matrices. Fibers are
incorporated into plastics with the aim of improving the mechanical properties
of the polymer. The interest in the use of natural vegetal fibers as
reinforcement has kept growing mainly due to their renewable origin, low price,
and favorable environmental impact (natural fibers are biodegradable and
natural-fiber-reinforced composites can be easily recycled). Therefore,
ligno-cellulosic natural fibers represent an interesting alternative as
substitutes for traditional synthetic fibers in composite materials.On the other hand, in contrast with synthetic fibers,
whose properties can be easily and univocally determined, natural fibers are
characterized by a large dispersion of their characteristics. As a matter of
fact, the behavior and the properties of natural fibers depend on many factors
such as harvest period, weather variability, quality of soil, and climate of
the specific geographic location. The variability of the properties of natural
fibers is so high that it can also be observed among fibers belonging to the
same plant. Such variability makes it difficult to predict the mechanical
features of the yield composite and make it necessary to employ a statistical
approach to model them.In order to describe the characteristics of
natural-fiber-reinforced composites, in [17] the use of statistical description of the quantities
of interest was proposed. In particular, the following scheme was adopted.(1) Marginal statistical characterization of free
fibers' geometrical/mechanical features: the geometrical and mechanical
features of interest of the natural fibers used to reinforce the composite may
be characterized before production. In particular, the diameter (d) of any single fiber as well as the Young elasticity modulus (tensile strength, E) and the resistance (strain strength, S) are measured.
The sets of these values are statistically characterized, namely, the probability density functions p(⋅), p(⋅), and p(⋅) are estimated, in order to provide a statistical description of the geometrical/mechanical features of these fibers.(2) Adoption of a free-fibers mechanical model: the mentioned geometrical/mechanical quantities, referred to the same set of fibers, are not independent. For instance, it is easily predictable that the
elasticity of a free fiber depends on its diameter. In order to correlate these quantities, a popular model found in the literature is usually adopted. In particular, the Griffith model E(d) = g
1 + (g
2/d) is often employed because of its simple structure and because it embodies the
fundamental intuition that the elasticity of a fiber is inversely proportional to its diameter. In [17], the Griffith's parameters g
1 and g
2 were computed
by fitting the model to the available measures on the basis of a standard least-square procedure.(3) Composite production and in-slab fibers measurement: the measured fibers were then employed to yield a
reinforced composite matrix. The production process breaks and warps the fibers within the composite: their mechanical features change but still explain the mechanical properties of the yield composite. In order to characterize these
properties it is therefore necessary to examine the geometry of the fibers
within the composite (which appears as a transparent matter after fusion and
leaves the fibers visible, see, e.g., Figure 14): the length and diameter of
the residual fibers within the composites are measured. It is important to note
that, at this stage, just the geometric features of the residual fibers may be
actually measured via visual inspection, while the mechanical properties of the
residual fibers cannot be measured directly. Such mechanical properties are therefore
to be predicted through the use of the fitted Griffith model applied to the
diameter values of the fibers within the composite. As explained later, this
choice results to be only partially sensible.
Figure 14
A shot of a
matrix composite with flax fibers well visible embedded in. (Experiment on
polypropylene composite data.)
(4) Modeling of composite's mechanical properties:
the statistical characterization of the mechanical properties of the composite
slab may now be predicted trough the Halpin-Tsai model [17] which, on the basis of the
modal value of the free-fibers geometrical/mechanical features and of the set
of length/diameter values of the residual fibers within the composites and
other few production data, furnishes a series of mechanical properties values,
namely, E and S, of the composites. This is true under the hypothesis
that the composite features may be computed as the weighted distribution of the
mechanical values pertaining to homogeneous fibers. Namely, the Halpin-Tsai
model, in conjunction to a proper PDF-estimation technique, is used to estimate
the quantities
where p(⋅) and p(⋅) denote the distributions of the two mechanical properties of the composite, p|d,l(⋅ | ⋅, ⋅) and p|d, l(⋅ | ⋅, ⋅) denote the effect of embedded fibers geometry on composite's mechanical characteristics, and p, l(⋅, ⋅) denotes the weighting geometric factor. The intervals 𝒟 and ℒ denote the value ranges of the diameters and lengths of the fibers within the composite.(5) Marginal statistical characterization of composite's mechanical features: a statistical characterization of the distribution of the values E and S returns the modal values of these quantities as a final result. These predicted values may be finally compared to the
measured values, after mechanical tensile/strain characterization of the composite slab, in order to validate the quality of the prediction. Such comparison showed that the results obtained by using this procedure are more
accurate than the results obtained without passing trough the statistical characterization of the single-fiber measures. However, analysis also showed that the Griffith model, computed on the basis of the geometrical features of
the free fibers and employed in order to predict the value of the mechanical features of the fibers embedded in the polypropylene composite on the basis of their geometrical features, is not completely reliable.The unreliability of the Griffith model in the present application may be explained by the fact that it was originally thought to for
synthetic (glass) fibers and does not take into account the different mechanical
behavior of the original fibers during composite production. Also, for large
diameter values, the Griffith model may loose physical meaning because it may
return negative values for the tensile strength. We may consider replacing the
two-parameter Griffith model with the LUT-based neural model of Section 3. The
motivations of this replacement are the following.As the neural model has free (monotonic) shape, no preknowledge that might bias the solution
is required nor intentionally embedded into the resulting model, except for the
monotonicity of the model. It is worth noting that, according to basic
mechanical considerations and to Griffith model itself, the relationship E = f(d) should be monotonically decreasing.The standard least-squares fitting used to optimize the Griffith model (or any other
parametric model) works under the hypothesis that the measured values of E and d are available
as coupled, namely, in the form (d, E), with k ranging between 1 and the total number of available measures. This might not, however, be the case: it is well
possible that during the measurement session a nonnegligible fraction of free
fibers cannot be fully characterized in diameter, tensile and strain strength
(in particular, the last feature is rather difficult to measure due to its
inherently destructive character). Therefore, standard fitting-based modeling
does not appear to be appropriate here.Further about
the measurement difficulties, the measures obtained in this kind of
characterization are always somewhat unreliable, for instance, because of the
tendency of the fibers to form agglomerates and to fray, which make it
difficult to obtain accurate measurements of the their geometrical features.
Statistical characterization inherently provides a way to weight and interpret
correctly the available results.In the present contribution, we focus on the tensile
strength prediction of reinforced composites, thus we focus primarily on
modeling the relationship E = f(d) for the free fibers. As mentioned, a set of measures of the diameter values and a set of fibers' elasticity modulus values are available for a bunch of free fibers.After the statistical distributions p(⋅) and p(⋅) are estimated, we aim at finding a model E = f(d) that transforms the probability density of the diameters into the probability density of the elastic moduli. Our basic hypothesis is that the same model also explains the
relationship between the diameters and moduli themselves. As mentioned, no prior knowledge is required for modeling. The available data in the present experiment are the measured diameters expressed in micrometers (μm), which we identify with the 𝒟 data set and
elastic moduli expressed in Mega-Pascal (MPa), which we identify with the 𝒟 data set. The
data are shown in Figure 15. The estimated cumulative distribution functions
required by the modeling procedure are depicted in Figure 16 and the obtained
model is depicted in Figure 17. The computed model fits the data in quite a
satisfactory way. Also, by comparing the model obtained with the proposed
algorithm with the model obtained in the previous work [17], it can be concluded that
the new model looks to be more adherent to the actual measurements.
Figure 15
Polypropylene composite recordings. (a)–(b) Diameters(μm)
and elastic modulus (MPa). (c) Data scatter plot.
Figure 16
Cumulative distribution functions of the data sets 𝒟 and 𝒟 required by the
modeling procedure. (Experiment on cherry blossom data.)
Figure 17
Model
computed by the algorithm of Section 3.3 (Experiment on polypropylene
composite data.)
4.4. Blood measurements data sets
The 1982 annual meeting of the American Statistical
Association (ASA) was held during August in Cincinnati (OH, USA). At that
meeting, the ASA Committee on Statistical Graphics sponsored an “exposition of
statistical graphics technology.” The purpose of this activity was to more
fully inform the ASA members about the capabilities of computer graphics in
statistical work.As a common database that the statistical analysis
methods were to be tested on, a set of biomedical data was made available,
which arose in a study to develop screening methods to identify carriers of a
rare genetic disorder. Four measurements m
1, m
2, m
3, and m
4 on blood samples were made available.
Because the disease is rare, there are only a few carriers of the disease from whom data are available. The
data came in two files, one for carriers and one for noncarriers of the
disease. The data were stripped off of the patients names and other
identifiers, otherwise the data were made available as received by the analyst
[45]. Each vendor or
provider of statistical graphics software participating in the exposition
analyzed these data using their software and prepared tabular, graphical, and
text output illustrating the use of graphics in such analysis and summarizing
their conclusions: the purpose of the analysis was to develop a screening
procedure to detect carriers and to describe its effectiveness. As additional
information to data themselves, the following warning was provided: “experts
in the field have noted that young people tend to have higher measurements.”
We take as data set 𝒟 the pooled set
of patients ages (expressed in years) and as data set 𝒟 the pooled set
of m
4 measures (unknown measure unit). The m
4 measures contain “holes” therefore the sizes of the data sets 𝒟 and 𝒟 differ. The data are shown in Figure 18.
Figure 18
Blood measurements data sets: data set 𝒟 is the pooled
set of patients ages (years), data set 𝒟 is the pooled
set of m
4 measures (unknown measure unit). The m
4 data set
contains missing measures, therefore the sizes of the data sets 𝒟 and 𝒟 differ.
The purpose of the modeling effort proposed here is
to quantify the trend of m
4 measure versus patient's age, under the hypothesis that such dependency is decreasing, as suggested by analysts. The estimated cumulative distribution functions required
by the modeling procedure are depicted in Figure 19 and the obtained model is
depicted in Figure 20. The result of modeling seems to evidence a quasilinear
trend in the decrease of the m
4 value versus age, in a range [22 38].
Figure 19
Cumulative distribution functions of the data sets 𝒟 and 𝒟 required by the modeling procedure. (Experiment on blood measurements data.)
Figure 20
Model computed by the algorithm of Section 3.3 (Experiment on blood measurements
data.)
5. CONCLUSION
In the present paper, we discussed the problem of
bivariate statistical modeling via neural (nonlinear) systems that are able to
match their input-output statistic to the statistic of data sets, a
relationship among which is sought for.We proceeded under the missing-at-random assumption
for the incomplete data and considered bivariate statistical modeling via a
method that looks partially resembling to a regression method. Statistical
modeling was given here an interpretation as pooled-statistic matching problem
for a nonlinear system. Namely, instead of considering the x ∈ 𝒟 and y ∈ 𝒟 variables as paired and to look for a nonlinear model that best fits the variables values, we consider only cumulative information that arise by pooling the values within
the data sets 𝒟 and 𝒟. The neural system nonlinear transference thus matches the probability distributions of the variables, rather than the
variables' values themselves. As a consequence, the proposed statistical
modeling technique allows to cope with the modeling problem when the size of
the two data sets does not match and/or when the pairing relationship of the
values within x ∈ 𝒟 and y ∈ 𝒟 is unknown. Also, as the individual data in the sets 𝒟 and 𝒟 are not accessed directly by the modeling procedure, if the data sets of the phenomena under modeling are not available but only their pooled cumulative statistical distributions are available, the modeling process may take place anyway. The resulting nonlinear model fits the data in a probability-density-function
transformation sense.A key point of the method is that the quantities of interest as well as the designed model were proposed to be represented in terms of paired lists of real numbers termed LUTs, which were proven to provide an
efficient way of representing and handling the variables appearing within the
devised statistical modeling algorithm. A prominent advantage of the procedure
is the lack of hard computational requirement except for LUT handling (which
consists in sorting/searching on lists of numbers) and few simple algebraic
operations on numbers. It was underlined that an inherent restriction of the
method is that the developed theory requires the model to be monotonic. The
model shape is otherwise unrestricted, being thus free of any other shape
constraint and resulting free of biasing effects inherently tied to other
modeling methods.In order to assess the numerical statistical modeling
method proposed in the present work, numerical experiments performed on
synthetic as well as real-world data sets, both of matching/unmatching sizes
were conducted. The proposed algorithm was applied to modeling a synthetic data
set with density distributions known in analytic form and a synthetic data set
related to digital images for pattern recognition purpose. Also, it was applied
to the modeling of sweet cherry blossom blooming data, to the statistical
characterization of the mechanical properties of polypropylene composites
reinforced with natural fibers as well as to the modeling of blood measurements
data for trend discovery in laboratory analysis. The results of numerical
experiments showed that the computed models fit the data in a satisfactory way
and result otherwise sensible and reasonable.Two technical aspects of the discussed method may
raise some concerns, namely, the following.Intrinsic discretization of the input/output function when look-up tables are used:
while the use of LUTs makes the handling of the algorithm easy, the intrinsic
discretization causes errors with respect to the continuous case. In
particular, the interpolation as a way to obtain better solutions might rise
some concerns about the accuracy of function representation. The present Author
has investigated in previous paper the effect of changes in bin size/numbers,
for example, in the context of estimating the entropy of a random variable
whose samples only are available [46]. This problem is also discussed in the book [47].Scalability for high-dimensional data: it would be interesting considering the
extension of the discussed bivariate statistical modeling procedure to
multivariate statistical modeling, namely, to the case that a many-to-one
variables model is needed of. Such extension will need a theoretical basis to
be developed as well as a set of convenient numerical representation/handling
procedures to be devised subsequently.We are currently seeking for an extension of the discussed isotonic bivariate
statistical modeling approach to a many-to-one modeling procedure. The main
challenge here is how to formulate the multivariate statistical modeling by
statistic matching. A possible solution under investigation seems to involve
partial differential equations on the modeling function based on joint
probability density functions of the “independent” variables and of the
“dependent” variable. From the technical/implementation side, the extension
of traditional (2-variable) LUTs to multivariable LUTs looks straightforward.