Literature DB >> 28386181

Analytical fuzzy approach to biological data analysis.

Weiping Zhang¹, Jingzhi Yang², Yanling Fang³, Huanyu Chen³, Yihua Mao⁴, Mohit Kumar³.

Abstract

The assessment of the physiological state of an individual requires an objective evaluation of biological data while taking into account both measurement noise and uncertainties arising from individual factors. We suggest to represent multi-dimensional medical data by means of an optimal fuzzy membership function. A carefully designed data model is introduced in a completely deterministic framework where uncertain variables are characterized by fuzzy membership functions. The study derives the analytical expressions of fuzzy membership functions on variables of the multivariate data model by maximizing the over-uncertainties-averaged-log-membership values of data samples around an initial guess. The analytical solution lends itself to a practical modeling algorithm facilitating the data classification. The experiments performed on the heartbeat interval data of 20 subjects verified that the proposed method is competing alternative to typically used pattern recognition and machine learning algorithms.

Entities: Chemical Disease Gene Species

Keywords: Fuzzy membership functions; Modeling; Variational optimization

Year: 2017 PMID： 28386181 PMCID： PMC5372457 DOI： 10.1016/j.sjbs.2017.01.027

Source DB: PubMed Journal: Saudi J Biol Sci ISSN： 2213-7106 Impact factor: 4.219

Introduction

Data mining is increasingly motivating area of research due to an abundance of data facilitated by modern era of information technology. Data mining techniques such as classification and clustering play a vital role in the development of medical decision support systems contributing to improved healthcare quality. The medical decision making problems inherently involve complexities and uncertainties and thus the researchers have advocated the integration of fuzzy methodologies in medical data interpretation. The handling of uncertainties by capturing of knowledge using fuzzy sets and rules together with an interpretability offered by simple linguistic if-then rules are two most important features of fuzzy methodologies. The fuzzy approaches are commonly applied to medical data classification problems (Fan et al., 2011, Gadaras and Mikhailov, 2009, Nguyen et al., 2015, Papageorgiou, 2011, Seera and Lim, 2014). The mathematical analysis of biomedical signals is performed to construct models identifying the mappings between signal features and the patient’s state. The mathematical relationship between signal features and the patient’s state is affected by uncertainties arising from individual factors (e.g. related to body conditions) that can’t be mathematically taken into account. The fuzzy filters have been previously proposed to alleviate the effect of uncertainties on medical data analysis (Kumar et al., 2007a, Kumar et al., 2007b, Kumar et al., 2008, Kumar et al., 2010a, Kumar et al., 2010b) wherein robust estimation algorithms have been applied to design a fuzzy model that identifies the functional relation between physiological parameters and subjective rating scores. Also, stochastic fuzzy modeling and analysis techniques have been introduced to take simultaneously the advantages of Bayesian analysis and fuzzy theory for a mathematical handling of the uncertainties in biomedical signal analysis (Kumar et al., 2010a, Kumar et al., 2010b, Kumar et al., 2012a, Kumar et al., 2012b). A recent work (Kumar et al., 2016a, Kumar et al., 2016b) introduced in a rigorous manner a stochastic framework for robust fuzzy filtering and analysis of signals. Although Kumar et al., 2016a, Kumar et al., 2016b introduced modeling and analysis framework is general and rests on strong mathematical foundations, it considers only the signal and thus can’t be directly applied to nonsignal multivariate data samples. There remains the need of automated design methods to fully exploit the uncertain handling capabilities of fuzzy systems. The typically used approaches to design the fuzzy sets and systems include evolutionary algorithms (Alcala et al., 2009, Antonelli et al., 2012, Cococcioni et al., 2011, Gacto et al., 2010, Pulkkinen and Koivisto, 2010, Robles et al., 2009), data clustering (Celikyilmaz and Turksen, 2008, Chen and Chen, 2007, Liao et al., 2003, Oh et al., 2003), adaptive filtering (Aliasghary and Arghavani, 2012, Kumar et al., 2006, Kumar et al., 2009a, Kumar et al., 2009b, Mottaghi-Kashtiban et al., 2008, Simon, 2005), and information theoretic concepts (Aliasghary and Arghavani, 2012, Au et al., 2006, Makrehchi et al., 2003). The determination of fuzzy membership functions remains a challenge as membership functions, due to the nonlinearity of the problem, can’t be optimized analytically. Thus, most design methods of fuzzy membership functions lack in mathematical theory and are based on numerical algorithms which might be slow and inexact. Recently, (Kumar et al., 2016a, Kumar et al., 2016b) introduced an analytical approach for the determination of fuzzy membership functions using the variational optimization method. The proposed analytical approach of (Kumar et al., 2016a, Kumar et al., 2016b) allows to mathematically incorporate the given modeling scenario in fuzzy membership functions’ design problem and thus can be potentially extended to medical data modeling scenario. The authors observe that the application of fuzzy paradigm in medicine, despite being an extensively studied area, doesn’t provide a rigorous analytically derived methodology or approach to interpret medical data while taking mathematically into account the measurement noise as well as the individuality. The medical data are multi-dimensional whose good representation by means of fuzzy membership functions is the aim of the mathematical theory presented in this study. This text introduces a data model that takes into account both measurement noise and uncertainties arising from individuality related factors. A multivariate data sample, represented as y = [y1 ⋯ y]T ∈ RP, is assumed to be generated by an uncertain signal model displayed in Fig. 1. It is assumed an uncertain signal model for a scalar y. Here, y is the observed value of an unknown scalar m being affected by measurement noise v and uncertainty u. The uncertainty u (equal to the dot product of Gj ∈ RK and α ∈ RK) is being generated by a linear combination of K different sources: (α1, ⋯ ,α) that the jth element of y is generated aswhere vj is the measurement noise, u is the uncertainty affecting the model, and mj is an unobserved scalar variable. The uncertainties are assumed to be generated by linearly transforming a K-dimensional (K ⩽ P) vector as follows:

Figure 1

An uncertain signal model for a scalar y.

Defining , can be expressed as the dot product of and , i.e., Our approach is of treating all the variables (appearing in the uncertain signal model of Fig.1) as uncertain being characterized by fuzzy membership functions. assuming that medical data, under the given status of a patient, is generated by a finite mixture of uncertain signal models of the type that of Fig. 1. determining the fuzzy membership functions on variables with the help of experimentally measured data samples in an analytical manner using variational optimization (Kumar et al., 2016a, Kumar et al., 2016b). The approach results in a tractable solution to model the multivariate data samples by means of fuzzy membership functions and thus medical decision support systems can be built up on the top of the data models. The modeling of data using a finite mixture of signal models of the type of Fig. 1 is typically considered in a stochastic setting assuming variables as random (i.e. characterized by probability distribution functions) and Bayesian framework is commonly used for the inference of posterior distributions. The originality of this study lies in solving the modeling problem in a completely deterministic framework where fuzzy membership functions are defined over variables to characterize uncertainties about their values. The optimal shapes of fuzzy membership functions are determined via analytically maximizing the “over uncertainties averaged log membership” values of data samples around an initial guess. The maximization problem is analytically solved using variational optimization as suggested initially in Kumar et al., 2016a, Kumar et al., 2016b. The contribution of this study is to derive the analytical expressions of fuzzy membership functions on variables of the multivariate data model leading to the development of a classification algorithm. It is demonstrated through experimental data that our approach is competing alternative to typically used classification algorithms including “k-nearest neighbors”, “support vector machines”, “decision tree”, “random forest”, “AdaBoost”, “Gaussian naive Bayes”, “linear discriminant analysis”, and “quadratic discriminant analysis”. The better classification performance of our approach is attributed to the efficient modeling of the data distribution in multi-parametric space. The significance of this work is that the analytically derived expressions for fuzzy membership functions for representing uncertainties associated with medical data would facilitate a system theoretic approach to mathematically design the medical expert systems. This would provide researchers, unlike typically used ad-hoc numerical algorithms, a mathematical theory on fuzzy membership functions’ applications in medicine. This text is organized into sections. Section 2 introduces an uncertain model of multivariate data and an analytical solution for optimizing the data model is provided in Section 3. A practical algorithm, based on the derived analytical solution, is stated in Section4 4 for the modeling of multivariate data samples. Section 5 applies the proposed approach on the experimental heartbeat interval data of 20 subjects followed by concluding remarks in Section 6.

An uncertain model of multivariate data

By an uncertain model, it is meant that system variables are characterized by fuzzy membership functions. Despite the availability of a wide range of fuzzy membership function types, only following two types of fuzzy membership functions are chosen to model the variables for keeping the analysis in its most basic form:

Gaussian’s membership function (Kumar et al., 2016a, Kumar et al., 2016b)

The Gaussian membership function on a vector x ∈ Rn, with mean equal to mx and precision equal to Λx, is defined as

Gamma membership function (Kumar et al., 2016a, Kumar et al., 2016b)

The Gamma membership function on a non-negative scalar z can be defined as A few examples of this type of membership functions for different values of a and b are provided in Fig. 2. The parameter a is referred to as the shape parameter and b is referred to as the rate parameter (i.e. the reciprocal of the scale parameter). The peak of the membership function is given at (a − 1)/b. The skewness of the membership function is inversely proportional to the value of a. The Gamma membership function can alternatively be represented as

Figure 2

A few examples of Gamma membership functions (Kumar et al., 2016a, Kumar et al., 2016b).

The relations between the parameters of two forms of Gamma membership functions are as follows: All of the variables, appearing in Fig. 1, are assigned carefully either of Gaussian or Gamma membership function in Definition 3, Definition 4, Definition 5, Definition 6, Definition 7, Definition 8.

Fuzzy membership function on v

The fuzzy membership function on v ∈ R is defined as zero-mean Gaussian with scaled precisions aswhere is the precision scaled by . The uncertainties of and are characterized by the following Gamma membership functions: Here, , and are uncertain as well as characterized by the following Gamma membership functions:

Fuzzy membership function on y

The fuzzy membership function on , for a given , is defined as The membership function on is derived by replacing in (1) by . The multivariate fuzzy membership function on , for a given , is defined as the product of its individual elements’ membership functions as

Fuzzy membership function on m

The multivariate fuzzy membership function on m = is defined as Gaussian as

Fuzzy membership function on

The multivariate fuzzy membership function on is defined as zero-mean Gaussian with precision equal to unity matrix as The multivariate fuzzy membership function on is defined as zero-mean Gaussian aswhere is the precision of kth element of and is uncertain characterized by the following Gamma membership function: To model the multivariate data sample distributed arbitrarily in -dimensional data space, a mixture of finite number of uncertain signal models is considered in Definition 9.

Fuzzy membership of y as a finite mixture of uncertain signal models

The fuzzy membership function on , for a given , is defined as a mixture of different uncertain signal models aswhere is the mixing proportion of the ith uncertain signal model with , and is a set of parameters defined aswhere is uncertain characterized by the following Gaussian membership function is uncertain characterized by the following Gaussian membership function is uncertain characterized by the following Gamma membership function: is uncertain characterized by the following Gaussian membership function: is uncertain scalar characterized by the following Gamma membership function: is uncertain characterized by the following Gamma membership function: is uncertain characterized by the following Gamma membership function: is uncertain scalar characterized by the following Gamma membership function:

Analytical optimization of mixture of uncertain signal models

Given N data samples, , the aim is to define the multivariate fuzzy membership function on y in an “optimal” manner. The approach is to optimize the fuzzy membership function (defined on y by Definition 1) with respect to while taking into account the uncertainties of the parameters represented by set Ω. To take into account the uncertainties of the parameters represented by the set Ω, the “optimal” membership functions on the parameters must be first determined. For this, assume that , , , , , , , and are arbitrary fuzzy membership functions on , , , ,, , and respectively. Define a function, q(Ω), as follows Define a differential functional, , as follows Define a differential functional, , as follows The optimization process maximizes an objective functional, , defined as is maximized with respect to , , , , , , , and and under the following constraints: Fixed Integral Constraints on Membership Functions: Unity Maximum Value Constraints on Membership Functions: The values of , and are so chosen such that maximum value of , , , , , , , and is equal to one. Unity Sum Constraint on Mixing Proportions: . The first term of computes the averaged log-membership value of data samples when the average is taken over uncertain parameters Ω being modeled by membership function . The second term of regularizes the maximization problem toward initial guess . The third term of regularizes the estimation of toward initial guess . The analytical expressions for variational membership functions, that maximize under Fixed Integral and Unity Maximum Value Constrains, are Once the membership functions representing the uncertainties on the parameters have been optimally determined, the optimal multivariate fuzzy membership function on y = [y1 ⋯ yP] ∈ RP is defined by averaging over the uncertainties such thatwhere After evaluating the integral, , the expression of the optimal membership function on y is as follows: Finally, the constant of proportionality is chosen equal to one resulting in

An Algorithm for multivariate data modeling

Algorithm

The analytical solution to mixture of uncertain signal models, derived in section (3), lends itself to Algorithm 1 for the modeling of multivariate data samples by determining membership functions on all of the variables and parameters. Algorithm 1 suggests to choose initial values of parameters based on k-means clustering and eigenvalue decomposition of sample covariance matrix. Remark 1 (Complexity and Iterations) Algorithm 1 is based on the invoking of parameters updating rules (3–20). The time complexity of the algorithm, as a result of computing the inverse of a P × P sized matrix in update rule (10), is O(P 3). Algorithm 1, after initializing the parameters, invokes a single iteration of parameters updating rules. Thanks to the analytically derived solution due to which a single iteration is sufficient for parameters to nearly converge after initializing the parameters carefully. However, the optimal values of C and K are determined by maximizing the average fuzzy membership value of the data samples through repeated application of update rules. Remark 2 (Free parameter β in Algorithm 1) Algorithm 1 has only single free parameter, β ∈ [0, 0.5], to be chosen by the user. The maximum possible number of signal models in the mixture, Cmax, depends on the value of β. It will be demonstrated through experiments that algorithm’s performance is not highly sensitive to the choice of β.

Data distribution modeling

The application of Algorithm 1 on given data samples results in the determination of Copt different fuzzy membership functions on unobserved variable m which (membership functions) are defined as Let be the set of parameters returned by Algorithm, i.e., . Finally, a data model, constructed from using Algorithm, is represented by a fuzzy membership function defined as

Classification

The data modeling capability of functional can be exploited for the classification purpose. If are S different sets returned by Algorithm corresponding to the data samples of S different classes, then the class-label associated to a vector y could be predicted as

Demonstrations on Toy data sets

Fig.3 shows an example of the 2-dimensional data samples and a display of the fuzzy membership function (calculated using (22)) over the data space. As depicted in Fig.3, the distribution of the samples in P-dimensional space is modeled by the fuzzy membership function . Stochastic mixture models have been extensively studied in the literature and are typically used to learn data distributions. The most commonly used Gaussian mixture models(GMM) fit the given data samples by assuming that each data sample has been generated by a stochastic mixture of a finite number of the Gaussian distributions. “Expectation Maximization” algorithm is typically used for the learning of the Gaussian mixture models from data samples where the number of components in the mixture can be efficiently selected using the Bayesian information criterion (BIC). There may arise the situations when GMM don’t give favorable results. Fig.4(a) is an example of data samples where better performance of Algorithm 1 than GMM (together with BIC) is observed. A comparison between color plots of GMM based likelihood (displayed inFig.4(b)) andAlgorithm 1 based fuzzy membership function (displayed in Fig.4(c)) demonstrates the effectiveness of Algorithm 1 in modeling the distribution of data samples.

Figure 3

An example of the model learned from 2-dimensional data samples using Algorithm 1 (with β = 0.5).

Figure 4

An example of the comparison between the Gaussian mixture models and Algorithm 1 (with β = 0.5).

Heartbeat intervals classification

The section applies the proposed methodology on the experimentally recorded heartbeat intervals (referred to as the R-R intervals) of 20 different subjects while they were performing two different types of tasks in a chemical laboratory of Zhejiang University. One task involved manual pipetting of the chemical solutions while the other task involved working with the computer. The aim is to classify heartbeat intervals of a subject between the two tasks. The P-dimensional data samples were created from the sequence of R-R intervals as(see Table 1)where RRi is ith heartbeat interval. The R-R intervals corresponding to the first half of the task duration serve as the training data and that of second half as testing data. Table 2 lists the median of classification accuracy over 20 subjects, obtained on testing data by different classification methods, for different values of data dimension P. The better classification accuracy of the analytical fuzzy approach in Table 2 supports the arguments that proposed approach could be an effective tool for modeling and analysis of biomedical data.

Table 1

A comparison of different classification algorithms with the proposed method in term of classification accuracy on testing data.

Method	Dataset 1	Dataset 2	Dataset 3
Nearest neighbors	100%	100%	75%
Linear SVM	91%	46%	51%
RBF SVM	90%	100%	59%
Decision tree	98%	100%	80%
Random forest	98%	100%	73%
AdaBoost	93%	97%	80%
Naive Bayes	92%	97%	57%
LDA	90%	29%	52%
QDA	90%	96%	57%

Analytical fuzzy (β = 0.5)	100%	100%	82%

Table 2

A The median accuracy (in %) of different algorithms in classifying the testing heartbeat intervals between two tasks performed by subjects.

Method	Median of % accuracy (P = 2) % accuracy (P = 2)	Median of % accuracy (P = 4)	Median of % accuracy (P = 6)	Median of % accuracy (P = 8)
Nearest neighbors	87.11	90.33	91.08	92.65
Linear SVM	87.11	89.24	90.64	91.58
RBF SVM	84.07	84.17	86.99	90.11
Decision tree	84.95	87.22	88.83	89.57
Random forest	86.75	88.93	90.84	92.51
AdaBoost	88.36	90.72	91.87	92.60
Naive Bayes	87.40	89.27	91.05	92.18
LDA	88.67	90.70	91.59	92.99
QDA	88.04	88.46	90.08	90.97

Analytical fuzzy (β = 0)	88.75	91.16	92.14	93.14

Concluding remarks

The theoretical contribution of this work is to propose an analytical fuzzy approach that provides a principled basis for determining the fuzzy membership functions to handle uncertainties in a modeling problem. The theoretical results form the basis for designing an algorithm that results in an efficient modeling of the data distribution in multi-parametric space. The analytically derived expressions for fuzzy membership functions for representing uncertainties associated with biomedical data should facilitate a system theoretic approach to mathematically design the medical expert systems.

4 in total

1 in total

1. Research on Ultrasonic Image Recognition Based on Optimization Immune Algorithm.

Authors: Xueqiang Zeng; Sufen Chen
Journal: Comput Math Methods Med Date: 2021-05-17 Impact factor: 2.238