Literature DB >> 35597855

A novel early diagnostic framework for chronic diseases with class imbalance.

Xiaohan Yuan¹, Shuyu Chen², Chuan Sun¹, Lu Yuwen¹.

Abstract

Chronic diseases are one of the most severe health issues in the world, due to their terrible clinical presentations such as long onset cycle, insidious symptoms, and various complications. Recently, machine learning has become a promising technique to assist the early diagnosis of chronic diseases. However, existing works ignore the problems of feature hiding and imbalanced class distribution in chronic disease datasets. In this paper, we present a universal and efficient diagnostic framework to alleviate the above two problems for diagnosing chronic diseases timely and accurately. Specifically, we first propose a network-limited polynomial neural network (NLPNN) algorithm to efficiently capture high-level features hidden in chronic disease datasets, which is data augmentation in terms of its feature space and can also avoid over-fitting. Then, to alleviate the class imbalance problem, we further propose an attention-empowered NLPNN algorithm to improve the diagnostic accuracy for sick cases, which is also data augmentation in terms of its sample space. We evaluate the proposed framework on nine public and two real chronic disease datasets (partly with class imbalance). Extensive experiment results demonstrate that the proposed diagnostic algorithms outperform state-of-the-art machine learning algorithms, and can achieve superior performances in terms of accuracy, recall, F1, and G_mean. The proposed framework can help to diagnose chronic diseases timely and accurately at an early stage.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35597855 PMCID： PMC9123399 DOI： 10.1038/s41598-022-12574-x

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.996

Introduction

Chronic diseases have been a severe health issue in the world. In 2019, the World Health Organization pointed out that chronic diseases account for about 7 of the top 10 causes of death in the world[2]. Deaths caused by chronic diseases account for more than 63% of the total global deaths. Common chronic diseases include heart disease, diabetes, hypertension, etc., which are mainly caused by individual unhealthy lifestyles[3]. Once people suffer from chronic diseases, several vital organs (e.g., eye, brain, heart, kidney, etc.) will be damaged, and it is easy to cause a series of serious complications affecting work and life[4]. Patients with chronic diseases are particularly vulnerable to infectious diseases, such as the coronavirus disease 2019 (COVID-19)[5]. More than 48% of COVID-19 patients have a history of chronic diseases and are more likely to develop severe symptoms[6,7]. Additionally, chronic diseases will lead to expensive medical expenses[8]. The Centers for Disease Control and Prevention reports chronic diseases are leading drivers of the nation’s 3.8 trillion in annual health care costs[9]. The main reason for the high fatality rate and expensive medical expenses is that chronic diseases have some terrible clinical presentations such as a long onset cycle, insidious symptoms, irreversible development, and various complications[10]. The above information reminds us that we need to quickly strengthen the prevention, diagnosis, and treatment of chronic diseases. Therefore, the early diagnosis of chronic diseases is urgent and essential, which can motivate high-risk patients to change their unhealthy lifestyles, thereby reducing the incidence of complications and further improving their health and quality of life. Since the onset of chronic diseases is imperceptible and there are no obvious clinical symptoms in the early stage, it is difficult for doctors to determine the risk of patients with chronic diseases. Nowadays, machine learning has become the hottest promising technology for the assisted diagnosis of diseases with its advantages of autonomous learning and low error rate[11-13]. Several state-of-the-art machine learning algorithms have been widely used in the early diagnosis of different chronic diseases (e.g., chronic kidney disease, diabetes), such as support vector machines (SVM)[14], logistic regression (LR)[15], k-nearest neighbor (KNN)[16], decision trees (DT)[17], and the ensemble of some algorithms[18-20]. However, existing works are mainly dedicated to data preprocessing (e.g., data regularization and feature selection) to improve the early diagnostic performance of only a certain chronic disease[21,22]. Besides, they ignore the problems of feature hiding and imbalanced class distribution in chronic disease datasets. Hence, these methods are not conducive to improving the performance of the diagnostic model and are not suitable for a universal and efficient diagnosis of chronic diseases. The problem of feature hiding represents that the feature in the dataset maybe not be directly related to decision-making. It needs to be further comprehensively analyzed together with other features to obtain the features directly related to decision-making[23]. For example, based on the heart rate and body mass index in the data, it is not possible to directly decide whether a patient has heart disease. If the visible original features of the data are directly used, neither the doctor nor the machine learning may be able to make a wise decision. Therefore, we need to expand the feature space of the data to capture its potential features related to chronic disease diagnosis. Additionally, the imbalanced class distribution of the dataset refers to a significant skew that exists between the number of samples for the different classes, which is also called the class imbalance problem[24]. The dominant class is called the majority class, and the remaining classes are called the minority class. Learning from the dataset with the class imbalance problem will make the learned model unreliable, which is more concerned with identifying the majority class correctly and ignoring the minority class[25,26]. Especially, in the chronic disease dataset, the number of sick cases (minority class) is generally lower than the number of healthy cases (majority class). However, the cost of misdiagnosing a sick case as a healthy case is significantly higher than the cost of misdiagnosing a healthy case as a sick case. The former may cause the patient to miss the best treatment period[27]. Therefore, how to accurately identify sick cases from the class imbalanced chronic disease dataset without affecting the overall diagnostic performance is of crucial importance and also a very challenging task. Deep neural networks have great potential for solving various engineering problems in many fields, by extracting high-level features from data to achieve superior classification performance[28,29]. However, most deep neural network algorithms are not friendly to small-scale datasets and are prone to data overfitting[30,31]. Additionally, as the collected chronic disease data are not generally abundant (i.e., small-scale datasets), some existing deep neural network algorithms cannot train a well diagnostic model for chronic diseases. Recently, the deep polynomial neural network (PNN) has received the attention of some researchers[32-34]. We investigate the advantage of PNN and find that PNN is very friendly to classification tasks on small-scale datasets compared to other deep neural network algorithms. Surprisingly, the ideal PNN is parameter-free and can reduce the training error to zero iteratively[35]. Each network node of PNN is a polynomial function of its input. Thus, PNN can represent any polynomial value over the input data. Particularly, similar to other deep neural network algorithms, the network architecture of PNN is constructed layer by layer, which can represent higher and higher level (hidden) features of the input data. In other words, PNN can hierarchically expand the feature space of its input, and effectively capture features related to chronic disease diagnosis. Finally, the output layer of PNN can be constructed by solving a simple convex optimization problem. In this paper, we are motivated to investigate the issue of the early diagnosis of chronic diseases. To the best of our knowledge, we are the first to study a universal and efficient diagnostic framework for chronic diseases, which can extract high-level features and solve the class imbalance problem to diagnose chronic diseases timely and accurately. Specifically, to efficiently capture high-level features hidden in chronic disease datasets, we propose a network-limited PNN (NLPNN) algorithm to avoid the problem of over-fitting. NLPNN can be seen as data augmentation in terms of its feature space. Additionally, as collected chronic disease datasets generally have a serious class imbalance problem, that is, the number of positive samples (sick cases) is significantly less than the number of negative samples (healthy cases), the PNN-based diagnostic model cannot fully learn the knowledge of sick cases, resulting in costly misdiagnosis (low recall). To alleviate this class imbalance problem, we further consider empowering samples with attention (i.e., weight) to change the importance of each sample and propose an improved NLPNN algorithm, named attention-empowered NLPNN (AEPNN). AEPNN pays more attention to these samples that are misclassified by NLPNN, regarded as data augmentation in terms of its sample space. Thus, the main contributions of this paper are summarized as follows.The rest of the paper is organized as follows. We discuss related work in “Related work” section. “Diagnostic framework for chronic diseases” section presents the proposed algorithms, and experiment results are shown in “Experimental results” section. Finally, “Conclusion” section concludes this paper. We study a universal and efficient diagnostic framework to make timely and accurate early diagnosis of chronic diseases with small-scale datasets. We propose an NLPNN algorithm to avoid the problem of over-fitting, which can efficiently capture high-level features hidden in chronic disease datasets and achieve high classification accuracy. We further propose an AEPNN algorithm to solve the class imbalance problem, which greatly improves the recall of the diagnostic model, that is, it can accurately diagnose the sick case. We evaluate and compare the proposed methods against other state-of-the-art methods using nine chronic diseases datasets (partly with class imbalance) and extensive experimental results demonstrate that the proposed two diagnostic models outperform state-of-the-art machine learning algorithms, and can achieve superior accuracy and recall.

Related work

Early diagnosis of chronic diseases

Several existing machine learning algorithms have been proposed to diagnose a certain chronic disease[36-38]. Heydari et al.[36] compared the performance of various machine learning classification algorithms in the early diagnosis of type 2 diabetes. The simulation results showed that the performance of classification techniques depends on the nature and complexity of the dataset. Khan et al.[37] developed a chronic disease risk prediction framework. To reduce the impact of outliers, Alirezaei et al.[38] incorporated K-means clustering, SVM, and meta-heuristic algorithm to diagnose diabetes disease. However, they ignored the influence of data distribution and structural changes on model generalization performance. Under the premise of not changing the structure and distribution of data, the authors in[13] proposed a diagnostic model based on XGBoost for chronic kidney disease (CKD). Sekar et al.[39] used a hierarchical neural network fusion method (FHNN) for the stratified diagnosis of cardiovascular disease (CVD). However, the impact of FHNN mainly depends on the optimal choice of the sub-neural network. Some tree-based ensemble learning techniques applied to early diagnosis methods of diabetes were comprehensively studied by Tama et al.[20], and the differential performance of different classification methods was evaluated through statistical significance tests. At the same time, Altan et al.[40] also compared various machine learning algorithms for the early diagnosis of chronic obstructive pulmonary disease and proposed a deep learning model to analyze multi-channel lung sounds using statistical features of Hilbert-Huang transform, which successfully achieved high classification performance of accuracy, sensitivity, and specificity of 93.67%, 91%, and 96.33%, respectively.

Class imbalance

In medical datasets, the problem of class imbalance seriously affects the accuracy of classifiers[27,24]. In most cases, it directly leads to a high rate of misdiagnosis of the disease. This is because the class imbalance of the training data brings difficulties to the algorithm learning, and the algorithm pays more attention to the majority class[41]. However, the minority class in medical datasets (sick vs. healthy) is often more important from a data mining perspective, and it usually carries critical and useful knowledge. At present, many scholars have studied the class imbalance problem, among which there are three main methods to alleviate the class imbalance[42,43]. (1) Data-level methods: in the data preprocessing stage, re-sampling is used to reduce the size of the majority class or increase the size of the minority class (or both) to balance the training set and eliminate difference. (2) Algorithm-level methods: in the training phase, the learning algorithm is modified to be suitable for mining data with imbalanced distributions. (3) Hybrid methods: the advantages of the first two methods are combined to alleviate the adverse effects of class imbalance on the results.

Diagnostic framework for chronic diseases

Statement: I confirm that all methods were performed in accordance with the relevant guidelines and regulations. In this section, we propose a universal and efficient diagnostic framework for diagnosing chronic diseases timely and accurately. The proposed framework consists of the NLPNN algorithm and AEPNN algorithm to alleviate the problems of feature hiding and class imbalance, respectively.

Network-limited polynomial neural network

The PNN algorithm is dedicated to learning the high-level polynomial feature representation of the data through multi-layer network architecture, and finally, output features hierarchically[32,33]. Although the PNN algorithm has been proven to run in polynomial time, it still has a limitation, that is, the depth and width of the network cannot be controlled. Its network depth and width are both adaptive, and the criterion for depth stopping is until the training error is zero[35]. In the worst case, the network depth can be infinitely deepened or the network width can be as large as the number of training samples n. This will lead to severe overfitting. Hence, we present an NLPNN algorithm for the early diagnosis of chronic diseases to avoid this issue. The structure of NLPNN is shown in Fig. 1a, and the details of the NLPNN algorithm applied to chronic diseases diagnosis be described below.

Figure 1

Flowchart of the proposed algorithms: (a) NLPNN; (b) AEPNN.

Flowchart of the proposed algorithms: (a) NLPNN; (b) AEPNN. For the early diagnosis of chronic diseases, we denote the labeled training dataset as , where is the set of n samples with d features; is a n-dimensional column vector and , . Here, = 1 means that the i-th sample is labeled as a sick case, and = − 1 otherwise. The M-order multivariate polynomial on the sample is written aswhere is a d-dimensional vector composed of non-negative integers and ; is a coefficient of monomial of degree j. Represent the value of each polynomial on n samples by linear projectionAccording to linear algebra, there are n polynomials , and form a basis of space. Therefore, there is a coefficient vector , so that . The network layer of PNN is constructed by solving the basis of polynomial hierarchically, and each node calculates a linear function or weighted product over its input. We denote the j-th node of the i-th layer as , which actually represents a feature (original or high-level) of the input data. For the first layer, the j-th node is the degree-1 polynomial (or linear) function , and the is the basis of all values obtained by a polynomial of degree 1 on the training dataset. They form the columns of matrix and . So far, a single-layer network has been constructed, and its output spans all the values obtained by the linear function on the training sample. Generally speaking, the basis of the degree-2,3,...M polynomial is also obtained in the same trick. However, we find that the basis of the degree-M multiple polynomials is composed of vector elements. The scale of the basis of the polynomial increases exponentially with its degree, which will run into a computational problem. The work in[35] indicates that any degree-m polynomial can be regraded aswhere and are degree-1 and degree- polynomials respectively; is a polynomial of degree not greater than . Since all degree-1 polynomials are spanned by the nodes at the first layer of PNN, any degree-2 polynomial can be written aswhere , , are scalar multipliers. (4) implies that the construction of the second layer of the network is based on the first layer. The matrix is formed by concatenating the columns of , , which spans all values attainable by degree-2 polynomials, andwhere the symbol indicates the Hadamard product; refers to the first column of F; refers to the number of columns of F. Similar to degree-1 polynomial, the column subset of should be found, so that the column of are the basis of column of . The second layer of the PNN is constructed by the column of , which is the product of two nodes and in the first layer. The next step is to repeat the above process. Successively, the layers of the network are constructed. We represent the matrix, written asThus, we find a linearly independent column subset of , which lets the columns of matrix are a basis of the columns of the augmented matrix , where the columns of can span the values attained by all polynomials for degree at most over the training dataset. In addition, it needs to be explained that the conversion of to is achieved bywhere the projection matrix and . Therefore, when the M-layer network of the PNN is constructed, all the values obtained by the polynomial of degree at most M over the training dataset can be spanned by the columns of the matrix F. In fact, F stores the high-level features of the input data, the deeper layer, the higher feature. However, for the implementation of NLPNN, we use a parameter to pre-limit the depth and width of the network, which represents that the network consists of M () non-output layers and each layer has nodes at most. In the first non-output layer, we use singular value decomposition on the augmented data matrix to obtain its partial orthogonal basis, which forms the nodes (select the first main singular vectors). In the next non-output layer, a standard Orthogonal Least Squares (OLS) procedure is utilized to greedily select the partial orthogonal basis which are the first relevant features for diagnosis of chronic disease according to the established high-level feature set . Finally, a simple linear classifier with input data is trained. Therefore, there are M linear classifiers in the output layer. It should be pointed out that each linear classifier is trained by a stochastic gradient descent method, which is utilized to solve the regularization problemwhere is a hinge loss and represents the i-th row of matrix ; is the regularization factor. Then combined with the value set of the regularization factor, we check the network performance layer by layer on the verification dataset to find the optimal network layer and the best regularization factor. Finally, an optimal linear classifier is obtained byand the output is this optimal classifier. The purpose of NLPNN is to adaptively find features related to diagnosis from the augmented data that is augmented in terms of its feature space. The detailed process of NLPNN is shown in Algorithm 1, which briefly describes the entire process from the establishment of the network layer to the acquisition of the output layer.

Attention-empowered NLPNN

Some chronic disease datasets exist the class imbalance problem, where sick cases are generally scarce compared to healthy cases. However, the correct diagnosis of the minority sick cases among all cases is vital in a healthcare system. The reason is that the cost of misdiagnosing sick cases is much higher than healthy cases, where the latter only requires further examination and the former carries a life-threatening risk. During the training phase of NLPNN, since the samples of each class in the imbalanced dataset are utilized equally, the trained model tends to bias towards the majority class and ignore the samples (sick cases) in the minority class. Thus, NLPNN does not perform well in dealing with class imbalance problems and causes serious misdiagnosis of minority sick cases. Furthermore, for the early diagnosis of chronic diseases, although we are more concerned with the accurate diagnosis of sick cases, we cannot ignore the overall diagnostic accuracy. To alleviate the class imbalance problem, we empower the cases with attention (i.e., weight) and propose an AEPNN algorithm. AEPNN pays more attention to the cases misdiagnosed by NLPNN by changing the importance of these cases. Motivated by committee-based learning[25], AEPNN trains and combines multiple complementary NLPNN to further improve the performance of NLPNN in alleviating the class imbalance problem. The structure of AEPNN is shown in Fig. 1b. For the implementation of AEPNN, we first assign an identical initial weight to each sample in the training dataset. An NLPNN classifier is trained from the training dataset with the initialized weight distribution and ’s error is fed back to the training sample, so that the training sample’s distribution is adjusted by . Then, the second NLPNN classifier is trained from the training dataset with the weight distribution , where the weights of samples misdiagnosed by are increased in to make pay more attention to the samples that are misdiagnosed by . This process is repeated until is trained after L iterations. Finally, the predicted label is obtained through the weighted combination of all NLPNN classifiers. The main process is shown in Algorithm 2. Specifically, we denote the true label corresponding to sample as , and the predicted label obtained by the NLPNN classifier as . Obviously, the loss function is defined aswhere represents the probability density function of following the data distribution . However, has poor mathematical properties (non-convex and non-continuous), which makes it very difficult to be solved directly. To optimize the loss function more conveniently, we select a convex and continuously differentiable exponential loss function (11) to replace the loss function (10). Lemma 1 proves that is the consistent replacement of the loss function , which means that (11) can replace (10) to update the weight of the sample and the weight of the classifier in Algorithm 2.

Lemma 1

The consistent replacement of the loss function is the exponential loss function

Proof

Please see Appendix 1. In Algorithm 2, the is obtained by applying the NLPNN classifier to the initial samples distribution . When is generated based on distribution , the weight of the classifier is obtained iteratively by minimize the exponential loss function . From Lemma 2, we know that is a necessary and sufficient condition for the exponential loss function to obtain the minimum value. It means that, under the encouragement of the weight , the classifier can achieve the best performance on the dataset with distribution .

Lemma 2

The exponential loss function at obtain the minimum value, where and . Please see Appendix 2. is the voting result of the first l NLPNN classifier with weights , and its error can be corrected by the next classifier . Ideally, can correct all errors of by minimizing the exponential loss . From Lemma 3, all errors of can be corrected by the NLPNN classifier which is trained based on the sample weight distribution , where is the normalization factor to ensure that is a distribution.

Lemma 3

Assume that the base classifier is generated based on the data distribution , and is the weight of the classifier , , , then all the false predictions of can be corrected through the ideal base classifier , which is generated based on the data distribution Please see Appendix 3. In summary, we iteratively optimize the exponential loss function by introducing two kinds of attention ( and ) to achieve the superiority of AEPNN on class-imbalanced datasets.

Experimental results

Some DNN models are not suitable for classification tasks with the small-scale dataset due to the over-fitting problem. However, the PNN-based deep learning algorithm performs well for the early diagnosis of chronic diseases with the small-scale dataset, due to its unique network structure. We select five state-of-the-art machine learning algorithms as the baseline algorithms, i.e. SVM[44], LR[45], KNN[46], DT[47], and multi-layer perceptron (MLP)[48].

Chornic disease datasets

To verify the effectiveness of the proposed algorithm in the early diagnosis of chronic diseases, we select nine public and two private chronic disease datasets for experiments. Nine public chronic disease datasets (i.e., http://archive.ics.uci.edu/ml, https://www.kaggle.com/datasets) include CKD, Pima Indian diabetes dataset (PIMA), CVD, Heart Disease Dataset (Heart), Framingham Heart Disease dataset (Fra_Heart), Hepatitis dataset (Hep), Breast Cancer Wisconsin dataset (BCW) in UCI Machine Learning Repository, Type 2 Diabetes Mellitus Dataset (T2DM) and Gestational Diabetes Mellitus dataset (GDM) in the Tianchi Precision Medicine Competition. They are scarce and precious, but some of them have problems, such as small size, class imbalance, and missing value. Two private chronic disease datasets (Pri_hyper dataset and Pri_diab dataset) are collected from a district in Chongqing, China. The Pri_hyper dataset consists of the health records of hypertensive patients and healthy people. The Pri_diab dataset consists of the health records of diabetic patients and healthy people. The composition details of the selected datasets are listed by Table 1, in which column Datasets is shorthand for the name of the dataset; column Features represents the number of features; column Samples represents the number of samples; column Positive: Negative represents the ratio of the number of positive and negative samples; column Missing shows whether there are missing values in the corresponding dataset. Consistently, we split each chronic disease dataset randomly into a training dataset and testing dataset with 8:2, and maintain the distribution of the class before the split. For baseline algorithms that have to process missing values and regularize data, we fill the missing values with zeros and regularize the data. The implementation of proposed algorithms does not require any other data preprocessing technology.

Table 1

The composition details of chronic disease datasets.

Datasets	Features	Samples	Positive:Negative	Missing
CKD	24	400	1:0.6	No
PIDD	8	768	1:1.87	No
T2DM	40	5642	1:11.19	Yes
CVD	11	70000	1:1.001	No
Heart	13	1025	1:0.95	No
GDM	83	1000	1:1.13	Yes
Fra_Heart	15	4240	1:5.58	Yes
Hep	19	155	1:0.24	Yes
BCW	10	699	1:1.9	Yes
Pri_hyper	33	9091	1:1.13	No
Pri_diab	28	14,525	1:12.78	No

The composition details of chronic disease datasets.

Evaluation measurements

For the early diagnosis of chronic diseases, the generalization performance can be estimated on the test dataset. In addition to using the area under the receiver operating characteristic curve (AUC) to evaluate the performance of the model, we also selected the following evaluation indicators to evaluate the proposed algorithm:where TP, FP, TN, and FN represent true positive, false positive, true negative and false negative respectively; N is the total number of samples. represents the ratio of the number of correctly predicted specific classes to the total number of samples. represents the ratio of the number of correctly predicted healthy cases to the total healthy cases. represents the ratio of the number of correctly predicted sick cases to the total predicted sick case. represents the ratio of the number of correctly predicted sick cases to the total number of sick cases. is defined based on the harmonic average of precision and recall.

Comparison of performance

We investigate the impact of different network depths and regularization factor in the NLPNN model for the diagnostic performance of eleven chronic diseases, where (network layer plus output layer) and . To visually find the most suitable and , we combine them into a binary set , and establish a bijection function between and described in Table 2. We set as the horizontal axis to indirectly draw the generalization performance curve of NLPNN with network depth and regularization factor. From Fig. 2, we can see that NLPNN has two advantages in the diagnosis of all chronic diseases, that is, there is no over-fitting phenomenon; the training accuracy is increasing with the increase of the number of network layers (it can be observed that when =1,6, 11,...). However, different values will affect the performance of the NLPNN algorithm, the impact on different chronic disease datasets is different.

Table 2

The bijective relationship between and .

		Regularization factor \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\lambda )$$\end{document}(λ)
		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10^{-3}}$$\end{document}10-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10^{-2}}$$\end{document}10-2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10^{-1}}$$\end{document}10-1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10^{\ 0}}$$\end{document}100	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10^{\ 1}}$$\end{document}101
Depth \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\Delta )$$\end{document}(Δ)	2	1	2	3	4	5
	3	6	7	8	9	10
	4	11	12	13	14	15
	5	16	17	18	19	20

Figure 2

Training and test performance versus on eleven chronic disease datasets.

The bijective relationship between and . Figure 2a shows that NLPNN can achieve 100% generalization performance on the CKD dataset when . Then, with the increase of and the change of , the test performance decreases somewhat, but both fluctuate within the range of 5%. It means that only a shallow polynomial neural network model can accurately diagnose chronic kidney disease. We can see from Fig. 2b, c, g and k that the value has almost no effect for the diagnostic accuracy of diabetes and heart disease. In particular, for the diagnosis of hepatitis B disease (Fig. 2h), although the accuracy of the NLPNN model does not vary greatly, its specificity is unstable with the change of value. This reason is that the Hep dataset has only 155 samples and the negative samples only account for 24% of the total samples. In addition, we can find the best output performance of NLPNN and the corresponding value on eleven chronic disease datasets from the Fig. 2. Therefore, according to Table 2, we can find the network structure and the regularization factor when NLPNN achieves the best performance, as shown in Table 3.

Table 3

Optimal parameter settings for different datasets.

		Network parameter
		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega ^{*}$$\end{document}Ω∗	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda ^{*}$$\end{document}λ∗
Dataset	CKD	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {24} &{} {} &{} {} &{} {} \\ \end{array} } \right]$$\end{document}24	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-3}$$\end{document}10-3
	PIMA	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {9} &{} {9} &{} {9} &{} {9} \\ \end{array} } \right]$$\end{document}9999	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	T2DM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {32} &{} {32} &{} {32} &{} {} \\ \end{array} } \right]$$\end{document}323232	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-3}$$\end{document}10-3
	CVD	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {12} &{} {12} &{} {12} &{} {12} \\ \end{array} } \right]$$\end{document}12121212	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-3}$$\end{document}10-3
	Heart	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {13} &{} {13} &{} {} &{} {} \\ \end{array} } \right]$$\end{document}1313	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	GDM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {84} &{} {} &{} {} &{} {} \\ \end{array} } \right]$$\end{document}84	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	Fra_Heart	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {14} &{} {14} &{} {14} &{} {14} \\ \end{array} } \right]$$\end{document}14141414	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	Hep	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {14} &{} {14} &{} {14} &{} {14} \\ \end{array} } \right]$$\end{document}14141414	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{\ 0}$$\end{document}100
	BCW	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {11} &{} {11} &{} {11} &{} {11} \\ \end{array} } \right]$$\end{document}11111111	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	Pri_hyper	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {32} &{} {32} &{} {} &{} {} \\ \end{array} } \right]$$\end{document}3232	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-2}$$\end{document}10-2
	Pri_diab	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {\begin{array}{*{20}l} {29} &{} {29} &{} {29} &{} {} \\ \end{array} } \right]$$\end{document}292929	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-1}$$\end{document}10-1

Training and test performance versus on eleven chronic disease datasets. Optimal parameter settings for different datasets. The generalization performance comparison of baseline algorithms and NLPNN algorithm on eleven chronic disease datasets are shown in Table 4, which lists the test performance results under the unified standard. In general, the diagnostic accuracy of NLPNN on the eleven chronic disease datasets is better than baseline algorithms. Especially for the diagnosis of chronic kidney disease and breast cancer, NLPNN can achieve a generalization accuracy, recall, and F1_score, of 1.0000, 1.0000, and 1.0000, respectively. In addition, NLPNN also shows significant advantages in the diagnosis of Hepatitis disease, and its generalization accuracy is about 10% better than the baseline algorithms (SVM:0.8000, LR: 0.8333, KNN: 0.8000, DT: 0.8333, MLP: 0.8000).

Table 4

Performance of different algorithms.

Abbreviation		Acc	Re	F1_score	Abbreviation		Acc	Re	F1_score
CKD	SVM	0.9875	0.9800	0.9899	Fra_Heart	SVM	0.8349	0.0000	0.0000
	LR	0.9875	0.9800	0.9899		LR	0.8420	0.1000	0.1728
	KNN	0.9500	0.9200	0.9583		KNN	0.8337	0.0357	0.0662
	DT	0.9750	0.9600	0.9796		DT	0.8361	0.0643	0.1146
	MLP	0.9875	0.9800	0.9899		MLP	0.8314	0.1357	0.2099
	NLPNN	1.0000	1.0000	1.0000		NLPNN	0.8726	0.0614	0.1148
PIDD	SVM	0.7792	0.5517	0.6531	Hep	SVM	0.8000	1.0000	0.8846
	LR	0.7987	0.6207	0.6990		LR	0.8333	0.9565	0.8979
	KNN	0.7662	0.4827	0.6086		KNN	0.8000	0.9130	0.8749
	DT	0.7468	0.4310	0.5618		DT	0.8333	1.0000	0.9019
	MLP	0.6559	0.2414	0.3457		MLP	0.8000	0.9130	0.8749
	NLPNN	0.8247	0.6774	0.7568		NLPNN	0.9310	1.0000	0.9565
T2DM	SVM	0.9179	0.0000	0.0000	BCW	SVM	0.9714	0.9545	0.9545
	LR	0.9202	0.0733	0.1311		LR	0.9714	0.9545	0.9545
	KNN	0.9164	0.0092	0.0176		KNN	0.9714	0.9545	0.9545
	DT	0.9187	0.0092	0.0182		DT	0.9500	0.9545	0.9231
	MLP	0.9179	0.0092	0.0180		MLP	0.3143	1.0000	0.4783
	NLPNN	0.9232	0.0097	0.0192		NLPNN	1.0000	1.0000	1.0000
CVD	SVM	0.7244	0.6401	0.6977	Pri_hyper	SVM	0.7372	0.6162	0.6859
	LR	0.7249	0.6858	0.7125		LR	0.7350	0.6494	0.6953
	KNN	0.6354	0.5389	0.5950		KNN	0.7570	0.6765	0.7216
	DT	0.7247	0.6730	0.7085		DT	0.7224	0.4663	0.6100
	MLP	0.5385	0.9815	0.6789		MLP	0.7009	0.6210	0.6591
	NLPNN	0.7265	0.6847	0.7161		NLPNN	0.7624	0.6706	0.7266
Heart	SVM	0.8780	0.9307	0.8826	Pri_diab	SVM	0.9280	0.0000	0.0000
	LR	0.8780	0.9109	0.8804		LR	0.9315	0.0478	0.0913
	KNN	0.9024	0.8911	0.9000		KNN	0.9294	0.1244	0.2023
	DT	0.8488	0.8317	0.8442		DT	0.9325	0.0718	0.1327
	MLP	0.8878	0.8911	0.8866		MLP	0.9322	0.0861	0.1545
	NLPNN	0.9073	0.9320	0.9100		NLPNN	0.9360	0.0053	0.0106
GDM	SVM	0.6500	0.5591	0.5977
	LR	0.6150	0.5376	0.5649
	KNN	0.6200	0.4194	0.5064
	DT	0.6950	0.5269	0.6164
	MLP	0.6250	0.4839	0.5455
	NLPNN	0.7300	0.7021	0.7097

The best results for each dataset are marked in bold.

Performance of different algorithms. The best results for each dataset are marked in bold. Figure 3 plots the ROC curves to further compare the performance of the NLPNN algorithm and the baseline algorithms. The AUC value of the proposed algorithm is generally better than baseline algorithms. It is also worth noting that in the diagnosis task of chronic kidney disease and breast cancer, the NLPNN model is an “ideal model” with an AUC value of 1 (Fig. 3a, i).

Figure 3

ROC curves of different algorithms with the corresponding AUC values on chronic disease datasets.

ROC curves of different algorithms with the corresponding AUC values on chronic disease datasets. In this paper, we not only pay attention to the overall accuracy of the model in the diagnosis of chronic diseases but also pay more attention to whether the model can accurately diagnose sick cases (positive samples). That is, we hope that the recall of the model is as high as possible on the premise that the overall accuracy is high. For T2DM, CVD, Fra_Heart, and Pri_diab datasets, we observe that the ratio of the number of correctly predicted sick cases to the total number of sick cases is low, that is, the recall rate is low. The reason is that there is a class imbalance problem in these datasets. To solve this problem, the AEPNN algorithm 2 is proposed in “Diagnostic framework for chronic diseases” section. Because the NLPNN algorithm is a strong classifier, we do not need too many individual classifiers, whose number is equal to the number of iterations. The test performance will change with the increase of the number of training rounds of the NLPNN algorithm. Although the overall diagnostic accuracy decreases slightly, the diagnostic accuracy of sick cases has been significantly improved. We choose the number of iterations corresponding to the maximum value of the difference between the growth rate of recall and the decrease rate of accuracy as the final number of training rounds of the NLPNN algorithm to obtain the best performance. Figures 4, 5, 6, 7 show the performance of the proposed algorithm when applied to the Fra_Heart, T2DM, Pri_diab, and CVD datasets at different iterations of NLPNN, respectively. Comprehensive analysis with Table 1, we can see that the higher the class imbalance ratio of chronic disease data, the more obvious AEPNN improves the recall.

Figure 4

The test performance versus number of iteration on Fra_Heart dataset: (a) generalization performance; (b) performance growth rate.

Figure 5

The test performance versus number of iteration on T2DM dataset: (a) generalization performance; (b) performance growth rate.

Figure 6

The test performance versus number of iteration on Pri_diab dataset: (a) generalization performance; (b) performance growth rate.

Figure 7

The test performance versus number of iteration on CVD dataset: (a) generalization performance; (b) performance growth rate.

The test performance versus number of iteration on Fra_Heart dataset: (a) generalization performance; (b) performance growth rate. The test performance versus number of iteration on T2DM dataset: (a) generalization performance; (b) performance growth rate. The test performance versus number of iteration on Pri_diab dataset: (a) generalization performance; (b) performance growth rate. The test performance versus number of iteration on CVD dataset: (a) generalization performance; (b) performance growth rate. The generalization performance of AEPNN on the Fra_Heart dataset is shown in Fig. 4a. The performance growth rate is calculated based on the number of NLPNN classifiers being one. From Fig. 4b, we observe that the recall has a growth rate of close to 300% when the number of NLPNN classifiers is six, which is chosen as the best number of NLPNN classifiers for the diagnosis of heart disease. The most surprising thing is the performance of AEPNN on the T2DM and Pri_diab datasets. As it can be seen from Figs. 5a and 6a, when the number of NLPNN is greater than four, the recall is significantly improved. When the number of NLPNN reaches ten, the growth rate of the recall approaches 4000% on the T2DM dataset and 6000% on the Pri_diab dataset. We can also know that the growth rate of recall is much higher than the decreased rate of accuracy from Figs. 5b and 6b. From Fig. 7, we can see that although the performance of AEPNN on the CVD dataset is not significantly improved, the growth rate of recall is still higher than the decreased rate of accuracy. It indicates that the proposed algorithm is effective for the improvement of recall. The advantage it brings is that it can reduce the missed diagnosis rate for sick cases so that more patients with chronic diseases can treat and control the development of the disease in time. We also quantitatively compare the generalization performance of AEPNN and NLPNN algorithms by introducing , which is a powerful indicator to evaluate the classification accuracy for class imbalanced datasets[49]. From Table 5, we can see that AEPNN can effectively improve by combining multiple NLPNNs. In particular, AEPNN can increase the from 0.0985 to 0.5728 on the T2DM dataset and from 0.0731 to 0.5385 on the Pri_diab dataset by combining ten NLPNNs.

Table 5

Performance of two proposed algorithms on four datasets with class imbalance.

Abbreviation		Acc	Re	F1	G_mean
T2DM	NLPNN	0.9232	0.0097	0.0192	0.0985
T2DM	AEPNN (10/10)	0.8095	0.3883	0.2402	0.5728
CVD	NLPNN	0.7265	0.6847	0.7161	0.7256
CVD	AEPNN (10/10)	0.7279	0.6810	0.7160	0.7267
Fra_Heart	NLPNN	0.8726	0.0614	0.1148	0.2476
Fra_Heart	AEPNN (6/10)	0.8219	0.2456	0.2705	0.4731
Pri_diab	NLPNN	0.9360	0.0053	0.0106	0.0731
Pri_diab	AEPNN (10/10)	0.9098	*0.3048	0.3032	0.5385

Performance of two proposed algorithms on four datasets with class imbalance. AEPNN (10/10) AEPNN (10/10) AEPNN (6/10) AEPNN (10/10)

Conclusion

In this paper, we have investigated a universal learning algorithm based on PNN for the early diagnosis of chronic diseases. Five state-of-the-art baseline algorithms are selected to compare with the NLPNN algorithm. Experiment results show that NLPNN achieves the best accuracy on the nine chronic disease datasets. In particular, for the early diagnosis of chronic kidney disease and breast cancer disease, the generalization accuracy, recall, specificity, and AUC value of this model have achieved 1.000, 1.000, 1.000, and 1.000, respectively. Furthermore, an AEPNN algorithm is further proposed to alleviate the class imbalance problem in chronic disease datasets. We aim to increase the probability of the sick cases being accurately diagnosed, that is, to increase the recall value of the model. Experiments on the four chronic disease datasets with class imbalance problems have confirmed the effectiveness of our model. It is noted that the AEPNN model performs best on the Pri_diab dataset with a positive-negative sample ratio of 1:12.78, and the growth rate of its recall is close to 6000%. The proposed algorithm can effectively assist chronic disease experts in quickly screening patients with chronic diseases, and save the cost of further testing for patients. It should be pointed out that although our algorithm performs better on small-scale datasets, the PNN-based model also shows great application potential on large-scale datasets, such as protein-protein interaction prediction and disease diagnosis based on medical images. In future work, we will further investigate the PNN-based model in disease diagnosis. Although PNN can effectively capture hidden features parameter-free, there is still a problem with how to adaptively select the best-hidden features from the network architecture of PNN to achieve competitive performance. Thus, we consider combining PNN with computational intelligence algorithms (such as monarch butterfly optimization (MBO), earthworm optimization algorithm (EWA), and elephant herding optimization (EHO)) to improve the performance of disease diagnosis. Supplementary Information.

23 in total

1. A tongue features fusion approach to predicting prediabetes and diabetes with machine learning.

Authors: Jun Li; Pei Yuan; Xiaojuan Hu; Jingbin Huang; Longtao Cui; Ji Cui; Xuxiang Ma; Tao Jiang; Xinghua Yao; Jiacai Li; Yulin Shi; Zijuan Bi; Yu Wang; Hongyuan Fu; Jue Wang; Yenting Lin; ChingHsuan Pai; Xiaojing Guo; Changle Zhou; Liping Tu; Jiatuo Xu
Journal: J Biomed Inform Date: 2021-02-01 Impact factor: 6.317

2. Biased Random Forest For Dealing With the Class Imbalance Problem.

Authors: Mohammed Bader-El-Den; Eleman Teitei; Todd Perry
Journal: IEEE Trans Neural Netw Learn Syst Date: 2018-11-20 Impact factor: 10.451

3. Protein-Protein Interactions Prediction via Multimodal Deep Polynomial Network and Regularized Extreme Learning Machine.

Authors: Haijun Lei; Yuting Wen; Zhuhong You; Ahmed Elazab; Ee-Leng Tan; Yujia Zhao; Baiying Lei
Journal: IEEE J Biomed Health Inform Date: 2018-06-12 Impact factor: 5.772

4. Multiple predictively equivalent risk models for handling missing data at time of prediction: With an application in severe hypoglycemia risk prediction for type 2 diabetes.

Authors: Sisi Ma; Pamela J Schreiner; Elizabeth R Seaquist; Mehmet Ugurbil; Rachel Zmora; Lisa S Chow
Journal: J Biomed Inform Date: 2020-01-28 Impact factor: 6.317

5. Improved Overlap-based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson's Disease.

Authors: Pattaramon Vuttipittayamongkol; Eyad Elyan
Journal: Int J Neural Syst Date: 2020-07-17 Impact factor: 5.866