Literature DB >> 35603639

A classification for complex imbalanced data in disease screening and early diagnosis.

Abstract

Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.

Entities: Chemical

Keywords: AUC; Alzheimer's disease; brain imaging data; class imbalance; group LASSO

Mesh：

Year: 2022 PMID： 35603639 PMCID： PMC9541048 DOI： 10.1002/sim.9442

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

In many disease screening and early diagnosis studies, imbalanced classification is the most common challenge when a severely skewed class distribution in the data is attributed to the rarity of the disease. Traditional classification methods, such as logistic regression and machine learning models, that generally assume a balanced class distribution often perform poorly and misclassify subjects from the minority class (ie, disease) as ones from the majority (ie, health), resulting in a high false negative rate. Although it is possible to achieve a high predictive accuracy as well as a good specificity, the sensitivity is anticipated to be low due to the high false negative rate. Imbalanced classification is even more challenging when the real data structure is complex, for example, in high‐dimensional longitudinal settings. In many biomedical studies, high‐dimensional longitudinal data are often collected irregularly and sparsely, where the high‐dimensional measurements on each subject are taken repeatedly at discrete random time points and the number of measurements may vary between subjects. As a good example, in the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, magnetic resonance imaging (MRI) data which are generally high‐dimensional are acquired during the scheduled follow‐up visits at 6‐month intervals, for example, months = 6, 12, 18,…, 144. However, participants may arbitrarily miss a few of their pre‐scheduled visits due to various reasons. As a result, the number of repeated measurements varies by subject, causing an irregular and sparse structure in the data. Such data intrinsic characteristics should be incorporated into the procedure of imbalanced classification, but the increasing difficulty and challenge in implementation is then expected. To deal with imbalanced classification, one popular approach in the literature is the data‐based approach which aims at re‐balancing the class distribution by simply resampling the data, such as undersampling the majority class or oversampling the minority class. , Nevertheless, this method may either cause a loss of information in the majority class or overuse the data from the minority class. Another popular approach is the algorithm‐based approach which mainly depends on the choice of an appropriate inductive bias. , For instance, different penalties are assigned to different classes in the support vector machine (SVM)‐based classifiers. But this type of approach often requires a thorough knowledge of the learning algorithm and the specific application domain, which may be a daunting task to analysts. Another approach is the cost‐sensitive approach which considers the varying costs of different misclassification types; , however these types of costs are usually unknown in practice. Other remedies for imbalanced classification are mainly boosting‐based ensemble methods proposed in the area of data science. The boosting algorithms in these methods are basically centered around the combination of several simple classifiers/approaches in order to modify the training data sets for better prediction. , , , , , To address the classification for high‐dimensional data, several approaches have been proposed over the past decade. For example, Fan and Fan proposed the features annealed independence rules (FAIR) to select the most important features via a two‐sample ‐test. Fan and Song established a maximum‐marginal‐likelihood‐type approach for feature screening. Mai and Zou developed the Kolmogorov filter which enjoys the sure screening property to identify statistically significant variables. The fundamental idea of this filter is to construct a specific rule for dimension reduction and use the screened features for subsequent analysis. With application to high‐dimensional omics data, Yu and Park proposed an AUC‐based approach with penalization such as LASSO and elastic net. Nonetheless, these methods are not capable of dealing with the longitudinal and/or imbalanced structure in data. To handle the classification for longitudinal data, Tomasko et al and Marshall and Barón proposed a modified classical linear discriminant analysis using mixed‐effects models to accommodate the over‐time underlying associations. De La Cruz‐Mesia and Quintana considered a nonlinear hierarchical structure to accommodate the longitudinal profiles and developed a fully Bayesian approach for parameter estimation. More recently, Arribas‐Gil et al considered a semiparametric linear mixed‐effects model (SLMM) and proposed a unified estimation procedure based on a penalized EM‐type algorithm. However, these methods usually require specific distributional assumptions on biomarkers. These stated methods can only address parts of the issues for complex imbalanced data. To our best knowledge, there is no single approach yet that can accommodate all aforementioned complications comprehensively. As we are motivated by the ADNI study, it is particularly of interest to detect Alzheimer's disease (AD) earlier with all available patient data. Early detection and diagnosis of AD have become increasingly critical for developing future care and treatment. That is because early intervention with medications may slow the progression of disease and provide more opportunities for medical caregivers to gain more understanding about AD and plan for the future. To delay the onset or slow the progression by giving the timely intervention of AD, a prognostic model that can be used for early detection is therefore urgently needed. However, the prevalence of AD in the US elder population (for 65+ year) is 11%, meaning that the class distribution is expected to be skewed and imbalanced. As an evidence, we do observe such a highly skewed distribution in the ADNI data. In the same dataset, we also observe that the brain imaging data which are in high‐dimensional and longitudinal setting are collected irregularly and sparsely, which further escalates the challenge of classification as we mentioned previously. In this article, we propose a two‐stage approach to overcome these challenges in classification for complex imbalanced data. Specifically, the techniques of functional principal component analysis (FPCA) are employed for feature extraction from longitudinal biomarkers and then the univariate exponential loss function coupled with group LASSO penalty is used to approximate the empirical area under the receiving operator characteristic curve (ie, AUC) in high‐dimensional settings. In other words, the longitudinal data can be first analyzed by FPCA with a significant reduction in its longitudinal dimension, and then the major principal components which are treated as the extracted features can be further used for classification using the proposed AUC‐type classifier with group LASSO penalty for feature selection. Finally, the block‐wise coordinate descent algorithm is adopted in the process of model estimation. Our approach can substantially improve the sensitivity that oftentimes is very low for imbalanced data and relieve the computational complexities under such a sophisticated data structure. For illustration, we apply our approach to ADNI data for early detection of AD. We mainly focus on the participants who are diagnosed as cognitive normal (CN) at baseline but convert to AD at a later time point. To this end, our model is trained to identify the AD patients only using the data right before their first diagnosis of AD. In other words, our approach can early determine high‐risk patients who actually have AD later in the near future. The rest of this article is organized as follows. In Section 2, we briefly introduce the FPCA approach which is often used for dimension reduction in functional data analysis and then present the proposed AUC‐type classification framework. In Section 3, we illustrate the proposed classification method using the ADNI data including longitudinal brain imaging data and clinical biomarker data. In Section 4, we conduct extensive simulations to evaluate the performance of the proposed method in finite sample size. Finally, conclusion and possible extensions are discussed in Section 5.

MAIN FRAMEWORK

Our method is a two‐stage approach which first involves the use of FPCA to address the longitudinal structure and then uses the proposed AUC‐type classifier coupled with group LASSO penalty to improve imbalanced classification. In this section, we briefly introduce the FPCA and empirical AUC, and then present the AUC‐type classifier with group LASSO penalty for appropriate variable selection under class imbalance.

Functional principal component analysis

To perform a FPCA on irregular and sparse longitudinal data, we adopt a version of FPCA proposed by Yao et al, referred as Principal components Analysis through Conditional Expectation (PACE). Unlike classical FPCA, their approach is particularly useful to model irregular and sparse longitudinal data. The PACE ensures that the functional principal component (FPC) scores extracted from longitudinal features of each subject are well‐approximated even when only few measurements are available for a subject. These FPC scores then can be treated as important features/biomarkers summarized from the longitudinal profiles of corresponding subjects , and used for classification subsequently. Assume that is the longitudinal trajectory of the th predictor of the th subject with . Let be its mean function and cov denote the covariance function which quantifies the correlation between time points and . According to the spectral decomposition, the covariance function can be written as , where are nonincreasing eigenvalues, that is, , and are the corresponding orthonormal eigenfunctions. Using the Karhunen‐Loève (KL) expansion, , can be expressed as where are uncorrelated random variables with mean zero and variance . In practice, is usually approximated by the first eigenfunctions as where can be determined by the pre‐specified percentage of variance explained (PVE). Specifically, the value of is often chosen as the smallest integer such that PVE. In general, is often observed at irregular and sparse time points. Suppose is a random observation of , we have where is the measurement error with mean zero and variance . By applying PACE to the th longitudinal predictor in the pooled data, the estimated mean function , covariance function , eigenvalues , eigenfunctions and error variance can be obtained hierarchically. Specifically, and are first estimated using the penalized spline fit and moments approaches as described in the articles of Staniswalis and Lee and Yao. Then and can be obtained from the spectral decomposition of the estimated . The estimated error variance is calculated from the average difference of the middle 60% of diagonal elements between the raw and estimated covariance matrices. Finally, FPC scores s for the th subject are estimated as follows: where and are vectors, and is a matrix with if and if with , . Note that all these FPC scores can be obtained by using the fpca.sc function , , in the R package refund, and can be determined by setting a specific value for PVE, such as 90%, 95%, or 99%. Based on what we have observed from the simulations and real data analyses, using is generally sufficient enough to characterize the longitudinal data and can simplify the process of extracting features from longitudinal biomarkers using FPCA. With a sensitivity study (not shown here), we notice that the classification performance of our proposed method is not affected by the selection of , only showing very mild differences in performance. Therefore, we adopt for all simulations and real data analysis throughout the article. After obtaining theses FPC scores, a classification procedure can be applied subsequently.

Empirical AUC and its surrogate losses

The area under the receiver operating characteristic (ROC) curve, that is, the AUC, is a well‐known rank‐based statistic and frequently used to evaluate the performance of a classifier. The AUC summarizes both the sensitivity (or true positive rate, TPR) and 1‐specificity (or the false positive rate, FPR) and reflects all possible trade‐offs between TPR and FPR by varying the decision threshold. Thus, maximizing the AUC is indeed a process of searching for an optimal threshold that leads to both optimal sensitivity and specificity. Because of this, AUC that represents a probability of a randomly selected positive instance having a higher score than a randomly chosen negative instance is thus insensitive to class prevalence and misclassification costs under data imbalance. , After extracting FPC scores from the trajectories of all biomarkers, we can combine them linearly, as other traditional AUC‐based approaches, to improve prognostic accuracy. The ultimate goal of our study is to find the optimal linear combination of these FPC scores so that the empirical AUC is maximized even under the complex and imbalanced data structure, and hence achieving optimal sensitivity and specificity. Let and be a ‐dimensional vector containing all FPC scores for the th and th subjects in the health and disease groups, respectively, where , , and and denote the number of subjects in the two groups, respectively. Given any coefficients vector , the empirical AUC for multiple FPC scores can be estimated as follows: where is the indicator function. However, this estimated empirical AUC can not be used directly for classification in high‐dimensional settings because of computational concerns. Due to the discontinuity and non‐convexity of empirical AUC, a widely used technique for circumventing the computational challenge is to approximate the empirical AUC with some pairwise convex surrogate loss function. , , , , However, it usually necessitates pairwise comparisons between positive and negative instances, resulting in quadratic computational complexity. , , , To alleviate the computational burden associated with pairwise surrogate losses, several non‐pairwise strongly proper losses, such as the exponential loss and squared hinge loss, have been proposed and shown to be consistent with the AUC maximization task. , , Besides that, Gao and Zhou developed a sufficient condition for AUC consistency and established the equivalence of univariate exponential accuracy loss and pairwise exponential surrogate accuracy loss. As a result, using empirical AUC or univariate exponential loss in classification is expected to be equivalent in terms of performance. Thus, we use univariate exponential loss to develop the proposed AUC‐type classifier.

The proposed AUC‐type classification framework

In light of the established equivalence between minimizing the univariate exponential loss and maximizing the empirical AUC, the loss function used in our approach is given as follows to address the issue of class imbalance: where is a vector containing all FPC scores of th subject, is the corresponding response with binary outcomes, that is, if positive and if negative, and denotes the total number of subjects with . Notice that each biomarker trajectory of a subject is summarized as a set of FPC scores. Thus, this set of scores is treated as a grouped feature. Owing to high‐dimensional settings, we adopt the group lasso penalty proposed by Yuan to accommodate the grouping structure and perform group‐feature selection. The objective function can be written as: where is a coefficient vector corresponding to the th grouped feature, is the number of FPC scores within the th group, is the total number of groups, is the tuning parameter, and is the norm. Here, is used to adjust for the varying group sizes. Note that the tuning parameter can be determined using a ‐fold cross‐validation with empirical AUC or univariate exponential loss, which are indeed equivalent in terms of classification performance (see Section 2.2). By the ease of interpretation of AUC, we use empirical AUC as criterion for all simulations and real data analysis throughout the article. Regarding the choice of , it generally involves a trade‐off between bias and variance. To be more precise, a large value of typically results in small bias but large variance when evaluating the model performance, whereas a small value of results in relatively large bias but small variance. The most commonly used values for are = 3, 5, or 10. Considering the small sample size in the disease group under data imbalance, we adopt a five‐fold cross‐validation in the following analyses, which not only achieves the bias‐variance trade‐off but also generates a moderate‐sized hold‐out fold for validation. In general, one may select a proper ‐fold cross‐validation based on the sample size and the severity of imbalance. To solve for the that minimizes Equation (2), we employ a quadratic approximation which is similar to that in the article of Simon et al. Let , where is the design matrix, and , , , be the gradient and Hessian of the loss function in Equation (1) with respect to and , respectively. Using a second‐order Taylor expansion centered at the initial value , Equation (1) becomes: where , , and consist of all terms that do not depend on . Then, can be estimated by optimizing a penalized reweighted least squares: where The objective function consists of a quadratic term and the group lasso penalty. The quadratic term can be viewed as squared errors in the estimated between the current and previous iterations. As we aim to minimize , the estimator is viewed as an solution with the least squared error to maximize the empirical AUC. Regarding the term of group lasso penalty, it intrinsically ensures that only a subset of “group” features are selected, thus significantly reducing the model complexity. Each of can be estimated iteratively by the block coordinate descent algorithm presented by Yuan. Specifically, to solve for the coefficients vector for the th grouped feature, we first compute the corresponding first derivative of as: where and are the data matrices corresponding to the th and th grouped features respectively, is the group size of th grouped feature, and Next, by setting Equation (3) to zero, we can obtain . Specifically, when , we can get: when , it is easy to obtain: Hence, cycling through each group of FPC scores, simultaneous variable selection, and model estimation can be achieved via the following Algorithm 1. It is worth mentioning that the proposed objective function is guaranteed to converge to the global minimum using the above algorithm when initialized with an arbitrary value for . The detailed convergence analysis has been thoroughly discussed by Tseng. To reduce the number of required iterations and increase the computational efficiency in high‐dimensional sparse settings, we suggest initializing with a vector of small values, such as as we used in this article. To regularize with the group lasso penalty, variable selection is conducted on the group level. Specifically, each set of FPC scores simply represents each longitudinal biomarker. Therefore, these scores extracted from a particular biomarker can be only all selected or all dropped, depending on whether the associated biomarker is important or not to the model. To speed up the computation, we employ a strategy called active‐set convergence which has been discussed in the articles of Krishnapuram et al, Meier et al, and Friedman et al. Specifically, after the first cycle through groups, the remaining iterations will be restricted to the active‐set which will be updated after each cycle. The entire process stops after the active‐set does not change.

ALZHEIMER'S DISEASE DATA

Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). The ADNI was launched in 2003 as a public‐private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). Detailed information regarding the ADNI study and the complete protocol can be found in the articles of Mueller et al and Jack et al. In the ADNI data, participants are labeled with: cognitively normal (CN), MCI, or AD based on a series of assessments at their initial visits. These are also their states at baseline. It is expected to have repeated evaluations conducted subsequently at a 6‐month interval. Most existing studies focused on predicting the conversion from MCI to AD for individuals who were diagnosed as MCI at baseline. However, the conversion process could begin years before the onset of symptoms. In our analysis, we focus on the development of a prognostic model that can be used for early detection of AD among CN individuals. We select 267 subjects who are normal at baseline and have at least three visits. Among them, 30 subjects progress to AD at a later time, denoted as AD, and 237 subjects remain as normal, denoted as CN. The demographic information of those subjects is summarized in Table 1. It should be noted that the longitudinal data are indeed irregularly observed among participants. Specifically, each participant undergoes these assessments at different time points and has a different number of visits. The distribution of number of visits is presented in Table 2.

TABLE 1

Demographic characteristics of selected subjects

		Age (years)		Gender (%)
Group	n	Mean	SD	Male	Female
CN	237	74.5	5.6	52.7	47.3
AD	30	75.4	3.9	40.0	60.0

TABLE 2

Distribution of number of visits

	Number of subjects
Visits	CN	AD
3	68	2
4	100	4
5	13	3
6	10	5
7	13	2
8	14	4
9	10	7
10	9	3
Total	237	30

Demographic characteristics of selected subjects Distribution of number of visits In the literature, biomarkers from different modalities have been utilized to investigate the progression of AD. Brain abnormalities detected by MRI are considered to be valid markers of AD and are widely used to predict the conversion from MCI to AD. , , , , Fluorodeoxyglucose positron emission tomography (FDG‐PET) is able to provide the estimates of cerebral metabolic rates of glucose, thus revealing the pattern of regional hypometabolism which is a prominent hallmark of AD. , , , Additionally, biomedical changes in the brain are directly presented in the Cerebrospinal fluid (CSF). Hence, CSF‐based biomarkers are often employed to depict the pathological changes of AD. , , , In this study, we mainly focus on biomarkers that are extracted from the MRI modality. All of the 3D T1‐weighted MRI images downloaded from the ADNI database for each subject are processed using Freesurfer (v6.0.0, Martinos Center for Biomedical Imaging) which is an open‐source software suite and freely available at FreeSurferWiki (https://surfer.nmr.mgh.harvard.edu/fswiki/FreeSurferWiki). The longitudinal processing pipeline in Freesurfer consists of the following steps: spatial normalization and intensity correction, Talairach registration, brain mask creation, subcortical segmentation, surfaces reconstruction, and cortical atlas registration and parcellations. More details about the processing framework can be found in the article of Reuter. There are 319 biomarkers in total generated by Freesurfer v6.0.0, with each corresponding to a specific region of interest (ROI) in the brain. More specifically, these ROIs consist of cortical volume, cortical thickness average, cortical surface area, and the volume estimates of a wide range of subcortical structures. , In addition to those biomarkers extracted from the brain imaging data, we also include five cognitive and functional scores which are closely associated with AD and popular in the literature: , , Alzheimer's Disease Assessment Scale‐Cognitive 13 items (ADAS‐Cog 13), Mini Mental State Examination (MMSE), Functional Assessment Questionnaire (FAQ), Rey Auditory Verbal Learning Tests (RAVLT immediate score and RAVLT learning score). Besides that, other demographic and genetic variables that might be predictive of AD conversion are also included: baseline age, gender, and apolipoprotein E allele 4 (APOE4). Figure 1 presents the longitudinal trajectories of ADAS‐Cog 13 for subjects used in this study, showing the sparse and irregular characteristics of the ADNI dataset. The trends in the two plots suggest the potential of using ADAS‐Cog 13 to identify AD patients among these normal subjects at baseline.

FIGURE 1

Longitudinal trajectories of ADAS‐Cog 13 for cognitively normal subjects and AD patients

Longitudinal trajectories of ADAS‐Cog 13 for cognitively normal subjects and AD patients For the model training, the last visit data of each CN subject is excluded. But for AD patients, we use the data before their first diagnosis of AD in order to train the model only based on the data before progressing to AD. By this, our model is capable of identifying potential AD patients before their next clinical visit. As an illustration, the data of a CN (or an AD) participant that is used for model training is shown in Figure 2 with a red box.

FIGURE 2

Clinical diagnosis of a CN subject or an AD patient over time. The red box represents the data used for model training. The blue box represents the final diagnosis used as the membership outcome

Clinical diagnosis of a CN subject or an AD patient over time. The red box represents the data used for model training. The blue box represents the final diagnosis used as the membership outcome For the model evaluation, the processed data are randomly split into training and test subsets, comprising 70% and 30% of all instances respectively. A stratified sampling method is employed to ensure that both subsets have the same imbalance ratio. To deal with these longitudinal biomarkers, the PACE algorithm proposed by Yao is applied to obtain the corresponding FPC scores which are then used as predictors in our model. The tuning parameter in the proposed method is selected by five‐fold cross‐validation using the empirical AUC as the criterion. For comparison purposes, logistic regression with penalty and SVM with linear kernel are also conducted with this ADNI dataset. The results based on 500 Monte Carlo replicates are given in Table 3. It is worth noting that the class distribution is highly imbalanced in this ADNI dataset (ie, CN=237, AD=30). Both penalized logistic regression and SVM are biased towards the majority class, thus leading to the low sensitivity of 36% and 44%, respectively. Moreover, it seems that SVM tends to overfit under the high‐dimensional setting and performs poorly on the test data. However, our proposed approach is capable of dealing with the case of class imbalance, and achieves superior classification performance, especially in terms of sensitivity which is often considered as an important measure in medical diagnosis. As shown in Table 3, the performance of the proposed framework outperforms logistic regression and linear SVM in terms of its AUC and sensitivity (88% and 79%, respectively) with a slight compromise in specificity, which indicates the superiority of our method for such a complex imbalanced dataset. Finally, our approach indicates that several biomarkers selected by group LASSO seem associated with early detection of AD. For example, the biomarkers with high absolute value of coefficient include: FAQ and ADAS in clinical scores; left and right postcentral gyrus, left precentral gyrus in subcortical volumes; left postcentral gyrus, right medial orbitofrontal cortex, right supramarginal gyrus, right pericalcarine cortex in cortical thicknesses. Albeit interesting, more thorough investigations from the view of neuroscience are strongly encouraged before coming to any further conclusions.

TABLE 3

Classification results (S.E.) for ADNI data with logistic, linear SVM and the proposed method based on 500 Monte Carlo replicates

		L1 logistic	Linear SVM	Proposed method
Training set	Sensitivity	0.601(0.297)	0.999(0.001)	0.946(0.066)
(nh=166, nd=21)	Specificity	0.999(0.001)	0.999(0.001)	0.973(0.035)
	Accuracy	0.956(0.033)	0.999(0.001)	0.970(0.035)
	AUC	0.918(0.167)	0.999(0.001)	0.976(0.033)
Test set	Sensitivity	0.362(0.199)	0.441(0.154)	0.790(0.145)
(nh=71, nd=9)	Specificity	0.996(0.008)	0.980(0.015)	0.880(0.094)
	Accuracy	0.925(0.022)	0.919(0.023)	0.870(0.084)
	AUC	0.832(0.147)	0.854(0.068)	0.880(0.091)

Abbreviations: logistic, logistic regression with penalty; Linear SVM, support vector machine with linear kernel; (), number of subjects in the CN and AD groups respectively.

Classification results (S.E.) for ADNI data with logistic, linear SVM and the proposed method based on 500 Monte Carlo replicates Abbreviations: logistic, logistic regression with penalty; Linear SVM, support vector machine with linear kernel; (), number of subjects in the CN and AD groups respectively.

SIMULATION STUDY

In this section, we conduct extensive simulations to evaluate the performance of the proposed method. Two data‐generating schemes are considered: (1) class memberships are generated by a logistic regression model; (2) class memberships are pre‐determined by the belonging group: health or disease. For each scheme, the classification performance is further assessed under two settings: (i) a low‐dimensional setting with and (ii) a high‐dimensional setting with . Throughout all simulations, it is assumed that each subject has a longitudinal profile with observations measured at seven different time points (ie, and represents the baseline). We also perform other two popular methods (ie, logistic regression and SVM) at various levels of class imbalance for comparison purposes.

Class memberships determined by a logistic regression model

In the first scheme, we generate class memberships using a logistic regression model. Specifically, it is a two‐stage process. In the first stage, we assume that the th longitudinal predictor for the th subject is generated by a linear model: where the subject‐specific random effect is generated from and the measurement error is generated from . In the second stage, we convert the longitudinal predictor into a set of FPC scores using the FPCA approach, then denoted as . These sets of FPC scores are indeed considered as features and then used in the subsequent classification procedure. As we extract the FPC scores using the PACE algorithm, the number of principal components is fixed at two, for simplicity, to override the required setting for PVE. Later, the class memberships are assigned through the following logistic regression model: where is the coefficient vector corresponding to the th longitudinal predictor. Notice that is a vector as we set the number of principal components at two for each longitudinal predictor. The intercept can be set to generate different levels of class imbalance. Typically, the membership can be coded as “health” if and “disease” if . For our analysis, low and high dimensional settings are examined separately. For each setting, 500 Monte Carlo replicates are simulated at each imbalance ratio. For each replicate, the data of 600 subjects are generated. Among them, 300 subjects are used for model training and the rest of 300 are used as a test data set for evaluation. Low‐dimensional setting: Three (3) longitudinal predictors are simulated for each subject, where we set , , and . To obtain class memberships using the above logistic regression model, we let , , . The intercept is given by different values () to obtain the imbalance ratio of , respectively. The classification results are presented in Table 4. In this setting, the performances of three methods are comparable in terms of AUC and accuracy. However, regardless of training or testing, noticeable lower sensitivities are observed in the methods of logistic regression and SVM as the imbalance ratio increases, whereas the sensitivity declines slightly with the proposed method.

TABLE 4

Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in low‐dimensional setting based on 500 Monte Carlo replicates

		nh/nd=3.2			nh/nd=4.9			nh/nd=6.1
Imbalance ratio		Logistic	SVM	Proposed	Logistic	SVM	Proposed	Logistic	SVM	Proposed
Training	Sensitivity	0.689	0.681	0.873	0.618	0.594	0.884	0.548	0.483	0.896
		(0.061)	(0.071)	(0.035)	(0.081)	(0.100)	(0.039)	(0.108)	(0.156)	(0.041)
	Specificity	0.941	0.945	0.856	0.965	0.970	0.867	0.981	0.987	0.882
		(0.013)	(0.015)	(0.034)	(0.009)	(0.012)	(0.034)	(0.007)	(0.008)	(0.038)
	Accuracy	0.880	0.882	0.860	0.910	0.910	0.870	0.938	0.937	0.883
		(0.019)	(0.019)	(0.025)	(0.016)	(0.016)	(0.029)	(0.013)	(0.013)	(0.034)
	AUC	0.933	0.932	0.932	0.940	0.937	0.937	0.948	0.945	0.945
		(0.016)	(0.016)	(0.016)	(0.017)	(0.018)	(0.017)	(0.017)	(0.019)	(0.018)
Test	Sensitivity	0.672	0.658	0.833	0.594	0.564	0.828	0.507	0.435	0.819
		(0.065)	(0.071)	(0.059)	(0.087)	(0.097)	(0.074)	(0.111)	(0.147)	(0.087)
	Specificity	0.933	0.935	0.840	0.959	0.964	0.857	0.975	0.981	0.871
		(0.021)	(0.021)	(0.039)	(0.015)	(0.016)	(0.037)	(0.012)	(0.013)	(0.040)
	Accuracy	0.870	0.869	0.838	0.900	0.899	0.852	0.927	0.925	0.866
		(0.018)	(0.018)	(0.026)	(0.016)	(0.016)	(0.028)	(0.015)	(0.015)	(0.032)
	AUC	0.923	0.922	0.921	0.929	0.927	0.927	0.934	0.932	0.931
		(0.017)	(0.018)	(0.018)	(0.019)	(0.019)	(0.020)	(0.020)	(0.020)	(0.021)

Note: .

Abbreviation: , number of subjects in health and disease groups respectively.

High‐dimensional setting: Five hundred (500) longitudinal predictors are simulated for each subject, where the coefficients of that correspond to the th predictor are generated randomly from truncated normal distributions (): For simplicity, we assume that the first five predictors are significant, with each corresponding specified as follows: , , , , . The remaining 495 predictors are assumed to be insignificant, thus having . Similar to the low‐dimensional setting above, different levels of class imbalance () are assessed by assigning different values () for correspondingly. The simulation results are provided in Table 5. In this setting, the performance of the proposed method is better than that of the other two approaches in terms of AUC and sensitivity. It seems that logistic regression and SVM tend to classify subjects into the majority class (ie, the health group), thus resulting in low sensitivity. However, the proposed method achieves a better sensitivity with a little sacrifice of specificity and accuracy.

TABLE 5

Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in high‐dimensional setting based on 500 Monte Carlo replicates

		nh/nd=3.2			nh/nd=4.9			nh/nd=6.1
Imbalance ratio		Logistic	SVM	Proposed	Logistic	SVM	Proposed	Logistic	SVM	Proposed
Training	Sensitivity	0.705	0.999	0.900	0.659	0.999	0.905	0.549	0.999	0.899
		(0.221)	(0.001)	(0.035)	(0.293)	(0.001)	(0.052)	(0.360)	(0.001)	(0.062)
	Specificity	0.997	0.999	0.894	0.999	0.999	0.891	0.999	0.999	0.888
		(0.005)	(0.001)	(0.037)	(0.002)	(0.001)	(0.050)	(0.001)	(0.001)	(0.061)
	Accuracy	0.928	0.999	0.896	0.942	0.999	0.893	0.938	0.999	0.889
		(0.053)	(0.001)	(0.032)	(0.049)	(0.001)	(0.047)	(0.049)	(0.001)	(0.058)
	AUC	0.982	0.999	0.957	0.982	0.999	0.952	0.944	0.999	0.946
		(0.022)	(0.001)	(0.020)	(0.051)	(0.001)	(0.034)	(0.135)	(0.001)	(0.044)
Test	Sensitivity	0.412	0.221	0.791	0.262	0.109	0.724	0.174	0.063	0.686
		(0.112)	(0.058)	(0.078)	(0.122)	(0.056)	(0.122)	(0.125)	(0.043)	(0.137)
	Specificity	0.968	0.901	0.856	0.982	0.958	0.860	0.989	0.977	0.860
		(0.021)	(0.026)	(0.039)	(0.016)	(0.017)	(0.048)	(0.013)	(0.013)	(0.057)
	Accuracy	0.836	0.740	0.841	0.859	0.813	0.837	0.874	0.848	0.835
		(0.025)	(0.023)	(0.031)	(0.021)	(0.020)	(0.037)	(0.019)	(0.018)	(0.044)
	AUC	0.892	0.645	0.913	0.876	0.640	0.889	0.842	0.635	0.877
		(0.029)	(0.038)	(0.028)	(0.050)	(0.044)	(0.043)	(0.108)	(0.050)	(0.047)

Note: .

Abbreviation: : number of subjects in the health and disease groups respectively.

Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in low‐dimensional setting based on 500 Monte Carlo replicates Note: . Abbreviation: , number of subjects in health and disease groups respectively. Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in high‐dimensional setting based on 500 Monte Carlo replicates Note: . Abbreviation: : number of subjects in the health and disease groups respectively.

Class memberships pre‐determined by health and disease groups

Unlike the previous data‐generating scheme, we generate class memberships without using any model‐based mechanisms. The longitudinal predictors are simulated for the health () and disease () groups separately: where and are the mean functions of the th longitudinal predictor for the th health and the th disease subject, respectively, and are the subject‐specific random effects generated from . The random errors and are generated from . The PACE algorithm is applied to each predictor to extract the FPC scores which are further used as features in the proposed method. By this data‐generating scheme, class memberships of all subjects are pre‐determined, that is, if health and if disease. Under this scheme, we also consider low and high‐dimensional settings. The classification performances are also examined at different levels of class imbalance. Assuming a total sample size of 300, different numbers of subjects are assigned to the health and disease groups to generate various imbalance ratios. That is, for the ratios of . In each scenario, 500 Monte Carlo replicates are simulated. Low‐dimensional setting: Three (3) longitudinal predictors are simulated for each of the subjects. For the health group, the mean is assumed to be constant across different time points. Specifically, we set: , , . For the disease group, we let , where , to reflect the progression of the disease. Three sets of used for the data generation are specified as follows: , , and . High‐dimensional setting: Five hundred (500) longitudinal predictors are simulated for each subject. Among them, the last 475 predictors are considered insignificant and the corresponding mean functions are assumed to be the same for both the health and disease groups, that is, . For the first 25 predictors that are considered significant, their mean functions are generated differently for health and disease groups. For the health group, the mean is assumed to be constant, that is, , , where is generated from a truncated normal distribution (): For the disease group, we let , where the coefficients that correspond to the th predictor are randomly selected, for each Monte Carlo sample, from truncated normal distributions: The simulation results are given in Tables 6 and 7. Even under this data‐generating mechanism, the proposed approach outperforms logistic regression and SVM across various levels of class imbalance in many perspectives, especially the good performance in sensitivity regardless of being in low‐ or high‐dimensional setting (see Tables 6 and 7). When the class imbalance becomes more severe, the proposed method still can achieve a high sensitivity whereas a substantial drop is observed in the other two methods. It is worth mentioning that the AUCs of the proposed method are higher than those of logistic regression and SVM in the high‐dimensional setting, also coming along with smaller SEs. This result indeed indicates the stability of our approach in high‐dimensional settings.

TABLE 6

Classification results (S.E.) with various sample sizes of health and disease groups in low‐dimensional setting based on 500 Monte Carlo replicates

		(nh,nd)=(225,75)			(nh,nd)=(257,43)			(nh,nd)=(270,30)
Sample size		Logistic	SVM	Proposed	Logistic	SVM	Proposed	Logistic	SVM	Proposed
Training	Sensitivity	0.774	0.767	0.895	0.596	0.555	0.873	0.454	0.379	0.852
		(0.097)	(0.108)	(0.049)	(0.205)	(0.250)	(0.071)	(0.226)	(0.278)	(0.086)
	Specificity	0.954	0.957	0.894	0.976	0.982	0.864	0.985	0.991	0.846
		(0.014)	(0.015)	(0.045)	(0.008)	(0.010)	(0.074)	(0.007)	(0.008)	(0.080)
	Accuracy	0.909	0.909	0.895	0.922	0.921	0.866	0.932	0.930	0.846
		(0.032)	(0.032)	(0.042)	(0.029)	(0.032)	(0.070)	(0.021)	(0.023)	(0.079)
	AUC	0.953	0.952	0.951	0.927	0.919	0.925	0.906	0.885	0.902
		(0.033)	(0.035)	(0.034)	(0.064)	(0.081)	(0.064)	(0.076)	(0.110)	(0.077)
Test	Sensitivity	0.751	0.743	0.865	0.555	0.511	0.818	0.416	0.335	0.780
		(0.099)	(0.108)	(0.059)	(0.198)	(0.235)	(0.095)	(0.214)	(0.254)	(0.119)
	Specificity	0.949	0.951	0.882	0.970	0.976	0.853	0.981	0.988	0.837
		(0.017)	(0.016)	(0.048)	(0.012)	(0.014)	(0.075)	(0.010)	(0.011)	(0.080)
	Accuracy	0.899	0.898	0.878	0.911	0.909	0.848	0.925	0.922	0.832
		(0.033)	(0.032)	(0.044)	(0.029)	(0.030)	(0.071)	(0.019)	(0.020)	(0.077)
	AUC	0.945	0.944	0.945	0.912	0.907	0.912	0.889	0.872	0.889
		(0.037)	(0.038)	(0.037)	(0.066)	(0.079)	(0.068)	(0.079)	(0.109)	(0.080)

Note: .

Abbreviation: : number of subjects in the health and disease groups respectively.

TABLE 7

Classification results (S.E.) with various sample sizes of health and disease groups in high‐dimensional setting based on 500 Monte Carlo replicates

		(nh,nd)=(225,75)			(nh,nd)=(257,43)			(nh,nd)=(270,30)
Sample size		Logistic	SVM	Proposed	Logistic	SVM	Proposed	Logistic	SVM	Proposed
Training	Sensitivity	0.825	0.999	0.927	0.782	0.999	0.921	0.395	0.999	0.911
		(0.121)	(0.001)	(0.033)	(0.234)	(0.001)	(0.039)	(0.369)	(0.001)	(0.053)
	Specificity	0.992	0.999	0.924	0.999	0.999	0.918	0.999	0.999	0.904
		(0.007)	(0.001)	(0.032)	(0.001)	(0.001)	(0.035)	(0.001)	(0.001)	(0.053)
	Accuracy	0.950	0.999	0.924	0.968	0.999	0.918	0.939	0.999	0.905
		(0.034)	(0.001)	(0.028)	(0.034)	(0.001)	(0.032)	(0.036)	(0.001)	(0.051)
	AUC	0.987	0.999	0.972	0.993	0.999	0.968	0.842	0.999	0.956
		(0.013)	(0.001)	(0.015)	(0.024)	(0.001)	(0.019)	(0.224)	(0.001)	(0.034)
Test	Sensitivity	0.657	0.457	0.861	0.389	0.192	0.793	0.131	0.064	0.709
		(0.063)	(0.056)	(0.053)	(0.113)	(0.056)	(0.076)	(0.121)	(0.047)	(0.113)
	Specificity	0.967	0.930	0.892	0.987	0.981	0.895	0.997	0.994	0.885
		(0.015)	(0.016)	(0.031)	(0.009)	(0.008)	(0.037)	(0.004)	(0.005)	(0.054)
	Accuracy	0.890	0.812	0.884	0.901	0.868	0.881	0.910	0.901	0.868
		(0.018)	(0.020)	(0.024)	(0.015)	(0.011)	(0.031)	(0.010)	(0.006)	(0.046)
	AUC	0.947	0.832	0.951	0.922	0.807	0.931	0.777	0.762	0.897
		(0.016)	(0.029)	(0.017)	(0.032)	(0.037)	(0.026)	(0.183)	(0.049)	(0.042)

Note: .

Abbreviation: : number of subjects in the health and disease groups respectively.

Classification results (S.E.) with various sample sizes of health and disease groups in low‐dimensional setting based on 500 Monte Carlo replicates Note: . Abbreviation: : number of subjects in the health and disease groups respectively. Classification results (S.E.) with various sample sizes of health and disease groups in high‐dimensional setting based on 500 Monte Carlo replicates Note: . Abbreviation: : number of subjects in the health and disease groups respectively.

DISCUSSION

In this work, we have developed a novel classification framework for imbalanced data under a longitudinal and high‐dimensional structure. With the use of FPCA, a substantial dimension reduction has been achieved for the irregular and sparse longitudinal data, and no distributional assumptions on biomarkers are needed. Unlike other traditional classification methods, the proposed AUC‐type classifier with univariate exponential loss function can well and efficiently approximate the empirical AUC which is intrinsically robust against imbalance, thus resulting in a great sensitivity without largely impairing the overall accuracy and specificity. Coupled with the group lasso penalty, feature selection can be conducted within the procedure of classification simultaneously. As early detection of AD is a recognized health care priority in the United States, we can initially respond to this task by applying the proposed method using the longitudinal brain imaging data together with clinical and cognitive measures. To the best of our knowledge, this is the first study in the literature that focuses on using the longitudinal MRI data to early identify AD patients among these individuals who are diagnosed as normal at baseline. The proposed method not only can detect the at‐risk AD patients among these baseline normal‐cognition participants but also can identify the most significant biomarkers (such as brain regions) that are associated with the development of AD, though biomarker discovery often requires further and deeper investigations. The proposed method can handle longitudinal and high‐dimensional imaging data; however, in practice, each individual's imaging data may not always be available. Because an MRI scan typically is a more expensive procedure which may keep normal individuals from doing the scan and further resulting in the lack of imaging data. But even without the brain imaging data, the proposed method still can perform nicely as we have shown in the simulation study under low‐dimensional settings. Apart from the longitudinal data, the proposed method can still be applied to cross‐sectional imbalanced data, for example applications to the gene expression microarray data, by simply skipping the use of FPCA. The proposed method is mainly developed for imbalanced classification in longitudinal and high‐dimensional settings, but the feature extraction process via FPCA could be somewhat time‐consuming when the longitudinal data is dense or the total number of subjects is large. This can be improved by employing other techniques of functional data analysis, for example, the natural cubic spline which has been proven to be an easy‐implemented and efficient approach for both sparse longitudinal data and dense functional data. , , Besides that, the FPCA requires a pre‐specified number of basis functions, which might be critical for extracting the FPC scores. A simulation was conducted to study how to determine the number of basis functions for FPCA and how the number of basis functions impacts the imbalance classification (results not shown). We suggest using the minimal number of measurements among all subjects minus one as the number of basis functions to ensure the FPC scores can be successfully obtained. It is also worth noting that the feature extraction (ie, FPC scores) by PACE can still be performed even when missing values occur in the longitudinal profiles of subjects. However, because the proposed method is supervised, the response/label must be provided for model training as required in our proposed loss function. Finally, it is possible to extend our approach to incorporating other alternative surrogate loss functions, such as square and squared hinge losses, for the approximation of the empirical AUC. Such an extension may potentially improve the classification performance and reduce the computational burden. Furthermore, the extension to data that are generated from nonlinear spaces can make the proposed method more general. As one possible solution, a kernelized transformation may be performed on the data prior to any statistical or machine learning modeling. These extensions are indeed beyond the scope of this article and require further investigations.

CONFLICT OF INTEREST

The authors declare no potential conflict of interest.

45 in total

1. A discriminant analysis extension to mixed models.

Authors: L Tomasko; R W Helms; S M Snapinn
Journal: Stat Med Date: 1999-05-30 Impact factor: 2.373

2. A model-based approach to Bayesian classification with applications to predicting pregnancy outcomes from longitudinal beta-hCG profiles.

Authors: Rolando de la Cruz-Mesía; Fernando A Quintana
Journal: Biostatistics Date: 2006-06-05 Impact factor: 5.899

3. Regularized ROC method for disease classification and biomarker selection with microarray data.

Authors: Shuangge Ma; Jian Huang
Journal: Bioinformatics Date: 2005-10-18 Impact factor: 6.937

4. High Dimensional Classification Using Features Annealed Independence Rules.

Authors: Jianqing Fan; Yingying Fan
Journal: Ann Stat Date: 2008 Impact factor: 4.028

5. Online Nonlinear AUC Maximization for Imbalanced Data Sets.

Authors: Junjie Hu; Haiqin Yang; Michael R Lyu; Irwin King; Anthony Man-Cho So
Journal: IEEE Trans Neural Netw Learn Syst Date: 2017-01-27 Impact factor: 10.451

6. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

Review 7. Brain glucose metabolism in the early and specific diagnosis of Alzheimer's disease. FDG-PET studies in MCI and AD.

Authors: Lisa Mosconi
Journal: Eur J Nucl Med Mol Imaging Date: 2005-04 Impact factor: 9.236

8. Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent.

Authors: Noah Simon; Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2011-03 Impact factor: 6.440

9. Shrinkage estimation for functional principal component scores with application to the population kinetics of plasma folate.

Authors: Fang Yao; Hans-Georg Müller; Andrew J Clifford; Steven R Dueker; Jennifer Follett; Yumei Lin; Bruce A Buchholz; John S Vogel
Journal: Biometrics Date: 2003-09 Impact factor: 2.571

10. Dynamic prediction of Alzheimer's disease progression using features of multiple longitudinal outcomes and time-to-event data.

Authors: Kan Li; Sheng Luo
Journal: Stat Med Date: 2019-08-06 Impact factor: 2.373

1 in total

1. A classification for complex imbalanced data in disease screening and early diagnosis.

Authors: Yiming Li; Wei-Wen Hsu
Journal: Stat Med Date: 2022-05-23 Impact factor: 2.497

1 in total