Literature DB >> 35205498

Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection.

Abstract

Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data.

Entities: Chemical

Keywords: functional predictor; functional principal component analysis; linear mixed-effects model; mobile device; sparse group regularization; wearable device data

Year: 2022 PMID： 35205498 PMCID： PMC8871497 DOI： 10.3390/e24020203

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Data captured by mobile devices have lately received much attention in the data science community. Such data are typically recorded at a high frequency, giving rise to an ample volume of information at a very fine scale, and thus present many methodological challenges in statistical modeling and data analyses. In this paper, we plan to utilize the strength of the classical kernel machine method that enjoys fast computing speed via the linear mixed-effects model to deal with such high-frequency data using a functional data analysis approach. The motivation for our proposed framework come from data collected from a tri-axis accelerometer. Accelerometers, worn on the hip or wrist as a way of monitoring physical activity, are becoming more and more common [1,2,3,4]. There are several different accelerometers available such as ActiGraph GT3X+ (ActiGraph, Pensacola, FL, USA) and Actical (Phillips Respironics, Bend, OR). Raw accelerometer data are often collected in high-resolution signals with a sampling frequency ranging from 30–100 Hz. The commercial software on these devices provides activity counts (ACs) [2,4], which are calculated from the raw accelerometer data using proprietary algorithms. As an example from our motivating dataset, Figure 1 displays a three-dimensional time series of ACs per minute, each on one axis, from one subject wearing the GT3X+ over a period of 7 days (d).

Figure 1

Activity counts over 7 d from a tri-axis (X-, Y- and Z-axis) accelerometer of a subject.

Oftentimes, different types of summaries of the tri-axis ACs are suggested in the literature as opposed to the utility of all three raw functionals [5,6,7,8]. These summary-data-based approaches may be regarded as a quick and dirty dimension reduction strategy that comes up with summarized data with computationally manageable volumes, which would be then analyzed by existing methods and software. One concern with the use of summarized data would be the loss of potential fine features that can only be captured in data of high resolution. Recently, some researchers have attempted to use the entire functional AC curve through functional data analysis techniques [6,9,10]. Further details on current methods being used to retrieve and interpret accelerometer data can be found in [11]. Our contribution in this paper pertains to a new framework in that tri-axis accelerometer data are used as three-dimensional correlated functional predictors in an association analysis with a potential health outcome such as the Body Mass Index (BMI). The relationship between physical activities and childhood obesity has long been a central interest of public health sciences, and our new scalar-on-functional regression model can provide some new insights into this important scientific problem. We begin with a brief review of existing functional data models, the least-squares kernel machine model, and different variable selection techniques, which prelude the framework for this paper.

1.1. Functional Regression

There has been much attention in recent years given to functional data analysis (FDA) where either covariates, or response, or both are functional as opposed to scalar in nature [12,13,14,15,16,17]. In this paper, we focused on the methodology that allows us to relate multiple functional covariates to a scalar outcome in a nonlinear way in the presence of other scalar covariates. To proceed, let us introduce some notation. Let be the class of square-integrable functions on a compact set . This is a separable Hilbert space with inner product , for . Consider a probability space , where Z denotes a functional random variable that maps into , namely . Define , where P is a certain probability measure, , , and assume in the rest of this paper. For convenience, we also assume that Z is mean centered, namely . The class of functional linear models (FLM) (e.g., [13,14,15]) is proposed to relate a functional covariate Z with a mean-centered scalar outcome y, which is also known as scalar-on-functional regression: , where the error term is a mean zero random variable uncorrelated with Z. An optimal solution of the unknown functional parameter is typically obtained by minimizing the mean-squared error: . Moreover, the mean model for the mean-centered scalar y takes the form . As suggested in the literature, we may obtain an optimal estimator of b by expanding functional predictor Z under certain basis functions. In this paper, we focus on the utility of functional principal component analysis (FPCA) to perform the decomposition of the functional Z. By the Karhunen–Loève expansion (e.g., [18,19,20]), we may write , where are the eigenvalues, and the loadings are given by . These coefficients satisfy (i) mean zero, ; (ii) variance one, ; (iii) uncorrelated, for . Then, the mean model may be rewritten as follows, where coefficients , , which are unknown due to the unknown b. Equation (1) presents a linear projection of scalar outcome y on the space spanned by the standardized principal components (PCs) ’s of functional predictor Z. On these lines of research, Müller and Yao (2008) proposed a class of functional additive models (FAMs) that extends Equation (1) by allowing a nonparametric form of the projection: where is a fully unspecified nonlinear smooth function to be estimated. It is obvious that Müller and Yao’s extension given in (2) takes an additive model on individual coefficient (or feature) components ’s. Regularization is often needed for both (1) and (2) in order to deal with these infinite-dimensional unknowns. One of the challenges concerning regularization for (2) lies in the technical treatment in the functional space. Müller and Yao (2008) [21] proposed truncation (or a hard threshold) of the eigenspace to retain only the leading components that explain the majority of the total variation in Z. Zhu, Yao, and Zhang (2014) [15] proposed another regularization for the functions using the powerful COSSO method [22]. One advantage for this kind of regularization method is that sums of higher-order functional principal components are allowed to be potentially included in the fit model, if they make stronger contributions to the functional relationship than the leading functional principal components. This regularization method [15] begins with an additive model where s represents some initial degrees of truncation to specify the total number of additive components to be considered. Then, COSSO helps simultaneously regularize and select important functional components among the s functions . Although the above discussion is based on a single functional predictor Z in mind, it is appealing to extend such a framework with multiple functional predictors for a broad range of problems. When multiple functional predictors, say , are considered, it is not clear if the above additive model specification remains suitable to handle the complexity, especially a non-additive relationship (e.g., interactions) may be of interest to understand the association between a scalar outcome and multiple functional predictors. In effect, from both the perspectives of theoretical advances and application needs, relaxing the additive relationship is an important task in functional data analysis. Alternatively, there are some methods (e.g., [16,17]) in the literature that do not use the strategy of decomposing Z into its functional components. In this paper, we adopt the framework of kernel machine regression models to extend the methodologies with non-additive relationships between multiple functional predictors and the scalar outcome.

1.2. Least-Squares Kernel Machine

Liu, Lin, and Ghosh (2007) [23] proposed a semi-parametric regression model for subject , where they used the least-squares kernel machine (LSKM) to analyze multidimensional genetic pathways denoted by a vector . The key feature of this model is the nonlinear relationship between the outcome and a vector of gene expressions , which is characterized by a nonparametric smooth function h. Under the theory of smoothing splines, function h is assumed to lie in a reproducing kernel Hilbert space (RKHS), , generated by a positive-definite kernel function . For the ease of exposition, we suppress the bandwidth for the kernel in the following discussion. Then, both parameter and function h are estimated by maximizing the scaled penalized likelihood function: where is the tuning parameter and is the norm of the RKHS. For a function , we have . Then, , where is an matrix whose entry is and . It is known in the literature (e.g., [23,24]) that maximizing in (3) turns out to be equivalent to solving the normal equations from the following linear mixed-effects model (LMM): where is an vector of random effects with distribution and an n-dimensional vector error term , with . One remarkable advantage of solving (3) through the existing numerical procedure of the LMM is most advocated in the literature [25], where we can determine the smoothing parameter as part of the estimation of the variance components of the LMM. Therefore, instead of using cross-validation or other information-based tuning methods on , we can solve simultaneously for all the model parameters in (3), as shown in [23]. Utilizing this numerical strength of the kernel machine regression model, we propose a semi-parametric regression model by incorporating functional principal components of functional predictors (i.e., the ) to evaluate a nonlinear relationship of a scalar outcome with multiple functional covariates in a non-additive way. Assuming that function h belongs to an RKHS, we can use existing software packages for solving LMMs to obtain estimates of all model parameters and the smoothing parameter.

1.3. Feature Selection

To deal with high-dimensional functional principal components from functional covariates, we invoked the sparse regularization approach in the kernel machine regression model. Note that for both mean models (1) and (2), one needs to truncate the series from the Karhunen–Loève expansion. Regularization helps reduce from an infinite number of terms to a sum of finite terms. To introduce some notations, here we present a brief review on the group lasso (GL) [26], sparse group lasso (SGL) [27], and non-negative garrote [28]. See also the series of work originated by COSSO [22]. Yuan and Lin (2007) [26] proposed the group lasso, which solves the convex optimization problem: , where L is the total number of groups of covariates and refers to a subset of covariates associated with group ℓ. Friedman, Hastie, and Tibshirani [27] extended the group lasso to allow within-group sparsity, namely SGL, given as , where . The additional -norm penalty term on encourages individual sparsity, while the first penalty targets sparsity at the group level. It is easy to see that group lasso is a special case of the SGL when . The non-negative garrote proposed by Breiman (1995) [28] is another useful means of variable selection. It invokes a scaled version of least-squares estimation given by: subject to . Here, is an matrix with columns , with being the least-squares estimates from with no constraints. Obviously, estimate implies that covariate would be excluded from the fit model. Breiman’s formulation that turns a variable selection problem into a parameter estimation problem will be applied for the development of feature selection on functional principal components in this paper. This paper is organized as follows. Section 2 introduces our proposed high-dimensional kernel machine regression. Section 3 outlines a simple step-by-step algorithm that is used to implement the sparse estimation method. Section 4 concerns asymptotic properties for our proposed sparse kernel machine regression. Section 5 provides simulation results to examine the performance of our method, with comparisons with existing methods. Section 6 illustrates the proposed method by an association analysis of the relationship between the BMI and functional accelerometer data. Section 7 includes our conclusions. The Appendix A contains some key technical details, including the proofs of the theoretical results, while Appendix B presents a discussion on the model identifiability issue.

2. Model and Estimation

Consider a regression analysis of a scalar outcome y on p functional covariates, , . Let be the -element vector of functional principal component (FPC) features from the observation of the ℓth functional covariate , and let be the grand vector of all FPC features from all p functional covariates for subject i, . Clearly, the set of FPC features from each functional covariate forms a group, and in total, there are p groups with many FPC features and . The high dimensionality of FPC features presents the key methodological challenge in the analysis. We consider the following functional kernel machine regression (FKMR) model: where is a set of parameters for the effects of q scalar covariates , is an s-variate smooth nonparametric function with being the functional space generated by a Mercer kernel and error terms . The FKMR model (4) allows for not only nonlinear, but also non-additive relationships with multiple functional covariates via their FPC features, , and a scalar outcome, y. The statistical task is to estimate and select important functional covariates that are related to the outcome of interest through regularizing the FPC features within each functional covariate. To proceed, following Beiman’s [28] non-negative garrote method, we here introduce a new s-dimensional scaling vector , , by which we can set a new vector of weighted FPC features by via the Hadamard product (i.e., elementwise product). Note that is grouped and denoted by where is an -element vector of FPC features of the functional covariate . When the element, say , is equal to zero, the corresponding FPC feature will not be selected in the set of important FPCs, and moreover, functional covariate is excluded from the FKMR model when the entire vector . We estimate the unknowns in the FKMR model (4), as well as the scaling parameters by minimizing the penalized objective function , whose expression is given on the right-hand side of the following Equation (5): where and are two tuning parameters, and penalty may be specified according to a certain regularization method. For the case of sparse group lasso (SGL), we take , . Typically, is predetermined and set to or depending on the trade-off between group and within-group sparsity, while the factor () controls the relative group sparsity to individual sparsity of each functional predictor . Meanwhile, a large tuning parameter for would remove a certain group of FPC features from the FKMR model when all elements in the vector are zero. Given , an equivalent optimization to the above (5) can be formulated as follows: where is an matrix whose th element is . Lemma 1 below establishes the equivalency of optimization solutions between (5) and (6), which is crucial in our estimation procedure. A solution ( The proof of Lemma 1 is given in Appendix A.1. (Existence of optimizers). If the kernel The proof of Theorem 1 is given in Appendix A.3. Note that there may exist multiple optimal minimizers for (5); Theorem 1 ensures only the existence of optimal solutions, but provides no guarantees for uniqueness due to the fact that (5) or (6) is a nonlinear and non-convex optimization problem. It is worth noting that in both (5) and (6), we set the bandwidth for the kernel at a fixed value due to the identifiability issue with respect to the scaling parameters . Refer to Appendix B for more detailed discussions on the issue of parameter identifiability.

3. Implementation and Algorithm

We propose an iterative algorithm to implement our proposed estimation procedure in which we require the differentiability of the kernel with respect to the scaling factor and some additional assumptions presented below in order to ensure algorithmic convergence. One part of the algorithm solving (5) is carried out under fixed , where the resulting minimization problem reduces to the equivalent maximization problem in the least-squares kernel machine (3) with the FPC features, , being replaced by . As pointed out in Section 1.2, the step of numerical calculation can be easily executed in the same fashion as the solution from the linear mixed model, including the REML estimation of the smoothing parameter . The other part of the algorithm is performed under fixed , and , where we solve the nonlinear and non-convex optimization problem to update estimates of . Lemma 2 below helps us solve for the scaling parameter . For fixed ( where The proof of Lemma 2 is given in Appendix A.2. Linearizing the function in (7) leads to an equivalent form: where , with being the gradient of the function with respect to evaluated at for some , and being the columns of associated with the ℓth group of . This is precisely the form of the standard sparse group regularization problem: This implies that (8) presents a standard sparse group regularization problem with a specific choice of penalty function . The convergence of the above iterative search algorithm for updating for fixed (, , ) can be justified by the proximal Gauss–Newton method [29]. Readers are referred to [30] for details on the proximal Gauss–Newton method. One of the key assumptions of the proximal Gauss–Newton method is the existence of a local minimizer. This condition is satisfied in the above (8). This is because according to Theorem 1, there exists a global minimizer. Algorithm 1 summarizes these iterative steps, which is showed to satisfy a descent property: under the convergence of the proximal Gauss–Newton algorithm for Step 2.2. Perform FPCA (e.g., the R package fdapace) to extract the functional component features for the p functional predictors, and store them in a grand vector for each individual subject , ; Initialize to be a vector of ones. which translates to mapping the original component scores to itself. Set up a grid of possible tuning parameters for and , respectively. Set the kernel bandwidth parameter, which may depend on . For each pair of from our grid, perform Steps 2.1–2.3 and 3.1 below. At the -th step in the algorithm, first solve the LSKM problem with fixed () (based on a closed-form solution) to update and . Solve the group regularity problem (8) with fixed and fixed (, , , ) using the updates from the previous iteration. At this step, the proximal Gauss–Newton algorithm produces an update at convergence. Repeat Steps 2.1–2.2 until convergence. Perform cross-validation over all pairs of () to determine the final . To speed up Algorithm 1, we propose the following operational schemes that avoid setting up the pairs of (,) and performing Step 3.1. Here are a few remarks on the two algorithms. (i) Algorithm 2 depends on good starting values in order to enjoy a fast search. (ii) The main difference between Algorithms 1 and 2 is that is fixed in Algorithm 1, while it is changing in Algorithm 2. Some similar algorithms with changing tuning parameters have been proposed in the literature, such as the single index model [31]. (iii) There is no guarantee that both algorithms converge to a global minimizer, and the proximal Gauss–Newton method used in the implementation can only find stationary points. Numerical solvers for the optimization problem in (5) or in (6) indeed remain an open problem in the field of nonlinear and nonconvex optimization. Step 2.1 of Algorithm 1 is performed by running the linear mixed model with our initial fixed from Step 1.2 of Algorithm 1 to obtained updated values of , , and . Step 2.2 is performed with solving the group regularity problem (8) through the Gauss–Newton algorithm using cross-validation-based tuning (e.g., R package oem). Rerun Step 2.1 using the updated from Step 2.2 to obtain the estimates for and .

4. Theoretical Guarantees

Our theoretical analysis focuses on the finite-sample error bounds for the estimators obtained by (5) or (6). Consequently, we are able to establish the estimation consistency. For simplicity, we set and consider a general setting of random vectors so that the FPC features correspond to a special case. Along similar lines as those of [15,32], the estimation consistency is proven in the case of the SGL penalty function. We define a map with an s-element vector , which gives rise to a collection of all scaling map functions: . Since is a linear (and bounded) operator, is a real vector space where with any and . To perform a group regularization estimation, we define an SGL penalty by a norm on for a fixed as follows: Consequently, the SGL regularization estimation requires the following constrained optimization: where . Lemma 3 below provides the essential finite-sample inequalities that lead to the estimation consistency. (Basic inequality). Let where We need the following notation before presenting our theoretical guarantees. Let denote the minimal covering number of the function set under the empirical metric based on the random vectors . Let be a shorthand notation. This means that there exist functions (not necessarily in the set ) such that for every function , there exists a such that , with . Define the -entropy of for the empirical metric, , as . Consider a functional space of the form: We postulate the following assumptions. The error term Clearly, the moment condition is bounded below from zero. where (Consistency) Under Assumptions 1–3 above, if tuning parameters then we have Theorem 2 implies estimation consistency under the right rates for the two tuning parameters and . Due to the potential identifiability issues explained in detail in Appendix B, although the estimator may not be unique, the sum of and is not too far away from the sum of the true and . If the RKHS, The proofs of Theorem 2 and Corollary 1 are given in Appendix A.4 and Appendix A.5, respectively. Often, when we are only interested in a subset of functions in the RKHS (e.g., functions with norm less than one), we can substitute the full space in Corollary 1 with the subspace of interest. Refer to [15] or [32], where both considered an RKHS (i.e., Sobolev space) with functions of norm less than or equal to one.

5. Simulation Experiments

We performed extensive simulation to investigate the performance of our proposed procedure, including the performance of SGL variable selection and its overall accuracy. Due to the limitations of space, we include results from two simulation experiments in this section, and more results may be found in the first author’s Ph.D. dissertation [30].

5.1. Setup

In the evaluation of the performance accuracy, following [15], we used both quasi- and adjusted quasi- defined as follows: The latter is known to be appealing for the comparison of the estimation sparsity. There is another performance metric of interest in addition to model accuracy. Performance in variable selection is summarized in terms of the stability measured by sensitivity and specificity for both functional and variable selections under these simulation experiments. Our algorithm uses existing R packages, including emmreml, kspm, and oem. Specifically, we designed the following two simulation settings. Scenario 1: A single functional predictor with sparsity in the FPC features. Scenario 2: Multiple functional predictors with sparsity in the functional predictors and with sparsity in the FPC features of important functional predictors. Each of these two scenarios would be handled using certain suitable penalty functions to address the designed sparsity; for example, in Scenario 2 we used a two-level variable selection penalty (e.g., SGL) to deal with two types of sparsity in the true model. In all analyses, we used the Gaussian kernel in our estimation, where p was set as the number of features, which is equivalent to dividing the vector by . This scaling parameter may be either estimated or set to the number of features to overcome the identifiability issue according to [33], where theoretical justification was given for the use of the number of features for the bandwidth parameter in the case of the Gaussian kernel. According to [23], due to the difficulty of the graphical display for the estimated s-dimensional function of , we summarized the goodness-of-fit by regressing the true h on the estimated , with both being evaluated at the design points. From this concordance regression analysis, we may measure the goodness-of-fit on through the average intercepts, slopes, and R-squared (also known as the coefficient of determination) obtained over the number of replications. Clearly, a high-quality fit is reflected by (i) the intercept being close to zero, (ii) the slope being close to one, and (iii) the R-squared being close to one. Moreover, we graphically display the estimated function by setting all variables equal to 0.5 except the one of interest over a grid of 100 equally spaced points on the interval . Such visualization of the functional estimation at each margin further facilitates the evaluation of the proposed algorithm in addition to the results obtained from the concordance regression analyses. In all scenarios, we generated 1000 IID functional paths, of which 750 paths were assigned to the training set and 250 paths were assigned to the test set for an external performance evaluation. It is the test set that we used to display the performance accuracy. We used a one-dimensional covariate to show the flexibility of our model in a semi-parametric setting, with independent copies of . We chose the true coefficients in the kernel machine model similar to those given in [23].

5.2. Simulation in Scenario 1

In this simple scenario with a single functional predictor, we simulated data from a model with sparsity in its FPC features. To do so, we generated a single functional predictor based on the first 15 eigenbasis of the Fourier basis functions over the interval : . That is, a functional predictor was created as a linear combination of the 15 basis functions, where is the Fourier basis function, is the jth eigenvalue of Z, and is the jth FPC feature that is simulated from a normal distribution detailed as follows. There were 100 sampled points that were first equally spaced in the interval and then varied with certain small deviations drawn from . Set and independently over . As was done in [17], instead of directly using , we used , where is the CDF of the standard normal. This resulted in . We chose the second, , and ninth, , features as important features in the following true nonlinear non-additive model: with . FPCA was performed by the R package PACE [34], producing the estimated FPC scores, , as well as the estimated eigenvalues, , which in turn enabled us to compute , . We applied both LASSO and MCP penalty functions in our implementation, termed as and , respectively. We compared the results of our method with the standard linear approach with both LASSO and MCP under the assumption of linear functional relationships, as well as the COSSO method for functional additive regression [15] using the R package COSSO [15,34]. Since the COSSO package is built for nonparametric regression (and not partial linear models), we adopted the backfitting strategy and regressed the residuals with our estimated effect of removed. In addition, we compared our method with an oracle FKMR estimator, called , that assumed the full knowledge of the true containing two true nonzero signals, and . We also considered two oracle versions of our proposed algorithm, and , both of which used the knowledge of true in order to evaluate the performance of the FPCA procedure. This evaluation is important as our proposed procedure can be in principle used in simpler cases that do not involve functional covariates. Note that once we used FPCA to obtain features, our algorithm essentially works in a standard regression setting with the sparsity of covariates. Thus, our proposed procedure can be in principle used in simpler cases with scalar covariates. In Scenario 1, due to the highly nonlinear relationships between the FPC features and the outcome, as expected, the naive linear model performed poorly in terms of both model selection and model consistency. The detailed simulation results for Scenario 1 can be found in the first author’s Ph.D. dissertation [30]. In brief, our proposed method worked well in all aspects. In this setting, COSSO also worked well in terms of model fit, but it tended to select noisy features more frequently than our proposed method, leading to more false positives.

5.3. Simulation in Scenario 2

Now, we generated four functional predictors of the form: , where , , and were set in the same way as those given in Scenario 1. It follows that , where is the jth -transformed feature for the ℓth functional covariate. Sparsity was specified as follows: the first and second functional covariates, and , were chosen as important signals in which these transformed FPC features, , are five important features (three features from the and two features from ) that are related to the outcome: where . This model specifies both group sparsity (two of the four functional predictors) and within-group sparsity (three of the nine FPC features in and two of the nine FPC features in ). In addition, we specified non-additive relationships in the true model across multiple functional covariates. We fit the data using the proposed methods, including , , , , , and , and the results based on 100 replicates are summarized in Table 1. For comparison, we also fit the simulated data by existing methods, including the linear model (denoted by LM + penalty), COSSO functional additive regression, and the oracle method using the knowledge of true important features in the analysis, as done in the above simulation of Scenario 1. From Table 1 regarding the goodness-of-fit, we see that all of our FKMR estimators outperformed the standard linear estimators in terms of among all of our penalty functions, and they outperformed COSSO for penalties that accounted for group sparsity. In the concordance regression analysis, we see that all intercepts were close to zero, all slopes close to one, and all close to one, indicating a high goodness-of-fit for functional estimation. COSSO tended to perform on par for penalties that did not account for group sparsity (LASSO and MCP). It is evident that using a group sparsity penalty function (SGL, GLasso, and GMCP) clearly outperformed the methods that did not regularize the grouping of covariates (Lasso and MCP). In addition, our FKMR estimators (except ) performed as well as the oracle estimator both in terms of and in terms of our estimate of functional h. The results also indicated that there were little differences between using a concave (MCP or GMCP) penalty function or using a convex (GLasso or SGL) penalty function.

Table 1

Goodness-of-fit and the concordance regression for Scenario 2.

Model	RAQ2	β	Reg of h on h^
Model	RAQ2	β	Intercept	Slope	R2
FKMRLasso	0.830	2.00	−0.062	1.01	0.848
FKMRGLasso	0.937	1.99	−0.055	1.01	0.972
FKMRSGL	0.928	2.00	−0.051	1.01	0.955
FKMRMCP	0.835	2.01	−0.062	1.01	0.856
FKMRGMCP	0.935	1.99	−0.056	1.01	0.970
FKMRGMCPoracle	0.911	1.99	−0.049	1.01	0.937
COSSO	0.832	–	–	–	–
LM + Lasso	0.453	–	–	–	–
LM + GLasso	0.324	–	–	–	–
LM + SGL	0.450	–	–	–	–
LM + MCP	0.513	–	–	–	–
LM + GMCP	0.307	–	–	–	–

As regards the group sparsity, Table 2 indicates that the all methods had a high sensitivity of detecting functional signals, while the proposed FKMR methods had better specificity than both sparse linear models and COSSO. Concerning the within-group sparsity, it is interesting to note that a bigger difference was seen in terms of what type of penalty function was being used in feature selection. As shown in Table 3 and Table 4, using a general penalty (e.g., Lasso and MCP) that does not take the grouping structure into account tended to under-select important features within a group. COSSO tended to perform well within group sparsity. Moreover, Figure 2 shows that the FKMR method estimated the five signal functions ( and ) well.

Table 2

Sensitivity and specificity of functional selection for Scenario 2.

Model	Selection Frequency
Model	Z1^	Z2^	Z3^	Z4^
FKMRLasso	100	100	0	0
FKMRGLasso	100	100	4	4
FKMRSGL	100	100	0	0
FKMRMCP	100	100	0	0
FKMRGMCP	100	100	3	4
COSSO	100	100	5	6
LM + Lasso	100	100	19	21
LM + GLasso	94	99	7	8
LM + SGL	100	100	19	18
LM + MCP	100	100	20	19
LM + GMCP	93	99	7	8

Table 3

FPC feature selection for signal functional in Scenario 2.

Model	Selection Frequency
Model	ζ11^	ζ21^	ζ31^	ζ41^	ζ51^	ζ61^	ζ71^	ζ81^	ζ91^
FKMRLasso	100	1	97	0	0	0	0	0	0
FKMRGLasso	100	100	100	100	100	100	100	100	100
FKMRSGL	100	21	100	71	26	20	17	16	15
FKMRMCP	100	1	99	1	0	0	0	0	0
FKMRGMCP	100	100	100	100	100	100	100	100	100
COSSO	100	2	100	93	1	0	0	1	0
LM + Lasso	100	10	100	100	10	8	7	10	5
LM + GLasso	94	94	94	94	94	94	94	94	94
LM + SGL	100	12	100	100	10	8	8	11	5
LM + MCP	100	10	100	100	9	8	9	7	5
LM + GMCP	93	93	93	93	93	93	93	93	93

Table 4

FPC feature selection for signal functional in Scenario 2.

Model	Selection Frequency
Model	ζ12^	ζ22^	ζ32^	ζ42^	ζ52^	ζ62^	ζ72^	ζ82^	ζ92^
FKMRLasso	0	3	0	0	0	0	100	0	0
FKMRGLasso	100	100	100	100	100	100	100	100	100
FKMRSGL	16	100	14	7	16	23	100	15	7
FKMRMCP	0	11	0	0	0	1	100	0	0
FKMRGMCP	100	100	100	100	100	100	100	100	100
COSSO	8	97	5	5	5	15	100	3	3
LM + Lasso	17	100	14	7	16	23	100	15	6
LM + GLasso	99	99	99	99	99	99	99	99	99
LM + SGL	17	100	14	7	16	23	100	15	7
LM + MCP	17	100	13	6	16	23	100	15	8
LM + GMCP	99	99	99	99	99	99	99	99	99

Figure 2

Five marginal estimates of important feature functions with 95% shaded confidence bands evaluated at 100 grid points while holding all other components equal to in Scenario 2.

6. Data Example

To show the usefulness of our proposed methodology, we analyzed data of 550 children recruited by the ELEMENTS study [35], who had consent to wear an actigraph (ActiGraph GT3X+; ActiGraph LLC. Pensacola, FL, USA). This wearable was to be placed on their non-dominant wrist for five to seven days with no interruption. The actigraph measured tri-axis accelerometer data sampled at 30 Hz, which captured three different directions of a person’s movement. The BMI was the outcome of interest as it is biomarker of obesity. Sex and age were confounding factors used in the analysis. Due to some missing data, our analysis only included children who wore the device properly for 85% or more over the study period, which resulted in 395 participants, consisting of 189 males and 206 females. Other studies such as [36] have excluded days of accelerometer data with more than five percent missing. The mean ± SD BMI of the study cohort was 21.5 ± 4.1. The mean age of the study participants was 14.3 ± 2.1 y. A more detailed description of the dataset used for this paper can be found in [37]. Our primary interest was to see if the BMI is associated with physical activity in the presence of other covariates, specifically sex and age. We preprocessed the activity counts over the 7 d of wear by taking the median in the 1 min epoch over the entire 7 d of wear. For example, since all the participants started wearing the device at 3 p.m., the first data point for each individual was a median of 7 ACs (each for one day) for the 1 min epoch of 3:00–3:01 p.m. This procedure that takes the medians across the minutes from different days has been considered in other applications such as [36]. See Figure 3 as an example of the resulting time series of medians derived from the AC data displayed in Figure 1.

Figure 3

The 24 h minute-by-minute medians of 7 d ACs for one subject.

We applied the following five models, labeled as M0–M4 for convenience, to analyze the data with the 24 h median ACs as functional predictors. Let be the ith person’s kth FPC score for functional predictor j. Linear model (LM) with only the fixed features: Linear model with SGL penalty (LM+SGL) using the FPCA features: LSKM using the FPCA features: FKMR model with SGL penalty () using the FPCA features: COSSO using the FPCA features: In order for a direct application of the COSSO R package, we used residuals in the COSSO model fit, with and being the estimates of the coefficients from Model M0. The BMI and age were mean centered and scaled to be a standard deviation of one, so was absent in the models. Here are some key findings from the data analyses. First, in terms of the goodness-of-fit, Table 5 suggests that M3, i.e., our proposed model FKMR with the SGL penalty, gave the best performance, where the adjusted of M3 was nearly twice as big as all the other four models. Second, it is interesting to note that both the COSSO and the did not select the FPC scores associated with the Z-axis. Third, as shown in Table 6, all of the FPC components chosen by COSSO were also chosen by the . It is worth noting that the linear model together with the SGL penalty selected the highest number of FPC components, yet performed the worst in terms of the model fit.

Table 5

Goodness-of-fit for the five models used in the data analysis.

Model	Adjusted R2
M0: LM	0.07
M1: LM + SGL	0.13
M2: LSKM	0.18
M3: FKMRSGL	0.30
M4: COSSO	0.14

Table 6

Axis-specific FPC feature selection.

Model	X-Axis						Y-Axis					Z-Axis
Model	ζ11^	ζ21^	ζ31^	ζ41^	ζ51^	ζ61^	ζ12^	ζ22^	ζ32^	ζ42^	ζ52^	ζ13^	ζ23^	ζ33^	ζ43^
FKMRSGL		✓	✓	✓		✓	✓		✓		✓
COSSO				✓			✓		✓
LM + SGL	✓			✓	✓	✓	✓	✓	✓						✓

7. Conclusions

In this paper, we proposed a method to model the nonlinear relationship between multiple functional predictors and a scalar outcome in the presence of other scalar confounders. We used the FPCA to decompose the functional predictors for feature extraction and used the LSKM framework to model the functional relationship between the outcome and principal components. We developed a simultaneous procedure to select important functional predictors and important features within selected functionals. We proposed a computationally efficient algorithm to implement our regularization method, which was easily programmed in R with the utility of multiple existing R packages. It should be noted that although we focused on functional regression in this paper, the method proposed can be applied to non-functional predictors. In effect, by using functional principal components, we essentially bypassed the infinite-dimensional problem and worked effectively in a non-functional framework with the FPC features. Through simulation and using data from the ELEMENT dataset, we demonstrated how the FKMR estimator outperformed existing methods in terms of both variable selection and model fit. It should be noted that the existing COSSO method did perform well in terms of variable selection, as shown in Section 5. A technical issue pertains to identifiability limitations with regard to the bandwidth parameter and to the RKHS estimator. To overcome this, we suggested fixing the bandwidth parameter; see the detailed discussion in Section 3. We established key theoretical guarantees for our proposed estimator. In the case where there are multiple proposed estimators (and thus the identifiability issues arise), the established theoretical properties in Section 4 apply to any of those estimators. Variable section on functional predictors presents many technical challenges, and there are many methodological problems that remain unsolved. This paper demonstrated a possible framework to regularize estimation with a bi-level sparsity of functional group sparsity and within-group sparsity. In the LSKM paper [23], it was briefly mentioned that if the relationship between the scalar outcome and p genetic pathways is additive, we can tweak the model as where each belongs to its own RKHS. It is easy to extend our method and algorithms to handle this case. For future research, an extension on longitudinal outcomes may be considered via a mixed-effects model where are the random effects. Other useful extensions to the proposed paradigm would be on the lines of generalized linear models and Cox regression models.

17 in total

Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection.

1. Introduction

1.1. Functional Regression

1.2. Least-Squares Kernel Machine

1.3. Feature Selection

2. Model and Estimation

3. Implementation and Algorithm

4. Theoretical Guarantees

5. Simulation Experiments

5.1. Setup

5.2. Simulation in Scenario 1

5.3. Simulation in Scenario 2

6. Data Example

7. Conclusions

Review 1. The technology of accelerometry-based activity monitors: current and future.

2. Methods to assess an exercise intervention trial based on 3-level functional data.

3. MULTILEVEL FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS.

4. Functional Generalized Additive Models.

5. Structured functional additive regression in reproducing kernel Hilbert spaces.

6. Performance of Activity Classification Algorithms in Free-Living Older Adults.

7. Adiposity in Adolescents: The Interplay of Sleep Duration and Sleep Variability.

8. Examination of different accelerometer cut-points for assessing sedentary behaviors in children.

9. An Activity Index for Raw Accelerometry Data and Its Comparison with Other Activity Metrics.