Literature DB >> 35510184

Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning.

Omar Maddouri¹, Xiaoning Qian^1,2, Francis J Alexander², Edward R Dougherty¹, Byung-Jun Yoon^1,2.

Abstract

Classification has been a major task for building intelligent systems because it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions-either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which impedes the design and evaluation of accurate classifiers. Atlhough transfer learning can improve the learning in target domains by incorporating data from relevant source domains, it has received little attention for performance assessment, notably in error estimation. Here, we investigate knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce a class of Bayesian minimum mean-square error estimators for optimal Bayesian transfer learning, which enables rigorous evaluation of classification error under uncertainty in small-sample settings. Using Monte Carlo importance sampling, we illustrate the outstanding performance of the proposed estimator for a broad family of classifiers that span diverse learning capabilities.

Entities: Chemical

Keywords: Bayesian error estimator; OBTL; classification; error estimation; importance sampling; model uncertainty; optimal Bayesian transfer learning; transfer learning

Year: 2022 PMID： 35510184 PMCID： PMC9058919 DOI： 10.1016/j.patter.2021.100428

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Transfer learning (TL) provides promising means to repurpose the data and/or scientific knowledge available in other relevant domains for new applications in a given domain. The ability to transfer relevant data/knowledge across different domains practically enables learning effective models in target domains with limited data. Classifier design can take advantage of TL to address small-sample challenges we often face in various scientific applications. However, rigorous error estimators that can leverage such transferred data/knowledge for better estimation of classification error have been missing to date, which makes the design framework epistemologically incomplete. Generally, the scientific validity of any predictive model is assessed by the ability to generalize outside the observed training sample. However, the available sample is often too small in many scientific applications (e.g., bio-marker discovery) to hold out sufficient data just for testing purpose, which makes the reuse of training data for both classifier design and error estimation inevitable. While various error estimation schemes exist to date, their accuracy and reliability in a small-sample setting are often questioned. For instance, in Dalton and Dougherty many classification studies of cancer gene expression data have been listed where the performance was assessed by cross-validation (CV) based on small-size training datasets. Analyses in Braga-Neto and Dougherty have shown that CV error estimators derived based on small-size samples show large variance, which explains the controversy across many biological studies that relied on data-driven CV. Model-based error estimation also faces practical challenges as non-informative modeling assumptions may mislead the error estimators in case of model mismatch. The ability for accurate error estimation based on small samples is also critical in other contexts, an example being continual learning, where a series of labeled datasets are sequentially fed to the learner as in realistic learning scenarios. In recent years, continual learning regained attention as a promising strategy for avoiding “catastrophic forgetting” that may arise when the training data are split for a series of small learning operations called tasks. Such a continual learning setting is becoming prevalent these days, where retaining the observed training data is either undesirable (confidentiality) or intractable (high-throughput systems), and developing reliable task-specific error estimators is indispensable. For instance, an intuitive approach to continual learning from a Bayesian perspective is to leverage the posterior of the current task to update the prior of the next task. However, analysis in Farquhar and Gal has shown that evaluation approaches for this prior-focused setup suffer from severe bias in realistic scenarios, particularly for finely partitioned data. Recent work in Goodfellow et al. provided a solution for test data scarcity by reusing the same test set in the context of a continuously evolving classification problem. To avoid overfitting the test data, the authors employed a reusable holdout mechanism based on the area under the receiver operating characteristic curve metric. Nevertheless, this approach remains contingent on the availability of an independent test set. For these reasons, there is a pressing need to develop novel error estimators that can effectively overcome data scarcity limitations. For assessing different classification models in the context of small-size training datasets, having an accurate error estimator with TL capabilities that can take advantage of relevant datasets in other domains would be highly beneficial. Such an estimator would be readily applicable to continual learning as cross-task datasets can be seen as related source-target samples. In the next sections, we provide a brief review of the standard error estimation techniques along with prevalent TL scenarios. A more comprehensive review can be found in the supplemental information, sections 3 and 5. For unknown feature-label distributions, the classification error of a given classifier is typically estimated by leveraging a large sample collected from the true distribution. However, limiting factors, such as the excessive cost of large-scale data acquisition, make it often infeasible to collect and hold out large test sets. Consequently, the available small-size sample may have to be used for both training and evaluating the classifier, and researchers have strived to devise practical methods for accurate error estimation. Existing error estimation schemes can be broadly categorized into parametric and non-parametric methods. Non-parametric estimators compute the error rate by counting the misclassified points, where widely used estimators include the resubstitution, CV, and bootstrap estimators. Parametric methods include the popular plug-in estimator that naively estimates the true error from an empirical model. The Bayesian minimum mean-square error estimator (BEE) proposed in Dalton and co-workers, is another benchmark parametric estimator that significantly enhances the robustness by computing the expected true error with respect to the posterior of the model parameters. The BEE has shown notable improvements over standard estimators as it effectively handles the uncertainty about the underlying feature-label distribution., Recently, TL has emerged as an alternative to provide remedies for pitfalls caused by training data scarcity in a target domain by utilizing available data from different yet relevant source domains. Based on the properties of source and target domains, two scenarios of TL may arise. The first one, commonly known as “homogeneous TL,” occurs when the source and target domains share the same feature space. The second scenario is called “heterogeneous TL” and is considered when differences exist between domains in terms of their feature space or data dimensionality. In practice, the most common setting for TL, known also as domain adaptation, assumes similar families of feature-label distributions across domains. In this study, we propose a TL framework for robust estimation of classification error based on a rigorous Bayesian paradigm. To the best of our knowledge, this study is the first work on TL-based BEE, which can significantly enhance our understanding of transferability across domains in the context of error estimation. Building on the Bayesian transfer learning framework proposed in Karbalayghareh et al., we introduce a TL-based BEE estimator that can enhance the error estimation accuracy in the target domain by utilizing the data available in a relevant source domain based on the joint prior of their feature-label distributions. We present a rigorous study of error estimation in the context of Bayesian TL and show that our proposed TL-based BEE effectively represents and exploits the relatedness (or dependency) between different domains to improve error estimation in a challenging small-sample setting, where the number of observed data points from the target domain of interest is in the range of 5–50. For applicability of the proposed TL-based BEE estimator in real-world problems for arbitrary classifiers, we introduce an efficient and robust importance sampling setup with control variates where the importance density and the control variates function are carefully defined to reduce the variance of the estimator while keeping the overall sampling process computationally feasible and scalable. For this purpose, we utilize Laplace approximations for fast evaluation of matrix-variate confluent and Gauss hypergeometric functions. The performance of the TL-based BEE estimator is extensively evaluated using both synthetic datasets as well as real-world biological datasets. As our main focus in this study is the estimation of classification error, we consider a variety of existing classifiers with different levels of learning capabilities to demonstrate the general applicability of our TL-based BEE estimation scheme. We also show the outstanding performance of the proposed estimator with respect to standard error estimation techniques that are commonly used.

Results and discussion

Overview of the proposed Bayesian error estimation via TL

We propose a class of Bayesian minimum mean-square error (MMSE) estimators for TL where the observed sample is a mixture of source and target data. The basic classification setting and a brief review of the standard BEE estimator are presented in the supplemental information, sections 2 and 4. For symbols and notations, see Table S1. Rooted in signal estimation, the BEE has been motivated by optimal filtering for functions of random variables. For a function of two random variables , the optimal estimator of a filter after observing only Y in the mean-square sense is given by Replacing X with the parameter vector θ of the feature-label distribution and Y by the sample (of size n), leads to the standard BEE that has been introduced in Dalton and Dougherty as In TL, the sample is a mixture of source and target data such that , with , and the classifier is designed either on , , or . We note that and are two labeled datasets from the source and target domains with sizes and , respectively (see Bayesian TL framework for binary classification, for generation details). This requires close attention as the TL-based BEE is valid only for fixed classifiers given the sample. This assumption carries limitations. For instance, classifiers that are only fixed given but not , are not deterministic for every set of parameters estimated based on . In this paper, we introduce the TL-based BEE defined aswhere denotes the parameter vector of the joint model formed by the target parameters and the source parameters . For a fixed classifier given , this estimator is optimal on average in the mean-square sense and unbiased when averaged over all parameters and samples. For classification in the target domain, the posterior density reduces to the posterior of the target parameters after observing the target and source data and takes the formwhere is obtained by marginalizing out the source domain parameters. Ultimately, the BEE for TL takes the form For the sake of simplicity we writewhere denotes the posterior of the target parameters after observing the hybrid sample .

Experiments and datasets

To evaluate the performance of the proposed error estimator, we consider the mean-square error (MSE) as a performance measure to understand the joint behavior of the classification error and its estimate . For the random vector , the MSE is defined as In what follows, we present an overview of the experimental setup for demonstrating the performance of the proposed TL-based BEE based on three different types of classifiers (see experimental procedures, sections 4.5 and 4.6 for more details) applied to both synthetic data as well as real-world biological datasets.

Bayesian TL framework for binary classification

We consider a binary classification problem in the context of supervised TL where there are two common classes in each domain. Let and be two labeled datasets from the source and target domains with sizes and , respectively. We are interested in the scenario where . Let , , where denotes the size of source data in class y. Likewise, let , , where denotes the size of target data in class y. We consider a d-dimensional homogeneous transfer learning scenario where and are normally distributed and separately sampled from the source and target domains, respectively.where , is a mean vector in domain z for class y, and is a precision matrix (inverse of covariance) in domain z for label y. An augmented feature vector is a joint sample point from two related source and target domains given bywithwhere denotes the transpose of matrix X. This sampling is enabled through a joint prior distribution for and that marginalizes out the off-diagonal block matrix . Using a Gaussian-Wishart distribution as the joint prior for mean and precision matrices, the joint model factorizes as For conditionally independent mean vectors given the covariances, the joint prior in (Equation 11) further factorizes into The block diagonal precision matrices for are obtained after sampling from a predefined joint Wishart distribution as defined in Karbalayghareh et al. such that , where is a hyperparameter for the degrees of freedom that satisfies and is a positive definite scale matrix of the form and are also positive definite scale matrices and denotes the off-diagonal component that models the interaction between source and target domains. Given , and assuming normally distributed mean vectors, we getwhere is the mean vector of the mean parameter and is a positive scalar hyperparameter. The joint prior distribution as derived in Karbalayghareh et al. acts like a channel through which the useful knowledge transfers from the source to the target domain, causing the posterior of the target parameters of the underlying feature-label distribution to be distributed more narrowly around the true values.

Synthetic datasets

To simulate and verify the extent of knowledge transferability across domains, we consider a wide range of joint prior densities that model the different levels of relatedness between the source and target domains. The proposed setup is as follows. We consider a binary classification problem in the context of homogeneous TL with dimensions 2, 3, and 5. In the simulated datasets, the number of source data points per class varies between 10 and 500 and between 5 and 50 for target datasets. This mimics realistic settings of small-size sample conditions (especially in the target domain) as reported in the literature. We set up the data distributions as follows. , , , , , , , where ϑ is an adjustable scalar used to control the Bayes error in the target domain, and and are all-zero and all-one vectors, respectively. For the scale matrices of Wishart distributions we set , , and , where is the identity matrix of rank d. To ensure that the joint scale matrix is positive definite , we set with , , and . As in Karbalayghareh et al., the value of controls the amount of relatedness between the source and target domains (see experimental procedures, section 4.6, for more details). To control the level of relatedness by adjusting only without involving other confounding factors, we set such that . In this setting, the correlation between the features across source and target domains are governed by , where small values of correspond to poor relatedness between source and target domains while larger values imply stronger relatedness. To sample from the joint prior, we first sample from a non-singular Wishart distribution to get a block partitioned sample of the form from which we extract . Afterward, we sample for and . In our simulations we use two types of datasets: training datasets that contain samples from both domains and testing datasets that contain only samples from the target domain. In all the simulations we consider testing datasets of 1,000 data points per class and we assume equal prior probabilities for the classes.

RNA sequencing datasets

To evaluate the performance of the TL-based BEE on real-world data, we consider classifying patients diagnosed with schizophrenia using transcriptomic profiles collected from psychiatric disorder studies. Based on two RNA sequencing (RNA-seq) datasets listed in Table 1, we selected the transcriptomic profiles of three genes, based on a stringent feature selection procedure comprising the analysis of differential gene expression, clustering of gene-gene interactions, and statistical testing for multivariate normality. More specifically, we focus on analyzing the astrocyte-related cluster of differentiation 4, found to be significantly upregulated in subjects with schizophrenia. We select the top three hub genes that collectively satisfy the Royston's multivariate normality test applied to the full datasets for both classes at a significance level of 99%. The identified genes satisfying all the aforementioned criteria include SOX9, AHCYL1, and CLDN10, with an average module centrality of 0.86 measured by genes' module membership (kME). In addition to normalization and quality control performed in Gandal et al., the selected features in both datasets have been further standardized to zero means and unit variances across both classes as in Karbalayghareh and co-workers.,

Table 1

Independent schizophrenia RNA-seq datasets sampled from two different brain tissues

Disease	No. of samples		Brain region	Dataset
	Case	Control
Schizophrenia	53	53	frontal cortex	syn4590909¹⁴
	262	293	DLPFC	syn2759792¹⁶
Total	315	346

Independent schizophrenia RNA-seq datasets sampled from two different brain tissues We consider the dataset syn2759792, sampled from the brain dorsolateral prefrontal cortex (DLPFC) area, as a target dataset and syn4590909, sampled from the frontal cortex (FC) region, as a source dataset. Among 555 postmortem brain samples in syn2759792, we randomly draw 5 samples per class as training data and we use the remaining samples to evaluate the classification error. This process is repeated 10,000 times to estimate the average MSE deviation of the TL-based BEE from the true error. To determine the model hyperparameters, we assume shared values for case and control samples in source and target domains and we set , . As represents a cross-domain property, we employ the TL-based BEE to conduct an exhaustive greedy search for in the task of estimating the true classification error by leveraging data points from a source domain dataset. In our hyperparameter tuning experiments, we consider source datasets of different sizes () and we retain the value of that leads to the smallest MSE deviation from the true error across all the experiments. At each iteration, we randomly permute the source samples for statistical significance. The remaining parameters are set as follows: , , and , such that the mean of the Wishart precision matrices will be equal to the identity matrix, which matches the normal standardization. For mean vectors and , we pool all case and control samples in each domain and consider their means, respectively.

Performance on synthetic datasets

We start by evaluating the performance of the proposed TL-based BEE in estimating the Bayes error, which corresponds to the true error of the quadratic discriminant analysis (QDA) (see experimental procedures, section 4.5) in the target domain, for different levels of and different size combinations of the utilized source and target datasets. In Figure 1, we investigate the behavior of the TL-based BEE when the target data are fixed while we vary the size of the source data. We show the results for in the first column, the results for in the second column, and the results for in the last column. The rows correspond to the results for target datasets with different sizes: on the top and on the bottom. The MSE curves show similar trends for all three values of d, where we can see that the deviation of the error estimate from the true error significantly decreases when highly related source data are employed. This behavior diminishes as the relatedness between the two domains decreases. Notably, using large source datasets () of moderate to small relatedness values () does not negatively impact the performance of the estimator for low dimensions () as shown in the first and second columns of Figure 1. As the dimensionality further increases (), relying on large source datasets with moderate or poor relatedness to the target domain slightly increases the deviation of the estimated error from the true error (i.e., in the third column). This tiny asymptotic deviation is explained by potential undesirable effects of relying on large source datasets of modest relatedness. However, it is important to note that the proposed TL-based BEE in the context of the given Bayesian TL framework suppresses this behavior, as it does not directly depend on the source data but the information transfer occurs through the joint prior. The joint prior acts like a bridge through which the useful knowledge passes from the source to the target domain. Effects of using source data in different TL settings (especially, a non-Bayesian setting) may require further investigation. Moreover, the simulation results in different columns show that the MSE deviation decreases as we rely on larger target datasets. However, the gain in performance as we use additional source data is reduced when target data are more abundant. This is illustrated by the slope of the MSE graphs that flattens as increases. Finally, Figure 1 shows that, for higher dimensions, the MSE deviation tends to increase. This is expected as increasing the dimensionality generally leads to a more difficult error estimation problem.

Figure 1

Effect of source data on the performance of the TL-based BEE for quadratic classifiers

MSE deviation from true error for Gaussian distributions with respect to source sample size. The Bayes error is fixed at 0.2 in all subfigures. For direct evaluation and higher dimensions, see Figures S2 and S3.

Effect of source data on the performance of the TL-based BEE for quadratic classifiers MSE deviation from true error for Gaussian distributions with respect to source sample size. The Bayes error is fixed at 0.2 in all subfigures. For direct evaluation and higher dimensions, see Figures S2 and S3. Next, Figure 2 shows the MSE deviation with respect to the size of the target dataset for dimensions 2, 3, and 5. The first row corresponds to the case of using source datasets of size and the second row shows the results for . The performance of the TL-based BEE estimator improves with the increasing availability of target data. We can also clearly see that the MSE deviation from the true error asymptotically converges to comparable values for all relatedness levels. When highly related source data are available, the TL-based estimator yields accurate estimation results even when the target dataset is small. These results consolidate the findings in Figure 1 about the redundancy of source data in the presence of abundant target data. Across all graphs in Figure 2, we can see that a relatedness coefficient results in a nearly constant deviation from the true error as a function of target data size, which suggests that highly related source data act almost identically like the target data, regardless of the shift across the domains in terms of their means. Similar to the trends shown in Figure 1, results across different columns of Figure 2 demonstrate that the error estimation difficulty increases with the increase of dimensionality. This is clearly reflected in the MSE deviation from the true error in Figure 2, which shows that, as the dimension increases from (first column) to (last column), the MSE increases by one order of magnitude.

Figure 2

Effect of target data on the performance of the TL-based BEE for quadratic classifiers

MSE deviation from true error for Gaussian distributions with respect to target sample size. The Bayes error is fixed at 0.2 in all subfigures.

Effect of target data on the performance of the TL-based BEE for quadratic classifiers MSE deviation from true error for Gaussian distributions with respect to target sample size. The Bayes error is fixed at 0.2 in all subfigures. Now, we aim at investigating the effect of classification complexity on the performance of the proposed TL-based BEE. To this end, we conduct simulations, in which we vary the Bayes error through a wide range of possible values and evaluate the TL-based BEE at each given Bayes error for different sizes of target data while using source datasets of a fixed size . In binary classification, the Bayes error has an upper bound specified by the true error of random classification, which is 0.5, as every data point can be randomly assigned one of the class labels. Ideally, we would vary the Bayes error across the interval as in Dalton and Dougherty. However, in our setup, we do not impose any structure on the covariance matrices, nor do we assume that they are scaled identities. This makes the control of the Bayes error much more difficult. In addition, the joint sampling setup within our Bayesian TL framework inhibits any modification of the randomized parameters. Consequently, the only practical way to adjust the Bayes error is to tune the mean vector parameters that specify the means for the class mean vectors with . In our experiments, we were able to fully control the Bayes error for and we considered the following values . Achieving the same range of values for and was more challenging, and our implemented heuristic did not converge for high values of Bayes error as setting did not help in increasing the Bayes error. However, we were able to vary the Bayes error for within the range , and for , within , sufficient for observing the trends. Figure 3 shows the MSE deviation with respect to the Bayes error for dimensions 2, 3, and 5. Results in the first row are obtained using target datasets of size 20 and those in the second row are obtained using target datasets of size 50. We can see that the Bayesian MMSE estimator performs best when using source data of high relatedness to the target domain as expected. For Bayes error in the range , the MSE deviation from the true error is very high, which makes this range of Bayes error as the most challenging setting for error estimation. For a Bayes error of 0.2, the MSE deviation is average across all the experiments, which confirms the validity of our previous assumption in selecting this value to investigate classification problems of moderate difficulty. We note that the TL-based BEE shifts the performance in favor of low and high Bayes error levels. Indeed, the TL-based BEE performs well in this case because the estimated target parameters are sufficiently accurate, even with a small target sample.

Figure 3

Effect of the classification complexity on the performance of the TL-based BEE for quadratic classifiers

MSE deviation from QDA true error with respect to Bayes error. Source sample size was set to in all subfigures.

Effect of the classification complexity on the performance of the TL-based BEE for quadratic classifiers MSE deviation from QDA true error with respect to Bayes error. Source sample size was set to in all subfigures. In addition to investigating the effect of different relatedness levels between source and target domains, in Figure 4 we have examined the performance of the TL-based BEE for the case when the source class means are swapped between the two classes, such that they show opposite trends compared with the class means in the target domain. For this purpose, we reproduced the experiments in Figure 1 after flipping the class means of source datasets with respect to the target classes (i.e., , for ). In the first row of Figure 4, we use the generated source datasets as observed samples from the source domain. Interestingly, the obtained results match those observed in Figure 1. This postulates that the knowledge transfer across source and target domains in the context of the studied Bayesian TL framework does not depend on the arrangement of the class means in the source and target domains but only rests on the level of relatedness between the two domains. For verification, we have intentionally considered the same source datasets in the previous experiment as target datasets for estimating the TL-based BEE and we plotted the obtained results in the second row of Figure 4. Clearly, the TL-based BEE veers away from the true error as we consider additional source data points. This deviation is worse with poorly related source data (). These results confirm previous findings in Karbalayghareh et al. that the joint prior model in the utilized Bayesian TL framework acts like a bridge that distills the useful knowledge from the source domain and effectively transfers it to the target domain.

Figure 4

Effect of the arrangement of the class means in the source and target domains on the performance of the TL-based BEE

MSE deviation from true error with respect to source sample size. The source class means are flipped with respect to target classes ( , for ). In the first row, the source datasets are correctly considered as source samples. In the second row, the source datasets are intentionally considered as target samples. The Bayes error is fixed at 0.2 and .

Effect of the arrangement of the class means in the source and target domains on the performance of the TL-based BEE MSE deviation from true error with respect to source sample size. The source class means are flipped with respect to target classes ( , for ). In the first row, the source datasets are correctly considered as source samples. In the second row, the source datasets are intentionally considered as target samples. The Bayes error is fixed at 0.2 and . Results from the second set of experiments that use a linear discriminant analysis (LDA) classifier (see experimental procedures, section 4.5) were similar to the ones obtained using the QDA classifier except for some differences in the performance of the TL-based BEE with respect to the Bayes error that we report in Figure 5 (see supplemental information, section 8, for additional results). The TL-based BEE performance has similar trends with respect to small and moderate Bayes errors when compared with the presented results obtained using the QDA classifier. A notable difference here is observed for large values of Bayes error where the TL-based BEE shows decreased performance in terms of MSE deviation from the true error, which is due to the fact that the employed LDA classifier is sub-optimal compared with the Bayes classifier. This is expected as linear decision boundaries tend to be more sensitive to deviations from true model parameters for highly overlapping class-conditional distributions. In our final set of experiments using synthetic datasets, we compare the performance of the proposed TL-based BEE to standard error estimators for different dimensions and various source datasets of relatedness level to the target domain for an optimal Bayesian transfer learning (OBTL) classifier (see experimental procedures, section 4.5). In Figure 6, we show the MSE deviation with respect to different target dataset size. As clearly shown, our proposed TL-based BEE significantly outperforms all other standard error estimators by a substantial margin. In agreement with previous findings in the literature, the standard error estimators perform comparably for low dimensions (i.e., ), where the bootstrap may show a slight advantage. As the dimensionality increases (i.e., ), the performance shift of the studied estimators becomes more apparent. For example, the resubstitution estimator performs poorly in the small-sample regime while the bootstrap estimator outperforms leave-one-out cross validation (LOO) and CV. Furthermore, we noticed that increasing the size of the source dataset does not lead to any apparent performance improvement for the standard estimators. This is because these estimators do not directly depend on the source data for error estimation (as they are incapable of taking advantage of data from different yet relevant domains). However, providing additional source data to the TL-based BEE considerably reduces the MSE deviation from the true error for all dimensions as shown in Figure 6.

Figure 5

Effect of the classification complexity on the performance of the TL-based BEE for linear classifiers

MSE deviation from LDA true error for Gaussian distributions with respect to Bayes error. Source sample size is set to in all subfigures. See also Figures S4 and S5.

Figure 6

Comparative analysis of the performance of the TL-based BEE with respect to standard error estimators

MSE deviation from true error with respect to target data size. The proposed TL-based BEE is compared with other widely used estimators. In all subfigures, the Bayes error is fixed at 0.2, and .

Effect of the classification complexity on the performance of the TL-based BEE for linear classifiers MSE deviation from LDA true error for Gaussian distributions with respect to Bayes error. Source sample size is set to in all subfigures. See also Figures S4 and S5. Comparative analysis of the performance of the TL-based BEE with respect to standard error estimators MSE deviation from true error with respect to target data size. The proposed TL-based BEE is compared with other widely used estimators. In all subfigures, the Bayes error is fixed at 0.2, and .

Performance on real-world RNA-seq datasets

To analyze the performance of the TL-based BEE on real-world data, we have trained a QDA classifier on a small target dataset that consists of five sample points per class extracted from syn2759792 in Table 1. Using different source datasets collected from syn4590909, we show in Figure 7A the MSE deviation of the TL-based BEE from the true error with respect to .

Figure 7

Performance of the TL-based BEE on real-world RNA-seq datasets

MSE deviation from QDA true error for normally distributed brain gene expression data with respect to and . (A) Gene features from the FC brain region demonstrate high relatedness with those from DPLFC area (). (B) Utilizing the data from source domain significantly reduces the MSE of the TL-based BEE in the target domain.

Performance of the TL-based BEE on real-world RNA-seq datasets MSE deviation from QDA true error for normally distributed brain gene expression data with respect to and . (A) Gene features from the FC brain region demonstrate high relatedness with those from DPLFC area (). (B) Utilizing the data from source domain significantly reduces the MSE of the TL-based BEE in the target domain. For all combinations and different sizes of source datasets, the FC brain region showed high relatedness to the DLPFC brain area where the optimal MSE deviation from the true error was obtained for . Interestingly, findings in Gandal et al. also confirm that syn4590909 and syn2759792 are highly related, as independent gene expression assays for both brain regions have consistently replicated the gradient of transcriptomic severity observed for three different types of psychiatric disorders, including bipolar disorder and schizophrenia. We note that the significant decrease in the MSE deviation from the true error in Figure 7A corresponds to the boost in performance caused by increasing from 0.01 to larger values. This can be explained by the high relatedness between the two studied domains. Indeed, assuming very poor relatedness (i.e., ) between the domains, deviating from the ground truth of high relatedness results in a very large MSE. We show in Figure 7B the increasing gain in accuracy of the TL-based BEE in estimating the classification error after using additional labeled observations from the source domain. These results again confirm the efficacy and advantages of our TL-based error estimation scheme, compared with other standard error estimation methods, when additional data are available from different source domains that are nevertheless relevant to the target domain. From a practical perspective, our proposed TL-based BEE has the potential to facilitate the analysis of real-world datasets in the context of small-sample classification. Challenges of designing and evaluating classifiers (e.g., for clinical diagnosis or prognosis) in a small-sample setting are prevalent in scientific studies in life sciences and physical sciences due to the formidable cost, time, and effort required for data acquisition. This is certainly the case for the example that we consider in this section, where invasive brain biopsies would be needed to get the data.

Insights gained

In this section, we summarize the insights gained from our analyses, which demonstrate the potential advantages of applying TL to the estimation of classification errors. Our results have shown that incorporating data and knowledge from relevant source domains is helpful to significantly enhance the classification error estimation accuracy. When an appropriate source domain is identified, the efficiency of the knowledge transfer process depends on the correlation of the features across domains, rather than the class-conditional mean values of the features, with our problem setups. From an error estimation perspective, our investigation has revealed that, unlike classifier design, the most challenging setting for error estimation arises in classification problems of moderate complexity in terms of Bayes error. When source datasets that are at least modestly relevant to the target domain of interest are available, knowledge transfer to the target domain by appropriate modeling of the joint prior could enhance both the accuracy and the reliability of the error estimation. This was validated in our current study, where the joint prior acts like a “channel” as well as a “filter,” through which useful relevant knowledge is passed from the source domain to the target domain. Our results have shown that using at least 200 data points from a relevant source domain, whose relatedness level is above 0.7, enables an accurate error estimation even with small target data (less than 50 sample points). Using real-world biological data (RNA-seq data), we have shown that the relatedness level can be empirically determined by exploring the range of possible values.

Limitations of the study

This section discusses the limitations of our current work in modeling assumptions, computational cost, and scalability to higher dimensions. Despite the precise mathematical definition of our error estimator, accurate estimation of the classification error is contingent on whether predictive posterior densities are available in closed forms or can be approximated in an effective manner. While such densities are available for Gaussian models (e.g., assuming joint Wishart priors), one may need to derive them for different priors for non-Gaussian distributions. The computational complexity to accurately estimate the proposed TL-based BEE through direct sampling methods can be excessive and may scale poorly for higher dimensions. However, we efficiently overcame this limitation by developing a robust importance sampling setup that has shifted all the computational overhead related to the TL process from Monte Carlo sampling to the numerical evaluation of the importance likelihood. Developing similar statistical methods for TL-based BEE would be needed for different modeling assumptions. While the definition of the TL-based BEE and the proposed robust importance sampling scheme are general and applicable to higher dimensions, controlling the Bayes error for synthetic datasets for dimensions higher than 5 can be challenging, which was the main reason for choosing the dimensions in this study. However, this is not an issue in practice, as the classification complexity in real-world applications (reflected by the Bayes error) is an inherent property of a given classification problem governed by the underlying feature-label distribution, and not a design choice. Technically, the proposed TL-based BEE can be applied to classification problems based on high-dimensional features as long as the required computational resources are available. Furthermore, we can also consider classifier design and error estimation based on a lower-dimensional representation of the original feature space—e.g., using principal-component analysis or auto-encoders—to make the computational cost manageable.

Conclusions

In this study, we have introduced a Bayesian MMSE estimator that draws from concepts and theories in TL to enable accurate estimation of classification error in the (target) domain of interest by utilizing samples from other closely related (source) domains. We have developed an efficient and robust importance sampling setup that can be used for accurate error estimation in small-sample scenarios that often arise in many real-world scientific problems. Extensive performance analysis based on both synthetic and real-world biological data demonstrates the outstanding performance of the proposed TL-based BEE clearly outperforming conventional estimators. In our proposed framework, Laplace approximations were used to alleviate the complexity associated with the exact evaluation of generalized hypergeometric functions that appear in the posterior distribution of the target parameters. Beyond the Gaussian model assumed in the validation experiments, we also provide a general mathematical definition for the TL-based BEE that can directly be extended to applications with non-Gaussian distributions where the model parameters can be inferred through Markov chain Monte Carlo (MCMC) methods. In this study, target and source domains were related through the joint prior of the model parameters that transfers useful knowledge across domains. A key property of the proposed TL-based BEE is its elegant ability to handle the uncertainty about the model parameters by integrating this prior with data, deducing robust estimates by accounting for all possible parameter values. Paramount practical challenges for the TL-based BEE include the identification of suitable source domains that share similar families of distributions as the target domain of interest. This is crucial as the relatedness across domains is mathematically modeled assuming the similarity of the feature-label distributions across domains. Furthermore, learning the joint prior for the distributions and modeling the relatedness between different domains may also present an engineering challenge. While techniques for knowledge-driven prior construction have been developed,, such techniques have yet to be developed for joint prior construction for relevant domains, which is an important future research direction. An important aspect enabled by the proposed TL-based BEE is optimal data acquisition from multiple domains that aims at maximally enhancing the error estimation capability based on a finite budget for data acquisition. For example, if one has a fixed budget to acquire additional data from either the source or target domain, what would be the most cost-effective strategy for data acquisition? In typical TL scenarios, data acquisition cost may be relatively cheaper in the source domain than in the target domain, although the data acquired in the target domain might be more impactful. A natural question is how one can maximize the “return-on-investment” for data acquisition given the available budget. Such strategies for optimal experimental design19, 20, 21, 22, 23, 24 and active learning25, 26, 27 have been actively studied in a Bayesian paradigm that enables objective-based uncertainty quantification via mean objective cost of uncertainty., While this is beyond the scope of this current study, it opens up interesting directions for future research.

Experimental procedures

Resource availability

Lead contact

Dr. Byung-Jun Yoon is the lead contact for this study and can be reached at bjyoon@ece.tamu.edu.

Materials availability

This study did not generate any physical materials.

Data and code availability

All RNA-seq datasets that have been utilized in this study are publically available. All original code has been deposited at https://github.com/omarmaddouri/TL_BEE, archived in Zenodo under the https://doi.org/10.5281/zenodo.5594476, and are publicly available as of the date of publication. In addition to the proposed importance sampling estimate, we also provide implementation of the direct evaluation using the predictive posterior density of target parameters.

Bayesian TL for error estimation

The advantage of the mathematical formulation that underlies the proposed TL-based BEE (and also the original TL Bayesian framework in Karbalayghareh et al.) is that it articulates a unified Bayesian inference model that assumes a specified prior distribution governing the parameter vector and acting like a bridge to help update after observing and . From this standpoint, the derivation of the TL-based BEE for TL depends on determining . To determine the TL-based BEE in the context of the presented Bayesian transfer learning framework we evoke the following theorem. Theorem 1: given the target and source data, the posterior distribution of target mean and the target precision matrix for the classes has Gaussian-hypergeometric function distribution given bywhere is a constant of proportionality given byandwith and given byand sample means and covariances for given by and are, respectively, the confluent and Gauss matrix-variate hypergeometric functions reviewed in the supplemental information, section 6. Now, using Theorem 1 and assuming that the class-0 prior probability c, , and are independent prior to observing and , the BEE for TL is given bywherewith being the parameter space that contains all possible values for .

Computing TL-based BEE for arbitrary classifiers

Computing the TL-based BEE for an arbitrary classifier involves the evaluation of the integral in (Equation 19). Even when we have an analytic expression for the true error of the studied classifier, the closed-form expression for the TL-based BEE cannot be easily derived due to the complex expression of the target posterior in the presence of the matrix-variate hypergeometric functions. With non-linear classifiers, this becomes practically impossible as no closed-form expression exists for the true error itself. The standard way to approximate the true error in this case is to consider the test error. For a specified parameter , a large test set is generated from , and the performance of is evaluated on that test set. This requires sampling from so that the integral in (Equation 19) can be approximated by a finite sum. Suppose we have N posterior sample points . Then the approximation is given by Because of the generalized confluent and Gauss hypergeometric functions in the expression of , sampling directly from the posterior is very laborious and the computational cost of applying MCMC methods is exorbitant as the execution may take several weeks even on high-performance computing clusters. To address this issue, in the next section we propose an efficient self-normalized importance sampling setup with control variates that provides accurate estimates for the TL-based BEE and significantly reduces the computation time to make the proposed TL-based BEE feasible.

Self-normalized importance sampling with control variates

Importance sampling

Importance sampling (IS) is a variance reduction technique that provides a remedy to sampling from complex distributions. To estimate , IS makes a multiplicative adjustment to to compensate for sampling from an alternative importance distribution instead of . If is a positive probability density function on , we can write Achieving an accurate IS estimation is contingent on selecting an appropriate importance density that is nearly proportional to . By analogy to Gordon and co-workers,, a plausible and cogent candidate for emanates as the posterior of target parameters upon observation of target-only data. Obviously, both distributions are tracking the same model parameters in the target domain upon observation of data. To determine we require the following lemma: Lemma 1: if where is a vector and , for , and has a Gaussian-Wishart prior, such that and , then the posterior of upon observing is also a Gaussian-Wishart distribution such thatwheredepending on the sample mean and covariance matrix Using Lemma 1 we now get the expression of the importance density given bywherewith sample mean and covariance given by After simplifications, the expression of the TL-based BEE in (Equation 21) takes the formwhere and is the likelihood ratio given by Although the likelihood ratio has a simplified expression, computing the hypergeometric functions involves the computation of series of zonal polynomials, which is computationally expensive and not scalable to high dimensions. To mitigate this limitation, we use the Laplace approximations of these functions (see Figure S1 and supplemental information, section 6). To rectify possible disproportionalities in likelihood ratios due to approximations, we consider the self-normalized IS estimate given bywith .

Control variates

For more stable and efficient estimates, we further combine IS with control variates. Using control variates in conjunction with IS is a variance reduction technique, in particular when a significant portion of a model for estimating the expectation can be solved explicitly. In our case, a useful control variates function (CVF) satisfieswhere δ is a constant. Under such circumstances, a more stable estimate for the TL-based BEE can be derived aswhere and β is a weighting coefficient tuned to reduce the variance of the estimate. The optimal value of β is given bywithand and denote covariance and variance, respectively (see supplemental information, section 7.3, for more details). In practice, it is not likely that we know beforehand, but it is estimated from the Monte Carlo sample. It turns out that has lower variance than by a factor of , where denotes the correlation coefficient between and and given by To select an appropriate CVF we need to consider two criteria. First, its expectation with respect to should have an exact evaluation. Second, it has to be correlated with the estimated error. A favorable candidate is the analytic true error of linear classifiers. In this study, we consider a CVF given by the true error of an LDA classifier defined by where , , and the pooled covariance is given by and are the empirical estimates utilized in (Equation 26). Thus, the CVF is given bywith denoting the standard normal Gaussian cumulative distribution function. Now it remains only to determine in closed-form to fully define the estimation setup. We can show after simplifications and using results from thatwhere is the sign function,and denotes the regularized incomplete beta function given bywith being the regular univariate gamma function. Details for simplifying are covered in supplemental information, section 7.4. The complete specification of the CVF concludes our IS setup. We enumerate some advantages of the proposed setup over direct sampling methods. First, the importance density is much simpler than the nominal density , which involves matrix-variate hypergeometric functions. Second, our setup successfully combines two variance reduction techniques that enable accurate estimation. Last, and most importantly, the independence of the generated Monte Carlo samples w.r.t source data permits the reuse of the sampled parameters with various source datasets for fixed models. This reusability significantly reduces the computational cost of sampling from and makes the utilization of advanced MCMC methods amenable as the whole process could be accelerated by a factor of 10–20, which also grows with the dimensionality and the number of used source datasets (see supplemental information, sections 7.5 and 7.6, for more details). For efficient sampling from , we use Hamiltonian Monte Carlo (HMC), proven to have a superior performance to standard MCMC samplers. For this purpose, we utilize the STAN software, which offers a full Bayesian statistical inference framework with HMC.

Classifier design

For a comprehensive evaluation of our TL-based error estimator, we design and perform a set of experiments. The proposed TL-based estimator is applied to a collection of classifiers with different levels of learning capacities and tested under various scenarios. To separate error estimation from classifier design, we start by analyzing the performance of the TL-based BEE estimator for fixed classifiers that do not depend on training data. This setup distinctly reveals the major characteristics of the TL-based BEE, excluding any confounding factors that may stem from classifier design and the performance of the resulting classifier. Next, we also conduct a comparative study of the TL-based BEE performance with respect to other widely used error estimators, which include resubstitution, CV, LOO, and the 0.632-bootstrap estimators. As these popular data-driven estimators involve classifier design on the training data, we will also consider a TL-based classifier designed on target and source data that operates in the target domain for comparison. For this, we employ the OBTL classifier introduced in Karbalayghareh et al., which shares the same Bayesian framework on which our TL-based BEE is developed. In what follows, we recall the definition of each classifier considered in our evaluations and also present the details of the evaluation experiments performed in this study. In the first set of experiments, we employ a fixed quadratic classifier assuming we know beforehand the true target parameters. For normally distributed data, this quadratic classifier corresponds also to the Bayes classifier that is optimal for the given feature-label distributions. Using QDA, we define , where The error estimation problem turns out to be an estimation of the Bayes error that coincides here with the true error of the designed QDA. Obviously, this classifier is independent from any observed sample as it is fixed assuming known true model parameters. Without loss of generality, we apply the TL-based BEE using labeled observations from a compound dataset compiled from target and source domains. In the second set of experiments we investigate the behavior of the TL-based BEE within the class of sub-optimal classifiers. To this end, we consider a linear classifier derived through LDA and we define where , , and the average covariance is given by Our goal is then to approximate the true error of this sub-optimal classifier using TL. Next, we evaluate the performance of the TL-based BEE for the OBTL classifier that can take advantage of both source and target domain data. The OBTL classifier is defined bywhere the objective function denotes the effective class-conditional density given by the following theorem: Theorem 2: the effective class-conditional density, denoted by , in the target domain is given bywhere

Simulation setup

Figure 8 provides a combined illustration of the simulation setup for all three classifiers. For rigorous evaluation of the performance of the proposed TL-based BEE, we primarily focus our experiments on assessing the impact of using different types and amounts of source data. This is enabled by the joint prior imposed over the model parameters and controlled by the relatedness coefficient that dictates the extent of interaction between the features in the two domains. For this purpose, we repeatedly conducted experiments following the flow chart in Figure 8 with different relatedness values (), where corresponds to the lowest relatedness between the two domains and reflects the highest relatedness within the range of studied values.

Figure 8

Simulation diagram using synthetic data

Flow chart illustrating the simulation setup based on synthetic datasets.

Simulation diagram using synthetic data Flow chart illustrating the simulation setup based on synthetic datasets. In the first set of experiments, we start by drawing a joint sample for each class , as described previously. Next, we iterate over the values of the hyperparameter ϑ to control through a dichotomic search to get a desired value τ of the Bayes error. This is achieved by drawing a sample and then generating a test set based on the joint sample . Using this test set, we determine the true error of the optimal QDA derived from . If the desired Bayes error (true error of the designed QDA) is attained then the iteration stops, otherwise we update ϑ and reiterate. In our experiments, unless otherwise specified, we set to mimic a moderate level of classification complexity. This step is indeed crucial as it maintains the same level of complexity across the experiments and guarantees a fair comparison across different levels of relatedness. We note that this procedure is valid for general covariances as it acts only on updating the value of the mean parameter without altering the structure of the covariances nor the random mean vectors. Obviously, this approach to specify the Bayes error maintains the Bayesian TL framework intact. However, it is not guaranteed to find values of target parameters that correspond to the desired Bayes error, especially for high dimensions and complex classification (large Bayes error) as we discuss in Performance on synthetic datasets. Once the problem complexity is set and the classifier is fixed, we generate training datasets that we use to evaluate the MSE of the TL-based BEE as depicted in Figure 8. To estimate the TL-based BEE, we employ the IS setup described previously and we draw 1,000 MC samples from the importance density using HMC sampler. In the second set of experiments, we follow a similar setup using an LDA classifier designed based on the true model parameters. As before, we employ QDA to determine the Bayes error to maintain the same complexity level across different experiments. As in the first set of experiments, we use the TL-based BEE to estimate the true error of the designed LDA classifier. In the last set of experiments on synthetic datasets, we conduct a comparative analysis study using an OBTL classifier designed using training datasets generated from the model parameters specified by the Bayes error. The error estimation task, in this scenario, aims at approximating the true error of the designed OBTL classifier determined using a large test set generated from the true feature-label distributions. As illustrated in Figure 8, QDA and LDA classifiers are fixed and derived from the true model parameters while the OBTL classifier is designed based on training datasets collected from the underlying feature-label distributions that correspond to the specified Bayes error. In all simulations, the designed classifiers are fixed given the observed samples and the TL-based BEE estimator is safely applied. Finally, regarding synthetic datasets, we note that the flow chart in Figure 8 is valid for all classifiers (QDA, LDA, and OBTL) and the notation designates the classifier of interest in the corresponding set of experiments. For instance, in the second set of experiments, refers to . In addition to this in-depth analysis of the performance, behavior, and characteristics of our proposed TL-based BEE based on synthetic datasets, we also performed additional validation based on real-world biological datasets. By using RNA-seq datasets syn2759792 and syn4590909 taken from different brain regions for studying brain disorders, we train a QDA classifier using the target data from the RNA-seq dataset syn2759792, and we leverage the source data from syn4590909 to evaluate the performance of the proposed TL-based BEE.

9 in total

1. Is cross-validation valid for small-sample microarray classification?

Authors: Ulisses M Braga-Neto; Edward R Dougherty
Journal: Bioinformatics Date: 2004-02-12 Impact factor: 6.937

2. Optimal Experimental Design for Gene Regulatory Networks in the Presence of Uncertainty.

Authors: Roozbeh Dehghannasiri; Byung-Jun Yoon; Edward R Dougherty
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2015 Jul-Aug Impact factor: 3.710

3. Constructing Pathway-Based Priors within a Gaussian Mixture Model for Bayesian Regression and Classification.

Authors: Shahin Boluki; Mohammad Shahrokh Esfahani; Xiaoning Qian; Edward R Dougherty
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-11-30 Impact factor: 3.710

Review 4. Controversies regarding and perspectives on clinical utility of biomarkers in hepatocellular carcinoma.

Authors: Pei-Pei Song; Ju-Feng Xia; Yoshinori Inagaki; Kiyoshi Hasegawa; Yoshihiro Sakamoto; Norihiro Kokudo; Wei Tang
Journal: World J Gastroenterol Date: 2016-01-07 Impact factor: 5.742

5. Shared molecular neuropathology across major psychiatric disorders parallels polygenic overlap.

Authors: Michael J Gandal; Jillian R Haney; Neelroop N Parikshak; Virpi Leppa; Gokul Ramaswami; Chris Hartl; Andrew J Schork; Vivek Appadurai; Alfonso Buil; Thomas M Werge; Chunyu Liu; Kevin P White; Steve Horvath; Daniel H Geschwind
Journal: Science Date: 2018-02-09 Impact factor: 47.728

6. Cancer biomarkers: can we turn recent failures into success?

Authors: Eleftherios P Diamandis
Journal: J Natl Cancer Inst Date: 2010-08-12 Impact factor: 13.506

7. Gene expression elucidates functional impact of polygenic risk for schizophrenia.

Authors: Menachem Fromer; Panos Roussos; Solveig K Sieberts; Jessica S Johnson; David H Kavanagh; Thanneer M Perumal; Douglas M Ruderfer; Edwin C Oh; Aaron Topol; Hardik R Shah; Lambertus L Klei; Robin Kramer; Dalila Pinto; Zeynep H Gümüş; A Ercument Cicek; Kristen K Dang; Andrew Browne; Cong Lu; Lu Xie; Ben Readhead; Eli A Stahl; Jianqiu Xiao; Mahsa Parvizi; Tymor Hamamsy; John F Fullard; Ying-Chih Wang; Milind C Mahajan; Jonathan M J Derry; Joel T Dudley; Scott E Hemby; Benjamin A Logsdon; Konrad Talbot; Towfique Raj; David A Bennett; Philip L De Jager; Jun Zhu; Bin Zhang; Patrick F Sullivan; Andrew Chess; Shaun M Purcell; Leslie A Shinobu; Lara M Mangravite; Hiroyoshi Toyoshiba; Raquel E Gur; Chang-Gyu Hahn; David A Lewis; Vahram Haroutunian; Mette A Peters; Barbara K Lipska; Joseph D Buxbaum; Eric E Schadt; Keisuke Hirai; Kathryn Roeder; Kristen J Brennand; Nicholas Katsanis; Enrico Domenici; Bernie Devlin; Pamela Sklar
Journal: Nat Neurosci Date: 2016-09-26 Impact factor: 24.884

8. Efficient experimental design for uncertainty reduction in gene regulatory networks.

Authors: Roozbeh Dehghannasiri; Byung-Jun Yoon; Edward R Dougherty
Journal: BMC Bioinformatics Date: 2015-09-25 Impact factor: 3.169

9. Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors.

Authors: Shahin Boluki; Mohammad Shahrokh Esfahani; Xiaoning Qian; Edward R Dougherty
Journal: BMC Bioinformatics Date: 2017-12-28 Impact factor: 3.169

9 in total

1 in total

1. Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.

Authors: Omar Maddouri; Xiaoning Qian; Francis J Alexander; Edward R Dougherty; Byung-Jun Yoon
Journal: Data Brief Date: 2022-04-02

1 in total