Literature DB >> 35434232

Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.

Omar Maddouri¹, Xiaoning Qian^1,2, Francis J Alexander², Edward R Dougherty¹, Byung-Jun Yoon^1,2.

Abstract

Transfer learning (TL) techniques can enable effective learning in data scarce domains by allowing one to re-purpose data or scientific knowledge available in relevant source domains for predictive tasks in a target domain of interest. In this Data in Brief article, we present a synthetic dataset for binary classification in the context of Bayesian transfer learning, which can be used for the design and evaluation of TL-based classifiers. For this purpose, we consider numerous combinations of classification settings, based on which we simulate a diverse set of feature-label distributions with varying learning complexity. For each set of model parameters, we provide a pair of target and source datasets that have been jointly sampled from the underlying feature-label distributions in the target and source domains, respectively. For both target and source domains, the data in a given class and domain are normally distributed, where the distributions across domains are related to each other through a joint prior. To ensure the consistency of the classification complexity across the provided datasets, we have controlled the Bayes error such that it is maintained within a range of predefined values that mimic realistic classification scenarios across different relatedness levels. The provided datasets may serve as useful resources for designing and benchmarking transfer learning schemes for binary classification as well as the estimation of classification error.

Entities: Chemical

Keywords: Bayesian transfer learning; Binary classification; Classifier design; Error estimation

Year: 2022 PMID： 35434232 PMCID： PMC9011006 DOI： 10.1016/j.dib.2022.108113

Source DB: PubMed Journal: Data Brief ISSN： 2352-3409

Specifications Table

Value of the Data

The data here provide useful resources for studying binary classification and error estimation problems from a transfer learning perspective. The relatedness across domains has been mathematically modeled as in [2] through a joint Wishart distribution over the model parameters. This enables rigorous quantification of the relevance across the source and target domains. The selective sampling of the model parameters in the source and target domains based on the classification complexity (Bayes error) makes the comparison of the evaluation results across different dimensions and relatedness levels possible, as it preserves the simulation conditions across different experiments. Without these stringent conditions, drawing statistically meaningful conclusions from empirical analysis would be practically difficult. The provided data are of practical values to any data-driven machine learning approach that employs transfer learning to solve binary classification problems. More specifically, the dataset can be used to design novel classifiers in the target domain based on additional data from the source domain. The large size of the provided dataset (for each configuration, there are data points per class for each domain) will facilitate the design, validation, and evaluation of new algorithms. The wide range of values for the feature space dimensions, Bayes errors, and relatedness levels will enable a comprehensive performance assessment of new classification and error estimation methods under diverse classification settings. In many scientific or clinical settings, training data are typically limited in the target domain (e.g., due to high data acquisition cost), which impedes the design and evaluation of accurate classifiers. Transfer learning can improve the learning outcome in the target domain by incorporating data from relevant source domain(s). From this perspective, the optimal setting to use the provided data is to consider only a few data points in the target domain to develop new machine learning methods (e.g., classifier design [2], classification error estimation [3]), and to leverage a relatively larger amount of source data to improve the machine learning task in the target domain. The substantial part of the remaining target data that are provided in the dataset should be mainly used to estimate the ground-truth metrics (i.e., true classification error) and not as training data. The provided simulation source code can be used to simulate other classification scenarios for higher feature space dimensions and/or different classification complexity levels. The detailed description of the simulation setup that was used to generate the current dataset can provide a solid guideline on how the experimental setup should be configured to study classification problems from a transfer learning perspective. As the transfer learning aspect involves various factors affecting the classification and error estimation performance, especially due to the heterogeneity of the data characteristics across domains, it is critical to maintain uniformity of the experimental conditions across all the simulations to enable interpretations of the obtained results that are accurate, valid, and statistically meaningful.

Data Description

As illustrated in Fig. 1, the main folder Synthetic_Data_Classification_Bayesian_Transfer_Learning contains three data sub-folders (d_2, d_3, and d_5) that correspond to dimensions 2, 3, and 5, respectively. The remaining sub-folder generation_source_code contains the Matlab source code.

Fig. 1

Hierarchy of the main data repository.

Hierarchy of the main data repository. In every data sub-folder (d_2, d_3, or d_5) there are 24 binary Matlab files with names encoded as follows: Data_d_x_Bayes_x.x_n_t_x_n_s_x_alpha_x.x_nu_x.mat, where: d_x: refers to the dimension of the feature space where x takes values 2, 3, or 5. Bayes_x.x: designates the classification complexity level (0.1, 0.2, 0.3, or 0.4). n_t_x: indicates the number of target data points per class (i.e.: for , this string is set to n_t_100000) n_s_x: indicates the number of source data points per class (i.e.: for , this string is set to n_s_100000) alpha_x.x: indicates the relatedness level (0.1, 0.3, 0.5, 0.7, 0.9, or 0.99). nu_x: specifies the value of a hyperparameter that corresponds to the degrees of freedom used to model the joint Wishart distribution (in our simulations, we set ). In every Matlab binary file among the aforementioned files, there are 4 indexed data containers called cell arrays that also contain, each, two cell containers. These data containers are described as follows: D_s: source dataset of dimension () per class. D_t: target dataset of dimension () per class. param_s: parameter vector of the source domain that specifies the means and the precision matrices of the multivariate Gaussian distributions that underlie the data classes and has the following cell parameters: mu_s: contains a -dimensional real-valued vector per class. Lambda_s: contains a () positive definite precision matrix per class. param_t: parameter vector of the target domain that specifies the means and the precision matrices of the multivariate Gaussian distributions that underlie the data classes and has the following parameters: mu_t: contains a -dimensional real-valued vector per class. Lambda_t: contains a () positive definite precision matrix per class. The Matlab source code directory generation_source_code1 is structured as illustrated in Fig. 2. For each dimension (), we include a list of sub-folders Bayes_x.x that correspond to different classification complexity levels that have been used in the evaluation experiments. Under each Bayes_x.x folder, we dedicate a sub-folder alpha_x.x for each relatedness level.

Fig. 2

Structure of the Matlab source code repository.

Structure of the Matlab source code repository. Fig. 3 shows the content of each relatedness level directory. Under each relatedness level folder, there is a main script file named simulate_Data.m that includes the simulation settings relevant to the parameter values as specified by the architecture of the parent directories. The remaining files (generate_Data.m, setup_parameters.m, test_error_QDA.m, and train_QDA.m)2 are shared across all the experiments and are duplicated in the different folders for distributed execution purposes.

Fig. 3

Matlab script files.

Experimental Design Materials and Methods

Bayesian transfer learning framework for binary classification

To model the synthetic data we consider a binary classification problem in the context of supervised transfer learning where there are two classes in each domain. Let and be two labeled datasets from the source and target domains with sizes and , respectively. Let , , where denotes the size of source data in class . Likewise, let , , where denotes the size of target data in class . With the assumption of disjoint datasets across classes, we have and with and . We consider a -dimensional homogeneous transfer learning scenario where and are normally distributed and separately sampled from the source and target domains, respectively:where , is a mean vector in domain for class , is a matrix that denotes the precision matrix (inverse of covariance) in domain for label , and denotes the multivariate normal distribution. An augmented feature vector is a joint sample point from two related source and target domains given bywithwhere denotes the transpose of matrix . Using a Gaussian-Wishart distribution as the joint prior for mean and precision matrices, the joint model factorizes asFor conditionally independent mean vectors given the covariances, the joint prior in (4) further expands toThe block diagonal precision matrices for are obtained after sampling from a predefined joint Wishart distribution as defined in [2] such that , where is a hyperparameter for the degrees of freedom that satisfies and is a positive definite scale matrix of the form . and are also positive definite scale matrices and denotes the off-diagonal component that models the interaction between source and target domains. Given , and assuming normally distributed mean vectors we getwhere is the mean vector of the mean parameter and is a positive scalar hyperparameter.

Synthetic datasets

In order to generate the synthetic data, we consider feature space dimensions 2, 3, and 5. In the simulated datasets, we set up the data distributions as follows: where is an adjustable scalar used to control the Bayes error in the target domain, and and are all-zero and all-one vectors, respectively. For the scale matrices of Wishart distributions we set and where is the identity matrix of rank . To ensure that the joint scale matrix is positive definite we set with and . As in [2], the value of controls the amount of relatedness between the source and target domains. To fully control the level of relatedness by adjusting only and without involving other confounding factors, we set such that . In this setting, the correlation between the features across source and target domains are governed by where small values of correspond to poor relatedness between source and target domains while larger values imply stronger relatedness. To sample from the joint prior, we first sample from a non-singular Wishart distribution to get a block partitioned sample of the form from which we extract . Afterwards, we sample for and . As illustrated in Fig. 4, we use in our simulations two types of datasets. Training datasets that contain samples from both domains and testing datasets that contain only samples from the target domain. While the training datasets are saved and stored in our data repository, the testing datasets are only aimed for simulation purposes to specify a desired level of classification complexity. In all the simulations we consider testing datasets of data points per class and we assume equal prior probabilities for the classes. We note that for normally distributed data, the optimal classifier for the the feature-label distributions, called also Bayes classifier, is a quadratic classifier that can be determined through quadratic discriminant analysis (QDA). This Bayes classifier is defined as: whereIts true error is called Bayes error.

Fig. 4

Flow chart illustrating the simulation set-up for generating the synthetic datasets in this paper.

Flow chart illustrating the simulation set-up for generating the synthetic datasets in this paper. To draw data for a specific Bayes error, we start by drawing a joint sample for each class . Next, we iterate over the values of the hyperparameter to control through a dichotomic search to get a desired value of the Bayes error. This is achieved by drawing a sample and then generating a test set based on the joint sample . Using this test set, we determine the true error of the optimal QDA derived from . If the desired Bayes error (true error of the designed QDA) is attained then the iteration stops, otherwise we update and reiterate.

Matlab script files

simulate_Data.m:

This is the main simulation file that generates and saves the data into binary Matlab files (*.mat). First, the constants and the model hyperparameters are set. In this example we show configurations for Bayes_error=0.1, and a relatedness level . Next, we draw from the joint model an initial sample for the model parameters in the source and target domains to initiate the heuristic search for the model parameters that correspond to the desired Bayes_error. Afterwards, we loop over different realizations of model parameters until we obtain the desired Bayes_error. To do so, we start with the joint sample for each class . Next, we iterate over the values of the hyperparameter referred to in the source code by mean_shift, to control through a dichotomic search to get a desired value (complexity_Bayes_error in the code) of the Bayes error. This is achieved by drawing a sample and then generating a test set based on the joint sample . Using this test set, we determine the true error of the optimal QDA derived from (lines 16 and 19). If the desired Bayes error (true error of the designed QDA) is attained then the iteration stops, otherwise we update the mean_shift variable and reiterate. Once a realization of model parameters satisfies the desired Bayes_error, target and source datasets are generated and stored into binary Matlab files.

setup_parameters.m:

This function takes as input the model hyperparameters that change their values across the simulated datasets, and uses the shared values of the remaining hyperparameters to fully characterize the feature-label distributions in source and target domains.

generate_Data.m:

As indicated by its name, this function takes as input a specified set of model parameters of source and target domains and generates synthetic training and testing datasets.

train_QDA.m:

This function permits to identify the Bayes classifier in the target domain. It implements the definition of a QDA classifier designed based on a predefined set of model parameters for a binary classification problem when the two classes are a-priori equally likely.

test_error_QDA.m:

This function allows to approximate the true classification error of a QDA classifier based on a given test set. In our simulations, this function is called to determine the Bayes_error that corresponds to the true classification error of a QDA classifier that has been designed based on the true model parameters in the target domain.

Ethics Statement

The work did not involve any human or animal subjects, nor data from social media platforms.

CRediT authorship contribution statement

Omar Maddouri: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Xiaoning Qian: Conceptualization, Formal analysis, Supervision, Validation, Writing – review & editing. Francis J. Alexander: Conceptualization, Formal analysis, Validation, Writing – review & editing. Edward R. Dougherty: Conceptualization, Formal analysis, Funding acquisition, Project administration, Supervision, Validation, Writing – review & editing. Byung-Jun Yoon: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Subject	Applied Machine Learning
Specific subject area	Bayesian transfer learning
Type of data	Binary Matlab files
	Matlab source code
	Shell script
How data were acquired	Matlab simulations
Data format	Binary Matlab files (*.mat)
	Matlab scripts (*.m)
	Shell script (*.sh)
Parameters for data collection	Three feature dimensions (d∈{2,3,5}) were considered for the studied feature spaces. Four classification complexity levels (Bayes error ∈{0.1,0.2,0.3,0.4}) were considered for each dimension. Six relatedness levels (\|α\|∈{0.1,0.3,0.5,0.7,0.9,0.99}) were used to model the relatedness between source and target domains.
Description of data collection	Source and target datasets have been generated by Matlab simulations. The feature-label distributions in source and target domains were assumed to be multivariate Gaussian distributions. Both domains were related to each other through a joint prior: i.e., Wishart distribution of the precision matrices of the underlying Gaussian feature-label distributions. The classification complexity has been modeled by the Bayes error that has been determined via the true classification error of an optimal quadratic discriminant analysis (QDA) classifier.
Data source location	Institution: Texas A&M University
	City/Town/Region: College Station, TX 77843
	Country: USA
Data accessibility	Repository name: Synthetic Data for Design and Evaluation of Binary Classifiers in the Context of Bayesian Transfer Learning [1]
	DOI: 10.17632/fn33cknmfx.1
	Direct URL to data: https://data.mendeley.com/datasets/fn33cknmfx/1
Related research article	O. Maddouri, X. Qian, F. J. Alexander, E. R. Dougherty, B.-J. Yoon, Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning, Patterns 3 (3) (2022) 100428. https://doi.org/10.1016/j.patter.2021.100428.

2 in total

1. Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.

Authors: Omar Maddouri; Xiaoning Qian; Francis J Alexander; Edward R Dougherty; Byung-Jun Yoon
Journal: Data Brief Date: 2022-04-02

2. Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning.

Authors: Omar Maddouri; Xiaoning Qian; Francis J Alexander; Edward R Dougherty; Byung-Jun Yoon
Journal: Patterns (N Y) Date: 2022-01-25

2 in total

1 in total

1. Synthetic data for design and evaluation of binary classifiers in the context of Bayesian transfer learning.

Authors: Omar Maddouri; Xiaoning Qian; Francis J Alexander; Edward R Dougherty; Byung-Jun Yoon
Journal: Data Brief Date: 2022-04-02

1 in total