Literature DB >> 30677030

SNP variable selection by generalized graph domination.

Shuzhen Sun^1,2, Zhuqi Miao³, Blaise Ratcliffe², Polly Campbell^4,5, Bret Pasch⁶, Yousry A El-Kassaby², Balabhaskar Balasundaram⁷, Charles Chen¹.

Abstract

BACKGROUND: High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. METHODS AND
FINDINGS: K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 30677030 PMCID： PMC6345469 DOI： 10.1371/journal.pone.0203242

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

With the rapid advancement of DNA sequencing technology, the volume and dimension of biological and medical data have been increasing at an unprecedented rate. Accompanying such high volume genetic data, the ‘curse of dimensionality’ has challenged the validity of statistical methods that do not scale to massive data. Statistical accuracy, model interpretability and computational efficiency could be significantly impacted, especially when the number of predictors is much greater than sample size [1]. For instance in high dimensional classification, conventional classification rules using all variables perform no better than random guess for small sample sizes [2]; and in omics data analysis where the ultimate goal is to identify a small number of predictors (biomarkers, metabolites or genes), the correlation structure among predictors in the biology of the experiment often complicates biomarker identification [3]. Sources for these unsatisfied algorithmic performance could be the result of model noise accumulation in high dimension, incidental correlation between residual errors and some predictors, and the spurious collinearity that causes over-fitting and mis-identification of models [4-6], making variable selection a practical solution for “large p small n” data [7, 8]. Magnitude and significance of linkage disequilibrium (LD) in the genome markedly varies between populations [9, 10], causing unexpected multi-collinearity that leads to unstable estimates of genetic parameters [11]. By reducing correlation in SNP predictors, Song et al. [12] and others showed that with a selected subset, comparable predictability for complex traits like grain yield and milk yield could be achieved [11, 13–16]. Results from Weigel et al. (2009) [17] further suggest that, not only compatible prediction accuracy could be derived from a much smaller, evenly spaced SNP subset, but the standard deviation of prediction accuracy reduced. Crucial to both analyzing and interpreting high dimensional SNP datasets, significant effort has been directed towards exploring variable selection processes by removing features that might be either redundant or irrelevant to the problem, for better predictability, or computational efficiency and informativeness [18]. This effort includes the logistic regression method [19], the penalized regression method [1, 20, 21], partial least squares regression (PLSR) [22], sure independence screening strategy [23], multi-stage regression methods [24], sorted l-one penalized estimation (SLOPE) via convex optimization [25], recurrent relative variable importance measure (r2VIM) [26], to name a few. However, these methods were designed to reduce variables from a statistical perspective in order to ease the process of prediction or assist GWAS (genome-wide association study) analysis, in which knowledge of phenotypic data is required. In the era of population genomics [27], many Fst-based genome-scan methods utilize large datasets such as SNP chips or genome complexity reduction approaches like RAD tags [28] and genotyping-by-sequencing (GBS) [29, 30], to estimate genetic parameters [31]. Identifying adaptive evolution and candidate genomic regions under selection is increasingly feasible, thanks to the development of sophisticated analytical tools for genome-scale polymorphism data [32-35]. Given the data volume, most of these Bayesian approaches suffer from extended computational time requirement [31] due to tedious numerical approximation procedures like Markov chain Monte Carlo (MCMC) [34] or reverse jump (RJ)-MCMC [33]. Furthermore, accurate inference of demographic parameters such as effective population sizes, migration rates, and divergence times between populations depends largely on the use of neutral marker data [36-38]. In other words, SNP variable selection methods without the use of phenotypic data are desirable for the purpose of reducing the bias caused by confounding variables, for minimizing computational load, and for avoiding the potential problem of allele frequency correlations in, for example, the Lewontin and Krakauer (LK) test [31, 39]. In this paper, we present SNP-SELECT, a variable selection algorithm based on a graph-theory approach that uses generalized dominating sets for a large volume of SNP data without the use of phenotypes. Application of graph theory to variable selection or data reduction has been seen in many data mining applications [40-42]. Typically, this involves clustering the data points into groups and using one point to represent each cluster, from which the network clustering [43] procedure would derive a much smaller number of clusters, resulting in variable selection. In our cases, data points (SNPs) are represented by vertices and an edge exists if two data points (two SNPs) are similar or related in a certain way (i.e., in LD or in correlation). We show the use of LD with an example; it is, however, important to note that the similarity criterion used to construct the network model can be based on any relationship measurement. The advantage and robustness of SNP-SELECT is also demonstrated with simulated datasets, and with empirical datasets for a Douglas-fir (Pseudotsuga menziesii) breeding population and for populations of three grasshopper mouse species (Onychomys spp.).

Material and methods

Generalized graph domination

Let G = (V,E) be a graph with vertex set V and edge set E⊆[V]2 (see [44] for basic graph theory concepts and notations). The open neighborhood of a vertex v is the set N(v) of vertices adjacent to vertex v. Note that v∉N(v) and the closed neighborhood of vertex v is denoted by N[v] = {v}∪N(v). Definition 1 [ Given a positive integer k and a graph G = (V,E), a subset of vertices D is said to be k-dominating if |N(v)∩D|≥k for every vertex v∉D. If D is a k-dominating set, then every vertex in V−D is said to be k-dominated. A minimum k-dominating set is one of smallest cardinality in the graph and this cardinality is called the k-domination number of the graph, denoted as γ(G). Note that the k-domination number of a graph increases as parameter k increases and the model becomes more restrictive as more neighbors are needed for each vertex outside the set to be k-dominated. Hence, every 2-dominating set is also a 1-dominating set, but the converse is not true. Intuitively, as the parameter k increases, we expect the k-dominating set to be a more reliable representation of the dataset as each point has at least k similar points in the k-dominating set. Hence, the choice of k must balance two conflicting criteria: solution fidelity (how well the dataset is represented) and solution size (how many data points are selected). To illustrate, graphic presentations of k-dominating sets for k = 1 and k = 2 were showed in Fig 1(A); and Fig 1(B) illuminated 1-dominating set using neural network data for the nematode, C. elegans [46, 47]. Neurons are represented by vertices in this neural network and as long as two neurons communicate with each other, an edge exists between them. The big dots in Fig 1(B) mark a 1-dominating set, and all the small dots (vertices) have at least one neighbor of the same color, which identifies the cluster.

Fig 1

(a) Illustration of 1-dominating set and 2-dominating set; (b) Illustration of 1-dominating set using the neural network data of C. elegans [46, 47]: the big nodes mark a 1-dominating set, and all the small nodes have at least 1 same color neighbor. Clustering a graph via k-dominating sets, especially with k = 1, is a popular technique in telecommunication and wireless networks [48]. If D is a 1-dominating set, then for each vertex v∈D the closed neighborhood N[v] forms a cluster that altogether cover V. Since by definition, every vertex not in the 1-dominating set has a neighbor in it and hence, is assigned to at least one cluster. Since the problem of finding a minimum k-dominating set is NP-hard [49], heuristic approaches and approximation algorithms have been proposed to find a small k-dominating set in the given graph [50]. However, the approach employed in this article to solve this combinatorial optimization problem was to formulate it as an integer program [51], implement and solve it using a state-of-the-art solver that employs a branch-and-cut algorithm with built-in primal heuristics and other presolve reductions among others. Given a positive integer k and a graph G = (V,E), the problem of finding a minimum k-dominating set can be formulated as the following linear integer program in binary variables. subject to: In any feasible solution x to this formulation, the binary variable x = 1 if and only if vertex i is included in the k-dominating set D, which is given by D = {i∈V: x = 1}. The constraints ensure that if a vertex i is excluded from the k-dominating set D, i.e. x = 0, at least k of its neighbors must be included.

Pairwise relationship between SNPs

The pairwise relationship (similarity or distance) between SNP variables primarily determines the structure of the graph G, and different ways for quantifying the pairwise relationship can influence the structure of the graph G, especially the sets of edges. Currently, many methods exist to measure the pairwise relationship of SNPs, for example, Hamming distance [52], mutual information [53, 54], allele sharing index [55, 56], and linkage disequilibrium (LD) [57-59], to name a few. We chose to use the frequently used LD approach to describe the pairwise relationship between SNP variables in this study, although the proposed approach continues to work with other similarity measures as well. The square of correlation coefficients (r2) for SNP variables were calculated to represent the values in the LD matrix (refer to [60] for the details). Since the gametic phase of haplotype frequencies for each pair of SNPs are unknown, the expectation maximization algorithm [61] was applied to infer the haplotype frequencies in the LD calculation. With a user-defined threshold (λ), an edge exists only if the pairwise relationship between the two SNPs (vertices) is greater than λ. Thus, for any given pairwise relationship measurement, as λ increases, the number of edges in the graph decreases, and consequently the number of isolated (independent) vertices in a graph can increases. For any positive integer k, an isolated vertex in the graph cannot be k-dominated by any other vertex, and must be included in any k-dominating set. In fact, this observation holds more generally for any vertex with fewer than k neighbors in the graph.

Scheme of SNP-SELECT

The details of SNP-SELECT are summarized as follows: Step 1: Construct a graph model G = (V,E): Let V be the set of all SNPs and E is initially empty; Step 2: Calculate linkage disequilibrium w for each pair of SNPs i,j∈V; Step 3: An edge between SNPs i and j is created if w>λ; Step 4: Identify isolated SNPs I←{i∈V: N(i) = ∅}; Step 5: Find a minimum k-dominating set in G−I. All experiments/analyses reported in this article were conducted on a 64-bit Linux compute node of a high performance computing cluster with 96GB RAM and Intel Xeon E5620 2.40GHz processor. The algorithm was implemented using C++, and the integer programming formulation for the minimum k-dominating set problem was solved using the GurobiTM optimizer 6.0 with default settings [62]. Given a running time limit, GurobiTM either returned an optimal solution, or a feasible solution with a gap to a lower bound on the optimal solution. Experiments/analyses reported in this study were performed with a 1-hour running time limit for GurobiTM. The solution returned by GurobiTM was used to identify the representative subset of the original dataset. In our preliminary analyses, we found that when λ is small, e.g. λ<0.2, the graph model tends to be very dense with an extremely large number of edges. When several thousands of SNPs are involved, such graphs can exceed memory limits during computation and result in a memory crash, before a feasible solution can be derived. Also, very small thresholds may not necessarily be realistic to capture similarities between SNPs. To address this issue, a stepwise search was implemented in SNP-SELECT for large SNP datasets as follows: Step 1: Construct a threshold set T = {λ1,λ2,…,λ}, where λ1>λ2>⋯>λ, and λ is the desired threshold, λ←λ, and λ−λ equals a predefined step; Let h = 1, and V1 be the set of all SNPs; Step 2: Construct G on V using λ; Step 3: Identify isolated SNPs I←{i∈V: N(i) = ∅}; Step 4: Find a minimum (or a small) k-dominating set S in G−I, let Y←S∪I; Step 5: If h = L, return Y, STOP; else V←Y, h←h+1, go to Step 2. In brief, this step-wise search of SNP-SELECT first finds a minimum k-dominating set Y1 (or the best solution available) on a graph model based on a larger threshold. Then the threshold is lowered to focus on the graph induced by Y1. The data size of current step is the output of previous step. This process is repeated until a desired low threshold is reached. The feature selection problem of large datasets is thus solved by iteratively reducing the value of threshold.

Simulation studies

To demonstrate the capacity of the k-dominating set algorithm to identify independent variables, and to select proxy variables among highly correlated ones, a simulated dataset that included 10 synthetic undirected networks with n = 1000 vertices were used to represent SNPs. In this synthetic network dataset, the pairwise relationships between SNPs (vertices), the weighted edge (w) between each pair of vertices (i,j), were generated using a uniform distribution over [0,1]. The randomly chosen edge weights, denoted by a, where , and without loss of generality assumed to be in increasing order, were assigned to edges using the following algorithm such that w Step 1: Initialize l←1; Step 2: for to n−1 Step 3: for to n Step 3: ; Step 4: ; Step 5: end-for Step 6: end-for A correlative relationship among SNP variables, or linkage disequilibrium (LD), is the non-random association between SNP alleles. The distribution of these relationships among SNPs in a given genome tends to be greater when SNPs are closely located; this correlation diminishes quickly as genomic distance between SNPs gets larger, e.g. LD decay [63]. As a result, the distribution of correlative relationships among SNPs is a mixture of a small number of highly correlated SNPs with a large number of SNPs in low correlations. Assigning edge weights in increasing order is a simple way to guarantee that only part of the vertices has low edge weights close to 0, which can be used to define the independent variables. Meanwhile, we can also identify a subset of vertices with edge weights higher than a predefined threshold within this set, which could be used to define the independent variables and highly correlated variables. A vertex i that has all neighbors with w<0.1, where j∈V and j≠i, was defined as an independent variable. The subset generated by SNP-SELECT has to include all the independent variables to confirm that the k-dominating set based approach is able to identify independent variables. Highly correlated variables were defined as a subset (P) of the variables where P⊂V, and the edge weights () within this subset are greater than a predefined threshold. In this simulation, we selected 0.8, 0.6, 0.4, and 0.3 as the predefined thresholds for the purpose of illustrating the capability of the proposed approach to select the highly correlated variables. If SNP-SELECT includes at least one of the predefined highly correlated variables, the performance of the algorithm in selecting proxy variables among the highly correlated ones is considered fulfilled.

Douglas-fir breeding populations

The Douglas-fir breeding population was established by the Ministry of Forests, Lands and Natural Resource of British Columbia, Canada in 1975 and consists of 165 full-sib families generated from structured paired-matings among 54 parents. The 1,372 individual trees used in this study consist of a subset of the full population and contains 37 full-sib families from 38 parents (see [64], for complete details). SNP genotypes for these 1,372 trees were generated using exome capture [65], resulting in 106,099 SNPs with missing ratio threshold less than 25% and minimum minor allele frequency (MAF) greater than 5% which comprises the ‘original’ data set. The average numerator relationship matrix (-matrix) of the DF dataset is known due to the structured pedigree, and was used as a baseline for comparison. We calculated the genomic estimated relatedness (matrix) using R package “rrBLUP” [66] using the mean imputation option on the original SNP dataset, as well as the five k-dominating SNP subsets. The comparison of the pedigree-based relatedness (-matrix) elements with those of the -matrices of the selected SNP subsets was performed to validate that SNP-SELECT is able to minimize the deviation of diagonal elements while obtaining comparable genetic covariance among individuals (off-diagonal elements).

Grasshopper mouse SNP data

Grasshopper mouse (genus, Onychomys) are cricetid rodents that inhabit prairies, deserts and desert grasslands throughout the western United States, northern Mexico, and south-central Canada [67]. Whereas O. leucogaster is readily distinguished based on body size, the two smaller species, O. arenicola and O. torridus, are morphologically cryptic and were treated as a single species until 1979 [68]. The SNP dataset analyzed here was generated using genotyping-by-sequencing, GBS [29], as part of a study designed to test for evidence of hybridization at a site in southwestern New Mexico where all three species come into contact [69], and at other sites in New Mexico and Arizona where O. leucogaster is sympatric with O. arenicola and O. torridus, respectively. SNPs were called using a reference-free SNP discovery protocol (UNEAK pipeline [70]), and filtered with minor allele frequency greater than 5% and missing ratio less than 10%.

Results

When the SNP-SELECT algorithm was applied to the synthetic network with k ∈ {1,2,3,4,5}, all the k-dominating sets found included the predefined independent variables. The performance of the k-dominating set model in the selection of proxy variables is presented in Fig 2, with the predefined highly correlated variable thresholds λ∈{0.8,0.7,0.6,0.5,0.4,0.3}. As shown in Fig 2(A), the definition of highly correlated variables was strict (); under this condition of few, highly connected variables, the use of larger values of either k or λ was encouraged. Also shown in Fig 2(B), 2(C) and 2(D), when relationships between variables are a mixture of high and low correlations, our results suggest the use of smaller values in k and λ to capture all relationships. By varying on k and λ, we demonstrate the flexibility and strength of SNP-SELECT in choosing proxy variables from highly correlated variables.

Fig 2

The capability of k-dominating set in selecting proxy variables among highly correlated variables.

Ten synthetic undirected networks with n = 1,000 vertices (V) were simulated. (a) highly correlated variables defined as ; (b) highly correlated variables defined as ; (c) highly correlated variables defined as ; (d) highly correlated variables defined as .

The capability of k-dominating set in selecting proxy variables among highly correlated variables.

Pedigree recovery for Douglas-fir breeding populations

The SNP-SELECT algorithm was applied to select the influential SNPs to reconstruct the known pedigree for a Douglas-fir (DF) breeding population. Four k-dominating sets (DF107, DF105, DF103, DF102) with k = 1, and λ∈{0.7,0.5,0.3,0.2} were generated. Among the four 1-dominating sets, DF103 has the best performance as shown in Table 1. To further investigate the impact of k on variable selection, another k-dominating set, DF203, with k = 2 and λ = 0.3 was generated to compare with DF103. The number of selected SNPs in DF107, DF105, DF103, DF102 and DF203 is 80,735, 67,062, 51,415, 41,539, and 68,188, respectively.

Table 1

The average difference of the upper triangle and the diagonal between pedigree-based relatedness (A-matrix) and genomic estimated relatedness (G-matrix).

The best selected-subset for pedigree reconstruction (subset DF103) is highlighted. λ is linkage disequilibrium estimate.

	k	λ	Num. ofSNP	Ave. differenceupper triangle	Ave. differencediagonal
Original Data	-	-	106,099	0.034353	0.180374
DF107	1	0.7	80,735	0.034240	0.103673
DF105	1	0.5	67,062	0.034139	0.055994
DF103	1	0.3	51,415	0.034018	0.019769
DF102	1	0.2	41,539	0.034249	0.123494
DF203	2	0.3	68,188	0.034180	0.123950
Random subset	-	-	51,415	0.034498	0.180419
COR03	-	0.3	39,768	0.034774	0.234326
LRTag03	-	0.3	51,022	0.034292	0.135324

The average difference of the upper triangle and the diagonal between pedigree-based relatedness (A-matrix) and genomic estimated relatedness (G-matrix).

The best selected-subset for pedigree reconstruction (subset DF103) is highlighted. λ is linkage disequilibrium estimate. Without variable selection, original SNP data generated an average of 18% discrepancy on the diagonal elements. The performance of the five k-dominating subsets was showed in the reduced diagonal differences from the genomic relationship matrix (-matrix) to the traditional pedigree-based average numerator relationship matrix (-matrix) (Table 1). Comparing the five k-dominating sets indicated that the DF103 subset performed best on pedigree reco, especially for the diagonal pedigree information recovery. Fig 3 further illustrates the efficiency of the DF103 subset on pedigree reconstruction, and indicates that the -matrix generated from the DF103 subset was closer to the known -matrix as compared with the original dataset’s -matrix. Additionally, we randomly selected 10 subsets with the same SNP number as DF103 from the original dataset and used the average results of these 10 subsets to represent the performance of the randomly selected subset. The results indicated that all five k-dominating sets outperformed the randomly selected subset (Table 1).

Fig 3

(a) Heatmap of the absolute difference between pedigree-based relatedness (-matrix) and genomic estimated relatedness (-matrix) generated from original data; (b) Heatmap of the absolute difference between pedigree-based relatedness (matrix) and genomic estimated relatedness (matrix) generated from DF103 subset. The color of Fig 3(B) is lighter than Fig 3(A). The lighter the color, the closer the relationship between and matrices of Douglas-fir breeding population. The effectiveness of SNP-SELECT was also examined by the conventional approach that filters for SNP variables by pairwise correlation coefficients, as well as the LRTag algorithm that applies minimum set covering for SNP selection [71]. The discrepancy between and matrices resulted from using correlation coefficient of 0.3 and λ = 0.3 was listed in Table 1 as COR03 and LRTag03, for pairwise correlation coefficient method and LRTag algorithm, respectively. Among all tests, the DF103 from SNP-SELECT remained the best SNP subset for estimating genetic relationship of Douglas-fir breeding population. Consider computing time requirement, when values of distance or pair-wise linkage disequilibria were pre-computed, SNP variable selection for SNP-SELECT could be complete in 8–10 minutes, while LRTag required about 18 hours for the same datasets.

Clustering analysis for grasshopper mouse populations

To investigate parameters influencing population genetics of grasshopper mouse populations, 85,812 SNPs were used to genotype 226 samples representing three species: O. arenicola (n = 76), O. leucogaster (n = 67), and O. torridus (n = 83), collected at 12 geographic locations (Table 2). The dataset was pre-filtered based on a maximum of 10% missing data, and minimum MAF (minor allele frequency) of 5%. The SNP-SELECT was applied to generate three SNP subsets (MICE103, MICE105 and MICE107) with k = 1 and λ∈{0.3,0.5,0.7}, respectively; the number of informative SNPs retained in MICE103, MICE105 and MICE107 was 2,144, 11,014, and 22,355, respectively. The missing data in the original dataset and the three k-dominating sets was imputed with the most frequent genotype. Before the geographic origin analysis, we split the 226 samples into 3 groups based on species identity. There were 5 sampling locations in each species group (Table 2).

Table 2

Geographic location of grasshopper mouse (Onychomys) samples.

Species	Site Name (Sample Size)
O. arenicola	Animas/Rodeo, NM (20);Pancho Villa, Chihuahua, Mex (7);Organ Mountains, NM (27);Sevilleta National Wildlife Refuge, NM (14); Hidalgo del Parral, Chihuahua, Mex (8).
O. leucogaster	Petrified Forest, AZ (13);Animas/Rodeo, NM (11);Sevilleta National Wildlife Refuge, NM (14);Felt, OK (19); Garden City, KS (10).
O. torridus	Lone Pine, CA (11);Carefree, AZ (8);Santa Rita Experimental Range, AZ (19);Animas/Rodeo, NM (28); Chiricahua Mountains, AZ (17).

The performance of the three k-dominating sets’ ability to predict the geographic origin of samples within each species was first evaluated using the k-means clustering approach in R [72]. Clustering was initiated with k = 5, random seed at 20 and nstart = 100, where nstart specifies the initial configurations, and the algorithm will report on the best one [73, 74]. The Adjusted Rand Index (ARI), a measure of agreement between clustering results and external criteria [75, 76], was used to evaluate the clustering results. As shown in Table 3, the clustering results for the largest SNP subset, MICE107, had the same performance as the original data of 85,812 SNPs in recovering the geographic origin of O. arenicola and O. torridus samples; however, MICE107 subset outperformed the original SNP data in recovering the geographic origin of O. leucogaster samples. Moreover, the clusters resulting from MICE105 and MICE103 exhibited larger ARI values than those from the original SNP data, indicative of a greater agreement and reduced errors in the clustering reached by SNP-SELECT variable selection. Overall, the MICE105 SNP subset (11,014 SNPs) demonstrated the greatest agreement among all selected subsets (Table 3).

Table 3

The adjusted rand index (ARI) shows the agreement between the computed clusters using k-means clustering algorithm and partitioning around medoids (PAM) algorithm with k = 5, using the original grasshopper mouse SNP data set and the k-dominating subsets.

ARI values listed below show the agreement measurement between original sample locations and clustering results.

Method	Dataset	SNP Number	O. arenicola	O. leucogaster	O. torridus
	Original data	85,812	0.3868	0.5981	0.5692
k-means	MICE107	22,355	0.3868	0.7158	0.5692
	MICE105	11,014	0.5256	0.7158	0.6003
	MICE103	2,144	0.3963	0.7158	0.6003
	Original data	85,812	0.0706	0.2229	0.2244
PAM	MICE107	22,355	0.0513	0.3852	0.1812
	MICE105	11,014	0.1016	0.3509	0.2445
	MICE103	2,144	0.1172	0.2902	0.2793

The adjusted rand index (ARI) shows the agreement between the computed clusters using k-means clustering algorithm and partitioning around medoids (PAM) algorithm with k = 5, using the original grasshopper mouse SNP data set and the k-dominating subsets.

ARI values listed below show the agreement measurement between original sample locations and clustering results. To confirm that the performance of SNP-SELECT was not the result of a specific clustering algorithm, the partitioning around medoids (PAM) algorithm [77] with k = 5 (random seed = 20) was performed using samples’ dissimilarity matrix of each species. To describe the dissimilarity matrix, we first define the -matrix as . Then the dissimilarity matrix is defined as . In Table 3, clusters resulting from the PAM algorithm also demonstrated that the selected subsets perform better than the original data in predicting actual sampling localities.

Discussion

Owing to technological advancement in DNA sequencing methods, life scientists are grappling with exceedingly large data sets [78]. The most obvious challenge is the large amount of genomic variation that needs to be processed and quantified in a very short time period [79]. Although various data techniques have been adopted, the resulting data sets have several characteristics that make downstream analyses challenging [80]. The common ones are: the number of variables is often much larger than the number of individuals, and data sets are usually sparse regarding relevant information, i.e. only a small subset of variables is associated with the phenotypic variation [81]. In genetic analyses using high dimensional data sets where there are more parameters than observations, penalized regression techniques are often required to ensure stable estimates [82, 83]. The estimates of SNP marker effects are strongly affected by collinearity between predictors through a “grouping effect”- groups of variables highly correlated with other groups (of variables) sporadically [84]. Such multicollinearity would further confound gene expression values obtained from RNAseq or determination of significance of SNP causality in genome-wide association (GWAS) or genomic selection (GS) studies [85-87]. As a result, multiple-step GWAS and GS analysis that includes SNP variable selection has been explored [26, 88–91]. While adoption of these methods might be an advantage when seeking functional variants associated with traits of interest, these fitness-associated SNP variables would bias inferences of gene flow, migration or dispersal [37, 92], and estimates of relatedness and inbreeding depression [93]. Without the dependency on phenotypes, SNP variable selection methods currently focus on pairwise correlations between variables (e.g. [94]). In principle, SNP variables are selected if the absolute value of a pairwise correlation (|corr(i,j)|) is less than a predefined threshold λ; or if |corr(i,j)| is no less than the given threshold, only the second variable will be selected (e.g. if |corr(i,j)|≥λ, SNP j will be selected). Here, we demonstrate the superior performance of the proposed k-dominating set variable selection over the conventional method of pairwise correlation coefficients (Table 1, COR03). As shown in Fig 3, diagonal values, indicative of the errors in estimating individuals’ genomic relationship based on markers, were minimized using SNP-SELECT. The pairwise estimates of genomic relationships (off-diagonal elements) were, however, mostly preserved (Table 1), suggesting that both the hidden and historical relatedness among individuals could still be recovered by the set of SNP variables selected by SNP-SELECT. The use of genomic markers to uncover hidden relationships and potential pedigree in open-pollinated progeny has been effective in tree breeding programs [95, 96]. Such pedigree reconstruction is a preferred method to determine the genealogical relationship among groups of related individuals, leading to improved estimation of genetic parameters [97-99]. To maximize the advantage of using dense genomic markers, VanRaden [100] derived estimates of marker-based relationships between pairs of individuals as a genomic relationship matrix (-matrix), which can be used as a substitute for the traditional pedigree-based average numerator relationship matrix (- matrix) in Henderson’s animal model [101-103]. Also, combining the -matrix and the -matrix into a single genetic relationship matrix (-matrix) has proven to be an effective approach to improve the relationship coefficients for better genetic parameter estimation [104, 105] and marker effect estimation [106], and to leverage extra phenotypic information from the non-genotyped individuals [103]. To ensure improved accuracy in such single-step methods, the - and -matrices should be compatible [107], and diagonal elements in the -matrix need to be consistent with the -matrix diagonal elements; therefore rescaling - and -matrices would reflect the mean difference between these matrices [108], a context in which using SNP markers selected by SNP-SELECT could be considerably beneficial.

Conclusions

The k-dominating set model provides a flexible and effective method for selecting informative SNPs; a C++ source code (SNP-SELECT) that uses GurobiTM Optimization solver is also released with the manuscript. This approach is scalable through the use of integer programming solvers and graph preprocessing, and can be extended to other genomic applications. Using pedigree reconstruction and cluster analysis, the capacity of SNP-SELECT was demonstrated for solving the variable selection conundrum of large datasets without any significant runtime considerations. Furthermore, SNP-SELECT does not depend on the use of LD to define threshold for edges; other similarity/distance measure would broaden its applicability beyond breeding science and ecological genetics. Future work on the algorithmic aspects of this approach could focus on the development of graph and model decomposition techniques, as well as preprocessing techniques to improve scalability in practice.

74 in total

Review 1. The power and promise of population genomics: from genotyping to genome typing.

Authors: Gordon Luikart; Phillip R England; David Tallmon; Steve Jordan; Pierre Taberlet
Journal: Nat Rev Genet Date: 2003-12 Impact factor: 53.242

2. A variable selection method for genome-wide association studies.

Authors: Qianchuan He; Dan-Yu Lin
Journal: Bioinformatics Date: 2010-10-29 Impact factor: 6.937

3. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers.

Authors: N Long; D Gianola; G J M Rosa; K A Weigel; S Avendaño
Journal: J Anim Breed Genet Date: 2007-12 Impact factor: 2.380

4. Collective dynamics of 'small-world' networks.

Authors: D J Watts; S H Strogatz
Journal: Nature Date: 1998-06-04 Impact factor: 49.962

5. LASSO with cross-validation for genomic selection.

Authors: M Graziano Usai; Mike E Goddard; Ben J Hayes
Journal: Genet Res (Camb) Date: 2009-12 Impact factor: 1.588

Review 6. Practical issues in screening and variable selection in genome-wide association analysis.

Authors: Sungyeon Hong; Yongkang Kim; Taesung Park
Journal: Cancer Inform Date: 2015-01-14

7. Associations between Allozyme Genotypes and Quantitative Traits in Douglas-Fir [PSEUDOTSUGA MENZIESII (Mirb.) Franco].

Authors: Y A El-Kassaby
Journal: Genetics Date: 1982-05 Impact factor: 4.562

8. An eQTL biological data visualization challenge and approaches from the visualization community.

Authors: Christopher W Bartlett; Soo Yeon Cheong; Liping Hou; Jesse Paquette; Pek Yee Lum; Günter Jäger; Florian Battke; Corinna Vehlow; Julian Heinrich; Kay Nieselt; Ryo Sakai; Jan Aerts; William C Ray
Journal: BMC Bioinformatics Date: 2012-05-18 Impact factor: 3.169

9. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs.

Authors: Mark D Shriver; Giulia C Kennedy; Esteban J Parra; Heather A Lawson; Vibhor Sonpar; Jing Huang; Joshua M Akey; Keith W Jones
Journal: Hum Genomics Date: 2004-05 Impact factor: 4.639

10. SNP-based pathway enrichment analysis for genome-wide association studies.

Authors: Lingjie Weng; Fabio Macciardi; Aravind Subramanian; Guia Guffanti; Steven G Potkin; Zhaoxia Yu; Xiaohui Xie
Journal: BMC Bioinformatics Date: 2011-04-15 Impact factor: 3.169

1 in total

1. Vocal divergence is concordant with genomic evidence for strong reproductive isolation in grasshopper mice (Onychomys).

Authors: Polly Campbell; Lena Arévalo; Heather Martin; Charles Chen; Shuzhen Sun; Ashlee H Rowe; Michael S Webster; Jeremy B Searle; Bret Pasch
Journal: Ecol Evol Date: 2019-11-06 Impact factor: 2.912

1 in total