| Literature DB >> 30677030 |
Shuzhen Sun1,2, Zhuqi Miao3, Blaise Ratcliffe2, Polly Campbell4,5, Bret Pasch6, Yousry A El-Kassaby2, Balabhaskar Balasundaram7, Charles Chen1.
Abstract
BACKGROUND: High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. METHODS ANDEntities:
Mesh:
Year: 2019 PMID: 30677030 PMCID: PMC6345469 DOI: 10.1371/journal.pone.0203242
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1(a) Illustration of 1-dominating set and 2-dominating set; (b) Illustration of 1-dominating set using the neural network data of C. elegans [46, 47]: the big nodes mark a 1-dominating set, and all the small nodes have at least 1 same color neighbor.
Fig 2The capability of k-dominating set in selecting proxy variables among highly correlated variables.
Ten synthetic undirected networks with n = 1,000 vertices (V) were simulated. (a) highly correlated variables defined as ; (b) highly correlated variables defined as ; (c) highly correlated variables defined as ; (d) highly correlated variables defined as .
The average difference of the upper triangle and the diagonal between pedigree-based relatedness (A-matrix) and genomic estimated relatedness (G-matrix).
The best selected-subset for pedigree reconstruction (subset DF103) is highlighted. λ is linkage disequilibrium estimate.
| Original Data | - | - | 106,099 | 0.034353 | 0.180374 |
| DF107 | 1 | 0.7 | 80,735 | 0.034240 | 0.103673 |
| DF105 | 1 | 0.5 | 67,062 | 0.034139 | 0.055994 |
| DF102 | 1 | 0.2 | 41,539 | 0.034249 | 0.123494 |
| DF203 | 2 | 0.3 | 68,188 | 0.034180 | 0.123950 |
| Random subset | - | - | 51,415 | 0.034498 | 0.180419 |
| COR03 | - | 0.3 | 39,768 | 0.034774 | 0.234326 |
| LRTag03 | - | 0.3 | 51,022 | 0.034292 | 0.135324 |
Fig 3(a) Heatmap of the absolute difference between pedigree-based relatedness (-matrix) and genomic estimated relatedness (-matrix) generated from original data; (b) Heatmap of the absolute difference between pedigree-based relatedness (matrix) and genomic estimated relatedness (matrix) generated from DF103 subset. The color of Fig 3(B) is lighter than Fig 3(A). The lighter the color, the closer the relationship between and matrices of Douglas-fir breeding population.
Geographic location of grasshopper mouse (Onychomys) samples.
| Animas/Rodeo, NM (20); | |
| Petrified Forest, AZ (13); | |
| Lone Pine, CA (11); |
The adjusted rand index (ARI) shows the agreement between the computed clusters using k-means clustering algorithm and partitioning around medoids (PAM) algorithm with k = 5, using the original grasshopper mouse SNP data set and the k-dominating subsets.
ARI values listed below show the agreement measurement between original sample locations and clustering results.
| Original data | 85,812 | 0.3868 | 0.5981 | 0.5692 | |
| MICE107 | 22,355 | 0.3868 | 0.7158 | 0.5692 | |
| MICE103 | 2,144 | 0.3963 | 0.7158 | 0.6003 | |
| Original data | 85,812 | 0.0706 | 0.2229 | 0.2244 | |
| PAM | 22,355 | 0.0513 | 0.1812 | ||
| MICE105 | 11,014 | 0.1016 | 0.3509 | 0.2445 | |
| 2,144 | 0.2902 |