| Literature DB >> 33329691 |
Xaviera Alejandra López-Cortés1, Felipe Matamala1, Carlos Maldonado2, Freddy Mora-Poblete3, Carlos Alberto Scapim4.
Abstract
Analysis of population genetic variation and structure is a common practice for genome-wide studies, including association mapping, ecology, and evolution studies in several crop species. In this study, machine learning (ML) clustering methods, K-means (KM), and hierarchical clustering (HC), in combination with non-linear and linear dimensionality reduction techniques, deep autoencoder (DeepAE) and principal component analysis (PCA), were used to infer population structure and individual assignment of maize inbred lines, i.e., dent field corn (n = 97) and popcorn (n = 86). The results revealed that the HC method in combination with DeepAE-based data preprocessing (DeepAE-HC) was the most effective method to assign individuals to clusters (with 96% of correct individual assignments), whereas DeepAE-KM, PCA-HC, and PCA-KM were assigned correctly 92, 89, and 81% of the lines, respectively. These findings were consistent with both Silhouette Coefficient (SC) and Davies-Bouldin validation indexes. Notably, DeepAE-HC also had better accuracy than the Bayesian clustering method implemented in InStruct. The results of this study showed that deep learning (DL)-based dimensional reduction combined with ML clustering methods is a useful tool to determine genetically differentiated groups and to assign individuals into subpopulations in genome-wide studies without having to consider previous genetic assumptions.Entities:
Keywords: deep learning; dimensionality reduction; genome-wide studies; machine learning; single-nucleotide polymorphisms
Year: 2020 PMID: 33329691 PMCID: PMC7732446 DOI: 10.3389/fgene.2020.543459
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Overview of the deep autoencoder (DeepAE) system and population clustering analysis from a large single-nucleotide polymorphism (SNP) dataset. (A) Schematic representation of the entity system. Input data are composed of a matrix of n samples (rows) and k SNPs (columns). Through autoencoder, categorical variables, i.e., the SNPs and samples, were represented as an m-dimensional vector. (B) Samples are projected into m dimensional sample entity vector space. The DeepAE learned the feature of the sample solely from input matrix, such that similar samples are clustered in close proximity. (C) The t-distributed stochastic neighbor embedding (t-SNE) representation of sample entity matrix from the SNP dataset transformed by DeepAE. The 183 inbred lines are labeled with a different color according to subpopulations of maize (dent corn and popcorn). (D) Population clustering analysis through unsupervised clustering techniques.
Validation indexes for the optimal number of clusters (K) according to Silhouette coefficient (SC) and Davies–Bouldin index (DBI).
| 2 | 0.08 | 2.93 | 0.67 | 0.39 | 0.78 | 0.30 | 0.08 | 2.94 | 0.67 | 0.34 | 0.78 | 0.30 |
| 3 | 0.05 | 3.59 | 0.61 | 0.55 | 0.74 | 0.39 | 0.07 | 2.69 | 0.65 | 0.38 | 0.73 | 0.38 |
| 4 | 0.05 | 3.58 | 0.56 | 0.55 | 0.57 | 0.59 | 0.08 | 2.52 | 0.59 | 0.49 | 0.59 | 0.55 |
| 5 | 0.04 | 3.53 | 0.56 | 0.66 | 0.51 | 0.62 | 0.08 | 2.72 | 0.52 | 0.5 | 0.56 | 0.55 |
| 6 | 0.05 | 3.79 | 0.48 | 0.69 | 0.51 | 0.61 | 0.05 | 2.95 | 0.42 | 0.54 | 0.48 | 0.58 |
| 7 | 0.05 | 3.71 | 0.47 | 0.69 | 0.49 | 0.62 | 0.05 | 2.92 | 0.46 | 0.69 | 0.47 | 0.63 |
| 8 | 0.06 | 3.37 | 0.37 | 0.81 | 0.47 | 0.67 | 0.06 | 3.35 | 0.38 | 0.64 | 0.46 | 0.63 |
| 9 | 0.06 | 3.49 | 0.39 | 0.82 | 0.44 | 0.68 | 0.06 | 3.15 | 0.37 | 0.63 | 0.44 | 0.71 |
Cross-tab analysis among subpopulations of maize (popcorn and dent corn) and clusters predicted by machine learning (ML) clustering methods in combination with dimensionality reduction techniques.
| DeepAE-KM | Cluster 1 | 97 | 15 | 92% |
| Cluster 2 | 0 | 71 | ||
| DeepAE-HC | Cluster 1 | 97 | 8 | 96% |
| Cluster 2 | 0 | 78 | ||
| PCA-KM | Cluster 1 | 97 | 34 | 81% |
| Cluster 2 | 0 | 52 | ||
| PCA-HC | Cluster 1 | 97 | 20 | 89% |
| Cluster 2 | 0 | 66 | ||
| InStruct | Cluster 1 | 97 | 17 | 91% |
| Cluster 2 | 0 | 69 |
FIGURE 2Visualization of the genetic structure of inbred lines of maize with t-distributed stochastic neighbor embedding (t-SNE) for the DeepAE (A) and representation of principal component analysis (PCA) (B) for single-nucleotide polymorphism (SNP) data. The colored dots orange and blue represent different inbred lines from the two subpopulations of maize (dent corn and popcorn, respectively).