| Literature DB >> 29489873 |
M Humberto Reyes-Valdés1, Juan Burgueño2, Sukhwinder Singh2, Octavio Martínez3, Carolina Paola Sansaloni2.
Abstract
Germplasm banks are growing in their importance, number of accessions and amount of characterization data, with a large emphasis on molecular genetic markers. In this work, we offer an integrated view of accessions and marker data in an information theory framework. The basis of this development is the mutual information between accessions and allele frequencies for molecular marker loci, which can be decomposed in allele specificities, as well as in rarity and divergence of accessions. In this way, formulas are provided to calculate the specificity of the different marker alleles with reference to their distribution across accessions, accession rarity, defined as the weighted average of the specificity of its alleles, and divergence, defined by the Kullback-Leibler formula. Albeit being different measures, it is demonstrated that average rarity and divergence are equal for any collection. These parameters can contribute to the knowledge of the structure of a germplasm collection and to make decisions about the preservation of rare variants. The concepts herein developed served as the basis for a strategy for core subset selection called HCore, implemented in a publicly available R script. As a proof of concept, the mathematical view and tools developed in this research were applied to a large collection of Mexican wheat accessions, widely characterized by SNP markers. The most specific alleles were found to be private of a single accession, and the distribution of this parameter had its highest frequencies at low levels of specificity. Accession rarity and divergence had largely symmetrical distributions, and had a positive, albeit non-strictly linear relationship. Comparison of the HCore approach for core subset selection, with three state-of-the-art methods, showed it to be superior for average divergence and rarity, mean genetic distance and diversity. The proposed approach can be used for knowledge extraction and decision making in germplasm collections of diploid, inbred or outbred species.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29489873 PMCID: PMC5831390 DOI: 10.1371/journal.pone.0193346
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An artificial data set with four accessions, A1 to A4, marked with three biallelic loci.
| Marker | Allele | A1 | A2 | A3 | A4 |
|---|---|---|---|---|---|
| 1 | 1 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | 2 | 0.0 | 1.0 | 0.0 | 1.0 |
| 2 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 2 | 1.0 | 1.0 | 1.0 | 0.0 |
| 3 | 1 | 0.5 | 0.0 | 0.0 | 1.0 |
| 3 | 2 | 0.5 | 1.0 | 1.0 | 0.0 |
Fig 1Histograms for the different parameters estimated in the landrace wheat collection.
A: Distribution of specificity in 41,052 alleles. B: Distribution of rarity in 7,986 accessions. C: Distribution of divergence in 7,986 accessions.
Fig 2Scatter plot of rarity against diversity in 7,986 wheat accessions.
The six wheat accessions with maximum rarity, along with their scores of Kullback-Leibler divergence based on the information of 4,126 SNP alleles, and geographic locations in Mexico.
| Accession | Rarity | Divergence | Region | State | Locality | Longitude | Latitude | Elevation (m) |
|---|---|---|---|---|---|---|---|---|
| SEEDDIV16096 | 1.161 | 1.201 | North | Chihuahua | Jicamorachi | -108.308 | 27.916 | 1746 |
| SEEDDIV15322 | 1.142 | 1.177 | Central | México | Río Frío | -98.833 | 19.317 | 2233 |
| SEEDDIV7458 | 1.037 | 1.000 | Central | Michoacán | La Zarzamora | -101.500 | 19.183 | 2200 |
| SEEDDIV5394 | 1.030 | 1.047 | Central | Michoacán | La Zarzamora | -101.500 | 19.183 | 2200 |
| SEEDDIV7464 | 1.025 | 1.030 | Central | Michoacán | La Zarzamora | -101.500 | 19.183 | 2200 |
| SEEDDIV12776 | 1.023 | 1.026 | Central | Michoacán | Toquara | -101.500 | 19.183 | 2000 |
Fig 3Heat map for 30 common and six rare accessions, genotyped with a sample of 500 alleles.
The dendrogram on the X axis represents the 500 alleles, whereas the one on the Y access represents the 36 accessions. Allele absence is shown in turquoise, allele presence in violet, and missing data in white.
Comparison between HCore and other three approaches for an approximate 10% core subset selection through several criteria.
| Method | Divergence | MR | SH | AR | LA |
|---|---|---|---|---|---|
| HCore | 0.442 | 0.438 | 7.954 | 0.985 | 59 |
| MixRep | 0.435 | 0.435 | 7.950 | 0.986 | 53 |
| REMC | 0.408 | 0.419 | 7.932 | 0.975 | 98 |
| MSTRAT | 0.406 | 0.417 | 7.931 | 0.983 | 67 |
| Random | 0.402 | 0.416 | 7.929 | 0.976 | 95 |
Divergence is the average Kullback-Leibler divergence, MR is the average modified Roger’s distance, SH is the Shannon diversity, AR is the allele richness, as a percentage of the alleles present in the whole collection, and LA (lost alleles) refers to the number of alleles of the whole collection that are not present in the core subset. The last row presents the values of the criteria in a random sample.
Comparison between HCore and other three approaches for a 20% core subset selection through several criteria.
| Method | Divergence | MR | SH | AR | LA |
|---|---|---|---|---|---|
| HCore | 0.434 | 0.434 | 7.948 | 0.988 | 46 |
| MixRep | 0.428 | 0.431 | 7.945 | 0.990 | 40 |
| REMC | 0.402 | 0.416 | 7.929 | 0.986 | 53 |
| MSTRAT | 0.402 | 0.416 | 7.929 | 0.987 | 51 |
| Random | 0.404 | 0.417 | 7.930 | 0.985 | 60 |