| Literature DB >> 35637693 |
María Agustina Raschia1, Pablo Javier Ríos2,3, Daniel Omar Maizon4,5, Daniel Demitrio6,3, Mario Andrés Poli1,7.
Abstract
Machine learning methods were considered efficient in identifying single nucleotide polymorphisms (SNP) underlying a trait of interest. This study aimed to construct predictive models using machine learning algorithms, to identify loci that best explain the variance in milk traits of dairy cattle. Further objectives involved validating the results by comparison with reported relevant regions and retrieving the pathways overrepresented by the genes flanking relevant SNPs. Regression models using XGBoost (XGB), LightGBM (LGB), and Random Forest (RF) algorithms were trained using estimated breeding values for milk production (EBVM), milk fat content (EBVF) and milk protein content (EBVP) as phenotypes and genotypes on 40417 SNPs as predictor variables. To evaluate their efficiency, metrics for actual vs. predicted values were determined in validation folds (XGB and LGB) and out-of-bag data (RF). Less than 4500 relevant SNPs were retrieved for each trait. Among the genes flanking them, signaling and transmembrane transporter activities were overrepresented. The models trained:•Predicted breeding values for animals not included in the dataset.•Were efficient in identifying a subset of SNPs explaining phenotypic variation. The results obtained using XGB and LGB algorithms agreed with previous results. Therefore, the method proposed could be applied for future association studies on milk traits.Entities:
Keywords: Dairy cattle; EBVF, estimated breeding values for milk fat content; EBVM, estimated breeding values for milk production; EBVP, estimated breeding values for milk protein content; Estimated breeding values; FDR, false discovery rate; GWAS, genome-wide association study; HxJ, Holstein x Jersey; LGB, LightGBM; LightGBM; MAE, mean absolute error; ML, machine learning; MSE, mean squared error; Milk fat content; Milk production; Milk protein content; RF, Random Forest; RMSE, root mean square error; Random forest; SNP, single nucleotide polymorphism; Single nucleotide polymorphisms; XGB, XGBoost; XGBoost
Year: 2022 PMID: 35637693 PMCID: PMC9144035 DOI: 10.1016/j.mex.2022.101733
Source DB: PubMed Journal: MethodsX ISSN: 2215-0161
Fig. 1Method workflow. EBVM: estimated breeding values for milk production; EBVF: estimated breeding values for milk fat content; EBVP: estimated breeding values for milk protein content; XGB: XGBoost; LGB: LightGBM; RF: Random Forest; MAE: mean absolute error; RMSE: root mean square error.
Matching with previous results. The number and percentage of previously reported relevant and top windows for each trait containing SNPs with positive gain obtained in this study is indicated.
| Trait | Comparison | XGB | LGB | RF |
|---|---|---|---|---|
| EBVM | relevant windows | 40 (76.9%) | 46 (88.5%) | 40 (76.9%) |
| top windows | 10 (100%) | 10 (100%) | 8 (80%) | |
| EBVF | relevant windows | 33 (57.9%) | 33 (57.9%) | 3 (5.3%) |
| top windows | 8 (80%) | 6 (60%) | 1 (10%) | |
| EBVP | relevant windows | 44 (78.6%) | 47 (83.9%) | 27 (48.2%) |
| top windows | 6 (60%) | 9 (90%) | 4 (40%) |
| Subject Area: | Bioinformatics |
| More specific subject area: | Machine learning applications in biology |
| Method name: | Construction of predictive models using machine learning algorithms for the identification of loci that best explain the variance in milk traits of dairy cattle. |
| Name and reference of original method: | B. Li, N. Zhang, Y.-G. Wang, A.W. George, A. Reverter, Y. Li, Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods, Front. Genet. 9 (2018) 237, doi:10.3389/fgene.2018.00237. |
| Resource availability: | N.A. |