Literature DB >> 33264312

Genome-wide identification of major genes and genomic prediction using high-density and text-mined gene-based SNP panels in Hanwoo (Korean cattle).

Hyo Jun Lee¹, Yoon Ji Chung¹, Sungbong Jang², Dong Won Seo¹, Hak Kyo Lee³, Duhak Yoon⁴, Dajeong Lim⁵, Seung Hwan Lee¹.

Abstract

It was hypothesized that single-nucleotide polymorphisms (SNPs) extracted from text-mined genes could be more tightly related to causal variant for each trait and that differentially weighting of this SNP panel in the GBLUP model could improve the performance of genomic prediction in cattle. Fitting two GRMs constructed by text-mined SNPs and SNPs except text-mined SNPs from 777k SNPs set (exp_777K) as different random effects showed better accuracy than fitting one GRM (Im_777K) for six traits (e.g. backfat thickness: + 0.002, eye muscle area: + 0.014, Warner-Bratzler Shear Force of semimembranosus and longissimus dorsi: + 0.024 and + 0.068, intramuscular fat content of semimembranosus and longissimus dorsi: + 0.008 and + 0.018). These results can suggest that attempts to incorporate text mining into genomic predictions seem valuable, and further study using text mining can be expected to present the significant results.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33264312 PMCID： PMC7710051 DOI： 10.1371/journal.pone.0241848

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Genomic prediction, which is the first step in genomic selection, is a method for calculating genomic estimated breeding values (GEBVs) using large numbers of genetic markers, such as single-nucleotide polymorphism (SNP), covering the whole genome [1]. The genomic prediction methods that are currently applied to livestock populations use the extent of linkage disequilibrium between markers and quantitative trait loci (QTL) because high-density SNPs increase the chances of co-segregation of markers with causal mutations [2]. Genetic variation in quantitative traits could be influenced by large numbers of loci affecting any given trait with small to moderate effects. In some cases, however, there are loci with moderate to large effects due to relatively recently selected mutations [3-5]. It is difficult to capture recently selected causal mutations in genomic prediction because the linkage disequilibrium between these mutations and other markers is incomplete [6]. Therefore, it is necessary to understand the genetic processes and information related to quantitative or complex traits more fully, as well as linkage disequilibrium between causal variants and common SNPs, to increase the ability of genomic prediction models. Genomic best linear unbiased prediction (GBLUP) is a commonly used method that has been widely utilized for genomic prediction. The main assumption of the GBLUP method is that most SNPs have small effects with a normal distribution, regardless of prior biological information on the genetic architecture of the traits [7]. However, the effects of SNPs associated with quantitative traits are not always normally distributed and the effects may differ depending on the biological processes of the traits. For these reasons, it might be necessary to incorporate previous biological knowledge into the GBLUP method for more accurate genomic prediction. In previous studies, when selected SNP panels based on biological information were weighted differentially in the GBLUP method, higher prediction accuracy was obtained compared with the normal GBLUP [8, 9]. In addition, using causal genes or markers with prior biological knowledge resulted in much more accurate QTL discovery [10]. As mentioned in the paragraph above, it is necessary to understand the biological characteristics of complex traits from previous studies for more accurate genomic prediction. However, manually scanning previous studies to analyze biological information requires a lot of time and effort because there are many published studies in the field of animal science, and the number is expanding at an increasing rate. As of 2018, approximately 29 million papers were cited in PubMed, one of the most commonly used life science databases (https://www.nlm.nih.gov/bsd/licensee/baselinestats.html). In addition, the majority of published papers are composed of unstructured text, which is difficult to use for other studies. Therefore, it is important to use techniques to extract useful information from the textual data without spending a lot of time. Text mining is one technique for resolving this problem [11]. In the biomedical field, text mining has been used to assist studies in gene–disease associations and gene–gene associations, and to analyze clinical datasets to improve quality of health care [12-14]. In addition, text mining has been widely applied in various fields other than biomedicine, such as business and marketing [15]. However, in the field of animal breeding, studies using text mining are still rare. The application of text mining to genomic prediction could be an interesting approach to animal breeding studies. In this study, text mining was used to identify genes associated with carcass and meat quality traits, and these text-mined genes with biological information were used for genomic prediction. The hypothesis of this study was that SNPs extracted from text-mined genes could be in tighter linkage disequilibrium with causal variants for carcass and meat quality traits, and weighting this SNP panel differentially in the GBLUP model could improve the performance of genomic prediction in cattle.

Materials and methods

Dataset

Hanwoo (Korean cattle) populations

The Animal Care and Use Committee of the National Institute of Animal Science (NIAS), Rural Development Administration (RDA), South Korea, approved the experimental procedures, and appropriate animal health and welfare guidelines were followed. The Hanwoo were sourced from two different commercial populations based on different phenotype measurements. The first commercial population included 12,635 individuals (animals were born between 2013 and 2016 and samples were collected between 2017 and 2019) evaluated for carcass traits (CWT, EMA, and BF). The second population consisted of 1,039 steers evaluated for meat quality traits (Warner–Bratzler Shear Force [WBSF] and intramuscular fat content). The two populations were half-sibs derived from 339 sires for the first population and 82 sires for the second population, with unrelated dams. All animals of the two populations (n = 12,635, n = 1,039) were slaughtered at averages of 918 and 920 days, respectively. The carcass traits (n = 12,635) consisted of three traits. The carcass weight (CWT/kg), backfat thickness (BF/mm), and eye muscle area (EMA/cm2) were measured after a 24-hour chill at the junction of the 12th and 13th ribs. Meat quality traits (n = 1,039) were measured by evaluating two traits in two muscles. The WBSF values of the longissimus dorsi muscle (D_SF) and semimembranosus muscle (S_SF) were measured according to the method described by Wheeler et al. (2000) [16]. Briefly, beef steak 2.5 cm2 thick was kept in polyethylene bags for 48 hours postmortem. All of the bags were heated in a water bath at 80°C for 30 minutes, until the internal temperature of the steaks reached 70°C. The samples were stored at room temperature for 30 minutes prior to measurement. An Instron Universal WBSF testing machine (Instron Corporation, Canton, MA) with a crosshead speed of 200 mm/min and a 50-kg load cell was used to measure the WBSF. Each sample was divided into six representative cores with a diameter of 1.27 cm and parallel to the muscle fibers. The final phenotype of the WBSF was the mean of the maximum force required to shear each core sample. The intramuscular fat contents of the longissimus dorsi muscle (D_IMF) and semimembranosus muscle (S_IMF) were measured using the microwave solvent extraction method described by AOAC International [17].

Genotyping and quality control

The genomic DNA of each animal group was extracted from longissimus thoracis muscle samples using a DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA). DNA concentration and purity were determined using a NanoDrop 1000 (Thermo Fisher Scientific, Wilmington, DE). A total of 13,674 samples were genotyped using the Illumina Bovine SNP50 BeadChip and the 1,295 samples were genotyped additionally by the Illumina Bovine HD BeadChip to use as the reference population in imputation step. All animals’ 50K genotypes were imputed to a high density level (777K) using Minimac3 [18]. r2 < 0.6 SNPs were excluded in the imputations step and SNPs on the sex chromosomes were excluded from the analysis. SNP quality control for each group was performed using PLINK1.9 software [19] based on the following criteria: minor allele frequency < 0.001 for carcass traits group and < 0.01 for meat quality traits group; gene call rate < 0.1. In the carcass trait group, 23,415 SNPs were excluded by the above step, leaving 670,080 SNPs. In the meat quality trait group, 56,477 SNPs were excluded by this step, and 637,017 SNPs were used for the analysis. The imputed 777K SNPs of each group were annotated using the SnpEff program [20].

Text mining and gene ontology term analysis

Published papers related to CWT, WBSF, IMF, BF, and EMA were searched before text mining. The workflow of the text mining is shown in S1 Fig. First, all the texts in the abstracts of papers containing queries related to traits in their abstracts or titles were collected. This step was performed using functions in the RISmed package of the R statistical programming language [21]. Words consisting of only capital letters or numbers were extracted to filter out words that were accidentally the same as gene symbols (e.g., impact, pigs). Finally, only words matching the bovine gene symbols in the BioMart databases were selected for analysis. The gene symbols were obtained from the Bioconductor package BiomaRT, and btaurus_gene_ensembl was used as the dataset [22]. SNPs contained in text-mined genes (TMG) were then extracted from the imputed 777K SNPs. Furthermore, SNPs from the intergenic region of TMG were also extracted because the intergenic region often contains functionally important elements, such as promoters and enhancers. The above two types of SNPs were used as text-mined SNPs. In this study, three marker sets—the imputed 777K SNPs (Im_777K), the SNPs excluding the text-mined SNPs from imputed 777K SNPs (exp_777K), and the text-mined SNPs—were used in genomic prediction. The Bioconductor R package ‘clusterProfiler’ was used for Gene Ontology (GO) analysis to identify the biological process of TMG [23]. The −log10 adjusted P-value (P.adj) by the Bonferroni method was used to examine the significance in GO analysis. To visualize the differences between QTL regions obtained from Animal QTL DB [24] and text-mined regions, karyotypes were plotted using the Circos program [25].

Statistical analyses

Genome-wide association study (GWAS) using text-mined gene-based SNP panels

The phenotypic data on carcass and meat quality traits were pre-adjusted for fixed effects including growing sites, birth year, season, and slaughter age using a linear model implemented in R software 3.3.1 (R Foundation for Statistical Computing, Vienna, Austria). The adjusted phenotypes and text-mined SNP panel were subsequently used for GWAS under a linear mixed model. The linear mixed model can be written as: where y is a vector of the corrected phenotype for N individuals; μ is the overall mean of the term and 1 is a vector of N ones; D is a vector of genotype of the candidate SNPs recorded as 0, 1, or 2; β is the additive effect of the candidate SNPs; g is a vector of random polygenic effects from the genetic relationship matrix (GRM) constructed by the Im_777K; and e is a vector of residuals. This model was computed by GCTA 1.26 [26]. The GRM for the polygenic effect (g) was constructed using all SNPs except those on the chromosome where the candidate SNP was located. The P-values were adjusted using the Bonferroni method to correct multiple hypotheses. The values calculated by dividing 0.05 by the number of text-mined SNPs were used as the thresholds for obtaining significant SNPs associated with the trait.

Genomic models for estimation and prediction

The three genomic models were used to estimate genetic and residual variances as well as to predict genomic estimated breeding values (GEBV) in models 1 to 3. The two types of GRM constructed by lm_777K and exp_777K were used for models 1 and 2, respectively. The equations can be written as: where y is the vector of the observed phenotype for N individuals. X is an incidence matrix for the fixed effects and b is the vector of fixed effects, which included growing site, birth month, birth year, slaughter month, slaughter year, and slaughter age as covariates for all traits. In addition, the carcass traits included slaughter place and sex, while the meat quality trait included farm information (the owner’s name of steers). In the two equations, g is the N vector of the additive effects from the GRM with lm_777K for additive genetic effects, and g is the N vector of the additive effects from the GRM with exp_777K. The genetic and residual effects were assumed to be normally distributed, with mean as zero. The variances estimated by the above two models are given by: where G and G− are GRMs with lm_777K and exp_777K, respectively; and I is an N*N identity matrix. In model 3, two GRMs constructed by exp_777K and text-mined SNPs were jointly used to differentially weight the random effects. The model used can be written as: where y is the vector of phenotypic observations, and g is the N vector of the additive effects from GRM with the text-mined SNPs. The genetic and residual effects were assumed to be normally distributed, with mean as zero. The variances estimated by model 3 are given by: where Gt is the GRM with the text-mined SNPs.

Variance component estimation and GBLUP

The variance components, , and , and heritability were estimated using an average information restricted maximum likelihood (AIREML) model by implementing the AIREMLF90 program in the BLUPF90 family [27]. The proportion of genomic variance explained by each model can be written as: GEBVs were predicted using GBLUP methods and a 10-fold cross-validation scheme was used to evaluate the accuracy of the GEBVs. Samples were divided into 10 groups of equal size. Nine of these groups were used as the reference set and the other group was used as the validation set in each cross-validation. The GEBVs for the model 1 and model 2 were calculated using the following mixed model. The matrix for the model used can be written as: where is the vector of the GEBVs distributed as g~(0,); G is genomic relationship matrix for individuals; Z is a design matrix designed one column for each GEBV and one row for each phenotype (if an individual would have no phenotype, Z would have a column with zero’s only for this individual). λ is shrinkage value calculated by (σ2e /σ2g). The GEBV for the model 3 is calculated using two random effect linear mixed model followed by Where and are vectors of GEBVs calculated by exp_777K and text-mined SNPs; G− and G are GRMs with exp_777k and text-mined SNPs. The final GEBV of model 3 is the sum of the two GEBVs (). The GRM (G) is defined as where M contains genotypes adjusted by allele frequency and p is the allele frequency for marker j [28]. All of these estimates were performed using BLUPF90 [27]. The accuracy of predicted breeding values was calculated as the Pearson’s correlation between the GEBVs and adjusted phenotypes (y) of the validation set, and the equation can be represented by:

Results

Text mining and gene ontology term analysis

The queries used to search the papers and a statistical summary of the text mining are shown in Table 1. Regarding number of searched articles, CWT ranked first with 1893 papers, followed with IMF, WBSF, BF, EMA with (1854, 1097, 602, 546), respectively. In the number of calling genes, IMF showed the largest number of genes with 576, although a similar number of papers with CWT were searched. Other traits were ranked in order of CWT, BF, EMA, WBSF with (288, 195, 167, 156). The 30 genes that appeared with highest frequency in text mining are shown in Table 2. The most matched gene to bovine gene symbols in each trait were (CWT: IGF1(36 times), WBSF: CAST(110 times), IMF: SCD(105 times), BF: MC4R(35 times), and EMA: MSTN(19 times)), respectively.

Table 1

Summary statistics of text mining and SNP calling.

Trait	Article⁶	Gene⁷	SNP⁸	Used query⁹
CWT¹	1,893	288	17,662	carcass weight[TIAB] OR dressed weight[TIAB]
WBSF²	1,097	156	6,143	Warner-Bratzler Shear Force [TIAB] OR cuttability [TIAB] OR meat tenderness [TIAB]
IMF³	1,854	576	30,983	intramuscular fat [TIAB]
BF⁴	602	195	9,335	back fat [TIAB]
EMA⁵	546	167	12,371	eye muscle area [TIAB] OR ribeye [TIAB] OR rib eye [TIAB]

Table 2

The 30 genes symbol that appeared with highest frequency in text mining.

Trait	Symbol	Freq	Trait	Symbol	Freq	Trait	Symbol	Freq
CWT	IGF1	36	BF	MC4R	35	EMA	MSTN	19
	MSTN	28		SST	26		CAPN1	18
	MC4R	25		IGF1	24		ADIPOQ	15
	LPL	24		FTO	18		LEPR	15
	TNF	24		GAA	16		PPARGC1A	15
	BLM	20		SLA	15		DES	13
	CAPN1	19		FASN	14		POMC	13
	IGFBP2	19		IGF2	14		GHR	12
	MGA	19		BSG	11		LEP	12
	NCAPG	19		MGA	11		PIK3C3	11
	POMC	18		RBP4	10		SLA	11
	IGF2	17		UCP2	10		CAST	9
	GHR	16		SPR	9		GH1	9
	AFP	15		CSTB	8		IGF2	9
	CRH	13		FABP3	8		LRIT3	9
	DGAT1	13		IGFBP3	8		LCORL	8
	FASN	13		LSR	8		MC4R	8
	GAA	13		MAP2K6	8		RPE	8
	LCORL	13		MTTP	8		ANGPTL3	7
	TRH	13		SCD	8		CRH	7
	CAPN3	11		STAT6	8		FABP4	7
	CAST	11		TNF	8		GRP	7
	SCD	11		CTSL	7		MAP2K6	7
	ABHD5	10		EZH2	7		AGAP3	6
	ASL	10		IRS4	7		BPI	6
	GNAS	10		MARK4	7		ERG	6
	IGFBP3	10		QSOX1	7		IGF1	6
	IGFBP4	10		SLC13A5	7		ADRB3	5
	IRS1	10		TGFBR1	7		EMD	5
	STAT6	10		UCP3	7		ME1	5
WBSF	CAST	110	IMF	SCD	105
	CAPN1	104		LPL	80
	CAPN3	19		FABP4	70
	KCNJ11	18		FAS	54
	NES	17		FABP3	52
	DNAJA1	16		FASN	52
	MSTN	14		LEPR	47
	ADAMTS4	11		PPARG	38
	DGAT1	11		DGAT1	36
	HSPB1	9		MC4R	36
	SCD	8		AFP	27
	TNNT3	8		MSC	26
	UCP3	8		CAST	25
	ANGPTL3	7		PRKAG3	23
	IGFBP2	7		FTO	22
	ADAMTS5	6		SREBF1	22
	CAPN2	6		CAPN1	20
	DLK1	6		MAT2B	19
	MYOD1	6		PLIN2	17
	PRKAG3	6		RYR1	16
	STAT6	6		KLF6	15
	UCP2	6		ACACA	14
	LEP	5		ADH1C	14
	MMP2	5		GPAM	14
	APP	4		IGF2	14
	FABP4	4		PDHB	14
	GEN1	4		PPARA	14
	IGF2	4		ASIP	13
	LOX	4		MSTN	13
	MAP3K5	4		VRTN	13

CWT: Carcass weight; SF: Warner-Bratzler Shear Force; IMF: intramuscular fatty acid content; BF: Backfat thickness; EMA: Eye muscle area; Article: number of articles searched in PubMed; Gene: number of mined genes from searched articles; SNP: number of SNPs called from imputed 777K markers; Used query: queries used to search articles in PubMed. In the results of Gene Ontology (GO) term analysis (Table 3), CWT, BF, EMA-related TMG showed significance relatedness with growth regulator and growth factor (“response to hormone”, “regulation of signaling receptor activity”, and “response to endogenous stimulus”, “response to peptide”). WBSF-related TMG were identified to be associated with organic acid (“carboxylic acid metabolic process”, “oxoacid metabolic process”, “monocarboxylic acid biosynthetic process”, “organic acid metabolic process”, “monocarboxylic acid metabolic process”). For IMF, the biological process terms with lipid synthesis and lipid metabolism were statistically significant (“regulation of lipid metabolic process”, “lipid metabolic process”, “fatty acid metabolic process”, “regulation of lipid biosynthetic process”). The karyotypes of the QTL regions registered in animal QTLDB, text-mined regions, and the intersection of the two regions are shown in Fig 1. The highest percentage of intersecting regions within the text-mined regions corresponded to regions of CWT-related TMG (36.3%), and the lowest corresponded to IMF regions (5.5%).

Table 3

The top five significant biological processes for each trait.

Trait	GO_ID	Biological process	GeneRatio¹	−log₁₀P.adj²:
CWT	GO:0009725	response to hormone	19.8%	9.5
	GO:0010469	regulation of signaling receptor activity	21.4%	8.2
	GO:0009719	response to endogenous stimulus	24.6%	7.5
	GO:0043066	negative regulation of apoptotic process	19.0%	6.9
	GO:0043069	negative regulation of programmed cell death	19.0%	6.7
WBSF	GO:0019752	carboxylic acid metabolic process	21.7%	2.6
	GO:0043436	oxoacid metabolic process	21.7%	2.4
	GO:0072330	monocarboxylic acid biosynthetic process	12.0%	2.3
	GO:0006082	organic acid metabolic process	21.7%	2.3
	GO:0032787	monocarboxylic acid metabolic process	14.5%	1.8
IMF	GO:0019216	regulation of lipid metabolic process	11.5%	12.7
	GO:0032787	monocarboxylic acid metabolic process	15.3%	12.2
	GO:0006629	lipid metabolic process	23.4%	11.5
	GO:0006631	fatty acid metabolic process	11.1%	9.4
	GO:0046890	regulation of lipid biosynthetic process	7.7%	9.3
BF	GO:0009725	response to hormone	23.4%	9.6
	GO:0032868	response to insulin	12.8%	8.2
	GO:1901700	response to oxygen-containing compound	28.7%	8.1
	GO:0009719	response to endogenous stimulus	28.7%	8.0
	GO:0043434	response to peptide hormone	12.8%	5.7
EMA	GO:1901652	response to peptide	14.1%	4.1
	GO:0032868	response to insulin	11.3%	4.0
	GO:0010243	response to organonitrogen compound	19.7%	4.0
	GO:0043434	response to peptide hormone	12.7%	3.6
	GO:0062013	positive regulation of small molecule metabolic process	9.9%	3.5

GeneRatio: gene calling rate, i.e., the ratio of genes involved in each biological process among entire set of text-mined genes; −: −log10 P-value adjusted by the Bonferroni method.

Fig 1

The karyotype of QTL regions registered in QTLDB, text-mined region, and the intersection of both regions.

Each karyotype represents the region for the trait indicated above. Percentages in parentheses beside the trait names indicate the ratio of text-mined region within QTLDB region.

The karyotype of QTL regions registered in QTLDB, text-mined region, and the intersection of both regions.

Each karyotype represents the region for the trait indicated above. Percentages in parentheses beside the trait names indicate the ratio of text-mined region within QTLDB region. GeneRatio: gene calling rate, i.e., the ratio of genes involved in each biological process among entire set of text-mined genes; −: −log10 P-value adjusted by the Bonferroni method.

Genome-wide association study (GWAS) with text-mined SNPs

The Manhattan plots for each trait are shown in Fig 2. The Bonferroni correction method was used for the significance test (0.05/number of SNPs) in the genome-wide association study, and the SnpEff annotation information was referenced for marker locations. Three significant clusters were found in CWT. The most significant markers at position 10710350 in chromosome 4 are involved in the intron region of CALCR gene (P = 10−29.6). In the genomic region of chromosomes 6 and 14, markers involved in the LCORL–SLIT2 (position: 39,932,557) and PLAG1–CHCHD7 (position: 25,015,640) intergenic regions showed the most significance (P = 10−40.2 and P = 10−105.3). There are four significant genomic regions in BF. The most significant marker on chromosome 22 is located at a downstream gene variant of PPARG (position: 57,362,666; P = 10−6.5). The other most significant markers in chromosomes 2, 13, and 23 clusters are located in INSIG2–EN1 (position: 70,895,063; P = 10−5.97), APCDD1L–VAPB (position: 58,449,824; P = 10−7.3), and BMP5–HMGCLL1 (position: 4,622,146; P = 10−16.9) intergenic region. Three clusters showed significance in EMA. The most significant markers in chromosomes 3, 6, and 14 are involved in the S100A10–THEM4 (position: 18,822,190; P = 10−6.2), LCORL–SLIT2 (position: 39,932,557; P = 10−12.5), and PLAG1–CHCHD7 (position: 25,015,640; P = 10−26.6). For meat quality traits, only one marker at position 98,540,675 on chromosome 7 showed significance for D_SF (P = 10−7.4), located in an intron variant of the CAST gene.

Fig 2

Manhattan plots with results of genome-wide association study using text-mined SNPs for each trait.

The y-axis shows the −log10P-value of each SNP and the x-axis is the marker index. The green line is the Bonferroni-line representing 0.05/number of markers. The blue line is the suggestive-line representing 0.1/number of markers.

Manhattan plots with results of genome-wide association study using text-mined SNPs for each trait.

Variance component estimation

A statistical summary of the variance component estimation is shown in Table 4. In carcass traits, CWT showed the highest heritability (0.42) when Im_777K was used in the estimation. BF and EMA showed no difference in heritability between the three different estimation models (BF: 0.41, EMA: 0.39). In meat quality traits, the heritabilities of WBSF in the two muscle types semimembranosus and longissimus dorsi were 0.1 and 0.19, respectively, when estimated using the Im_777K panel. S_IMF and D_IMF showed heritabilities of 0.21 and 0.32, respectively, when estimated using Im_777K. All four traits showed similar heritabilities between the three models.

Table 4

Variance components at different marker set.

Trait	Value	Im_777K¹	exp_777K²	exp_777K + tm_SNPs³
CWT	σ²_u	913.66	908.35	705.05 + 171.76
	σ²_e	1287.6	1297.2	1307.3
	h²	0.42	0.41	0.4
BF	σ²_u	9.51	9.44	8.91 + 0.63
	σ²_e	13.65	13.71	13.64
	h²	0.41	0.41	0.41
EMA	σ²_u	50.37	50.04	48 + 2.43
	σ²_e	77.59	77.87	77.55
	h²	0.39	0.39	0.39
S_SF	σ²_u	0.11	0.11	0.07 + 0.04
	σ²_e	1.02	1.02	1.02
	h²	0.1	0.09	0.09
D_SF	σ²_u	0.13	0.12	0.07 + 0.04
	σ²_e	0.55	0.55	0.55
	h²	0.19	0.18	0.17
S_IMF	σ²_u	0.66	0.67	0.65+ 0.000024
	σ²_e	2.46	2.44	2.47
	h²	0.21	0.22	0.21
D_IMF	σ²_u	5.28	5.24	4.34 + 0.73
	σ²_e	11.51	11.55	11.72
	h²	0.32	0.31	0.3

Im_777K: estimated variance components with imputed 777K SNPs; exp_777K: estimated variance components with imputed 777K SNPs except text-mined SNPs; exp_777K + tm_SNPs: estimated variance components when using two marker sets (exp_777K, text-mined SNPs) to different genetic variance. First genetic variance was a component of exp_777K and second was a component of text-mined SNPs.

Genomic prediction

The accuracy of GEBV are shown separately for the carcass traits (CWT, BF, EMA) and meat quality traits (WBSF, IMF) in Table 5. Fitting two different GRMs constructed with two different SNP panels (exp_777K + tm_SNPs) as random effects in the GBLUP model showed better accuracy than fitting one GRM with exp_777K in all traits. In CWT, the prediction accuracy with Im_777K was 0.453, which was 0.002 higher than in the model with exp_777K + tm_SNPs. Conversely, for BF, using exp_777K + tm_SNPs resulted in an accuracy of 0.421, which was 0.002 higher than that using Im_777K. EMA also exhibited its highest prediction accuracy (0.437) when using two GRMs with exp_777K + tm_SNPs. The accuracy of genomic prediction using two GRMs for WBSF in the two muscle types, semimembranosus and longissimus dorsi, were calculated as 0.129 and 0.189, respectively, and those for IMF were 0.168 and 0.225, respectively, which were better than those using Im_777K. In order to validate the effect of text-mined SNPs in the multi-GRM model, GBLUP using evenly-mined SNPs (em_SNPs) and except SNPs was additionally conducted (Table 6). For all four meat quality traits, the GBLUP using tm_SNPs showed higher accuracy than em_SNPs. It seems that CWT and EMA may have more polygenic characteristics than other traits, because em_SNPs showed higher accuracy than tm_SNPs in these two traits.

Table 5

Carcass traits average correlation between the GEBV and corrected phenotypic values (y) and standard error for 10-validation set.

Meat quality traits average correlation between the GEBV and corrected phenotypic values (y) and standard error for 10-validation set.

Trait	Im_777K	exp_777K	exp_777K + tm_SNPs
CWT	0.453 ± 0.01	0.449 ± 0.01	0.451 ± 0.01
BF	0.419 ± 0.01	0.413 ± 0.01	0.421 ± 0.01
EMA	0.423 ± 0.01	0.429 ± 0.01	0.437 ± 0.004
S_SF	0.105 ± 0.04	0.102 ± 0.02	0.129 ± 0.03
D_SF	0.121 ± 0.03	0.115 ± 0.04	0.189 ± 0.03
S_IMF	0.16 ± 0.02	0.15 ± 0.03	0.168 ± 0.02
D_IMF	0.207 ± 0.04	0.163 ± 0.03	0.225 ± 0.02

Table 6

Accuracy of evenly-mined GBLUP and text-mined GBLUP.

Traits	exp_777k + tm_SNPs	exp_777k + em_SNPs¹
CWT	0.451 ± 0.01	0.471 ± 0.01
BF	0.421 ± 0.01	0.419 ± 0.01
EMA	0.437 ± 0.004	0.438 ± 0.01
S_SF	0.129 ± 0.03	0.099 ± 0.02
D_SF	0.189 ± 0.03	0.095 ± 0.02
S_IMF	0.168 ± 0.02	0.147 ± 0.02
D_IMF	0.225 ± 0.02	0.202 ± 0.03

exp_777k + em_SNPs: multi-GRM GBLUP with evenly-mined SNPs and except SNPs.

Carcass traits average correlation between the GEBV and corrected phenotypic values (y) and standard error for 10-validation set.

Meat quality traits average correlation between the GEBV and corrected phenotypic values (y) and standard error for 10-validation set. exp_777k + em_SNPs: multi-GRM GBLUP with evenly-mined SNPs and except SNPs.

Discussion

Biological relatedness of text-mined gene with carcass and meat quality traits

Carcass traits

The top three mined genes for carcass traits were IGF1, MSTN, MC4R, SST, CAPN1, and PPARGC1A. Many previous studies have investigated the biological effect of these genes on the quantitative traits. Insulin-like growth factor (IGF) plays a key role in cell differentiation, growth, and metabolism regulation [29]. The myostatin (MSTN) gene, also known as GDF8, encodes a member of the transforming growth factor β superfamily, which is associated with the proper regulation of skeletal muscle mass and carcass yield in cattle [30]. The melanocortin 4 receptor (MC4R) gene plays an important role in energy balance and is associated with beef economic traits [31]. Peroxisome proliferator activated receptor gamma coactivator 1 alpha (PPARGC1A) have been standing out as a candidate gene for beef fat synthesis [32]. Although somatostatin (SST) inhibits growth hormone, there has been little research on the association between the SST gene and carcass traits. This gene seemed to have been mined because the abbreviation “SST” was used with other meanings, such as “sole soft tissue”, in the literature. In addition to these high ranked genes, other genes (i.e., NCAPG, POMC, LCORL, FTO, IGF2, FABP3, LEPR, and ADIPOQ) were also found to be associated with growth-related traits in multiple breed [33-40]. The significant genes in GWAS results (CALCR, PLAG1, INSIG2, PPARG, BMP5, S100A10) also have been identified to have relationship with growth performance and obesity of adipose tissue for pig and cattle [35, 41–45]. In addition, many other TMG also seems to be associated with growth related traits because the GO term results revealed that carcass traits-related TMG were associated with growth regulator and growth factor.

Meat quality traits

The CAST and CAPN1 were included in the two most frequently mined genes related to the WBSF. Calpain 1 (CAPN1) encodes the large subunit of calcium-activated neutral proteases (calpain), and the calpastatin (CAST) gene inhibits μ- and m-calpain activity. These two proteins, as key myofibrillar proteins, mediate proteolysis during postmortem storage of the carcass and cuts of meat at refrigerated temperatures and play important roles in meat tenderness [46]. The association between these CAST/CAPN1 and WBSF has been studied extensively [47-50]. In IMF, SCD, LPL, and FABP4 were the three most frequently mined genes. The stearoyl-CoA desaturase (SCD) gene encodes an enzyme involved in fatty acid biosynthesis, primarily the synthesis of oleic acid [51]. The lipoprotein lipase (LPL) gene encodes lipoprotein lipase, which provides triglyceride-derived fatty acids to adipose tissue [52]. Fatty-acid-binding protein 4 (FABP4) plays a number of important roles, including fatty acid uptake, transport, and metabolism in the muscle [53]. In addition to these genes, CAPN3, KCNJ11, DNAJA1 are also known to be associated with beef tenderness [54-56] and FABP3, LEPR, FASN, DGAT1 were reported to associated with IMF in previous studies [57-59]. In the results of GO term analysis for WBSF, biological processes related to the carboxylic acid biosynthetic and metabolic processes were significant. Carboxylic acid is an organic acid that was shown in previous studies to affect beef tenderness [60, 61]. In addition, IMF related TMG showed a significant association with the regulation of lipid metabolic and biosynthetic processes. According to these biological processes, GO term results can support that WBSF, IMF–related TMG have been associated with WBSF and IMF. When excluding text-mined SNPs from the Im_777K marker panels, the prediction accuracy for CWT, BF, WBSF, and IMF were decreased. In a previous simulation study, a panel that excluded QTL from the 50K SNP panel showed lower accuracy than a panel that included the QTL [2]. These results indicated that text-mined SNPs may be more strongly functionally associated with QTL for CWT, BF, WBSF, and IMF and include markers in a linkage disequilibrium relationship with QTL for these traits. Fitting two GRMs constructed using exp_777K and text-mined SNPs in the GBLUP model as different random effects resulted in higher accuracy than fitting one GRM constructed using Im_777K for BF, EMA, WBSF, and IMF. These results were consistent with previous studies indicating that differentially weighted subsets of markers based on genomic features increased the predictive ability [8]. The increase in accuracy was greater in the traits related to the longissimus dorsi muscle than in those related to the semimembranosus muscle. One of the most important factors that can affect the accuracy of genomic prediction is linkage disequilibrium between common SNPs and QTL [7]. As selection for a specific trait proceeds, linkage disequilibrium between causal polymorphisms for that trait and other marker loci appears to be stronger [6]. As traits related to the semimembranosus muscle were not considered in evaluating the degree of the Hanwoo breed, the selection of these traits would not have been carried out actively. Therefore, linkage disequilibrium between QTL and other markers would be weakened, and this seemed to have been responsible for these results. In this study, the SNPs that seemed to be related to the traits were selected by text mining, and the prediction accuracy was slightly increased when these SNPs were weighted differentially to other SNP panels. In the GBLUP method, the weights of GRMs are controlled by the lambda value (σ2e /σ2u). As σ2u estimated by text-mined SNPs showed lower variance than estimated by exp_777K, higher lambda values were multiplied to GRM made by text-mined SNPs and this seemed to increase the prediction accuracy by giving more weight to text-mined SNPs in the model. Nevertheless, in comparisons between multi-GRM models, the accuracy of CWT and EMA decreased when tm_SNPs was used. These results may indicate that text-mined GBLUP doesn’t seem to be effective in the case of traits that are more genetically affected by polygenic effect than causal variant effect. There may be limits to the conclusion that text mining can improve prediction accuracy, since text mined SNPs didn’t result in a significant improvement in prediction accuracy. However, there was a slight accuracy increase for meat quality traits and GO term analysis may suggests that text mining can play a role in finding functional genes for complex traits. Therefore, attempts to incorporate text mining into genomic predictions seem valuable and further study (i.e., other SNP effects weighting methods) using text mining can be expected to present the significant results [62, 63]. In addition, text mining may be used for various population or breeds, since marker selection by text mining didn’t use the phenotypic or genetic information of a specific population.

Conclusions

This study was performed to use text mining, to extract biological information from previous papers and increase the performance of genomic prediction. The results showed that text mining could be used to find genes related to specific traits because associations between each carcass and meat quality trait and TMG were identified in the results of text mining and GO term analysis. However, a word that was accidentally the same as a gene symbol but used with another meaning (i.e., SST) was also mined as a text-mined gene. Therefore, it will be necessary to develop further methods of text mining that can resolve this problem. In the genomic prediction results, text-mined SNPs seemed to be in tighter linkage disequilibrium with QTL for BF, EMA, WBSF, and IMF. There may be limits to the conclusion that text mining can improve prediction accuracy, since text mined SNPs didn’t result in a significant improvement in prediction accuracy. However, attempts to incorporate text mining into genomic predictions still seem valuable, and further study using text mining can be expected to present the significant results, because a slight accuracy increase for meat quality traits may suggests that text mining can play a role in finding functional genes for complex traits. In addition, text mining may be used for various population or breeds, since marker selection by text mining didn’t use the phenotypic or genetic information of a specific population.

The workflow of the text mining.

(TIF) Click here for additional data file.

SNP information used in this study.

(ZIP) Click here for additional data file. 3 Jun 2020 PONE-D-20-12127 Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle) PLOS ONE Dear Dr. Lee, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 18 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Shuhong Zhao, Ph.D Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you are reporting an analysis of a microarray, next-generation sequencing, or deep sequencing data set. PLOS requires that authors comply with field-specific standards for preparation, recording, and deposition of data in repositories appropriate to their field. Please upload these data to a stable, public repository (such as ArrayExpress, Gene Expression Omnibus (GEO), DNA Data Bank of Japan (DDBJ), NCBI GenBank, NCBI Sequence Read Archive, or EMBL Nucleotide Sequence Database (ENA)). In your revised cover letter, please provide the relevant accession numbers that may be used to access these data. For a full list of recommended repositories, see http://journals.plos.org/plosone/s/data-availability#loc-omics or http://journals.plos.org/plosone/s/data-availability#loc-sequencing. 3. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ 5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Lee et.al conducted both GWAS and GS in a Korean cattle population. Except for the regular information of genotype and phenotype for GWAS and GS, the reported SNPs that associated with six traits were also identified by a text mining technology and used in GWAS and GS. The GWAS was carried out by a regular mixed linear model and the GS was performed by a regular GBLUP model with a single random effect or two random effects. The idea is interesting and the experimental design is fine. However, the manuscript is not well written and it is hard to judge if the statistical analyzes are done correctly. Therefore, I have the following concerns. Major concerns: Combined with the information from Table 1 and Table 4, we can find that the advantages of model 3 are bigger than the other two models when the number of text-mined SNPs is smaller. To remove the effect of the number of text-mined SNPs, I suggest adding one more experiment. Take CWT trait as an example, please randomly select 17,662 SNPs from all SNPs as 'fake text-mined SNPs' and test the accuracy of model 3 again and added as the fourth column in Table 4; The published papers, which used this Korean cattle population, should not be used to identify the text-mined SNPs. This should be declared; The GWAS was only carried out on text-mined SNPs, how about the GWAS results if using all SNPs? The description of the GWAS model is not rigorous. For example, if the GWAS was performed by GCTA, the D should be a single text-mined SNP instead of a genotype matrix of text-mined SNPs. Please correct it and double-check the method section; The slaughter information was added to the genomic prediction model as one of the fixed effects. What is slaughter information? Also, meat quality traits were adjusted by farmer information, what is ‘farmer information’? Was the farmer information added as fixed effects or random effects in the adjusted model. All details of the adjusted model should be clearly described. Minor concerns: A comma should be added in a number with more than three digits; Why the MAF thresholds were set differently for carcass traits and meat quality traits? The number of significant digits retained should be kept the same in each table; What is TIAB in Table 1? It would be more straightforward if the top genes with corresponding frequencies were shown in a Table instead of Figure1; Some of the Chromosome IDs in Manhattan plots were overlapped. Reviewer #2: The manuscript studied the effect of text-mined SNPs on GWAS and genomic prediction in Hanwoo. A major concern I have is that only three scenarios were compared for genomic prediction, and more scenarios should be included. Some suggestions are: 1) tm_SNPs only 2) a random subset of k SNPs from Im_777K, where k is the number of markers in exp_777K 3)fit the random subset obtained in (2) and the remaining as the two components in model 3 Other comments: The standard error in table 4-1 is very low. Can you verify the standard error? Why fixed effects are pre-adjusted in GWAS but not in genomic prediction? Will GWAS using text-mined SNPs give lower power? Please add more discussion. Line 160: Is \\beta for one candidate SNP in each test? Please correct it in line 158. Line 209: Please correct it. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 13 Jul 2020 Manuscript PONE-D-20-12127R1 Dear Dr. Anita Estes we appreciate the time and effort that you dedicated to providing feedback on our manuscript entitled “Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle)". As you comment, we revised some information to comply with PLOS ONE's data sharing policy. 1) The relevant data used in this study were named as ‘Hanwoo reference population’ (carcass trait population) and ‘Hanwoo meat quality population’. In addition, none of the authors of this study have privileges in accessing datasets. 2) We revised the contact for accessing the genotype and meat quality traits data to (http://www.nias.go.kr/english/sub/boardHtml.do?boardId=depintro, National Institute of Animal Science, Animal Genome & Bioinformatics Division, Tae Hun Kim, PhD, Director of Animal Genome & Bioinformatics Division, thkim63@korea.kr). Data Availability in manuscript was also revised. 3) All SNPs position and rs-numbers used in this study were added in Supporting Information. Detailed information and accession number of each SNP can be found in Bos_taurus_UMD_3.1.1 genome information submitted in NCBI GeneBank (accession: GCA_000003055.5). Thank you again for your efforts, and please contact us if you have any further requests. Kind regards SeungHwan Lee ================================================== Manuscript PONE-D-20-12127 Dear Dr. Shuhong Zhao Thank you for giving us the opportunity to submit a revised manuscript ‘Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle)’ to PLOS ONE. We appreciate you for careful and thorough reading of our manuscript and valuable comment to our manuscript. As you comment, the reference citation and figure file naming were changed following the PLOS ONE style templates. In addition, captions for Supporting Information files were added at the end of our revised manuscript. We totally agree that the data sharing is important for reproducibility, replication, and validation of research. However, some of our data cannot be shared due to legal restrictions on data sharing. All the data-set (Phenotypes and genotypes for carcass traits) used in this study was provided by BioGreen 21 Program (Molecular Breeding Program) of National Institute of Animal Science, RDA. The carcass traits can be obtained at public web site (https://mtrace.go.kr/). Request for Genotype and meat quality traits data can be made to Korea National Institute of Animal Science (https://www.nias.go.kr:3443/front/main.do) at (lim.dj@korea.kr). We apologize for our situation and detailed data availability statement was added at the revised manuscript. Response to Reviewers Reviewer #1: Lee et.al conducted both GWAS and GS in a Korean cattle population. Except for the regular information of genotype and phenotype for GWAS and GS, the reported SNPs that associated with six traits were also identified by a text mining technology and used in GWAS and GS. The GWAS was carried out by a regular mixed linear model and the GS was performed by a regular GBLUP model with a single random effect or two random effects. The idea is interesting and the experimental design is fine. However, the manuscript is not well written and it is hard to judge if the statistical analyzes are done correctly. Therefore, I have the following concerns. Response: we appreciate the time and effort that you dedicated to providing feedback on our manuscript. We have incorporated most of your suggestions. Your insightful comment made our papers more valuable. We written in red for a point-by-point response to your concerns. Thank you. Major concerns: Combined with the information from Table 1 and Table 4, we can find that the advantages of model 3 are bigger than the other two models when the number of text-mined SNPs is smaller. To remove the effect of the number of text-mined SNPs, I suggest adding one more experiment. Take CWT trait as an example, please randomly select 17,662 SNPs from all SNPs as 'fake text-mined SNPs' and test the accuracy of model 3 again and added as the fourth column in Table 4; Response: We appreciate the reviewer’s feedback, we agree that the analysis using random sets should be added. However, in the model 3, it is important to identify how independent between called (text-mined SNPs) and non-called (exp SNPs) regions are. If the SNPs are evenly (or randomly) extracted, both SNP set (called, non-called) will exhibit similar patterns due to LD. In the linear model, Fitting two highly correlated GRMs will be less accurate or similar to fitting the GRM using the whole SNPs. Therefore, we thought that using the evenly extracted SNPs to model 3 will not be much different from the results we want to show in model 1. The table below shows the results of Model 3 using evenly extracted SNPs and text-mined SNPs. the second table is the correlation between GRMs constructed with each SNP set. text_mined + excepted evenly_extracted + excepted Im_777K CWT 0.451 ± 0.01 0.4707 ± 0.01 0.453 ± 0.01 BF 0.421 ± 0.01 0.4187 ± 0.01 0.419 ± 0.01 EMA 0.437 ± 0.004 0.4376 ± 0.01 0.423 ± 0.01 S_SF 0.129 ± 0.03 0.0992 ± 0.02 0.105 ± 0.04 D_SF 0.189 ± 0.03 0.0954 ± 0.02 0.121 ± 0.03 S_IMF 0.168 ± 0.02 0.1471 ± 0.02 0.16 ± 0.02 D_IMF 0.225 ± 0.02 0.202 ± 0.03 0.207 ± 0.04 text_mined + excepted evenly_extracted + excepted CWT 0.586 0.784 BF 0.576 0.596 EMA 0.578 0.642 WBSF 0.71 0.981 IMF 0.908 0.997 The published papers, which used this Korean cattle population, should not be used to identify the text-mined SNPs. This should be declared; Response: Thank you for pointing this out. There was no published paper using the Carcass trait population (n = 12,635) when conducting text-mining (sep, 2019). There is one published paper using part of the Meat quality population (n = 1,039), however the abstract of this paper does not contain the Gene symbol. Therefore, we can declare that published papers using this Korean cattle dataset were not included in text-mining. The GWAS was only carried out on text-mined SNPs, how about the GWAS results if using all SNPs? Response: Indeed, we had expected that GWAS with text-mined SNPs would be able to identify the distribution of QTLs whose effect disappeared during the selection process. However, these results were not much different with the all SNPs results. It seems that text-mined SNPs have not been able to filter out common SNPs that have little or no effect but appear to be high due to the LD, because the text-mined SNPs were distributed throughout the whole genomes. The figures below show the GWAS results using the all SNPs and the simulation results for the QTL effect chainging with selection. G0 (no selection) G10 (after 10 generation with selection) The description of the GWAS model is not rigorous. For example, if the GWAS was performed by GCTA, the D should be a single text-mined SNP instead of a genotype matrix of text-mined SNPs. Please correct it and double-check the method section; Response: We apologize for the confusion. In this study, single marker linear mixed model was used for GWAS. Therefore, D must be genotype vector for each SNP, not genotype matrix. We corrected GWAS description (line 160) to ‘D is a vector of genotype of the candidate SNPs recorded as 0,1, or2’. In addition, in single marker liner mixed model, beta of each model means additive effect of single marker. Therefore, we also revised description (line 161) to ‘beta is the additive effect of the candidate SNPs’ The slaughter information was added to the genomic prediction model as one of the fixed effects. What is slaughter information? Also, meat quality traits were adjusted by farmer information, what is ‘farmer information’? Was the farmer information added as fixed effects or random effects in the adjusted model. All details of the adjusted model should be clearly described. Response: Thank you for pointing this confusion out. In this study, the fixed effect (birth information, slaughter information) used in common for both population (carcass, meat quality) mean birth month, birth year, slaughter month, and slaughter year. We have the slaughter place information for carcass traits population, therefore we used this information as fixed effect for carcass traits. In case of meat quality population, we use the farmer information as fixed information, because we had the name of the owner for the slaughter steer (farmer information). To clarify these, we rewrote (line 177) to ‘b is the vector of fixed effects, which included growing site, birth month, birth year, slaughter month, slaughter year, and slaughter age as covariates for all traits. In addition, the carcass traits included slaughter place and sex, while the meat quality trait included farmer information (the owner’s name of steers).’ Minor concerns: A comma should be added in a number with more than three digits. Response: we revised all number with more than three digits. Why the MAF thresholds were set differently for carcass traits and meat quality traits? Response: You have raised an important point, we wanted to remove SNPs that less than 10 individuals had, because SNPs that too few individuals had are at risk of overestimating effects in GWAS or Genomic prediction. Since 10 individuals were equivalent to 0.001 of the carcass population and 0.01 of the meat quality population, different MAF thresholds were set for two population. The number of significant digits retained should be kept the same in each table Response: The significant digits and Gene ratio shown in table2 (table3 in the revised manuscript) were modified to show only one decimal point. What is TIAB in Table 1? Response: [TIAB] is one of the search options for Pubmed search engine. With [TIAB] option, only the papers which included words and numbers in a citation’s title, collection title, abstract, other abstract and keywords were search in Pubmed database. Other options and descriptions for Pubmed engine can be found in ‘https://www.ncbi.nlm.nih.gov/books/NBK3827/’. It would be more straightforward if the top genes with corresponding frequencies were shown in a Table instead of Figure1. Response: Thanks for pointing out. As you commented, the contents of all Figure 1 have been changed to Table2 for more tidy result delivery. Some of the Chromosome IDs in Manhattan plots were overlapped. Response: The revised pictures for the overlapping Chromosome IDs were re-attached. Reviewer #2: The manuscript studied the effect of text-mined SNPs on GWAS and genomic prediction in Hanwoo. A major concern I have is that only three scenarios were compared for genomic prediction, and more scenarios should be included. Some suggestions are: 1) tm_SNPs only 2) a random subset of k SNPs from Im_777K, where k is the number of markers in exp_777K 3)fit the random subset obtained in (2) and the remaining as the two components in model 3 Response: We appreciate the reviewer’s comment, we agree that the analysis using tm_SNPs only and random sets should be added. In scenarios using only tm_SNPs, prediction accuracy was lower than other scenarios. There seem to be two reason for these results. 1) there are some polygenic effect missing, i.e. interaction between markers, since only tm_SNP is used for genomic prediction. 2) text-mining didn’t identify all causal variant. We didn’t wrote these tm_SNPs results in the manuscript, because these results were thought to be inevitable in a genomic prediction using low density marker set. In addition, we had conducted the genomic prediction fitting evenly selected SNPs to model 3. However, the independence of the two GRM (tm_SNP + exp_SNP) is important for statistical power of model 3 because the collinearity of two GRMs could make model weaken. Therefore, we thought that using the evenly (or randomly) extracted SNPs to model 3 will not be much different from the results we want to show in model 1. The table below shows the results of Model 3 using evenly extracted SNPs and text-mined SNPs. the second table is the correlation between GRMs constructed with each SNP set. text_mined + excepted evenly_extracted + excepted Im_777K CWT 0.451 ± 0.01 0.4707 ± 0.01 0.453 ± 0.01 BF 0.421 ± 0.01 0.4187 ± 0.01 0.419 ± 0.01 EMA 0.437 ± 0.004 0.4376 ± 0.01 0.423 ± 0.01 S_SF 0.129 ± 0.03 0.0992 ± 0.02 0.105 ± 0.04 D_SF 0.189 ± 0.03 0.0954 ± 0.02 0.121 ± 0.03 S_IMF 0.168 ± 0.02 0.1471 ± 0.02 0.16 ± 0.02 D_IMF 0.225 ± 0.02 0.202 ± 0.03 0.207 ± 0.04 text_mined + excepted evenly_extracted + excepted CWT 0.586 0.784 BF 0.576 0.596 EMA 0.578 0.642 WBSF 0.71 0.981 IMF 0.908 0.997 Other comments: The standard error in table 4-1 is very low. Can you verify the standard error? Response: We calculated standard error by 1/√K sd{〖CV〗_1 (〖cor〗^(-(1) ) ),…〖CV〗_K (〖cor〗^(-(K)))} Where, 〖CV〗_n (〖cor〗^(-(n) ) ) is Pearson's correlation of test set of Validation N. In this study, we use 10 – fold validation for calculating accuracy. Therefore, we calculated the standard deviation of {〖CV〗_1 (〖cor〗^(-(1) ) ),…〖CV〗_10 (〖cor〗^(-(10)))}, and the value divided by √10 (1/√10 sd{〖CV〗_1 (〖cor〗^(-(1) ) ),…〖CV〗_10 (〖cor〗^(-(10)))}) was taken as the standard error. Why fixed effects are pre-adjusted in GWAS but not in genomic prediction? Response: We apologize for the confusion. and thank you for pointing this problem out. The GCTA program conduct association study by two steps, in order to improve computational efficiency. step1) the phenotype was adjusted by the mean and covariates (fixed effect), step2) the adjusted phenotype will subsequently be used for testing SNP association. Therefore, we pre-adjusted the phenotype to skip the step1. On the other hand, we didn’t pre-adjusted phenotype in genomic prediction, because the BLUPF90 program fits all fixed effects in one model. If SNPs are correlated with the fixed effects, pre-adjusting the phenotype by the fixed effects will probably cause overestimation of SNP effect. However, in this study, the environmental effect was used as a fixed effect. In addition, since SNP has not been mutated during the lifetime, there could be no interaction with the environment. Therefore, there are no difference between the two method (pre-adjusted or not). Will GWAS using text-mined SNPs give lower power? Please add more discussion. Response: In the individual selection processes for breeding, it is difficult to identify the QTL by GWAS as the effect of QTL decreases. Therefore, we expected that text-mining could filter out common-SNPs whose effect is overestimated by LD with causal variants. However, the GWAS using text-mined SNPs did not seem to significantly differ from the results using 777k SNPs. these results seem to be because text-mined SNPs were distributed throughout the whole genomes. Nevertheless, GWAS using text-mined SNPs still seems valuable in terms of identifying the association between causal variants candidate SNPs and phenotypes. The figures below show the GWAS results using the 777k SNPs and the simulation results for the QTL effect chainging with selection. G0 (no selection) G10 (after 10 generation with selection) Line 160: Is \\beta for one candidate SNP in each test? Please correct it in line 158. Response: We apologize for the confusion. In this study, single marker linear mixed model was used for GWAS. Therefore, beta must be one value for each SNP, not vector of the effect. We revised GWAS description (line 161) to ‘beta is the additive effect of the candidate SNPs’. Thank you for pointing out this problem. Line 209: Please correct it. Response: We apologize for the miswriting, and We thank the reviewer for this comment. We wanted to present the equation for calculating the GEBV of the reference and validation set in a mixed model. However, there were some miswriting (i.e G is inverse of the GRM, not GRM). therefore, we removed this equation and added two new equation in (line 209, 215). First equation is linear mixed model with single random effect for the model1 and model2, second equation is mixed model with two random effect for the model3. We thought this way will more clearly explain our prediction models. Submitted filename: Response.docx Click here for additional data file. 14 Aug 2020 PONE-D-20-12127R1 Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle) PLOS ONE Dear Dr. Lee, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sept 15. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Shuhong Zhao, Ph.D Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) Reviewer #2: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have followed my recommendations to improve the manuscript. I suggest to add the information of tables and figures in point by point response to the manuscript and discuss the newly added results as well. Reviewer #2: In author's reply, authors mentioned a potential reason for the results that prediction using tm_SNP only gives lower accuracy is "text-mining didn’t identify all causal variant.". If this is the reason, GWAS using tm_SNPs may cause lower power because causal variants excluded from tm_SNPs are not tested. In terms of prediction, significant differences are rarely observed, and it looks like a significant difference is observed for CWT, but "text_mined + excepted" is worse than "evenly_extracted + excepted". I would recommend authors to run more cross validation to further verify their conclusion. A paired t test may also be considered. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 24 Aug 2020 Reviewer #1: The authors have followed my recommendations to improve the manuscript. I suggest to add the information of tables and figures in point by point response to the manuscript and discuss the newly added results as well. Thank you for this suggestion. As suggested by the reviewer, we added the evenly_SNPs + except results at table 6. Table 6. Accuracy of evenly-mined GBLUP and text-mined GBLUP Traits exp_777k + tm_SNPs exp_777k + em_SNPs1 CWT 0.451 ± 0.01 0.471 ± 0.01 BF 0.421 ± 0.01 0.419 ± 0.01 EMA 0.437 ± 0.004 0.438 ± 0.01 S_SF 0.129 ± 0.03 0.099 ± 0.02 D_SF 0.189 ± 0.03 0.095 ± 0.02 S_IMF 0.168 ± 0.02 0.147 ± 0.02 D_IMF 0.225 ± 0.02 0.202 ± 0.03 exp_777k + em_SNPs1: multi-GRM GBLUP with evenly-mined SNPs and except SNPs. These results have shown, in CWT, em_SNPs can make more accurate prediction than tm_SNPs. As with the results using whole SNPs, The polygenic characteristic of traits seem to make these results. These results may indicate that text-mined GBLUP doesn’t seem to be effective in the case of traits that are more genetically affected by polygenic effect than causal variant effect. We also added this finding in the discussion (line 398). Thank you again for your comment. Reviewer #2: In author's reply, authors mentioned a potential reason for the results that prediction using tm_SNP only gives lower accuracy is "text-mining didn’t identify all causal variant.". If this is the reason, GWAS using tm_SNPs may cause lower power because causal variants excluded from tm_SNPs are not tested. In terms of prediction, significant differences are rarely observed, and it looks like a significant difference is observed for CWT, but "text_mined + excepted" is worse than "evenly_extracted + excepted". I would recommend authors to run more cross validation to further verify their conclusion. A paired t test may also be considered. Thank you for pointing this out. We totally agree that if text-mining can’t identify all causal variant, the GWAS using tm_SNPs may have lower power than using whole SNPs. However, the lower accuracy of tm_SNPs only seems to be more affected by the loss of polygenic effect (e.g gene x gene, gene x SNPs etc) than missing the causal variant. Even though the text-mining misses a small number of causal variants (variant recall will be decreased.), it still seems good way to filter out a large number of common SNPs (variant precision will be increased.!). According to this description, this method may have lower power to identifying the genetic architecture of target traits. However, considering the ability to remove the SNPs that make noise to GWAS with LD, it seems to be a good method for finding true causal variants of target traits. In terms of the genomic prediction results, as you commented on, only CWT and D_SF showed significant differences (by t-test). The reason why em_SNPs showed higher accuracy in CWT is seem to be because CWT is more affected by polygenic effect than causal variant effect due to the characteristic of quantitative traits (or carcass trait). We added these results at table 6, and discussion for these results also added (line 398). Thank you again for the time and effort that you dedicated to providing feedback on our manuscript. Table 6. Accuracy of evenly-mined GBLUP and text-mined GBLUP Traits exp_777k + tm_SNPs exp_777k + em_SNPs1 CWT 0.451 ± 0.01 0.471 ± 0.01 BF 0.421 ± 0.01 0.419 ± 0.01 EMA 0.437 ± 0.004 0.438 ± 0.01 S_SF 0.129 ± 0.03 0.099 ± 0.02 D_SF 0.189 ± 0.03 0.095 ± 0.02 S_IMF 0.168 ± 0.02 0.147 ± 0.02 D_IMF 0.225 ± 0.02 0.202 ± 0.03 exp_777k + em_SNPs1: multi-GRM GBLUP with evenly-mined SNPs and except SNPs. Submitted filename: Response_to_reviewer-Final.docx Click here for additional data file. 30 Sep 2020 PONE-D-20-12127R2 Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle) PLOS ONE Dear Dr. Lee, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct. 15th. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Shuhong Zhao, Ph.D Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript has been improved and my previous concerns were addressed. I have one more comment: It is very useful but also very challenge to weight a set of genetic markers to improve prediction accuracy. A recent study developed a new method to improve the prediction accuracy by weighting the contributions of genetic markers in the prediction model (Lilin Yin et. al, Genome Biology, 2020). It should be nice if this could be discussed to compare with your strategy. Reviewer #2: I have no further comments. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 4 Oct 2020 Reviewer #1: The manuscript has been improved and my previous concerns were addressed. I have one more comment: It is very useful but also very challenge to weight a set of genetic markers to improve prediction accuracy. A recent study developed a new method to improve the prediction accuracy by weighting the contributions of genetic markers in the prediction model (Lilin Yin et. al, Genome Biology, 2020). It should be nice if this could be discussed to compare with your strategy. Response: Thank you for introducing a good method. We also agreed that weighting genetic markers is a challenge and further study is required. Therefore, we revised line 405 included at discussion to "Therefore, attempts to incorporate text mining into genomic predictions seem valuable and further study (i.e., other SNP effects weighting methods) using text mining can be expected to present the significant results.". In addition, the paper you recommended and (Wang H et al., Genetics Research, 2012) paper were referenced at the end of the sentence. Reviewer #2: I have no further comments. Response: Thank you for the time and effort you made during the whole review process. Submitted filename: Response_to_reviewer.docx Click here for additional data file. 22 Oct 2020 Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle) PONE-D-20-12127R3 Dear Dr. Lee, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Shuhong Zhao, Ph.D Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 29 Oct 2020 PONE-D-20-12127R3 Genome-wide identification of major genes and genomic prediction using High-Density and Text-Mined Gene-Based SNP panels in Hanwoo (Korean cattle) Dear Dr. Lee: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Mrs. Shuhong Zhao Academic Editor PLOS ONE

54 in total

1. The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models.

Authors: R C Lewontin
Journal: Genetics Date: 1964-01 Impact factor: 4.562

2. A C/T mutation in microRNA target sites in BMP5 gene is potentially associated with fatness in pigs.

Authors: G C Shao; L F Luo; S W Jiang; C Y Deng; Y Z Xiong; F E Li
Journal: Meat Sci Date: 2010-10-30 Impact factor: 5.209

3. DISEASES: text mining and data integration of disease-gene associations.

Authors: Sune Pletscher-Frankild; Albert Pallejà; Kalliopi Tsafou; Janos X Binder; Lars Juhl Jensen
Journal: Methods Date: 2014-12-05 Impact factor: 3.608

4. Single nucleotide polymorphisms in the corticotrophin-releasing hormone and pro-opiomelancortin genes are associated with growth and carcass yield in beef cattle.

Authors: F C Buchanan; T D Thue; P Yu; D C Winkelman-Sim
Journal: Anim Genet Date: 2005-04 Impact factor: 3.169

5. Prerigor and postrigor changes in tenderness of ovine longissimus muscle.

Authors: T L Wheeler; M Koohmaraie
Journal: J Anim Sci Date: 1994-05 Impact factor: 3.159

6. Identification of KCNJ11 as a functional candidate gene for bovine meat tenderness.

Authors: Polyana C Tizioto; Gustavo Gasparin; Marcela M Souza; Mauricio A Mudadu; Luiz L Coutinho; Gerson B Mourão; Patricia Tholon; Sarah L C Meirelles; Rymer R Tullio; Antônio N Rosa; Maurício M Alencar; Sérgio R Medeiros; Fabiane Siqueira; Gelson L D Feijó; Renata T Nassu; Luciana C A Regitano
Journal: Physiol Genomics Date: 2013-10-22 Impact factor: 3.107

Genome-wide identification of major genes and genomic prediction using high-density and text-mined gene-based SNP panels in Hanwoo (Korean cattle).

Introduction

Materials and methods

Dataset

Hanwoo (Korean cattle) populations

Genotyping and quality control

Text mining and gene ontology term analysis

Statistical analyses

Genome-wide association study (GWAS) using text-mined gene-based SNP panels

Genomic models for estimation and prediction

Variance component estimation and GBLUP

Results

Text mining and gene ontology term analysis

The karyotype of QTL regions registered in QTLDB, text-mined region, and the intersection of both regions.

Genome-wide association study (GWAS) with text-mined SNPs

Manhattan plots with results of genome-wide association study using text-mined SNPs for each trait.

Variance component estimation

Genomic prediction

Carcass traits average correlation between the GEBV and corrected phenotypic values (y) and standard error for 10-validation set.

Discussion

Biological relatedness of text-mined gene with carcass and meat quality traits

Carcass traits

Meat quality traits

Conclusions

The workflow of the text mining.

SNP information used in this study.

1. The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models.

2. A C/T mutation in microRNA target sites in BMP5 gene is potentially associated with fatness in pigs.

3. DISEASES: text mining and data integration of disease-gene associations.

4. Single nucleotide polymorphisms in the corticotrophin-releasing hormone and pro-opiomelancortin genes are associated with growth and carcass yield in beef cattle.

5. Prerigor and postrigor changes in tenderness of ovine longissimus muscle.

6. Identification of KCNJ11 as a functional candidate gene for bovine meat tenderness.

7. Double muscling in cattle due to mutations in the myostatin gene.

8. DGAT1, a new positional and functional candidate gene for intramuscular fat deposition in cattle.

9. Differential gene expression of fatty acid binding proteins during porcine adipogenesis.

10. Novel SNPs in the bovine ADIPOQ and PPARGC1A genes are associated with carcass traits in Hanwoo (Korean cattle).