Literature DB >> 27964730

Prior knowledge guided eQTL mapping for identifying candidate genes.

Yunli Wang1, Rene Richard2, Youlian Pan3.   

Abstract

BACKGROUND: Expression quantitative trait loci (eQTL) mapping is often used to identify genetic loci and candidate genes correlated with traits. Although usually a group of genes affect complex traits, genes in most eQTL mapping methods are considered as independent. Recently, some eQTL mapping methods have accounted for correlated genes, used biological prior knowledge and applied these in model species such as yeast or mouse. However, biological prior knowledge might be very limited for most species.
RESULTS: We proposed a data-driven prior knowledge guided eQTL mapping for identifying candidate genes. At first, quantitative trait loci (QTL) analysis was used to identify single nucleotide polymorphisms (SNP) markers that are associated with traits. Then co-expressed gene modules were generated and gene modules significantly associated with traits were selected. Prior knowledge from QTL mapping was used for eQTL mapping on the selected modules. We tested and compared prior knowledge guided eQTL mapping to the eQTL mapping with no prior knowledge in a simulation study and two barley stem rust resistance case studies. The results in simulation study and real barley case studies show that models using prior knowledge outperform models without prior knowledge. In the first case study, three gene modules were selected and one of the gene modules was enriched with defense response Gene Ontology (GO) terms. Also, one probe in the gene module is mapped to Rpg1, previously identified as resistance gene to stem rust. In the second case study, four gene modules are identified, one gene module is significantly enriched with defense response to fungus and bacterium.
CONCLUSIONS: Prior knowledge guided eQTL mapping is an effective method for identifying candidate genes. The case studies in stem rust show that this approach is robust, and outperforms methods with no prior knowledge in identifying candidate genes.

Entities:  

Keywords:  Candidate genes; Lasso; Prior knowledge; eQTL mapping

Mesh:

Year:  2016        PMID: 27964730      PMCID: PMC5155383          DOI: 10.1186/s12859-016-1387-9

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

A quantitative trait refers to a phenotype, such as disease resistance, that varies quantitatively and is attributable to multiple genes. The first step for discovering candidate genes is to identify chromosome regions associated with a particular quantitative trait through Quantitative trait loci (QTL) mapping. More recently, Expression quantitative trait loci (eQTL) mapping has been applied to identify regulatory regions for genes from transcriptome and genotype data. eQTLs are genomic loci that regulate expression in mRNAs or proteins. QTL mapping usually identifies a large region. Once QTLs or eQTLs have been identified, molecular techniques are employed to narrow down to candidate genes [1]. Traditional linkage mapping methods such as Haley-Knott regression(HK) and composite interval mapping (CIM) have been widely used for QTL mapping and recently on eQTL mapping [1]. Both HK and CIM assume that traits (QTL mapping) or genes (eQTL mapping) are not related. Association mapping methods based on the independence between genes and SNPs ignore the epistasis among genes and interaction between alleles. Least absolute shrinkage and selection operator (Lasso) generate a sparse regression model for eQTL mapping with one gene associated with a small number of SNPs. It showed that Lasso outperformed CIM and HK for eQTL mapping [2]. Recently, multi-task Lasso considering multiple correlated genes and multiple SNPs in a linear regression model has been used for eQTL mapping. Some methods have used prior knowledge to infer the associations between genes and SNPs [3]. The prior knowledge was either represented as gene pairs, SNP pairs, gene networks, or genetic interaction networks. Graph-guided fused lasso (GFLasso) used the fusion penalty to group related multi-response variables [4]. Adaptive Multi-Task Lasso assumed that genes are correlated and also used the prior knowledge of SNPs [5]. Chen et al. proposed a more efficient algorithm for GFLasso [6]. Fused Multitask Penalised Regression (FMPR) encourages the sparsity in weights for related tasks [7]. Two-graph guided multi-task Lasso allows the overlapped subnetworks of genes and SNPs, but they assume that correlation between SNPs or genes are known to infer the correlation between SNP and genes [8]. Graph-regularized dual Lasso represents the genetic interaction network and protein-protein interaction network as two graphs on a linear regression model [9]. Although these methods are appealing, they were only used on yeast and human eQTL mapping which contains rich biological knowledge. Many eQTL mapping methods were used for identifying candidate genes [10-12]. Some used linkage mapping methods such as CIM [10, 11], and others used association mapping methods such as simple linear regression – Matrix eQTL [12], or GAPIT, single locus mapping for population structure [13]. However, most these studies using eQTL mapping for identifying candidate genes do not consider the correlated gene structure and genetic interactions between SNPs, and none of them used prior knowledge in eQTL mapping. Because such prior knowledge is usually not available or not reliable for most species, many studies still use the methods based on the independence assumption between gene and SNPs. We proposed a new eQTL mapping method guided by prior knowledge for identifying candidate genes by using data-driven prior knowledge from QTLs/eQTLs in eQTL mapping. Although some advanced models [4, 5, 8, 9] are proposed and evaluated on model organism, the basic multi-task Lasso model has not been used for identifying candidate genes. We propose to use prior knowledge inferred from QTLs or eQTLs to set penalty factors for SNPs in multi-task Lasso. This method does not rely on any regulatory features of genes or SNPs. We compared eQTL mapping guided by prior knowledge with no prior knowledge. The results show that eQTL mapping guided by prior knowledge outperforms the model without knowledge. We applied our method on two case studies to identify candidate genes that are responsible for resistance to stem rust in barley.

Methods

Our method has three steps. First, we perform QTL mapping to identify a relatively large chromosomal region associated with traits. Second, gene modules significantly associated with traits are selected. Then, the prior knowledge guided eQTL mapping method is performed on selected gene modules. In the second step, weighted correlation network analysis (WGCNA) is used for finding clusters (modules) of highly correlated genes [14]. Gene modules significantly associated with traits are selected. For multiple correlated genes in selected gene modules, we used the multiple-response linear regression model LassoM [15] for eQTL mapping. Throughout this article, we use the letter “M” following the original model to represent multi-response model. Consider K genes Y= [ Y 1,…,Y ]∈R for S samples, a linear regression model for the functional mapping from M SNPs to K gene is given as Where β 0 is a vector, β∈R is a coefficient matrix, and X∈R . The objective function of LassoM: LassoM is the linear regression model which gives the minimum mean cross-validated error. We applied LassoMP by using prior knowledge on the basic model LassoM to identify candidate genes. We use letter “P” to represent prior knowledge. LassoMP uses prior knowledge in the multi-response linear regression model. From previous QTL or eQTL mapping, we learn some SNPs are strong regulators for a trait. Each SNP has a penalty factor in LassoM. Let p ∈ [ 0,1] be the penalty factor for jth SNP. For a particular SNP found from experimental results or QTLs identified from multiple experiments, their evidence is considered strong and thus p =0 (no penalty). Otherwise, we can set p =0.5 if the evidence is derived from computational result only. Also, the elastic-net penalty α is used to get a linear combination of ∥β ∥1 and . The objective function becomes: From this model, we can derive three models: LassoM, multi-response ridge (RidgeM) and multi-response elastic net(elasticM), and they differ in elastic-net penalty α: Lasso (α=1), ridge (α=0) and elastic net (α=0.5). They can be combined with prior knowledge p to generate their prior knowledge on multi-response models: LassoMP, RidgeMP and elasticMP.

Results

Simulation study

We performed the simulation study to compare LassoM and LassoMP with two other multi-task Lasso methods GFLasso [6] and FMPR [7]. GFLasso and FMPR are implemented in the R package FMPR. To demonstrate the effect of using prior knowledge, we also compared LassoM and LassoMP with RidgeM, RidgeMP, elasticM, and elasticMP. They are implemented in the R package glmnet [16]. In RidgeMP, LassoMP and elasticMP, the penalty factor is set as p =0.

Simulation data

The performance of these eight models are compared in four setups. We set the number of samples N=50, vary the number of predictors X as 100 and 500, and the number of response variables Y as 10 and 20. We generated 30 datasets for each setups and compared the average performance of these models on the generated data. The simulation data is generated using the same method in [8]. The correlation between genetic markers and between genes are simulated. We compared the performance of these models using the root-mean-squared errors (RMSE), areas under the precision and recall curve (AUC), and degree of freedom (DF). RMSE and AUC were used to compare the performance of regression models in [8, 9]. We also used the DF since it indicates the number of predictors in the regression model. In eQTL mapping, usually a small number of genetic markers are associated with genes, so lower DF means less number of genetic markers in the model. The model with lower RMSE, higher AUC, and a lower DF are preferred. For each of the 30 datasets in four setups, cross-validation is performed on eight models and the optimal parameters are chosen, the models based on the optimal parameters are used to calculate RMSE, AUC and DF using the R package ROCR [17].

Simulation results

The results of simulation study are shown in Fig. 1. Among eight models, LassoMP outperforms others in RMSE and DF, while elasticMP reaches the best performance in AUC. Specifically, LassoMP and LassoM outperform GFLasso and FMPR in RMSE and DF, and LassoMP performs better than FMPR but worse than GFLasso in AUC, but LassoM performs worse than GFLasso and FMPR in AUC.
Fig. 1

The performance of eight multi-response models in simulation study

The performance of eight multi-response models in simulation study Interestingly, the DFs of GFLasso, RidgeM and RidgeMP are always equal to the number of predictors, which means they used all predictors in the regression models. For eQTL mapping, if all genetic markers are included in a linear regression model, it might explain a large proportion of response and performs better in terms of AUC, but it violates the assumption that genes are regulated by a small number of genetic markers. Also, the number of predictors used in GFLasso, RidgeM, RidgeMP, elasticM and elasticMP are far more than the number of samples, which means these models are over-fitting. In all four setups, the number of predictors in LassoMP, LassoM and FMPR are always less than the number samples, and LassoMP achieved the lowest DF. LassoM and LassoMP are sparse models and thus need less numbers, yet most significant predictors as compared to other six models. Comparing each pair of LassoMP vs. LassoM, RidgeMP vs. RidgeM, and elasticMP vs. elasticM, prior knowledge models reached better performance in AUC, RMSE and DF. It indicates that prior knowledge reduced cross-validation errors, predict accurately the associations between predictors and response variables using less number of predictors. LassoMP achieved the lowest average RMSE and lowest DF among eight models in four setups. LassoMP reaches lower RMSE and comparable AUC with much less number of predictors than other models. From the simulation study, we conclude that LassoMP is a better model than others. We also compared the computational time of eight models used in the simulation study. The average computational time of these models in five-fold cross validation of four experimental set ups are calculated (Table 1). LassoM, LassoMP, RidgeM, RidgeMP, elasticM, elasticMP are very efficient in all conditions. FMPR and GFLasso are more computationally expensive, and the runtime of GFLasso increases dramatically with the increase of number of response variables and predictors.
Table 1

The average computational time of eight models in four simulation set ups

ModelK=10 J=100K=20 J=100K=10 J=500K=20 J=500
(seconds)(seconds)(seconds)(seconds)
FMPR22.3461.9872.96232.90
GFLasso113.38496.00947.283776.94
RidgeM0.440.820.791.67
RidgeMP0.480.870.951.70
elasticM0.470.890.621.67
elasticMP0.480.950.611.70
LassoM0.480.980.561.42
LassoMP0.501.000.561.51
The average computational time of eight models in four simulation set ups

Stem rust case study 1

In this study, we aim to perform prior knowledge guided eQTL mapping on a barley data set to identify candidate genes underlying resistance to the Puccinia gramins f.sp. tritici in barley. Rpg1 has been identified as the stem rust resistance gene in barley for many stem rust pathogen races [18].

Dataset

Our study used genotyptic and phenotypic data, and a genetic map of two barley populations “Steptoe” and “Morex” generated from [10]. The barley stem rust data sets were downloaded from Gene Network [19]. Barley phenotypic data were generated from a population of 150 F1-derived doubled haploid (DH) lines derived from the cross of Steptoe X Morex (St/Mx). Stem rust infection type was measured using numeric infection type scores [10]. The gene expression experiments were performed on a set of 139 lines of embryo-derived tissues in these 150 lines of barley [19]. The gene expression data set was derived from the Affymetrix Barley1 GeneChip, which contains 22841 probe sets. The barley genotype data set had 842 SNPs in each of 150 St/Mx DH lines. We imputed the missing data points in genotype data using PHASE [20]. After removing co-segregated SNPs, we collected 413 SNP markers from the genotype data. We removed 12 samples from genotype data and one sample from the gene expression data because of too many missing values in either genotype or gene expression data. The gene expression data set contains 22841 probe sets and 138 samples. More details of the original barley data can be found in [10].

QTL mapping

To identify candidate genes for resistance to stem rust, we used our three-step method: QTL mapping, gene module selection, and prior knowledge guided eQTL mapping. QTL mapping is performed using the R/qtl R package [21]. All QTLs with LOD (logarithm (base 10) of odds) score at p-value <0.05 level in 1000 permutation tests are identified as QTLs from linkage mapping (Table 2). The only QTL revealed for stem rust infection type 0, 1, and 3 is located on chromosome 7H 0cM, which is co-located with Rpg1 SNP marker. One QTL on chromosome 2 at 49.3cM, co-located with ABC01899-1-1-301 SNP marker, is identified to be associated with stem rust infection type 2. The Rpg1 locus identified using linkage mapping in this data set coincides with the major stem rust resistance locus [18].
Table 2

QTLs for stem rust infection types

TraitSNPChromosomeCentimorganLOD score
Stem Rust Type 0Rpg17046.14
Stem Rust Type 1Rpg17059.57
Stem Rust Type 2ABC01899-1-1-301249.33.29
Stem Rust Type 3Rpg170106.32
QTLs for stem rust infection types

Gene module selection

Seventy-eight gene co-expression modules were generated using WGCNA. To summarize the gene expression profiles of the highly correlated genes inside a module, module eigengene (ME) is calculated based on the eigenvector of the first principle component of all genes in the module. Then, the correlation between each of stem rust infection type 0, 1, 2, 3 and each ME is calculated. If we consider stem rust infection type 0,1,2,3 as independent tests, which is a strict requirement, any module with p-value is considered as significantly correlated with traits. Three modules Plum1, Skyblue and Saddlebrown (Table 3) are significantly associated with stem rust infection type 0, 1, and 3.
Table 3

Gene modules significantly associated with stem rust infection types

TraitGene modulesCorrelation coefficient (p-value)
Infection Type 0Plum10.72 (2e-23)
Saddlebrown0.4 (9e-7)
Skyblue-0.7 (1e-21)
Infection Type 1Plum10.79 (3e-30)
Saddlebrown0.45 (3e-8)
Skyblue-0,72 (1e-23)
Infection Type 2None
Infection Type 3Plum1-0.81 (1e-32)
Saddlebrown-0.45 (2e-8)
Skyblue0.76 (3e-27)
Gene modules significantly associated with stem rust infection types

Prior knowledge guided eQTL mapping

At first, eQTL mapping between genes in selected gene modules and all SNP markers was performed using the basic model LassoM. Most genes in Plum1 and Skyblue gene modules both showed strong correlations with markers Rpg1 (7H@0cm) and Sft (7H@0.7cM), but genes in the Saddlebrown module had correlations with markers on 7H 28cM. The SNP markers for genes in the Plum1 and Skyblue modules overlap with the QTL identified in QTL mapping. To evaluate LassoM and LassoMP, we compared them in terms of model selection measures: mean-squared error (MSE), and proportion of variance explained, and biological meaning measure: proportion of cis eQTL. The low mean-squared error or high proportion of variance explained indicates a better linear regression model. eQTLs that map to the approximate location of their gene-of-origin are referred to as cis eQTLs. In contrast, those that map far from the location of their gene-of-origin gene, often on different chromosomes, are referred to as trans eQTLs. The higher proportion of cis eQTL means the model identifies more eQTLs with SNPs located near associated genes. Usually the proportion of cis eQTL is used for measuring the biological meaning of eQTL mapping algorithms [2, 9]. LassoM and LassoMP were compared on the same three selected gene modules: Plum1, Skyblue and Saddlebrown. The prior knowledge used in LassoMP on selected gene modules could be markers close to QTLs or eQTLs. Since Rpg1 is the QTL marker and also one of eQTL markers identified in Plum1 and Skyblue, we use Rpg1 as the prior knowledge for these two gene modules. For Saddlebrown, we used iEst5, 452-498, and 2124-984 SNP markers as the prior knowledge to guide the LassoMP process. The comparison between LassoM and LassoMP is performed in MSE by running 30 times five-fold cross validation (Fig. 2). The results show that LassoMP significantly outperforms LassoM on these three selected gene modules.
Fig. 2

Comparison of LassoM and LassoMP on Plum1 (a), Skyblue (b), and Saddlebrown (c) gene modules

Comparison of LassoM and LassoMP on Plum1 (a), Skyblue (b), and Saddlebrown (c) gene modules Next, we compare LassoM with LassoMP in terms of Mean-Squared Error and number of predictors (degree of freedom) in the linear regression models (Table 4). LassoM has lower mean-sqaured error. In all three gene modules, LassoMP only includes less then ten genetic markers in the model. It indicates that LassoMP achieves lower MSE with a much lower number of predictors in the model compared to LassoM. LassoMP has lower proportion of variance explained than that of LassoM because it uses less number of predictors. LassoMP are a better model than LassoM in terms of model selection.
Table 4

The comparison of four methods on three gene modules

Gene modulesMethodMean# ofProportion# of# of eQTLsProportion
Squaredpredictorsof varianceeQTLswith knownof cis eQTLs
Errorexplained(%)gene locations(%)
Plum1LassoM17504112356.016216917.39
LassoMP1639614952.132432729.63
SkyblueLassoM29181514752.651269479.57
LassoMP2801753236.6854450.00
SaddlebrownLassoM29198701446.733782821.43
LassoMP2675651347.238121100.00
The comparison of four methods on three gene modules LassoMP are compared to LassoM in terms of the biological meaning (Table 4). The proportion of cis eQTLs identified from these four methods is used to measure the biological meaning. Less than 40% of probe sets in the Barley1 Affy platform are mapped to physical positions. We use 15 cM as the threshold for cis eQTL. The proportion of cis eQTLs from eQTL mapping of Plum1, Skyblue, Saddlebrown modules are shown in Table 4. LassoMP has a higher proportion of cis eQTLs than LassoM. LassoMP is a better model than LassoM in terms of model selection and biological meaning. The cis eQTLs identified from LassoMP on Plum1, Skyblue and Saddlebrown gene modules are presented in Table 5. Probes with cis eQTLs and high correlation coefficients are potential candidate genes for resistance to stem rust, such as Contig3140_at, Contig6347_at, Contig2657_at, HVSMEn0014H06r2_s_at in Plum1, and Contig12634_at, Contig6348_at, Contig18035_at, Contig3141_at in Skyblue, and Contig40_x_at, Contig3269_at, Contig5614_s_at, Contig2957_at, and Contig4028_x_at in Saddlebrown.
Table 5

cis eQTLs identified in in Plum1, Skyblue and Saddlebrown modules

ModulesProbe setsChrcMSNPChrcM
Plum1AF509747.1_at70.9Rpg170.0
Contig11996_s_at712.7Rpg170.0
Contig9996_at70.4Rpg170.0
SkyblueContig14185_at70.6Rpg170.0
Contig26418_at70.2Rpg170.0
SaddlowbrownContig10289_at724.52124-984729.2
Contig11481_at722.92124-984729.2
Contig11570_at725.92124-984729.2
Contig13623_at727.62124-984729.2
Contig18611_at729.82124-984729.2
Contig5613_at722.72124-984729.2
Contig6931_at724.22124-984729.2
Contig10289_at724.5452-498728.5
Contig11481_at722.9452-498728.5
Contig11570_at725.9452-498728.5
Contig13623_at727.6452-498728.5
Contig18611_at729.8452-498728.5
Contig5613_at722.7452-498728.5
Contig6931_at724.2452-498728.5
Contig10289_at724.5iEst5716.8
Contig11481_at722.9iEst5716.8
Contig11570_at725.9iEst5716.8
Contig13623_at727.6iEst5716.8
Contig18611_at729.8iEst5716.8
Contig5613_at722.7iEst5716.8
Contig6931_at724.2iEst5716.8
cis eQTLs identified in in Plum1, Skyblue and Saddlebrown modules

Validation

We used the alignment between the Affymetrix GeneChip Barley Genome Array and high confidence barley ensemble gene IDs provided by Ensemble Plants [22]. Based on these ensemble gene IDs, the GO annotations are retrieved from Biomart [23]. In each gene module, the GO enrichment analysis is performed using hyperGTest function in GOstats [24]. Enriched GO terms in Plum1 (Table 6) and in Skyblue modules (Additional file 1) reveals defense related functions. Plum1 includes three defense response probe sets: AF509747.1_at, HVSMEl0003E22r2_at and HS16G07u_at. AF509747.1_at represents the Rpg1 gene [10], which is a gene specific to stem rust in barley. HVSMEl0003E22r2_at and HS16G07u_at both link to the disease resistance related function–ADP binding. The Plum1 gene module is enriched with defense response, ADP binding, and cell wall related fructosyltransferase activity.
Table 6

Enriched GO terms of matched ensemble genes from probe sets in Plum1 module

GOGOID PvalueTerm
MFGO:00472070.0011,2-beta-fructan 1F-fructosyltransferase activity
MFGO:00507380.002fructosyltransferase activity
MFGO:00435310.002ADP binding
MFGO:00905990.007alpha-glucosidase activity
MFGO:00045640.007beta-fructofuranosidase activity
MFGO:00045750.007sucrose alpha-glucosidase activity
MFGO:00159260.009glucosidase activity
MFGO:00055160.029calmodulin binding
MFGO:00700010.039aspartic-type peptidase activity
MFGO:00041900.039aspartic-type endopeptidase activity
BPGO:00704170.002cellular response to cold
BPGO:00346050.004cellular response to heat
BPGO:00466850.004response to arsenic-containing substance
BPGO:00069860.013response to unfolded protein
BPGO:00359670.013cellular response to topologically incorrect protein
BPGO:00346200.013cellular response to unfolded protein
BPGO:00069520.020defense response
BPGO:00069500.022response to stress
CCGO:00095060.024plasmodesma
CCGO:00300540.024cell junction
CCGO:00550440.024symplast
CCGO:00059110.024cell-cell junction
CCGO:00057830.038endoplasmic reticulum
Enriched GO terms of matched ensemble genes from probe sets in Plum1 module The Skyblue and Plum1 modules are associated to the same genetic marker Rpg1, but they show distinct functional annotations. Two probe sets Contig8651_at and Contig8651_s_at link to ADP-sugar diphosphatase activity, ADP-ribose pyrophosphohydrolase activity, ADP-glucose pyrophosphohydrolase activity, and one probe set Contig2598_s_at shows response to sucrose, glucose, and fructose. Skyblue is also enriched with transmembrane transport.

Stem rust case study 2

The second case study on stem rust resistance investigated the resistance to stem rust pathogen Ug99 in progeny of Q21861 and SM89010 [11]. Previous studies identified a major QTL Rpg-TTKSK on 5H in seedlings samples and other loci in adult plants [25]. The experiment was designed to examine the qualitative and quantitative resistance in seedlings and adult plants in response to Pgt race TTKSK. Crossing Q21861 and SM89010 generated 75 double haploid lines and each line has one sample treated with Pgt race TTKSK-inoculation (TTKS) and another sample treated with mock-inoculation (MOCK). The trait data includes infection types and infection severity in seedlings and adult plants. The gene expression data contains 22841 probesets from four biological replicates for each parental line, and 75 TTKS samples and 75 MOCK samples in QSM lines. Genotypic data uses the QSM genetic map from transcript-derived markers (TDMs), which include 378 markers. The gene expression data was downloaded from GEO [26], and phenotype and genotype data were downloaded from [11]. In the second case study, we use the same three-step prior knowledge guided eQTL mapping method as above to identify candidate genes. Phenotype data used for QTL mapping was split into four categories: infection frequency (IF), principal component (PC), severity (SEV), and lesion size (LES), and infection coefficient (IC). IF and PC traits are phenotype data for seedling, and other two for adult plants. PC1, 2, 3, and 4 are derived from IF0, 1, 2, 3 for infection types 0, 1, 2, 3, and infection coefficient are derived from severity and lesion size. The genotypic data has 378 markers on 75 DH lines. R/qtl [21] was used for linkage mapping between phenotype and genotype (Table 7). From seedling samples, 5H@147cm is a major QTL, and some other QTLs are identified from 3H. From adult plants, 5H@141cm is the main QTL for SEV, LES and IC, a few QTLs on 7H and 2H are also identified.
Table 7

QTLs for stem rust infection in QSM population

TraitChromosomeCentimorganLOD score
IF for Infection Type 03H04.06
IF for Infection Type 05H14719.36
IF for Infection Type 15H14716.5
IF for Infection Type 25H1453.28
IF for Infection Type 33H6.84.49
IF for Infection Type 35H146.821.97
PC13H6.84.79
5H146.831.39
PC2NULL
PC3NULL
PC4NULL
SEV 7-Oct-085H141.45.73
7H76.83.49
SEV 17-Oct-082H41.52.97
5H141.46.31
SEV 10-Nov-085H1414.18
LSE 7-Oct-085H1417.3
LSE 17-Oct-083H2.723.30
5H145.425.72
LSE 10-Nov-085H72.24.71
IC 7-Oct-085H1416.54
IC 17-Oct-085H1416.88
IC 10-Nov-085H1414.2
QTLs for stem rust infection in QSM population Since this dataset used TTKS and MOCK samples in pairwise experimental design, we first identified differentially expressed genes (DEG) from all samples. At first, we identified 362 DEGs from Q21861 and 4 DEGs from SM89010, merged them into 366 DEGs in parental lines with p-value <0.05. Clearly, Q21861 contributes much more DEGs than SM89010. Also, we identified 8460 DEGs from progeny among 75 paired samples with p-value <0.05. In total, 8487 DEGs were identified from parental lines and progeny. WGCNA was used to generate gene modules from the 8487 DEGs which include 154 samples. These 154 samples include four samples from parental lines and 150 samples from 75 progeny lines in TTKS and MOCK conditions. Through the hierarchical clustering, 78 modules were generated from the DEGs. The gene modules that are significantly associated with a trait were selected. At first, gene expression values were transformed (TTKS gene expression values minus MOCK gene expression values) in 75 TTKS samples, and gene modules significantly associated with each trait were identified respectively. The correlation coefficient between a gene module and a phenotype is considered significant with p-value <0.05. Four gene modules: Saddlebrown, Darkgrey, Darkmagenta, and Blue are significantly associated with infections in seeding and five gene modules Royalblue, Lightyellow, Yellow, Sienna3, and Darkgreen significantly associated with infections in adult plants (Table 8).
Table 8

Gene modules significantly associated with infection types in seedlings and adult plants

TraitGene modulesCorrelation coefficient (p-value)
IFIF0, IF3Darkgrey-0.23(0.0)
IF2Saddlebrown-0.22(0.06)
PCPC1Darkgrey-0.24(0.04)
PC2Saddlebrown0.31(0.007)
PC3Darkmagenta0.22(0.006)
PC4Darkmagenta0.23(0.05)
Blue0.24(0.007)
SEVSEV20081007Royalblue0.25(0.03)
Lightyellow0.27(0.02)
SEV20081110Royalblue0.25(0.03)
Lightyellow0.27(0.02)
SEV20081110Royalblue0.25(0.03)
LSELSE20081007None
LSE20081017Yellow0.27(0.02)
LSE20081110Sienna30.24(0.04)
Darkgreen-0.31(0.006)
ICIC20081007Royalblue0.25(0.03)
Lightyellow0.29(0.01)
IC20081017Royalblue0.22(0.05)
Lightyellow0.28(0.02)
IC20081110None
Gene modules significantly associated with infection types in seedlings and adult plants The Darkgrey and Saddlebrown modules are significantly associated with IF and PC. Darkgrey corresponds to PC1 and Saddlebrown associates with PC2. This indicates these two gene modules capture the main infection factors. Darkgrey is negatively associated with IF0 and IF3, and Saddlebrown is negatively associated with IF2. Since IF2 and IF3 represent high infection severity, the Darkmagenta and Blue modules correspond to PC3 and PC4, and it indicates that they capture minor factors for infection. Four gene modules are positively associated with SEV, LSE and IC, but only the Darkgreen is negatively associated with LSE. We identified eQTLs in two gene modules, Saddlebrown and Darkgreen, using LassoM. Saddlowbrown is negatively associated with IF2 and Darkgreen with LSE. eQTL mapping is performed on Darkgrey, Saddlowbrown, Darkgreen and Yellow using LassoM and LassoMP. Subsequently, we compared LassoM and LassoMP on four gene modules: Darkgrey, Saddlebrown, Darkgreen, and Yellow using 5-fold cross validation (Fig. 3). Prior knowledge for each gene module using LassoMP is selected from candidate markers identified using LassoM, and the marker with lowest MSE are selected as the prior knowledge. In the Darkgrey module, 5H@144cm reached the lowest MSE, and overlaps with the QTL (5H Rpg-TTKSK QTL region) identified from QTL mapping (Table 7). The Darkgrey module corresponds to PC1 of infection types, which captures the main factor of infection in seedlings. In the Yellow module, LassoMP with genetic marker at 2H@153.5cm has a lower MSE. In other two gene modules Saddlowbrown and Darkgreen, LassoMP reaches lower MSE comparing to LassoM.
Fig. 3

The comparison of LassoM and LasooMP on Darkgrey (a), Saddlebrown (b), Darkgreen (c), and Yellow (d) gene modules

The comparison of LassoM and LasooMP on Darkgrey (a), Saddlebrown (b), Darkgreen (c), and Yellow (d) gene modules We used LassoMP as the mapping method on these four gene modules. Top 10 eQTLs identified from Darkgrey, Saddlowbrown, Darkgreen and Yellow gene modules are listed in Table 9.
Table 9

Top 10 eQTLs identified using LassoMPs on Darkgrey, Saddlowbrown, Darkgreen and Yellow gene modules

ProbesSNPLocationCorrelation
coefficient
DarkgreyContig3156_s_atHZ58F11r_at5H@145.4cm0.505
Contig3155_s_atHZ58F11r_at5H@145.4cm0.350
Contig1385_atHZ58F11r_at5H@145.4cm0.264
Contig3157_atHZ58F11r_at5H@145.4cm0.262
Contig19929_atHZ58F11r_at5H@145.4cm0.259
Contig3151_atHZ58F11r_at5H@145.4cm0.256
Contig11361_atHZ58F11r_at5H@145.4cm0.196
Contig8307_s_atHZ58F11r_at5H@145.4cm0.147
Contig6701_s_atHZ58F11r_at5H@145.4cm0.143
Contig5469_atHZ58F11r_at5H@145.4cm0.108
SaddlebrownContig2210_atContig13249_at7H@47.1cm-0.642
Contig2212_s_atContig13249_at7H@47.1cm-0.612
Contig2209_atContig13249_at7H@47.1cm-0.411
Contig2214_s_atContig13249_at7H@47.1cm-0.376
HVSMEm0003C15r2_s_atContig13249_at7H@47.1cm-0.309
Contig1637_s_atContig13249_at7H@47.1cm-0.288
Contig1637_atContig13249_at7H@47.1cm-0.284
Contig2787_s_atContig13249_at7H@47.1cm-0.284
Contig13350_atContig13249_at7H@47.1cm-0.257
EBem10_SQ002_I10_s_atContig13249_at7H@47.1cm-0.170
DarkgreenContig2170_atContig4572_at7H@83.5cm-0.807
Contig8722_atContig4572_at7H@83.5cm-0.199
Contig11240_atContig4572_at7H@83.5cm-0.149
Contig21643_atContig4572_at7H@83.5cm-0.149
Contig3886_atContig4572_at7H@83.5cm-0.144
Contig23697_atContig4572_at7H@83.5cm-0.128
Contig13049_atContig4572_at7H@83.5cm-0.106
Contig13799_atContig4572_at7H@83.5cm-0.098
Contig20_atContig4572_at7H@83.5cm-0.089
Contig1315_s_atContig4572_at7H@83.5cm0.036
YellowContig8002_atContig1791_x_at2H@153.5cm0.288
Contig20602_atContig1791_x_at2H@153.5cm0.285
Contig11328_atContig1791_x_at2H@153.5cm0.266
Contig14754_atContig1791_x_at2H@153.5cm0.227
Contig23817_atContig1791_x_at2H@153.5cm0.205
Contig8052_atContig1791_x_at2H@153.5cm0.200
Contig23584_atContig1791_x_at2H@153.5cm0.197
HV_CEa0006L03r2_atContig1791_x_at2H@153.5cm0.190
Contig10957_atContig1791_x_at2H@153.5cm0.188
HA14H02r_atContig1791_x_at2H@153.5cm0.185
Top 10 eQTLs identified using LassoMPs on Darkgrey, Saddlowbrown, Darkgreen and Yellow gene modules The enriched GO terms in Saddlebrown (Table 10), Darkgrey, Darkgreen gene modules (Additional file 2) are presented. Five probe sets (Contig2209_at, Contig2210_at, Contig2214_s_at, Contig2211_at, Contig2212_s_at) in the Saddlebrown gene module are mapped to the same UniProt gene Pathogenesis-related protein PRB1-2. Mapped ensemble genes are enriched with defense response to fungus and immune response.
Table 10

Enriched GO terms of matched ensemble genes from probe sets in Saddleborwn module

GOGOID p–valueTerm
MFGO:00084220.001beta-glucosidase activity
MFGO:00429730.001glucan endo-1, 3-beta-D-glucosidase activity
MFGO:00045530.001hydrolase activity, hydrolyzing O-glycosyl compounds
MFGO:00167980.001hydrolase activity, acting on glycosyl bonds
MFGO:00159260.003glucosidase activity
MFGO:00167870.030hydrolase activity
BPGO:00098160.004defense response to bacterium, incompatible interaction
BPGO:00098170.005defense response to fungus, incompatible interaction
BPGO:00427420.019defense response to bacterium
BPGO:00508320.021defense response to fungus
BPGO:00096170.021response to bacterium
BPGO:00059750.023carbohydrate metabolic process
BPGO:00098140.028defense response, incompatible interaction
BPGO:00096200.029response to fungus
BPGO:00450870.041innate immune response
BPGO:00069550.043immune response
BPGO:00023760.046immune system process
CCGO:00055760.002extracellular region
Enriched GO terms of matched ensemble genes from probe sets in Saddleborwn module eQTLs for Darkgrey gene modules overlaps with the major Ug99 QTL Rpg-TTKSK locus at 5H. The Darkgrey gene module is enriched with defense related oxidoreductase activity. The Darkgreen module is the only gene module which contains disease resistance genes in adult plants. The major genetic marker for the Darkgreen gene module locates at 7H@83.5cm, which is the major hotspot of differentially expressed genes. The Darkgreen module is enriched with transporter activity, water channel activity, and oxygen binding.

Discussion

Our prior knowledge guided eQTL mapping method is different from other eQTL mapping methods in three aspects: 1) using data-driven prior knowledge, 2) selecting gene modules for eQTL mapping, 3) identifying major markers for the selected gene modules. We use the data-driven prior knowledge to identify candidate genes, which is applicable to a wide range of species. Knowledge driven methods, such as Lirnet, are heavily dependent on rich biological knowledge in model organisms. In our method, modules are generated in WGCNA using hierarchical clustering, which is an unsupervised method. Also, QTL and eQTL mapping are performed purely on genotypic, transcript profiling and phenotypic data. Any biological knowledge of these genetic markers and genes are not used in the mapping process. A few studies used biological knowledge in eQTL mapping [5, 8, 9]. Their methods are applicable for resource rich organisms such as yeast, human, mouse and Arabidopsis. For other species, gene functional annotations are very limited. In our method, selection of candidate genes takes evidence in trait-gene associations and gene-SNP associations. For example, some gene expression values do not have strong associations with traits, such as AF509747.1_at, but have strong associations with SNP markers. Our method is able to find the robust gene modules significantly associated with traits and SNP markers simultaneously. One of main features of our method is to identify major markers for the selected gene modules. In simulation study, we observe that LassoM and LassoMP are sparse models comparing withe other six models. They used much less number of predictors in the regression models. The major difference between LassoMP and LassoM is LassoMP uses the prior knowledge. The influence of prior knowledge on the performance of eight models in the simulation study is presented in Additional file 3. We proposed to use LassoMP further reduce the number of predictors. LassoMPs is a sparse solution for LassoMP, and the comparison between LassoMPs and LassoMP is available in Additional file 4. From the perspective of model selection, adding prior knowledge reduces the mean-squared error and keep the same proportion of phenotype variance. Moreover, these major regulators are shown to be functionally relevant to rust infection.

Conclusions

We proposed a new prior knowledge guided eQTL mapping methods for identifying candidate genes. Our method includes three steps: 1) identifying QTLs from QTL mapping; 2) generating and selecting gene modules; 3) prior knowledge guided eQTL mapping. In simulation study, we compared the prior knowledge guided methods LassoMP with other seven multi-task algorithms. The prior knowledge guided eQTL mapping methods outperformed those without prior knowledge. Using LassoMP, in the first barley case study, we identified three gene modules and a few genes as candidate genes for resistance to stem rust, and one of them is confirmed as stem rust resistance gene [10]. In the second case study for stem rust Ug99 resistance in QSM population, we identified four gene modules significantly associated with infection in either seedling or adult plants. One of the gene modules is co-located with the major QTL Rpg-TTKSK for stem rust infection. Another gene module contains four probes mapped to the same disease resistance gene. The proposed prior knowledge guided eQTL mapping method is applicable for different experimental design and a variety of species. The first case study used all genes and the second case study used a subset of differentially expressed genes on pairwise samples. The identified modules and candidate genes are functionally relevant to rust resistance.
  21 in total

1.  R/qtl: QTL mapping in experimental crosses.

Authors:  Karl W Broman; Hao Wu; Saunak Sen; Gary A Churchill
Journal:  Bioinformatics       Date:  2003-05-01       Impact factor: 6.937

Review 2.  Genome-wide association studies for common diseases and complex traits.

Authors:  Joel N Hirschhorn; Mark J Daly
Journal:  Nat Rev Genet       Date:  2005-02       Impact factor: 53.242

3.  ROCR: visualizing classifier performance in R.

Authors:  Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal:  Bioinformatics       Date:  2005-08-11       Impact factor: 6.937

4.  Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks.

Authors:  Jun Zhu; Bin Zhang; Erin N Smith; Becky Drees; Rachel B Brem; Leonid Kruglyak; Roger E Bumgarner; Eric E Schadt
Journal:  Nat Genet       Date:  2008-06-15       Impact factor: 38.330

5.  Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis.

Authors:  Marilyn A L West; Kyunga Kim; Daniel J Kliebenstein; Hans van Leeuwen; Richard W Michelmore; R W Doerge; Dina A St Clair
Journal:  Genetics       Date:  2006-12-18       Impact factor: 4.562

6.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

7.  Quantitative and qualitative stem rust resistance factors in barley are associated with transcriptional suppression of defense regulons.

Authors:  Matthew J Moscou; Nick Lauter; Brian Steffenson; Roger P Wise
Journal:  PLoS Genet       Date:  2011-07-28       Impact factor: 5.917

8.  Triticeae resources in Ensembl Plants.

Authors:  Dan M Bolser; Arnaud Kerhornou; Brandon Walts; Paul Kersey
Journal:  Plant Cell Physiol       Date:  2014-11-27       Impact factor: 4.927

9.  The BioMart community portal: an innovative alternative to large, centralized data repositories.

Authors:  Damian Smedley; Syed Haider; Steffen Durinck; Luca Pandini; Paolo Provero; James Allen; Olivier Arnaiz; Mohammad Hamza Awedh; Richard Baldock; Giulia Barbiera; Philippe Bardou; Tim Beck; Andrew Blake; Merideth Bonierbale; Anthony J Brookes; Gabriele Bucci; Iwan Buetti; Sarah Burge; Cédric Cabau; Joseph W Carlson; Claude Chelala; Charalambos Chrysostomou; Davide Cittaro; Olivier Collin; Raul Cordova; Rosalind J Cutts; Erik Dassi; Alex Di Genova; Anis Djari; Anthony Esposito; Heather Estrella; Eduardo Eyras; Julio Fernandez-Banet; Simon Forbes; Robert C Free; Takatomo Fujisawa; Emanuela Gadaleta; Jose M Garcia-Manteiga; David Goodstein; Kristian Gray; José Afonso Guerra-Assunção; Bernard Haggarty; Dong-Jin Han; Byung Woo Han; Todd Harris; Jayson Harshbarger; Robert K Hastings; Richard D Hayes; Claire Hoede; Shen Hu; Zhi-Liang Hu; Lucie Hutchins; Zhengyan Kan; Hideya Kawaji; Aminah Keliet; Arnaud Kerhornou; Sunghoon Kim; Rhoda Kinsella; Christophe Klopp; Lei Kong; Daniel Lawson; Dejan Lazarevic; Ji-Hyun Lee; Thomas Letellier; Chuan-Yun Li; Pietro Lio; Chu-Jun Liu; Jie Luo; Alejandro Maass; Jerome Mariette; Thomas Maurel; Stefania Merella; Azza Mostafa Mohamed; Francois Moreews; Ibounyamine Nabihoudine; Nelson Ndegwa; Céline Noirot; Cristian Perez-Llamas; Michael Primig; Alessandro Quattrone; Hadi Quesneville; Davide Rambaldi; James Reecy; Michela Riba; Steven Rosanoff; Amna Ali Saddiq; Elisa Salas; Olivier Sallou; Rebecca Shepherd; Reinhard Simon; Linda Sperling; William Spooner; Daniel M Staines; Delphine Steinbach; Kevin Stone; Elia Stupka; Jon W Teague; Abu Z Dayem Ullah; Jun Wang; Doreen Ware; Marie Wong-Erasmus; Ken Youens-Clark; Amonida Zadissa; Shi-Jian Zhang; Arek Kasprzyk
Journal:  Nucleic Acids Res       Date:  2015-04-20       Impact factor: 16.971

10.  Statistical estimation of correlated genome associations to a quantitative trait network.

Authors:  Seyoung Kim; Eric P Xing
Journal:  PLoS Genet       Date:  2009-08-14       Impact factor: 5.917

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.