Literature DB >> 23979571

Selecting informative traits for multivariate quantitative trait locus mapping helps to gain optimal power.

Riyan Cheng¹, Justin Borevitz, R W Doerge.

Abstract

A major consideration in multitrait analysis is which traits should be jointly analyzed. As a common strategy, multitrait analysis is performed either on pairs of traits or on all of traits. To fully exploit the power of multitrait analysis, we propose variable selection to choose a subset of informative traits for multitrait quantitative trait locus (QTL) mapping. The proposed method is very useful for achieving optimal statistical power for QTL identification and for disclosing the most relevant traits. It is also a practical strategy to effectively take advantage of multitrait analysis when the number of traits under consideration is too large, making the usual multivariate analysis of all traits challenging. We study the impact of selection bias and the usage of permutation tests in the context of variable selection and develop a powerful implementation procedure of variable selection for genome scanning. We demonstrate the proposed method and selection procedure in a backcross population, using both simulated and real data. The extension to other experimental mapping populations is straightforward.

Entities: Disease Species

Keywords: multitrait mapping; quantitative trait locus (QTL); statistical power; variable selection

Mesh：

Substances：
Genetic Markers

Year: 2013 PMID： 23979571 PMCID： PMC3813856 DOI： 10.1534/genetics.113.155937

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

IT is a common practice to collect data on multiple phenotypes when conducting a quantitative trait locus (QTL) mapping experiment. Historically, multiple traits are analyzed separately (referred to as single-trait analysis) (Edwards ; Stuber ; Weller ; Hubner ). Single-trait analysis does not benefit from additional information that can be gained from the correlations between traits. Therefore, multitrait analysis has been advocated in the QTL mapping community for many years (Jiang and Zeng 1995; Korol , 1998; Ronin ; Knott and Haley 2000; Verzilli ). By accounting for information in the residual covariance of certain traits, multitrait analysis has the potential to achieve a higher statistical power for QTL detection and result in more accurate estimates than single-trait analysis (Jiang and Zeng 1995). Furthermore, multitrait analysis allows formal studies of biologically interesting hypotheses such as pleiotropy (Mangin ) and QTL-by-environment interaction (Piepho 2001). As one of the main motivations, multitrait analysis is employed to increase statistical power for QTL detection. Unfortunately, it is not always more powerful than single-trait analysis (Jiang and Zeng 1995; Korol ; Wu ). Jiang and Zeng (1995) and Korol demonstrated that the statistical power of multitrait analysis depends on both the QTL effects and the structure of the residual covariance of the traits. Moreover, in situations where the number of traits is large [e.g., expression QTL (eQTL) analysis], it may not be possible to include all the traits in one analysis, using a usual multivariate approach. The shrinkage technique (Tsai and Chen 2009) can resolve the large p, small n problem but it is computationally intensive and thus may not be feasible in QTL mapping that typically scans a large number of marker loci. A critical question is, Which traits should be analyzed in the multitrait framework? While Ronin focused on multitrait analysis of pairs of traits, Knott and Haley (2000) suggested several considerations for multitrait studies that may be difficult to exercise. To address these issues we propose variable selection (Rencher 1993, 1998) as a strategy to choose a subset of traits for multitrait analysis. The proposed approach makes the most of multitrait analysis in terms of statistical power for QTL detection and is demonstrated for backcross populations, using Hotelling’s T2 statistic (but does not depend on this test statistic), and can be extended to other populations. Since the ultimate goal of QTL mapping is to detect genomic regions that are associated with specific traits or biological processes, the proposed method can provide such information and thus facilitate interpretation of the results.

Materials, Methods, and Results

Real data and preliminary analyses

We considered expression trait (e-trait) data of 211 recombinant inbred lines (RILs) that were derived from two parental inbred Arabidopsis thaliana accessions, Bayreuth-0 (Bay-0) and Shahdara (Sha), by selfing (Loudet ; Kim 2007; West ). Affymetrix technology (Kliebenstein ) was employed to generate the microarray data (available in the ArrayExpress database with query “E-TABM-126”). Ninety-five distributed markers contributed genotypic information across the five chromosomes (West ). The maximum genetic distance between two adjacent markers was 10.944 cM, the minimum was 2.224 cM, and the median was 4.771 cM (Figure S1 in supporting information, File S1). There are >23,000 genes in the repository. Rather than looking at all of them, we focused on the expression transcripts of 16 genes (i.e., 16 e-traits; Table S1 in File S1), which are in a well-studied defense pathway, from a control environment (Wang ). Considerations for choosing a small data set include the following: (a) it is computationally easier to establish a methodology using a relatively small data set; (b) our current proposed method is most suitable for small or moderately large data although strategies can be explored to apply it to very large data (section 9 in File S1); and (c) if we study the whole set, interpretation of results will be of first importance; however, this is beyond the scope of our study. In addition to these 16 e-traits, we also considered the first 50 e-traits in the same repository when we assessed the method we proposed later, using simulations. We first employed a single-trait single-marker approach for analysis of the A. thaliana data. We calculated Hotelling’s T2 test statistic at each of the 95 markers and performed 10,000 permutations of the genotypic data to estimate the 0.05 significance threshold, which was 16.44276, adjusted for all 95 markers and for all 16 e-traits. A marker was declared to have a significant association with an e-trait only if the T2 value was a local maximum along the genome and was equal to or larger than the estimated threshold. If two markers on the same chromosome had a significant association with the trait but the test statistic at any marker between them was not smaller than the smaller test statistic at these two markers by two standard deviations of the null distribution that was estimated by the permutation test, then the marker with the smaller test statistic was ignored. This was to prevent adjacent markers from being identified as QTL purely due to linkage. With these criteria, the single-trait single-marker approach identified 11 markers (At1g11360-4, At1g31580-11, At2g03750-5, At2g14560-2, At2g17240-6, At2g42680-9, At3g61100-9, At5g10380-10, At5g44320-4, At5g45110-11, and At5g48180-4) that were associated with the 16 e-traits at genome-wide significance level 0.05 (Figure 1A; see section 4 in File S1 for more information).

Figure 1

(A–C) Mapping profiles for single-trait analysis of the 16 traits (A) and multitrait analysis of selected traits vs. that of all 16 traits (B) and selected traits (C). The horizontal lines are 0.05 significance thresholds adjusted for all the markers (and all the traits in the case of single-trait analysis). Each vertical section displays one chromosome. C shows if a trait is selected for multitrait analysis (green), if the single-trait mapping curve goes above the threshold line at the locus (yellow), and if both occur (red). We then considered joint analysis of all 16 e-traits and employed a multitrait single-marker approach for the A. thaliana data. We calculated Hotelling’s T2 test statistic at each of the 95 markers and performed 10,000 permutations of the genotypic data to estimate the 0.05 significance threshold, which was 46.33388. With the same criteria as in the single-trait analysis, the multitrait single-marker approach identified 12 markers (At1g11360-4, At1g31580-11, At2g14560-2, At2g17240-6, At2g26640-8, At2g45140-3, At3g10720-6, At3g56360-7, At5g06660-5, At5g24930-6, At5g44320-4, and At5g53940-10) that were associated with the 16 e-traits at genome-wide significance level 0.05 (Figure 1B). Among the identified markers, At2g17240-6, At3g10720-6, At5g24930-6, and At5g44320-4 are at least 10 cM from any of the flanking markers of the 16 e-trait network genes. Two of these 12 markers, At3g10720-6 and At5g24930-6, were not detected when we analyzed the traits separately. We were interested in taking a closer look at markers 10 (At1g31580-11), 27 (At2g14560-2), 42 (At3g10720-6), and 55 (At3g56360-7). Single-trait analysis disclosed only one trait to associate with marker 10 but many traits with marker 27. The multitrait analysis of the 16 e-traits detected a QTL at marker 42 while single-trait analysis did not show any such evidence. Finally, both single-trait and multitrait analyses detected a QTL at marker 55; however, the single-trait mapping curve at this marker barely crossed the threshold line (Figure 1).

Selecting traits for multitrait analysis

The addition of a trait to a multivariate analysis is not always justified in terms of the statistical power for QTL detection. Generally, unique information about QTL affecting a trait is reduced in the presence of other traits. We can use this knowledge to select informative traits for multitrait QTL analysis. In fact, traits can be chosen such that the selected ones collectively contribute most to the test statistic and attain an optimal power for QTL detection. Suppose there are p traits (y1, y2, … , y). Let be Wilks’ Λ, a test statistic commonly used in multivariate hypothesis testing, corresponding to (y1, y2, … , y) and , 0 ≤ k < p. Rencher (1993) shows thatis distributed as , where d.f.H and d.f.E are degrees of freedom for hypothesis and error, respectively. In a backcross or recombinant inbred lines where there are two possible genotypes, Hotelling’s T2-test statistic can take the place of Wilks’ Λ, and is distributed as F1,−−2, where is Hotelling’s statistic based on (y1, y2, … , y) and n is the sample size. The statistic may be used to test whether a trait is redundant in the presence of other traits and to select traits for multitrait analysis. If there is a predefined order in which the traits are tested for association with a marker, the traits can be tested one by one in that order. If, however, there is no such predefined order, a subset of the traits that collectively contribute most to the test statistic (e.g., Hotelling’s T2) can be selected for QTL mapping. In this situation, (1) or (2) does not have the expected F distribution because of selection bias, and therefore we cannot rely on F tests to select traits. Instead, model selection techniques such as stepwise procedures can be employed (Rencher 1998). An entry/stay value can be specified to determine whether a trait should be selected. Alternatively, as demonstrated later we can select a given number of traits that contribute most to the test statistic. Multitrait analysis can be performed on the resulting subset of the traits to test for the association between the marker and the traits. In the case of two traits, Cheng (2007) showed that if a QTL has an effect on trait 1 but no effect on trait 2, then the contribution of trait 2 to Hotelling’s T2 given trait 1 tends to infinity as the residual correlation between the two traits goes to 1 (or −1). In general, we have the following proposition (see section 1 in File S1 for more information). Proposition 1. If a QTL has a nonzero effect on one trait but no effect on another trait, then the power to detect the QTL goes to 100% as the residual correlation between the two traits goes to 1 (or −1). As we will see later, Proposition 1 is very useful for multitrait analysis. In many situations, we may expect a QTL to have no effect, or a relatively negligible effect, on some trait (trait 2) but an intermediate effect on another trait (trait 1). We may not have sufficient power to detect this QTL with single-trait analysis. However, joint analysis of these two traits will have an increase in power to detect it if these two traits are closely correlated. This conclusion is not limited to the case of two traits (see section 9 in File S1 for examples).

Impact of selection bias on type I error rates and statistical power

Selection bias can invalidate significance thresholds that are based on the expected F distribution of (1) or (2) and result in inflated type I error rates even in simple situations where a single marker is tested for QTL. It may be possible to empirically estimate the null distribution of the test statistic, using the permutation test (Churchill and Doerge 1994). There are typically two ways to perform the permutation test: (1) permute the phenotypic data and keep the genotypic data and (2) keep the phenotypic data and permute the genotypic data. We chose the latter here since permuting genotypic data allows the relationship between the trait and covariates (if any) to be retained, which may result in better estimation (O’Gorman 2005; Cheng and Palmer 2013). However, how to actually perform the permutation test in variable selection is not obvious. There are many ways to select traits from the permuted data to estimate significance thresholds. In this study, we investigated four intuitive methods: (A) use the same “best” traits as selected from the original data (i.e., data without permutation), (B) select the same number of best traits as selected from the original data, (C) use the same procedure as selecting best traits from the original data, and (D) select a predefined number of best traits if this number is predefined to select best traits from the original data. We assumed that in methods A, B, and C the number of traits selected from the original data was not predefined and an entry/stay value was specified for a model selection procedure to select an optimal subset of traits.

Type I error rate:

We first investigated the capability of the above four methods to control type I error rates. We employed 1000 simulations to estimate type I error rates. In each simulation, we simulated 16 traits from a multivariate normal distribution whose mean and variance–covariance were respectively equal to the sample mean and variance–covariance, after adjusting for QTL effects at marker 42 (At3g10720-6), of the actual 16 Arabidopsis thaliana e-traits as detailed previously, and used the real genotypic data at marker 42. This marker was among the markers of interest since at this marker multitrait analysis of all 16 traits identified a QTL but single-trait analysis did not (Figure 1). We performed the stepwise backward elimination procedure to select traits. For methods A–C, the entry/stay value (if needed) was 3.886 (= F0.05;1,211−2), and if the resulting subset of traits was empty, we selected the trait with the largest T2. For method D, the predefined number of selected traits was 5. We then performed multitrait analysis of the selected traits (we used single-trait analysis and multitrait analysis interchangeably when there was only one trait of interest). In each of the simulations, we permuted the genotype data 10 times (as in Cheng and Palmer 2013), which resulted in a total of 10,000 permuted data sets, and then applied the four methods to each of the permuted data sets. The thresholds were estimated from these 10,000 permutations for each of these methods. A QTL was claimed if the test statistic exceeded the significance threshold estimated by using each of the four methods at a specified significance level. Table 1 displays the estimated type I error rates and their standard errors. We can see that using the same procedure (methods C and D) for the permuted data as in the analysis of the original data controlled type I error rates at the nominal significance levels, whereas using the same traits (method A) or selecting the same number of traits (method B) as selected from the original data resulted in inflated type I error rates.

Table 1

Estimated type I error rates and standard errors

	Significance level
Method	0.1	0.05	0.01
A	0.552 (0.0157)	0.443 (0.0157)	0.231 (0.0133)
B	0.207 (0.0128)	0.136 (0.0108)	0.041 (0.0063)
C	0.098 (0.0094)	0.051 (0.0070)	0.011 (0.0033)
D	0.116 (0.0101)	0.061 (0.0076)	0.018 (0.0042)

Four methods were implemented to select an optimal subset of traits for multitrait analysis of the permuted data: (A) use the same traits as selected in the analysis of the original data, (B) select the same number of traits as in the analysis of the original data, (C) use the same procedure as selecting traits in the analysis of the original data, and (D) select a predefined number (five) of traits. Standard errors of the estimated type I error rates are given in parentheses.

Statistical power and the number of selected traits:

The above simulation study indicates that we can employ either method C or method D to get a valid significance threshold if variable selection is performed to select traits for multitrait analysis. For method C, the number of selected traits generally depends on both the data and the entry/stay value. In other words, it will differ for either a different entry/stay value or a different data sample. For method D, one question of interest is that given the total number of traits, how the number of selected traits influences the statistical power, and another question is that given the number of selected traits, how the total number of available traits influences the statistical power. To answer these questions, we considered different sets of traits: (1) the 16 e-traits described and analyzed previously, (2) the 16 e-traits plus the first 20 of the additional 50 e-traits mentioned previously, and (3) the 16 e-traits plus all the additional 50 e-traits. We did not directly use these traits. Instead, we simulated traits by taking advantage of the covariance structure as well its relationship with the QTL effects. Specifically, we simulated traits that followed a multivariate normal distribution with mean being equal to one-fifth, one-third, one-half, or one-half of the QTL effects estimated from these traits at markers 10 (At1g31580-11), 27 (At2g14560-2), 42 (At3g10720-6), or 55 (At3g56360-7) and variance–covariance being equal to the residual variance–covariance estimated from those traits at the marker. Specifying a good covariance structure and QTL effects for simulations is not always easy. Fortunately, the covariance structure and the putative QTL effects at the markers were available, so we took advantage of this information by adjusting the QTL effects to observe a visually improved pattern in the power so that the estimated power was not too large or small to disguise important findings. We simulated 1000 data sets. In the analysis of each data set we selected the best subsets of all possible numbers of traits and then estimated the significance thresholds, using 5000 permutations of the genotypic data. Figure 2 shows the estimated power at significance level 0.05 for different numbers of selected traits from all 3 sets of available traits. We can see that the statistical power tended to increase as the number of selected traits increased, and the gain in power was large at the beginning but quickly decreased and became negligible at some point. There were cases where the power attained or approximated the maximum with less than one-third of the traits and a larger number of selected traits resulted in a lower power. For a relatively small number of selected traits, the statistical power was lower if the total number of traits was larger, meaning a more serious selection bias.

Figure 2

(A–D) Statistical power estimated from 1000 replicate simulations at the significance level 0.05

A variable selection procedure

As shown previously, we could employ method C or D to appropriately determine the decision rule via the permutation test. For method C, we needed to predefine an entry/stay value; however, what could be such a good value for optimal power was in question and choice of a good entry/stay value would be complicated by noting that such a value should vary with the total number of traits and likely be any number in a region on the real line. Alternatively, we could use method D and select a number of traits such that any additional trait contributes little to the statistical power according to a certain criterion. In this spirit, we proposed the following procedure to choose an optimal number of traits at a marker locus [referred to as variable selection for the optimal power procedure (VSFOP)]: Determine the maximum number, K, of traits to be selected. Take 1000 (say) nonparametric bootstrap samples and estimate the statistical power as well as its standard error for k = 1, 2, … , K best traits, denoted by p and e, respectively. Choose the largest k*(≤ K) such that and . The maximum number K should be much smaller than the sample size to ensure the parameters are estimable and the variation of the estimates is not overly large. In practice, it may be most useful to identify a handful of traits for an extended study and therefore, in addition to saving unnecessary computational cost, the maximum number K should not be too large to facilitate biological interpretation of the findings. In our data, we simply used K = 16. To take advantage of multitrait analysis, we chose k* = 2 if k* ≤ 1. The above procedure dynamically determines the optimal number of traits at a marker. Since the number of selected traits can vary across different markers, the test statistics may not be comparable. For easy comparison across scanning loci, we proceeded with another step: Scale the test statistic at locus i (i = 1, 2, … , L), T(i), by to get a new statistic ; that is, , where v > 0 and v is the estimated threshold for locus i at a given genome-wide significance level α such that P(|T(i)| < v, i = 1, 2, … , L) ≥ 1 − α. We may impose the condition P(|T(1)| < v1), P(|T(2)| < v2) = … = P(|T(L)| < v). We applied the above procedure to the e-trait data. The selected traits at each marker are displayed in Figure 1C (see Figure 3A for the number of selected traits at each marker) and the mapping result of multitrait analysis of the selected traits is shown in Figure 1B, where v in the adjusted Hotelling’s T2 at locus i (i = 1, 2, … , 95), , was the estimated threshold for multitrait analysis of all 16 traits at genome-wide significance level 0.05. We can see that while the number of selected traits was mostly between 4 and 8, we needed only 2 traits at a large number of markers for the purpose of QTL detection. The mapping profile of selected traits was very similar to that of all traits (Figure 1B), suggesting the success of the selection procedure (see section 10 in File S1 for the results of analyzing all 66 e-traits that we considered in the simulations).

Figure 3

(A–C) Number of selected traits (A), estimated statistical power using random 50% subsamples (B), and difference in the estimated power between the proposed VSFOP procedure and any of other methods (C). Five methods were considered: (1) single-trait analysis (ST), (2) multitrait analysis of selected traits (SL), (3) multitrait analysis of all traits (AT), (4) multitrait analysis of clustered traits (CL), and (5) single-trait analysis of the first eight principal components (PC). Each vertical section displays one chromosome. It is worth noting that the proposed procedure selected only 2 traits to disclose potential QTL if there was a strong association between a marker and a trait (e.g., marker 3 and trait 3) but selected many more traits if the association between a marker (e.g., marker 42) and any trait was weak. We were interested to explore more at marker 3 about the power of multitrait analysis and the proposed procedure. Variable selection would disclose traits 2, 3, and 12 to be associated with marker 3 (Table S4 in File S1). However, single-trait analysis did not provide any such evidence for trait 2 or trait 12 (Figure 1A). Actually at marker 3, the estimated QTL effects were negligible on trait 16 but intermediate on traits 2 and 12 (section 6 in File S1). While the opposite effects on and the large correlation between traits 2 and 12 helped to disclose trait 12, the high correlation between traits 2 and 16 led to the disclosure of trait 2. The advantage of multitrait analysis at marker 3 came from the favorable configuration of the QTL effects and the residual correlation structure (see section 6 in File S1 for more examples).

Comparison with clustering and principal components:

Multitrait analysis takes advantage of correlations among traits, which may provide a higher statistical power or more accurate estimation of parameters including QTL location (Jiang and Zeng 1995). However, multitrait analysis may not be feasible under the framework of traditional multivariate analysis when the number of traits is large. In such a situation, one may cluster traits and analyze the clustered traits, using a multitrait approach (Chun and Keles 2009). Alternatively, one can consider principal component analysis, which has been proposed for QTL mapping (Weller ). We looked at these strategies, using the e-trait data we analyzed above. Recall the data contained 16 e-traits and 95 markers for 211 individuals. We took 500 random subsamples of size 106 (i.e., 50% of the total 211 individuals) from the e-trait data and then analyzed each subsample as follows: We analyzed the traits separately (ST). We determined the number of traits by the proposed VSFOP procedure, using 1000 nonparametric bootstrap samples of the subsample at a marker, and selected that number of traits at the marker, and then analyzed the selected trait using the multitrait approach (SL). We implemented multitrait analysis of all 16 traits (AT). We analyzed clustered traits separately (CL), with clusters being defined by hierarchical clustering based on the correlations between the e-traits and by a cutoff of 0.75 (section 7 in File S1). We did not reestimate the correlations using the subsample since the estimation using the total data tends to be more reliable. We analyzed the first eight principal components (PC) of the trait data separately (section 8 in File S1). Suppose the traits in the data were Y, and K was a matrix that chose the subsample; i.e., the traits in the subsample were in the form KY, and V transformed Y to its principal components via YV. Then the PC traits in the subsample were KYV. In other words, we relied on the covariance matrix estimated from the total data to calculate the principal component scores. We estimated 0.05 significance thresholds using method D, adjusted for multiplicity of the tests, using 1000 permutations of the genotypic data in the subsample (see section 5 in File S1 for more information). For each of the above approaches, a marker was defined as a positive if Hotelling’s test statistic, T2, exceeded the 0.05 threshold. Then, a positive should be almost surely due to true QTL or linkage to QTL if the estimated power was apparently >0.05. Figure 3B displays the proportion of the 500 subsamples where each marker was detected as a QTL. The difference in the estimated power between the proposed method and any other method is displayed in Figure 3C. Overall, our proposed method had a performance virtually the same as joint analysis of the 16 traits at most loci and appreciably better in several genomic regions and was more powerful than the remaining three approaches. The proposed procedure dynamically determines the number of best traits based on data and preestimated power so that it can make the most of power. Note that our method was the most advantageous at loci where a small number of traits were selected (Figure 3); typically, one trait with large QTL effects and one with negligible QTL effects were among the selected traits at those loci, and the trait with negligible QTL effects was just to help in identification of the one with large QTL effects. Multiple-trait analysis of the clustered traits tended to work better than single-trait analysis of the principal component variables but seemed to be a compromise between single-trait analysis and multitrait analysis of all 16 traits.

Discussion

In this study, we devised a strategy to perform multitrait analysis. Instead of jointly analyzing all available traits, we proposed to select a subset of informative traits and analyze the selected traits by a multitrait approach. Using simulations and real data, we showed our proposed method has the potential to achieve optimal statistical power. To our knowledge, this is the first time that variable selection has been proposed for multitrait QTL mapping. We expect our proposed method will have practical applications. For instance, we are usually interested in genetic variants underlying economically important traits such as yield, protein content, quality attributes, and disease resistance in a breeding program. These traits tend to be correlated, and multitrait analysis is preferable to single-trait analysis, as examplified by Singh , who recently studied several wheat diseases. Then, our approach is not only able to attain an optimal power in the framework of multitrait analysis but also able to best disclose traits that are relevant to identified QTL, which may not be desirably delivered by either single-trait analysis or full joint analysis (see section 9 in File S1 for more information). There are a number of situations where our proposed method can be useful. First, when the number of traits is larger than the sample size (e.g., expression data of thousands of available genes), a subset of traits has to be selected if multitrait analysis is implemented without parameter regularization since including all the traits is not feasible due to the limited degrees of freedom. Shrinkage can get around the large p, small n problem; however, the computation can be a serious problem in the framework of multivariate analysis where matrix manipulation such as inversion, in addition to choice of a regularization parameter, is typically involved. Strategies of best applying the proposed method to large amounts of data remain an interesting research topic (section 9 in File S1). Second, the proposed method is also applicable to data where the number of traits is large and we are interested in a small subset of the most relevant traits. For example, in gene expression data we may be interested to know which genes are most influenced by an eQTL (if any). While multitrait analysis typically identifies QTL associated with a group of traits without readily disclosing which traits are associated with the QTL, variable selection chooses traits that are statistically most significant and thus most likely involved in a biological process (section 9 in File S1). Third, information from variable selection can also be useful when we test other biological hypotheses such as pleiotropy. Suppose a trait is not selected (multiple rounds of selection may be required; see section 9 in File S1 for more information); then the QTL is negligible or of no effect. We do not need to look at pleiotropy if either of two traits is not selected, and we can proceed to test pleiotropy without difficulty if both of two traits are selected. Variable selection is a data reduction technique. We compared our proposed method to another data reduction approach, namely, principal component analysis (Weller ), and showed with real data that our method outperformed it in terms of statistical power. There are other disadvantages of principal component analysis. First, it is not obvious which traits are associated with the detected QTL. Second, there is a question of how many principal components are to be analyzed. In the e-trait data, it seemed appropriate to analyze the first four principal component variables but the seventh and eighth, which accounted for tiny proportions of the total variance, were among those that identified QTL (Figure S4 and Figure S5 in File S1). The statistical power would tend to be much lower at most loci if we looked only at the first four principal component variables (data not shown). One may attempt to cluster the traits based on correlations and then analyze the clustered traits via multitrait analysis. Since the power of multitrait analysis depends on both the QTL effects and the correlation structure, it is not possible for analysis of clustered traits to work best at loci of different QTL effects. In contrast, variable selection dynamically selects traits according to QTL effects and correlations among traits and multitrait analysis of selected traits mostly works well, and our proposed selection procedure can fully exploit data to provide the best power. Moreover, how to cluster traits and to determine the number of clusters will have an impact on results. In terms of statistical power for the e-trait data, analysis of clustered traits seemed to be somewhere between single-trait analysis and multitrait analysis of all the traits (Figure 3 and section 7 of File S1). We introduced the variable selection approach using statistic (1) but we do not have to rely on it to perform variable selection. The model-based maximum-likelihood ratio statistic is more flexible and allows inclusion of identified QTL as covariates in variable selection without any problem. Then we may identify multiple QTL one after another as in single-trait multiple-QTL mapping. However, since variable selection can associate different traits with different QTL, a question may be how to best include identified QTL as covariates. Other questions include how to determine suitable cutoffs to claim QTL, especially when the number of traits under consideration varies. Finally, it will also be useful to extend our proposed method to incorporate practical considerations such as genotype-by-environment interaction and polygenic variation (Singh ).

22 in total

1. Bay-0 x Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis.

Authors: O. Loudet; S. Chaillou; C. Camilleri; D. Bouchez; F. Daniel-Vedele
Journal: Theor Appl Genet Date: 2002-02-13 Impact factor: 5.699

2. Induction of protein secretory pathway is required for systemic acquired resistance.

Authors: Dong Wang; Natalie D Weaver; Meenu Kesarwani; Xinnian Dong
Journal: Science Date: 2005-05-13 Impact factor: 47.728

3. Expression quantitative trait loci mapping with multivariate sparse partial least squares regression.

Authors: Hyonho Chun; Sündüz Keles
Journal: Genetics Date: 2009-03-06 Impact factor: 4.562

4. Linkage between loci of quantitative traits and marker loci: multi-trait analysis with a single marker.

Authors: Y I Ronin; V M Kirzhner; A B Korol
Journal: Theor Appl Genet Date: 1995-05 Impact factor: 5.699

5. Genomic survey of gene expression diversity in Arabidopsis thaliana.

Authors: Daniel J Kliebenstein; Marilyn A L West; Hans van Leeuwen; Kyunga Kim; R W Doerge; Richard W Michelmore; Dina A St Clair
Journal: Genetics Date: 2005-10-03 Impact factor: 4.562

6. Multiple trait analysis of genetic mapping for quantitative trait loci.

Authors: C Jiang; Z B Zeng
Journal: Genetics Date: 1995-07 Impact factor: 4.562

7. Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis.

Authors: Marilyn A L West; Kyunga Kim; Daniel J Kliebenstein; Hans van Leeuwen; Richard W Michelmore; R W Doerge; Dina A St Clair
Journal: Genetics Date: 2006-12-18 Impact factor: 4.562

8. The contribution of individual variables to Hotelling's T2, Wilks' lambda, and R2.

Authors: A C Rencher
Journal: Biometrics Date: 1993-06 Impact factor: 2.571

9. Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease.

Authors: Norbert Hubner; Caroline A Wallace; Heike Zimdahl; Enrico Petretto; Herbert Schulz; Fiona Maciver; Michael Mueller; Oliver Hummel; Jan Monti; Vaclav Zidek; Alena Musilova; Vladimir Kren; Helen Causton; Laurence Game; Gabriele Born; Sabine Schmidt; Anita Müller; Stuart A Cook; Theodore W Kurtz; John Whittaker; Michal Pravenec; Timothy J Aitman
Journal: Nat Genet Date: 2005-02-13 Impact factor: 38.330

10. Multi-trait and multi-environment QTL analyses for resistance to wheat diseases.

Authors: Mateo V Hernandez; Jose Crossa; Pawan K Singh; Navtej S Bains; Kuldeep Singh; Indu Sharma
Journal: PLoS One Date: 2012-06-05 Impact factor: 3.240

3 in total

1. Novel Resampling Improves Statistical Power for Multiple-Trait QTL Mapping.

Authors: Riyan Cheng; R W Doerge; Justin Borevitz
Journal: G3 (Bethesda) Date: 2017-03-10 Impact factor: 3.154

2. Using Phenomic Analysis of Photosynthetic Function for Abiotic Stress Response Gene Discovery.

Authors: Tepsuda Rungrat; Mariam Awlia; Tim Brown; Riyan Cheng; Xavier Sirault; Jiri Fajkus; Martin Trtilek; Bob Furbank; Murray Badger; Mark Tester; Barry J Pogson; Justin O Borevitz; Pip Wilson
Journal: Arabidopsis Book Date: 2016-09-09

3. The roles of pleiotropy and close linkage as revealed by association mapping of yield and correlated traits of wheat (Triticum aestivum L.).

Authors: Albert W Schulthess; Jochen C Reif; Jie Ling; Jörg Plieske; Sonja Kollers; Erhard Ebmeyer; Viktor Korzun; Odile Argillier; Gunther Stiewe; Martin W Ganal; Marion S Röder; Yong Jiang
Journal: J Exp Bot Date: 2017-07-10 Impact factor: 6.992

3 in total