This paper presents systematic methods for the detection of influential individuals that affect the log odds (LOD) score curve. We derive general formulas of influence functions for profile likelihoods and introduce them into two standard quantitative trait locus detection methods-the interval mapping method and single marker analysis. Besides influence analysis on specific LOD scores, we also develop influence analysis methods on the shape of the LOD score curves. A simulation-based method is proposed to assess the significance of the influence of the individuals. These methods are shown useful in the influence analysis of a real dataset of an experimental population from an F2 mouse cross. By receiver operating characteristic analysis, we confirm that the proposed methods show better performance than existing diagnostics.
This paper presents systematic methods for the detection of influential individuals that affect the log odds (LOD) score curve. We derive general formulas of influence functions for profile likelihoods and introduce them into two standard quantitative trait locus detection methods-the interval mapping method and single marker analysis. Besides influence analysis on specific LOD scores, we also develop influence analysis methods on the shape of the LOD score curves. A simulation-based method is proposed to assess the significance of the influence of the individuals. These methods are shown useful in the influence analysis of a real dataset of an experimental population from an F2 mouse cross. By receiver operating characteristic analysis, we confirm that the proposed methods show better performance than existing diagnostics.
Quantitative trait locus (QTL) analysis is a statistical method for detecting the precise location of chromosome regions associated with a particular phenotypic trait. To date, many models have been proposed for this analysis (e.g., Lander and Botstein, 1989; Zeng, 1993; Kao and Zeng, 1997; Sen and Churchill, 2001). Among them, single marker analysis and the interval mapping method are the most widely used. In QTL detection, log odds (LOD) scores are calculated for marker loci to indicate the plausibility of the existence of QTLs (e.g., Siegmund and Yakir, 2007; Wu et al., 2007).Figure 1 shows the LOD scores computed using data on 170 F2 mice. The LOD scores at 119 marker loci are plotted as small circles. The circles on the same chromosome are linked by solid segments, and the joint lines form the LOD score curve. The highest peak appears at chromosome 16; moreover, two moderate peaks located closely on chromosome 3 are observed. For geneticists planning to clone the genes responsible for QTLs, it is important to determine whether the two moderate peaks were caused by two QTLs or by stochastic fluctuations. However, the sample sizes of experimental populations generated from genetic crosses are usually not large and normally consist of a few hundred individuals at most. In such small populations, a few observations can have a major impact on LOD score curves and lead to unstable results.
Figure 1
LOD score curves for mouse chromosomes.
LOD score curves for mouse chromosomes.The importance of such sensitivity analysis is well recognized in experimental genetics. In genetic experiments, errors in genotyping and phenotyping are inevitable, which could cause unstable results. To solve this problem efficiently, we can find those individuals having large influence on the LOD score curve, to genotype them again, and to remeasure the phenotypes. Therefore, statistical methods for the identification of influential individuals are helpful in the process of QTL detection. In general, an observation is called influential if it has a large impact on the estimates of interest. It should be distinguished from outliers that are distinct from most of the data points in a sample but not necessarily influential (Bollen and Jackman, 1990). In fact, influential individuals are not necessarily bad and need not to be removed if reliable. They may contain the most interesting information and should be examined carefully. If the reexamination reveals that the original influential data are correct, and if the assumed model fits the data well, geneticists can go to the next stage for positional cloning of the causative gene for the phenotype. If the data are found to be unreliable, researchers need to conduct another QTL analysis based on the updated dataset. However, the conventional method currently used by geneticists to find influential individuals is inadequate from a statistical viewpoint (see Section 4). The purpose of this paper is to provide systematic methods for identifying individuals that influence LOD score curves.In statistical terminology, the LOD score is nothing but a profile likelihood function based on a statistical model describing the relationship between genotypes and phenotypes. In this paper, we first present a general theory of influence functions for profile likelihoods and apply formulas to the models of QTL detection. Then, to identify individuals that affect the shape of the profile likelihood, we propose the use of the empirical influence functions (EIFs) for a linear combination of LOD scores, which is designed specifically for the shape of interest. We propose three methods for designing the linear combination coefficients: projection based on orthogonal polynomials, principal component analysis (PCA) based on the EIF matrix, and the quadratic form of the EIFs. To assess the significance of the EIFs, we also propose a simple method for providing p-values by a robustified parametric bootstrap. This method is confirmed to control false positive rates through numerical studies.This paper is organized as follows. In Section 2, the general formulas of influence functions for profile likelihoods are derived. In Section 3, EIFs are introduced into QTL analysis. In Section 4, the three methods for designing the linear combination are proposed. After that, the method for calculating p-values for significance testing is proposed. In Section 5, we apply these methods to a real dataset of an experimental mouse population. In Section 6, we examine the validity and the statistical power of the proposed methods through simulations. Finally, in Section 7, we summarize the proposed methods, discuss the results of our data analysis, and provide guidelines for application. Mathematical details are provided in the Appendix.
2 Influence functions for the profile likelihood
As previously stated, the LOD score in QTL detection is defined as a profile likelihood, because the model describing the relationship between genotypes and phenotypes (and covariates, if available) includes an unknown location parameter γ of the QTL. The details of the QTL models are given in Section 3. Here, we develop the theory of influence analysis for profile likelihood functions in a general setting, apart from the context of QTL detection.We assume that for each individual i=1,…,n, its observation record x is taken from a statistical model f(ċ;γ, where is the parameter of interest, and is a vector consisting of the other parameters. In QTL detection, the parameter space Γ is the search region for the QTLs and is a discrete or continuous set; we consider both cases.Let ℓ(γ,θ;x)=ℓ(γ,θ)=log f(x;γ,θ) be the log-likelihood function. The average log profile likelihood function M with respect to parameter γ is defined as (e.g., Murphy and van der Vaart, 2000). To develop an influence analysis on the characteristics of the profile likelihood M, we focus on the following functionals of M:(i) linear functional: ∫Γ
M(γ)dC(γ)(ii) maximum of the profile likelihood: maxM(γ)(iii) maximizer of the profile likelihood: argmaxM(γ).Let ϕ(F) be a functional of the distribution function F and let ϕ(F) be its sample version, where F is the empirical distribution function defined by the data {x}=i,…n. Let δ be the distribution function that has probability mass 1 at x. Then, the influence function of ϕ(F) is defined as the directional derivative of ϕ at F in the direction of δ: where , and denotes the derivative coefficient with respect to ε at ε=0. The EIF (Hampel et al., 1986) of ϕ(F) at the observation record x is the influence function of ϕ at F in the direction of : Here, EIF(i;ϕ) measures the influence of individual i on the statistic ϕ(F). To simplify the notation, we use EIF(ϕ) interchangeably. The basic strategy of influence analysis is identifying individuals with large absolute values of EIF(ϕ).To derive the EIFs for the functionals (i)–(iii) of M, we first express M(γ) as a functional of the empirical distribution function F. Let L(γ,θ;F)=∫ℓ(γ,θ;y)dF(y). Since , the profile likelihood function and its maximizer in 1990 can be rewritten as M(γ)=M(γ,F) and , respectively, whereThen, the EIFs of the functionals (i)–(iii) can be obtained by the following theorem. The regularity conditions and the proofs are provided in Appendix A.1.
Theorem 2.1
(i) Let C be a bounded variation function on Γ. The influence function of the linear functional ϕ1(C;F)=∫ΓM(γ,F)dC(γ) at F is provided that exists uniquely.(ii) The influence function of the maximum profile likelihood ϕ2(F)=maxM(γ,F) at F is provided that exists uniquely.(iii) Assume that Γ is a continuous set. The influence function of the maximizer of the profile likelihood ϕ3(F)=argmaxM(γ,F) at F is where . Here, the subscripts indicate partial derivatives. For example, L(γ,θ;F)=∂L(γ,θ;F)/∂θ is the gradient vector and L is the Hessian matrix.Actually, 1986 is a special case of the influence function formula for M-estimators. Its numerator and denominator are the efficient score and the Fisher information for parameter γ in the presence of the nuisance parameter θ, respectively (e.g., Murphy and van der Vaart, 2000).
3 Statistical models and influence functions
3.1 Experimental crossing data
We first review the statistical models of QTL analysis. For simplicity, we only consider the data from F2 populations, although the statistical methods developed here are applicable to all other experimental design data.In QTL analysis, data are taken from n individuals (e.g., mice). The observation record of the i-th (i=1,…,n) individual consists of a phenotype y, a genotype vector at m marker loci located at d1,…,d, and covariates u. (In this paper, we let u be a scalar for simplicity.) Moreover, we assume that a putative QTL exists that has genotype at location γ on a chromosome and affects trait y. Note that γ is an unknown parameter, and is an unobserved variable in general. We assume that except for the QTL, none of the loci affect the phenotype. In the F2 population from two strains A and B, each marker has one of the three genotypes: AA homozygous, AB heterozygous, and BB homozygous. Throughout this paper, we use to denote genotypes AA, AB, and BB, respectively.
3.2 Statistical model for the interval mapping method
Various methods and computer programs are available for QTL detection. Among them, we consider two basic methods—the interval mapping method and single marker analysis. It should be noted that influence functions can be defined in any QTL detection method that is based on LOD score. In this paper, we pick these two methods as examples. Although both methods are formulated as single QTL models, they are often applied for constructing scan statistics to decide whether a QTL exists at a particular location, even though more than one QTL may exist (Wright and Kong, 1997).Interval mapping is a classical method proposed by Lander and Botstein (1989) in the early stages of research on QTL analysis. Although many other procedures have been developed from this method, it is still used as a standard method and has been incorporated into many leading software packages (e.g., Manly and Olson, 1999; Broman et al., 2003). Here, we briefly review the statistical model for this method.The statistical model for the interval mapping method consists of two parts. One part determines the joint distribution of the genotypes at the marker loci and the putative QTL. Because the positions of the marker loci are known, once the position γ of the QTL is given (as an unknown parameter), we can calculate the joint distribution of the m+1 genotypes as a function of γ by using the stochastic structure of the linkage. Similarly, we can calculate the joint distribution P(z), and then the conditional distribution of given z can be obtained as Appendix A.2 provides details of (6) under Haldane's linkage model.The second part of the interval mapping method models the stochastic behavior of phenotype y when genotype of the QTL and covariate u are given. Various statistical models can be constructed for different types of y. Since the concept of influence analysis can easily be extended to other settings, we only consider real-valued traits and use the normal linear model.For individuals i=1,…,n, assume that where and θ=(α,β,μ,ν,σ2) are unknown parameters. Then, the probability density function of y, given (,u), becomes a three-component finite mixture of normal distributions where is the density function of the normal distribution in 1997.In model 1997, the parameters α and β indicate the additive and dominance effects of the QTL, respectively. The null hypothesis H0:α=β=0 means that no QTL affects the phenotype. Note that when H0 is true, the density in 1989 becomes unrelated to , and the QTL location parameter γ is no longer identifiable. In the rest of the paper, we refer to the density 1989 as g(y|*,u;θ) under H0.The LOD score is defined as the base 10 logarithm of the likelihood ratio (LR) for testing the null hypothesis H0 when the QTL location is γ. The test statistic is defined as a function of γ. Let be the maximum likelihood estimator (MLE) given γ, and let be the MLE under H0. As mentioned above, is independent of γ. Thus, the LOD score in the interval mapping method is defined as where , and . Because does not depend on γ, we denote it as .An EM algorithm for estimating can be found in Lander and Botstein (1989), and can be obtained by using the ordinary least-square method. A LOD score curve obtained from the interval mapping method is shown in Figure 2A.
Figure 2
The LOD score curve for chromosome 3 and influence analyses on the maximum point of the curve. (A) The solid line is the LOD score curve obtained from the interval mapping method. The LOD score curve attains maximum of 3.55 at 32.2 cM. (B) EIF of the maximum LOD score 2001 and EIF of the putative QTL location 2007 for each mouse. The mice with identification numbers are candidate influential individuals. (C) The maximum LOD score is changed by removing the two mice. (D) The location of the maximum LOD score is moved by removing the two mice.
The LOD score curve for chromosome 3 and influence analyses on the maximum point of the curve. (A) The solid line is the LOD score curve obtained from the interval mapping method. The LOD score curve attains maximum of 3.55 at 32.2 cM. (B) EIF of the maximum LOD score 2001 and EIF of the putative QTL location 2007 for each mouse. The mice with identification numbers are candidate influential individuals. (C) The maximum LOD score is changed by removing the two mice. (D) The location of the maximum LOD score is moved by removing the two mice.
3.3 Influence functions in the interval mapping method
Inside the braces of 1997, the first term is the profile likelihood with respect to γ, and the second term is the ordinary likelihood. Although both are conditional likelihoods given (,u), the general theory discussed in Section 2 is valid without any alteration. The theorem below follows immediately from Theorem 2.1 and the definition of the EIF in (2).
Theorem 3.1
(i) The empirical influence function of a linear combination of k LOD scores with the coefficient vector =(c) for individual i is where ℓ(γ,θ)=log f(y|,u;γ,θ) and ℓ(θ)=log g(y|*,u;θ). In particular,(ii) The empirical influence function of the maximum score for individual i is where .(iii) The empirical influence function of the maximizer for individual i is where the subscripts indicate partial derivatives.The derivatives with respect to γ and/or θ can be computed numerically or analytically.
3.4 Statistical model and influence functions in single marker analysis
In the interval mapping method, the putative QTL is assumed to be located in a continuous region Γ. However, when the observed markers are sufficient in number and densely located, we can assume that the QTL is one of the marker loci and use a simpler method—single marker analysis. This method is actually a special case of the interval mapping method.Assume that the possible QTL region Γ is given by the locations of the marker loci {d1,…,d}. When γ=d for some j, the conditional probability 2008 becomes Then, the statistical model in 1997 is reduced to the following single marker analysis model: where is the marker genotype observed at location dεΓ.As in the interval mapping method, the parameters α and β in model 1994 indicate the additive effect and the dominance effect, respectively. We again consider the null hypothesis H0:α=β=0. Let be the MLE given γ, and let be the MLE under H0. We can again use 1997 to define the LOD score, but γ is restricted to Γ={d1,…,d}, and . Similarly, the EIFs of the score LOD(γ) at individual i can be obtained from 1999 and its maximum is given in 2001. The EIFs for cannot be defined since takes discrete values.
4 Influence analysis on aspects of the shape
4.1 Influence analyses for the shape of LOD score curves
If we are only interested in the score LOD(γ) at a particular location γ, we can calculate EIF(LOD(γ)) for each individual i from 1999 and conclude that individuals having large absolute values |EIF(LOD(γ))| influence the LOD score. However, as stated in Section 1, attention is paid not only to the value of the score at a particular γ but also to the shape of the score curve in genetics studies.A conventional approach to find influential individuals for the shape of the LOD score curve in an experimental population generated from a genetic cross is to examine the individual genotype data in the region in question. Because of the linkage, strong positive correlations exist among the genotypes (l ≥ k ≥ m) on a linked chromosomal region. The probability that flanking markers take the same genotypes (e.g., ) is close to 1. If some individuals have recombinant genotypes at some markers in the linked region of interest, they may influence the shape of the LOD score curve. Therefore, geneticists try to identify individuals that show recombinant genotypes near that region. However, this method has some difficulties. First, because recombinant genotypes may have many different patterns, identifying all the potentially influential patterns is difficult. Second, because this method does not take phenotype and covariates into consideration, the detected individuals are not necessarily influential.In this section, to detect individuals having a large influence on a particular shape of the LOD score curve, we propose to use EIFC(), the EIF of LODC()=∑cLOD(γ) for individual i. Regardless of whether the set Γ of possible QTL locations is continuous or discrete, we are interested in the shape of the LOD score curve in the region where k chromosome positions γ1,…,γ are located. The influence matrix is defined as an n×k matrix is the n×k EIF vector of LOD(γ). We also define EIF=(EIF(LOD(γ1)),…,EIF(LOD(γ)))′, the transpose of the i-th row vector of EIF.In this approach, the choice of the coefficient vector c=(c)1≤ is crucial. For example, if we want to examine a linear increasing trend in LOD scores (LOD(γ1),LOD(γ2),LOD(γ3)) at 3 loci, then a monotone coefficient vector =(c1,c2,c3)′ with (c1Our first proposal is the use of orthogonal polynomials. Let (l=0,1,…), where (·) is the l-th power. Let . Applying the Gram–Schmidt orthonormalization process to 's, we define sequentially by The process 1997 is written as a recursion form where 0=. Note that is an orthogonal projection matrix. The vectors are orthonormal in the sense that (k=l), 0 (k ≠ 1). Hence, the LODC( are uncorrelated between different l's. Then EIFC()(i=1,…,n) can be obtained from 1999. It also can be written as . One advantage of this method is that the coefficient vectors are easy to interpret. The vectors 0,1, and 2 are corresponding to the grand mean, linearity, and curvature, respectively.Alternatives to using specific contrasts are our second and third proposals given below. Note that the matrices defined in 1997 can be used for deleting the components that are not of interest. For example, for deleting the parallel shift or linear components, we can consider a class of linear combinations of EIFs defined by a projection LODC(′=′(LOD(γ1),…,LOD(γ))′ with =1 or 2, respectively. In the following discussion, we start from this projected LOD score.Our second proposal is based on the PCA of Lu et al. (1997) and Tanaka (1994). Noting that the variance of LODC(′) is ′(′), we consider the following singular value decomposition for the standardized influence matrix: where and are orthonormal n- and k-vectors, respectively, and is the symmetric square root matrix of the Moore–Penrose pseudoinverse matrix. Multiplying both sides of 1993 by and defining :=(′)−1/2, we have . Then H′Cl becomes the coefficient vector corresponding to the l-th component, and the principal component is the corresponding influence function. We refer to as the influence score vectorin the context of influence analysis. For individual i, the EIF with respect to the l-th component can be written as .This is an exploratory approach, and as in most cases of PCA, the result of this approach is not always easy to interpret. Thus, we recommend that the number of principal components rank (), not to be large.Our third proposal is to use the coefficients maximizing EIFC(′)=′ under the condition Var(LODC(′))=′(′=1. Define =, =′, and let ′=1. Using the Cauchy–Schwarz inequality, we have Because the maximizer is =κ− for a constant κ≥0, we find that and the maximum is the square root of the quadratic form below: Note that QEIF can be rewritten in a way of the PCA-based method: . We refer to this EIF as the quadratic EIF (QEIF) hereinafter. This alternative method can be used when PCA does not give an explicable result.
4.2 Approximation of Cov
For the covariance matrix Cov, we propose an approximate value evaluated under the null hypothesis H0. It might seem better to use the covariance under the non-null model. However, in QTL analysis, the single-QTL models 1997 or 1994 are used for defining only the scanning statistic (LOD score), and each column vector EIF(γ) comes from a different statistical model for the specified γ. Hence, model-based approaches such as the use of observed information or Cook's local influence (Cook, 1986) cannot be applied to our problem. Therefore, we use the covariance under the null model as a second-best alternative.To estimate Cov, we define two n×2 matrices: =(1,u)1≤, with and . We propose an approximation for Cov as follows. The regularity conditions and the proof are provided in Appendix A.3.
Theorem 4.1
When n is large, the covariance matrix Cov of the LOD scores under the assumption of no QTL can be approximated by where
4.3 Significance of EIFs
To assess the statistical significance of the EIFs for suspected individuals, we calculate their p-values in the framework of multiple testing problem without identifying the unknown true QTL model. In this case, nonparametric resampling approaches look favorable. However, they have difficulty in estimating tail probabilities of extreme statistics. To address this problem, we propose a robustified parametric bootstrap below.First, estimate the involved parameters from the given (observed) dataset by a robust regression (such as, using the Huber estimator) for the full regression model where J is a region including all possible QTLs of interest (see, e.g., Section 5.3). The effects from QTLs outside J can be regarded as a part of ε. Second, generate datasets with the same sample size as in the dataset using the estimated model 19 and Haldane's linkage model. Then find the maximum absolute EIF from each simulated dataset. Finally, the p-value can be calculated as the upper probability of the empirical distribution of the maximum EIFs. Individuals with relatively small p-values can be declared influential.The validity of this method is examined in Section 6.1. In the next section, we will use , the standardized version of EIFC, instead of the original ones for estimating p-values. This is because an additional numerical comparison study we conducted shows that the original EIFC sometimes gives too conservative p-values when no outlier exists. The proposed method is also applicable to the interval mapping method by generating datasets with model (19) containing all marker loci on the chromosome under consideration.
5 Analysis of F2 mice data
5.1 F2 mice data
In this section, we show how our proposed methods work with a real dataset, which is available from Supporting Information. Our data are taken from n=170 F2 progeny generated from the intercross of F1 hybrids of C57BL/6J and MSM/Ms mouse strains. Extensive phenotypic variation and great genomic diversity are observed between these two strains (Takada et al., 2008). In this analysis, we consider blood adiponectin concentration (log10[ng/ml]) as a quantitative trait. Adiponectin is a key adipokine in metabolic syndrome and is important in mammalian metabolism. Genotypes of the F2 progeny were observed at m=119 marker loci (including 94 SNP markers and 25 microsatellite markers). The LOD score curves obtained by single marker analysis are depicted in Figure 1. The LOD score curve at chromosome 16 attains the maximum. In fact, the adiponectin coding gene (Adipoq, MGI:106675) is located on this chromosome. In our analysis below, we focus on chromosome 3 where six SNP marker loci are genotyped.
5.2 Data analysis on a single location
The LOD score curve obtained from the interval mapping method is shown in Figure 2A. We chose the point (, ) for further analysis because it is the maximum peak point of the LOD score curve.The EIF values of the maximum 2001 and the maximizer 2007 of the LOD score for the 170 individuals are plotted in Figure 2B. In this figure, each circle corresponds to one mouse. The horizontal axis indicates the influence of each mouse on the maximum LOD score, and the vertical axis measures the influence of all the mice on the location of the putative QTL.In order to show the extent of the influence of individuals on the LOD score, we plot the LOD score curves obtained without the specific mice and compare them with the original LOD score curve. The extreme points in the horizontal direction are No. 13 and No. 63. Accordingly, in Figure 2C, the maximum LOD score decreases and increases over a wide range of values by deleting mice No. 13 and No. 63. Figure 2D shows that deletion of mice No. 62 or No. 94 leads to a peak location shift from side to side without changing the maximum LOD score substantially. In particular, the data for No. 94 moves the location of the putative QTL by 3.2 centi-Morgans (cM). Although these mice are not detected as influential using the method in Section 4.3, we can see from this example that influential individuals may mislead us when making decisions about the existence and the locations of the QTLs.
5.3 Data analysis on aspects of the shape
For the LOD score curve on chromosome 3 in Figure 1, let us focus on the locations of the peaks and valley (γ3,γ4,γ5)=(35.4,53.8,68.4) cM, and check whether any mice determine the two-peak shape.From 1997, we can calculate a vector of to detect the influential mice that significantly affect the parallel shift, inclination, or curvature of the LOD score curve. For the curvature, vector 2=(1.04,−2.36,1.31)′ is obtained. Figure 3A depicts the SEIF obtained from the orthogonal polynomial method with 2 for the 170 individuals. Among the individuals, mouse No. 60 has the largest SEIF of 0.447. Its phenotype is 4.776, and genotypes at the 3 loci (γ3,γ4,γ5) are (0, −1, 0). This mouse appears to greatly affect the curvature of the LOD score curve (Figure 3C), and thus is worthy of attention.
Figure 3
Influence analyses for the shape of the LOD score curve. (A) SEIFs of the 170 mice obtained by the orthogonal polynomial method. Each circle indicates one mouse. (B) Scatter chart of the two influence score vectors obtained from the PCA method. The mice with identification numbers have a noticeably larger impact on the shape of the LOD score curve on chromosome 3. (C) LOD score curves without mice No. 23 and No. 60. (D) Comparison of the LOD score curves obtained from the original dataset and the changed dataset.
Influence analyses for the shape of the LOD score curve. (A) SEIFs of the 170 mice obtained by the orthogonal polynomial method. Each circle indicates one mouse. (B) Scatter chart of the two influence score vectors obtained from the PCA method. The mice with identification numbers have a noticeably larger impact on the shape of the LOD score curve on chromosome 3. (C) LOD score curves without mice No. 23 and No. 60. (D) Comparison of the LOD score curves obtained from the original dataset and the changed dataset.To assess its significance, we conduct a parametric bootstrap according to the procedure in Section 4.3. We use Huber's robust method to estimate the regression coefficients and the error variance in (19). In our study, the R function rlm() with the option psi=psi.huber is used. Under model (19) with j={3,4,5}, based on 1000 repeated simulations, the p-value for mouse No. 60 is estimated as 0.39. Model (19) with all loci on chromosome 3 (i.e., J={1,2,…,6}) gives similar result. Accordingly, No. 60 is not an influential mouse.Using the PCA approach, we again search influential individuals for the shape of the LOD score curve on chromosome 3. In this method, projection matrix =1 in 1997 is used to remove the parallel shift of the LOD score curve. The scatter plot of the two influence score vectors is shown in Figure 3B.The correlations between the eigenvectors =(h)1≤ and EIFC()=(EIFC())1≤ for l=1,2, obtained in the previous example are This result suggests that the first eigenvector indicates the inclination and the second eigenvector indicates the curvature of the LOD score curve, even though they do not work in exactly the same way as the vectors .The influence from mice No. 23 and No. 60 is shown in Figure 3C by comparing the original LOD score curve with those obtained from the dataset without these mice. The solid line connects the original LOD scores to form the curve for chromosome 3. The dotted line is obtained by removing mouse No. 23, which has the largest absolute value at the first eigenvector. We now observe that the first peak and the valley have gone down but the second peak has gone up. The dashed line indicates the LOD score curve without mouse No. 60, which has the largest value at the second eigenvector. In this case, the valley of this curve disappeared almost completely.In our dataset, mouse No. 60 is not approved as influential. However, when we change its phenotype (4.776) to 3.842, the minimum phenotype in the dataset, its SEIF becomes −0.647, and the fictional mouse is detected as influential according to its small p-value 0.02 (see the LOD score curves in Figure 3D). This numerical experiment shows that the proposed method for p-value in Section 4.3 can accomplish its goal.
6 Simulation studies
6.1 Assessing the method in Section 4.3
In this subsection, we examine the validity of our method for p-values from Section 4.3. We confirm that in the case of the single marker analysis, this method controls the false positive rate when no outlier exists, and has statistical power when outliers exist.The outline of our simulation study consists of the following steps:(i) Assume a true QTL model. Repeat the steps (ii)–(v) N times.(ii) Generate a dataset D0 from the assumed true model. Calculate T0=max|SEIF| from D0, where with EIFC given in 1999.(iii) From the assumed model, generate another dataset D1 with one outlier included. Calculate T1=max|SEIF| from D1 as in (ii).(iv) As stated in Section 4.3, fit the full model (19) to the dataset D0 by Huber's method as in Sections 4.3 and 5.3, and generate a new dataset using the estimated robust regression parameters. Calculate from .(v) Apply the same procedure as in (iv) to the data D1 to obtain the dataset and .(vi) Compare empirical distributions of T0, T1, , and based on the N iterations.Because and are samples from the model with robust estimators, the distributions of both and are expected to approximate the distribution of T0 (i.e., null distribution). As stated below, our simulation study shows that this expectation is correct. That is, it is appropriate to use (either or , in practice) instead of T0. This means that our proposal for estimating p-values is validated.As the true model in step (i), we use model (19) with four plausible settings: M0:J=∅ (no QTL), M1:J={3} (one QTL at γ3), m2:J-{3,5} (two QTLs at γ3 and γ5), and M3:J={3,4,5} (the full model). The last model is also used as the full model in steps (iv) and (v). In steps (ii)–(v), genotypes are generated with Haldane's model, sexes u are produced as a Bernoulli sequence independently of , phenotypes in D0 and D1 are generated form the true model, and phenotypes in and are generated from the full model. In step (iii), the outlier is generated as follows: genotypes () at the third, fourth, and fifth loci are set as the relatively rare genotype sequence (0, −1, 0), which seemed to have the largest effect in the dataset, and is the same as that of mouse No. 60; error ε of the phenotype is generated from N(0,(3σ)2).Figure 4 shows the simulation results for the four models. The empirical distribution functions of T and (k=0,1) are depicted as black and gray lines, respectively. We find that the distribution of and are close to each other, which means that the distribution of ( or ) is stable for the existence of outliers. We also find that and are close to T0 and distinct from T1. This indicates that when no outlier exists the false positive rate is appropriately controlled (i.e., unbiased) and when outliers exist it has statistical power in detecting influential individuals. We also tried simulations with two outliers. The results are similar and omitted.
Figure 4
Simulation results for Section 6.1. Distributions of the maximum |SEIF| and their approximations based on 1000 replicates. (A), (B), (C), and (D) are the simulation results under the models M0, M1, M2, and M3, respectively. Solid lines are the empirical distribution functions of simulated T0 and their approximations when there are no outliers. Dashed lines are empirical distribution functions of simulated T1 and when one outlier exists. (T0, , T1, are referred to as “T no outlier,” “ no outlier,” “T outlier,” “ 1 outlier” in the legends, respectively.)
Simulation results for Section 6.1. Distributions of the maximum |SEIF| and their approximations based on 1000 replicates. (A), (B), (C), and (D) are the simulation results under the models M0, M1, M2, and M3, respectively. Solid lines are the empirical distribution functions of simulated T0 and their approximations when there are no outliers. Dashed lines are empirical distribution functions of simulated T1 and when one outlier exists. (T0, , T1, are referred to as “T no outlier,” “ no outlier,” “T outlier,” “ 1 outlier” in the legends, respectively.)
6.2 Power comparisons by ROC analysis
In this subsection, we confirm the statistical power of our proposals by comparing them with existing diagnostics in regression analysis. We assess four diagnostics, the EIFC (10), the QEIF (16), Cook's D (Cook, 1977), and the standardized residual r, by a receiver operating characteristic (ROC) analysis (Fawcett, 2006). In the context of QTL analysis, Hayat et al. (2008) studied the detection power of a modified version of Cook's D (Zewotir and Galpin, 2005) in a QTL model with random effects. Since our QTL model is a fixed-effect model, we use the original Cook's D in the comparisons. We restrict our attention again to detecting individuals that influence the two-peak shape of the LOD score curve on chromosome 3.Cook's D and the standardized residual r are defined through the regression model (19) with J={3,4,5}. Note that Cook's D is the quadratic form of the EIF vector for the parameter vector ((α,β),μ,ν.In this simulation, the two-QTL model (19) with J={3,5} is used as the true model. Each dataset contains two outliers and 168 normal individuals. The outliers are designed to have specified genotypes (0, −1, 0) at the third, fourth, and fifth loci. These genotypes are the same as those used in Section 6.1. The error ε of each outlier's phenotype is simulated in the following different ways: (a) Normal distribution N(0,(2σ)2), (b) N(0,(3σ)2), (c) t-distribution with 3 df and scale parameter 2σ, and (d) Cauchy distribution with scale parameter 2σ. Note that σ is the scale parameter used in generating ε for the 168 normal individuals. In each dataset, the genotypes and sexes of the normal individuals are generated from Haldane's model and Bernoulli distribution, respectively. Their phenotypes are simulated from the two-QTL model with the parameters estimated from the adiponectin dataset.The ROC curves of the four indicators are shown in Figure 5 based on 1000 replicates. For the varying thresholds, the average rates of correctly classifying the true influential cases (detection rates), and the average rates of misclassifying the normal cases as influential cases (false positive rates) are plotted. As expected, outliers with larger σ2 have larger absolute EIFs, and are thus more easily detected. In all panels, the EIFC has the largest area under its ROC curve and hence the best average performance. As the second-best method, QEIF is shown useful when the target shape cannot be fully specified.
Figure 5
Simulation results for Section 6.1. Comparison of ROC curves. EIFC: linear combination of EIFs, QEIF: quadratic EIF, Cook: Cook's D, Resid: standardized residual r. Distribution of outliers, ε, is (A) N(0,(2σ)2), (B) N(0,(3σ)2), (C) t3 with scale 2σ, and (D) Cauchy with scale 2σ.
Simulation results for Section 6.1. Comparison of ROC curves. EIFC: linear combination of EIFs, QEIF: quadratic EIF, Cook: Cook's D, Resid: standardized residual r. Distribution of outliers, ε, is (A) N(0,(2σ)2), (B) N(0,(3σ)2), (C) t3 with scale 2σ, and (D) Cauchy with scale 2σ.
7 Discussion and guidelines
7.1 Summary and discussion
In this paper, we developed a general theory of profile likelihood function, and apply it to the linear functional of LOD score function. We also proposed methods to detect influential individuals to the shape of LOD score curves.The proposed methods have the following four remarkable features.(i) These methods focus on interactive effects of genotype and phenotype—Phenotype and genotypes are incorporated in influence analysis on LOD scores. For example, in our dataset, mouse No. 13 has genotypes (−1,−1,−1) and the minimum phenotype 3.842, but its EIF on the curvature of the LOD score curve is 0.077, which is much less than that of 0.447 for mouse No. 60. However, as shown in Section 5.3, when the phenotype works with genotypes (0, −1, 0), its influence becomes significant. The probability of genotypes (0, −1, 0) under Haldane's model is 0.014. There are also some mice with rarer genotypes, such as mouse No. 9 (−1,0,−1) and mouse No. 128 (1, 0, −1), both with probability 0.007. However, neither of these mice has significant influence on the shape of the LOD score curve. Hence, the proposed methods do not separately detect outlier phenotype or rare genotypes. Note that, using these methods, individuals with outlier phenotypes, or rare multilocus genotypes including epistasis or concordant genotypes may also be identified as influential if they change the LOD score curve significantly.(ii) The proposed approach based on the EIF can be applied to other QTL models —In this paper, two simple models are dealt with as examples. However, the proposed approach can be applied to more complicated multiple QTL models based on LOD score, such as multiple interval mapping (Kao et al., 1999).(iii) Three methods are proposed to design coefficients of linear combination of LOD scores—They can be used when we have no clear idea for choosing the coefficients. Here we give a summary on the feature of these methods in Table 2.1.(iv) A method to assess the significance of each detected individual is proposed—We proposed a simulation based method to assess the significance of detecting influential individuals, and confirmed that this method is approximately valid (i.e., controlling false positives) in the case of the single marker analysis.Three methods for designing coefficients
7.2 Practical guidelines
The proposed influence analysis methods are designed to identify individuals that change significantly the LOD score curve. The data of the identified individuals may contain observation errors. Therefore, we should reexamine them at the first. The data of detected individuals that are confirmed to be accurate and reliable are potentially informative for the gene mapping process.Influence analysis is a model-based method. It is desirable to do QTL detection and influence analysis using models that are well fitted to the data. As stated in Section 3, our proposed EIF approaches can be applied to any QTL detection models based on LOD score. When the assumed model is incorrect, it is not easy to interpret the results.On the other hand, the one-QTL model (single marker analysis or the interval mapping method) is often used in initial scan even though it may not be the true model. Even in this case, the detected individuals provide important information. They may be true influential individuals in the assumed model, or they may suggest other possibilities of QTL model such as multiple-QTL or epistasis model. In any of these cases, the detected individuals are informative and should be investigated carefully in subsequent QTL analysis.In the process of QTL analysis, aside from influence analysis, the significance of LOD scores should be checked by standard methods such as permutation test (Churchill and Doerge, 1994).The dataset and software used in this article are available from the authors or at NIG Mouse Phenotype Database http://molossinus.lab.nig.ac.jp/phenotype/index.html .
Table 1
Three methods for designing coefficients
Method
Advantage [A] and drawback [D]
Orthogonal polynomial
[A] Coefficients are easy to interpret. Detection power is high.
[D] Need to choose the degree of polynomial, or need to try various degrees (e.g., linear, quadratic, cubic, …).
PCA-based method
[A] Useful as an exploratory data analysis.
[D] Results are not necessarily clear (not easy to interpret).
Quadratic form (QEIF)
[A] Omnibus test, and easy to use.
[D] Detection power is lower than for the orthogonal polynomial.