Literature DB >> 19500389

Accounting for uncertainty when assessing association between copy number and disease: a latent class model.

Juan R González¹, Isaac Subirana, Geòrgia Escaramís, Solymar Peraza, Alejandro Cáceres, Xavier Estivill, Lluís Armengol.

Abstract

BACKGROUND: Copy number variations (CNVs) may play an important role in disease risk by altering dosage of genes and other regulatory elements, which may have functional and, ultimately, phenotypic consequences. Therefore, determining whether a CNV is associated or not with a given disease might be relevant in understanding the genesis and progression of human diseases. Current stage technology give CNV probe signal from which copy number status is inferred. Incorporating uncertainty of CNV calling in the statistical analysis is therefore a highly important aspect. In this paper, we present a framework for assessing association between CNVs and disease in case-control studies where uncertainty is taken into account. We also indicate how to use the model to analyze continuous traits and adjust for confounding covariates.
RESULTS: Through simulation studies, we show that our method outperforms other simple methods based on inferring the underlying CNV and assessing association using regular tests that do not propagate call uncertainty. We apply the method to a real data set in a controlled MLPA experiment showing good results. The methodology is also extended to illustrate how to analyze aCGH data.
CONCLUSION: We demonstrate that our method is robust and achieves maximal theoretical power since it accommodates uncertainty when copy number status are inferred. We have made R functions freely available.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19500389 PMCID： PMC2707368 DOI： 10.1186/1471-2105-10-172

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

With the recent technological advances, various genome-wide studies have uncovered an unprecedented number of structural variants throughout the human genome [1-3], mainly in the form of copy number variations (CNVs). The considerable number of genes and other regulatory elements that fall within these variable regions make CNVs very likely to have functional and, ultimately, phenotypic consequences [4,5]. In fact, recent studies have reported a correlation between copy number of specific genes and degree of disease predisposition [6-8], indicating that identification of DNA copy number is important in understanding genesis and progression of human diseases. Several techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as array-based comparative genomic hybridization (aCGH). The goal of this approach is to identify contiguous DNA segments where copy number changes are present. The ability of aCGH to distinguish between different numbers of copies is limited, so various quantitative techniques are required for more precise, targeted analysis of genomic regions. For known CNVs, real time PCR assays can be used to compare the copy number status of particular loci in cases and controls. Individuals are typically binned into copy number categories using pre-defined thresholds of probe signal intensity. Recently, Multiplex Ligation-dependent Probe Amplification (MLPA) [9] has also been used to quantify copy number classes. This method allows the analysis of several loci at the same time in a single assay. MLPA is usually used to identify gains or losses in test samples with respect to controls [10], but it can also be used in the context of association studies in a case-control or cohort settings [11,12]. The statistical methods used in CNV-disease association studies are currently very simple. Quantitative methods give CNV probe signal intensity measurements for each individual as a continuous variable, from which copy number status is inferred, generally using pre-defined thresholds. Differences in copy number distribution between cases and controls are then assessed using χ2, Fisher or Mann-Whitney tests [6,13,14]. However, the distribution of CNV probe measurements is continuous and multimodal, meaning that signal intensity should be considered as a mixture of curves. In many instances, these curves overlap with various underlying distributions leading to uncertainty. Therefore, scoring copy number by binning and then assessing the association may lead to misclassification and unreliable results. Ionita-Laza et al. (2009) pointed out that it is not inmediately clear how this uncertainty of CNV calling should be incorporated in the statistical analysis [15]. To overcome this difficulty in assessing association between CNVs and disease, we propose a latent class (LC) model that incorporates possible uncertainty that appear when CNV calling is performed. After inferring copy number using Gaussian finite mixture distributions, or any other calling algorithm, the model assesses the relationship between the trait and a CNV using a mixture of generalized linear models. Association is then tested using a likelihood ratio procedure. We validate and compare our method with existing methods through a simulation study. We then illustrate how to test association between CNVs and the trait by using two real examples. One of them corresponds to a case-control study using data from a MLPA experiment where the true copy number status is known. The second example belongs to a study where breast cancer cell lines are analyzed using aCGH.

Methods

Inference of copy number status

Let us assume that we observe I individuals from a given population, consisting of mutually exclusive latent classes c = 1, ..., (e.g. copy number status). Instead of observing these classes, we observe a surrogate variable, X, corresponding to a continuous variable arising from any quantitative method. For instance, in targeted studies using MLPA or real-time PCR, X corresponds to peak intensities for each CNV probe. In the context of a whole genome scan, one may have quantitative data from aCGH or any other platform such as Illumina or Affymetrix, where, for each probe, the variable X corresponds to a ratio of intensities. Figure 1 shows a number of possible distributions that signal intensities may have. Some variants clearly show different underlying copy number status with multimodal signal intensities distributions (CNV2, CNV4 and CNV6). In other cases, where the existence of different copy numbers is not clear, inferring copy number by binning the data may be difficult or unfeasible.

Figure 1

CNV quantitative measurements. Examples of CNV data showing different clustering quality and copy number status.

CNV quantitative measurements. Examples of CNV data showing different clustering quality and copy number status. For each CNV variant, we are interested in classifying the subjects into the classes using the surrogate variable X. We propose to model the unobserved latent classes using a finite mixture model with components of the form where N(·|η, ) is the Gaussian distribution with Θ denoting all model parameters (e.g., Θ = (η, ), c = 1, ..., ), and x is the surrogate variable that corresponds to the quantitative measure of copy number status. For the component weights πit holds that The value of to be used is chosen by applying Bayesian Information Criteria (BIC) [16]. This mixture model approach for calling is similar to some used for the analysis of aCGH data [17,18] where correlation among probes should be considered. When analyzing MLPA data, it should be pointed out that in some instances, especially when there are individuals with 0 copies, the intensity distributions (see CNV2 and CNV4 in Figure 1) for a null allele is meant to be equal to 0. However, due to experimental noise it is fact that in some cases this ratio shows values that slightly deviate from this theoretical value. After our experience with hundreds of home-made MLPA probes, the value for null alleles is typically below 0.1; nevertheless, we recommend this parameter to be determined experimentally for each of the probes used in the MLPA experiments using the appropriate control samples. For these cases, the procedure used to estimate the parameters in (1) fails because the underlying distribution of individuals with 0 copies is not normal. In these situations we propose to fit the following mixture model to determine the latent classes where τ is given by the user, as previously indicated, , ℐ denotes an indicator function, and The posterior probabilities are used to segment data by assigning each individual to a given copy number status corresponding to the class with maximum posterior probability (MAP). After fitting this finite mixture model, we can perform a goodness-of-fit test using χ2 test statistic. Finite mixture parameters can be estimated using the EM algorithm [19,20] or Newton-type procedures [20]. Then, the posterior probability that individual i with an observed value x belongs to copy number class j is given by

Latent class model

Discrete traits

Let us suppose that copy number status is associated with a binary phenotype (case-control). The association is typically assessed using a χ2 test for the contingency table (Table 1). Misclassification in the table (due to uncertainty when inferring CNVs) is incorporated when we assign each individual to a given class c using maximum a-posteriori probability (MAP). Thus, this problem can be seen as an association study with misclas-sification ("measurement error") [21]. It is well known that misclassification of covariates has important implications for parameter estimates and statistical inference [22]. Some approaches account for such error [23,24]. These are, however, based on performing validation studies in a subsample. In the present context, this is unfeasible because hundreds of genes are normally analyzed at a time, and the technology may have a different sensitivity and specificity for each of the inspected loci. Therefore, we propose to use the posterior probability of belonging to each latent class to model the degree of misclassification of copy number status. We then take this information into account in the association model.

Table 1

Contingency table of disease status and copy number category

	Copy number status

Disease	1	2	⋯	Total
Cases	r₁	r₂	⋯	R
Controls	s₁	s₂	⋯	S

Contingency table of disease status and copy number category Conditioning on cluster c, we have that where = (β1, ..., β), c = 1, ..., is our vector of parameters, and Then, equation (4) can be rewritten as Now, we consider that copy number status is measured with error (i.e., the latent class is not known). Therefore, we are modeling the probability of being an affected individual as a mixture of binomial variables, as follows: where wis the posterior probability that individual i belongs to copy number class c, given in (3). Therefore, assuming conditional independence of case-control status, given latent class, the likelihood function for model parameters can be written as We can then simply compute the odds ratio (OR) of belonging to class c with respect to a given reference r as

Quantitative traits

We now consider the case where our phenotype, Y, is continuous. We assume that Y |c N(μ, σ2). In this case, conditioning on cluster c where Similar to the case of discrete traits, the likelihood function for model parameters is given by In this case we are interested in evaluating the difference between the mean effect of individuals with c copies and r copies. This can simply be computed as

Covariate Adjustment

In some instances researchers are interested in assessing the effect of CNVs after adjusting for other covariates, Z1, ..., Z(usually called confounding variables). In this case, the likelihood function can be written as where for discrete traits, and for quantitative traits. In both cases

Parameter estimation

In this section we address parameter estimation for the general situation of having covariates and either discrete or quantitative traits. For brevity, let ≡ (β, γ, σ) (notice that for discrete traits σ = 1). We consider that the weights, , are known and that they are given by the surrogate variable X from equation (3). Therefore, they can be used in the log-likelihood calculation, resulting in Here P(y| = c, Z, ) is given by equations (9) and (10) for discrete and quantitative traits, respectively. The maximum likelihood estimators (MLE) of the model parameters maximize this log-likelihood function. We propose to use a Newton-Raphson procedure to find parameter estimates. The k-th component of the score, S, is given by The k-th element of the Hessian, H, is where Formulae for the derivatives of hfor covariates and for discrete and qualitative traits are given in the Appendix. MLE can be used to estimate, under the multiplicative model, the OR between individuals with copy number status c with respect to a reference category (e.g., individuals with copy number status r) as Similarly, when analyzing continuous traits, the estimated mean effect among individuals with c copies with respect to those with r copies is The asymptotic variance-covariance matrix of maximum likelihood estimates of can be estimated using the observed information matrix, F, as Therefore, we can compute a 95% confidence interval (CI95%) for ORusing the expression and for where z denotes the (1 - α/2)-th quantile of a standard normal distribution, α is the desired type-I error, and subindex [·, ·] denotes the position in the inverse of Fisher's information matrix.

Hypothesis testing

We propose to use a likelihood ratio test to assess disease association, taking the model without the copy number variable as reference. Twice the increase in the log-likelihood provides the asymptotic χ2 statistic that tests H0: β1 = β2 = ... = . In many instances, we are interested in studying the trend in effect with respect to copy number status (e.g., additive model). This can be done by generalizing equation (11) in the form where D is a I × M design matrix, and ζ is a vector of dimension M having the model parameters. M is the total number of variables included in the model, including copy number status and confounding variables (e.g., M = + K). For example, a trend test on copy number status without covariates D would have the form and the trend hypothesis on copy number status is tested using a likelihood ratio test, comparing this model with the null model. Notice that this formulation allows us to accommodate different or common effects for each latent class. In this case, parameter estimates are obtained as shown above. Formulae for the derivatives obtained in the score and Hessian, where coefficients are not shared by each latent class, are shown in the Appendix. R language functions for the methods discussed in this paper are freely available at [25]

Results

Simulation study

We performed computer simulation studies to empirically examine the properties of the parameter estimators developed in the previous sections. The specific goals of these studies were: (i) to evaluate the performance of the proposed likelihood ratio trend test based on the latent class model for a number of CNV measurement distributions; (ii) to examine the effect of sample size (I) on the distributional properties of the estimators; (iii) to examine the bias and mean square error (MSE) of the estimators; (iv) to assess the accuracy whether of the variance and parameter estimates obtained using the observed information matrix. Simulations were performed as follows: To study (i), we simulated a binary trait using 300 cases and 300 controls. The unobserved copy number statuses (e.g. latent classes) were simulated depending on 3 different copy number status ( = 3), with the proportion of individuals in each category set as π = (0.5, 0.4, 0.1). The trend OR was set equal to 1.5. The observed signal intensity ratio (X variable) were simulated as a finite mixture of normal distributions using different means, η, and variances, σ2, to assess whether the separation of clusters and their variance affects power. To study (ii)–(iv) we simulated binary and quantitative traits. For the binary trait, simulation was performed as above but simulating various scenarios of sample size (I), OR and proportion of individuals with each copy number status, π. Again, we simulated different CNV distributions by varying η and σ2. For quantitative traits, we used the same simulation procedure but copy number status was simulated depending on a fixed mean trait level for the reference copy number status and a desired mean difference with respect to other copy number statuses. Next, we describe the settings for the different simulation parameters. Sample size: We chose the values of I: I ∈ {50, 300}. Although current studies are analyzing thousands of individuals, these values were chosen to evaluate the performance of our proposed method in moderately large samples. Copy number status: Since we were interested in evaluating the performance of the parameter estimates, we only simulated two different copy number statuses = {1, 2}. Odds ratio: To assess the impact of the strength of association between the disease and CNV, we chose two values for OR: OR ∈ {1.3, 2} in order to consider a moderate association and a strong one. Proportion of cases with normal copy number status: To evaluate the impact of classes with different number of individuals we set π ∈ {(0.8, 0.2), (0.5, 0.5)}. Finite mixture: To asses the impact of distribution of intensity ratio, X, we simulated two normal distributions with the following parameters: η ∈ {1, 1.5}, which correspond to having 2 (considered as normal copy number status) and 3 copies, respectively, and σ ∈ {(0.15, 0.15), (0.15, 0.2), (0.2, 0.2)}. In this case, these scenarios also helped us to model different situations regarding misclassification or how latent classes were separated. We compared three different approaches. The first (NAIVE) was based on assessing association between disease and copy number status obtained using MAP from the finite mixture model (2). That is, association was assessed using a χ2 test from Table 1. The second is the approach that has been used predominantly to date when analyzing this kind of data and is based on assigning CNV status using pre-defined thresholds (THRES). Association is then assessed using a χ2 test. As mentioned previously, we simulated data from two mixtures of normal distributions with means of 1 and 1.5. This is equivalent to simulating individuals with 2 and 3 copies, respectively. In this situation, it is considered that individuals with intensity (or intensity-ratio) greater than 1.33 correspond to individuals with 3 copies [10]. The third method is the one proposed in this paper, based on latent class (LC) using a χ2 test. In order to make the results comparable, the performance of LC based on likelihood ratio trend test was compared with that of the two other methods using a χ2 trend test (e.g. 1 degree of freedom). To evaluate bias and MSE of parameter estimates, χ2 of association was used for all three methods. Simulation results for evaluating the performance of the likelihood ratio trend test in our proposed model are shown in Figure 2. The top figures show the power for all methods analyzed under two scenarios (other scenarios are given in Additional file 1).

Figure 2

Empirical power for simulation studies. Empirical power for the three different approaches analyzed, varying the quality of clustering for underlying copy number status. Left panel is for fixed variance and varying means, while the right panel is for fixed mean and varying variances. The left panel shows the power for each method, varying the CNV measurement distribution with regard to the mean of each latent class, η, while the right panel gives the same information but with fixed means and varying variances, σ2. Figure 2 also depicts the distribution of CNV signal intensities for various scenarios. We observe that our proposed latent class model performs better in all cases, even when distribution of copy number status are not very well separated (e.g. more uncertainty). Simulation results to evaluate parameter estimates for discrete traits are presented in Table 2 and in Table S1 and Figures S3 and S4 (see Additional file 1). Similar results and conclusions are obtained for a quantitative trait. Table 2 and Figures S3 and S4 (see Additional file 1) summarize the OR obtained by comparing individuals with 3 copies to those with 2 copies (reference category) and give the MSE for two different sample sizes, I, two different proportions of individuals with 2 copies, π, and two different variances for each component of the mixture, σ. Table S1 (see Additional file 1) compares different methods to compute the standard error of the ORs for the various scenarios described above. The results compare asymptotic variance based on an observed information matrix (ASYM) with respect to empirical variance (EMP). Supplementary Table S1 also shows coverage and power of confidence intervals based on the three methods analyzed. As expected, when the sample size increased, the performance of the estimators of the finite-dimensional parameters improved (Table 2). In all cases, the LC method performs better than the others. LC has less bias than NAIVE and THRES in all cases, and also shows better MSE.

Table 2

Simulation study

								Mean Square Error (×10³)

I	π	e^β	σ	SIM	NAIVE	THRES	LC	NAIVE	THRES	LC
50	0.8	1.3	(0.15,0.15)	1.23	1.17	1.15	1.20	57	87	42
50	0.8	1.3	(0.2,0.2)	1.24	1.14	1.09	1.21	107	131	114
50	0.8	1.3	(0.15,0.2)	1.28	1.18	1.15	1.24	134	148	112
50	0.8	2	(0.15,0.15)	1.60	1.40	1.28	1.48	54	85	44
50	0.8	2	(0.2,0.2)	1.82	1.36	1.29	1.52	152	158	126
50	0.8	2	(0.15,0.2)	1.89	1.42	1.33	1.57	180	253	162
50	0.5	1.3	(0.15,0.15)	1.26	1.24	1.21	1.26	39	51	32
50	0.5	1.3	(0.2,0.2)	1.32	1.28	1.25	1.35	82	79	97
50	0.5	1.3	(0.15,0.2)	1.26	1.23	1.20	1.26	66	72	60
50	0.5	2	(0.15,0.15)	2.04	1.94	1.83	2.05	40	67	34
50	0.5	2	(0.2,0.2)	2.04	1.76	1.68	2.05	107	128	92
50	0.5	2	(0.15,0.2)	2.06	1.78	1.72	1.99	87	107	71

300	0.8	1.3	(0.15,0.15)	1.30	1.25	1.18	1.30	13	32	10
300	0.8	1.3	(0.2,0.2)	1.32	1.25	1.15	1.34	27	50	29
300	0.8	1.3	(0.15,0.2)	1.30	1.22	1.16	1.29	24	42	21
300	0.8	2	(0.15,0.15)	2.01	1.87	1.49	2.01	21	120	13
300	0.8	2	(0.2,0.2)	2.03	1.70	1.36	1.99	69	203	43
300	0.8	2	(0.15,0.2)	2.03	1.62	1.38	1.86	78	189	38
300	0.5	1.3	(0.15,0.15)	1.31	1.27	1.26	1.30	7	9	5
300	0.5	1.3	(0.2,0.2)	1.30	1.23	1.22	1.30	15	17	12
300	0.5	1.3	(0.15,0.2)	1.30	1.24	1.23	1.29	12	14	9
300	0.5	2	(0.15,0.15)	2.00	1.87	1.77	2.00	11	23	5
300	0.5	2	(0.2,0.2)	2.00	1.72	1.66	2.02	36	51	15
300	0.5	2	(0.15,0.2)	2.00	1.76	1.71	1.97	26	37	10

Odds ratio (e) and mean square error obtained in 1,000 simulations using the three different approaches, NAIVE, THRES and LC (see text for a description of each). Results are given for different scenarios, varying the number of individuals (I), the proportion of individuals with each copy number status (π), the odds ratio (e), and the variance for CNV quantitative measurements.

Simulation study Odds ratio (e) and mean square error obtained in 1,000 simulations using the three different approaches, NAIVE, THRES and LC (see text for a description of each). Results are given for different scenarios, varying the number of individuals (I), the proportion of individuals with each copy number status (π), the odds ratio (e), and the variance for CNV quantitative measurements. Regarding variance estimates, the estimation based on ASYM showed good performance in all scenarios (see Additional file 1, Table S1). Despite slightly overestimating of EMP, the bias was less pronounced for I = 300, as expected. Confidence intervals based on the LC method outperform those obtained by other methods with regard to power.

Application to real data

MLPA example

The first data set used to analyze CNV and disease was generated and kindly provided by one of the coauthors of the current work. Although data is still unpublished, it has been made available in a blinded format for reproducing our findings using the approach presented herein, and for other validation studies. Some candidate genes were identified after performing a whole genome scan analysis using aCGH, where a pool of controls and cases were compared. In order to further investigate the relationship between the disease and altered the genes, a targeted study including several variants was designed using the MLPA technique. We obtained signal intensities of MLPA assays for 360 cases and 291 controls. Figures 3 and 4 show the intensities for cases and controls for two selected genes. In both cases, we observe 3 latent classes, corresponding to 0, 1, and 2 copies of the gene. We found that the finite mixture model fits very well (χ2 goodness-of-fit test, P = 0.6615 and P = 0.4888). The main difference between these two cases is that copy number status for gene 1 can be established using a threshold method, while for the second gene this classification seems more arbitrary. As a consequence, misclassification should be taken into account when analyzing gene 2. Table 3 shows the classification of individuals as having 0, 1, 2 copies, estimated using equation (2) and the true copy number obtained by breakpoint cloning and assessing allele presence by PCR, which unequivocally reports the exact number of copies. From the table, we can see that the finite mixture model gives a perfect classification for gene 1 and some misclassification for gene 2. Goodness-of-fit test revealed that the proposed mixture model to determine CNV status was appropriate (p = 0.6615 and p = 0.1586).

Figure 3

Figure 4

Association between Gene 2 and disease. Graphical representation of peak intensities (CNV quantitative measurement) of individuals for Gene 2 analyzed in the example. The various colors indicate copy number status inferred using our proposed finite mixture model.

Table 3

Contingency table of estimated and true copy number status for the two genes examined in the real data example.

	True copy number status

	0	1	2
Gene 1
0	426	0	0
1	0	201	0
2	0	0	24

Gene 2
0	85	0	0
1	5	287	0
2	0	73	204

Association between Gene 1 and disease. Graphical representation of peak intensities (CNV quantitative measurement) of individuals for Gene 1 analyzed in the example. The various colors indicate copy number status inferred using our proposed finite mixture model. Association between Gene 2 and disease. Graphical representation of peak intensities (CNV quantitative measurement) of individuals for Gene 2 analyzed in the example. The various colors indicate copy number status inferred using our proposed finite mixture model. Contingency table of estimated and true copy number status for the two genes examined in the real data example. Table 4 shows the ORs and their 95%CI for the two genes analyzed. The first three columns show the results obtained in the laboratory using PCR, while the other columns show the results obtained after estimating the copy number status using our proposed finite mixture model and computing the ORs using a naïve approach (e.g. assuming that there is no misclassification) and the LC model that accounts for misclassification. As we can see, the results are the same for gene 1, since no misclassification is observed (see Figure 3 and Table 3). However, for gene 2, copy number status could not be determined as easily as for gene 1. Thus, we observe a different OR estimation and, more importantly, a different P-value for association. For instance, the order of magnitude of the association between the disease and gene 2 is better captured by the LC model than by the NAIVE approach. Regarding the OR estimates, the analysis using the true copy number status shows that individuals with one copy of gene 2 have a 63% decrease in disease risk with respect to individuals with 0 copies. As the 95%CI shows, this difference is statistically significant. We arrive at the same conclusion when we compare individuals with 2 copies with respect to those with 0 copies. Note that in both cases we observe that the naïve approach underestimates the OR, as shown by the simulation study.

Table 4

Association analysis of disease status and copy number category using the true copy number status and the estimated status obtained using the finite mixture proposed.

	True CN			Estimated CN
	Co	Ca	OR (CI95%)	Co	Ca	OR_naïve(CI95%)	OR_LC(CI95%)
Gene 1
0	210	216	1	210	216	1	1
1	75	126	1.63 (1.16,2.30)	75	126	1.63 (1.16,2.30)	1.63 (1.16,2.30)
2	6	18	2.92 (1.14,7.49)	6	18	2.92 (1.14,7.49)	2.92 (1.14,7.50)
P association			0.0027			0.0027	0.0023
P trend			5.0 × 10^-4			5.0 × 10^-4	5.0 × 10^-4

Gene 2
0	24	66	1	22	63	1	1
1	159	201	0.46 (0.27,0.77)	129	178	0.44 (0.26,0.75)	0.47 (0.27,0.82)
2	108	93	0.31 (0.18,0.54)	140	119	0.33 (0.19,0.57)	0.31 (0.18,0.54)
P association			7.2 × 10^-5			2.3 × 10^-4	8.4 × 10^-5
P trend			2.1 × 10^-5			1.0 × 10^-4	2.1 × 10^-5

Association analysis of disease status and copy number category using the true copy number status and the estimated status obtained using the finite mixture proposed.

aCGH example

The analysis of aCGH data requires additional steps to take into account the dependency across probes. Table 5 shows four steps we recommend for the analysis of this kind of data. First, MAP should be obtained with an algorithm that considers probe correlation. We use, in particular, the CGHcall R program which includes a mixture model to infer CNV status [18]. Second, we build blocks/regions of consecutive clones with similar signatures. To perform this step the CGHregions R library was used [26]. Third, the association between the CNV status of blocks and the trait is assessed by incorporating the uncertainty probabilities in the LC model. And fourth, corrections for multiple comparisons must be performed. We use the Benjamini-Hochberg(BH) correction [27]. This is a heuristic method that is robust against positive dependence and increasingly conservative as correlation increases [28].

Table 5

Steps used to assess association between CNVs and traits when aCGH is used.

Step 1. Use any aCGH calling procedure that provides MAP (uncertainty)

Step 2. Build blocks/regions of consecutive probes with similar signatures

Step 3. Use the signature that occurs most in a block to perform association unsing LC model

Step 4. Correct for multiple testing considering dependency among signatures

Steps used to assess association between CNVs and traits when aCGH is used. We applied the methodology to the breasts cancer data studied by Neve et al. [29], which is freely available from the bioconductor website [30]. The data consists on CGH arrays of 1 MB resolution [31]. The authors chose the 50 samples that could be matched to the name tokens of caArrayDB data (June 9th 2007). In this example the association between strogen receptor positivity (dichotomous variable; 0: negative, 1: positive) and CNVs was tested. We contrasted the association as given by the LC and the NAIVE models. The original data set contained 2621 probes which were reduced to 459 blocks after the application of CGHcall and CGHregions functions. Table 6 shows the number of CNV blocks associated with strogen receptor positivity for different significance levels. We observe that incorporating classification uncertainty with the LC model substantially increased the level of association, as compared to the NAIVE approach. The number of positive association at 5% of significance after applying BH correction was 49 and 24 for LC and NAIVE approach, respectively.

Table 6

Number of CNV blocks (out of 459) associated with estrogen receptor positivity from 50 aCGH breast cancer cell lines.

	Significance level

	10^-6	10^-5	10^-4	10^-3	10^-2
Latent class model	1	4	27	64	117
Chi-square test	0	2	10	41	93

Results are given for different levels of association and comparing our proposed model with the naïve approach that does not consider uncertainty.

Number of CNV blocks (out of 459) associated with estrogen receptor positivity from 50 aCGH breast cancer cell lines. Results are given for different levels of association and comparing our proposed model with the naïve approach that does not consider uncertainty.

Discussion

In this paper we have shown that the assessment of association between CNVs and disease using analysis methods that do no take into account uncertainty when inferring copy number status lead to larger p-values and underestimate the model parameters. This confounds the need to increase statistical power, which is reduced by the multiple comparison correction for the simultaneous testing of several loci. False positives are typically controlled by a dramatic reduction in the nominal p-value, such that very low values are required to reach statistical significance. Thus, a precise computation of these values is essential in genetic association studies. Here we have proposed a latent class model (LC) that accounts for the uncertainty of assessing CNV status and also accommodates potential confounding factors. In the case of analyzing quantitative traits, we also provide formulae to further propagate call uncertainty, as other authors have proposed in another context [32]. By analyzing quantitative traits, we have assumed that the response variable follows a normal distribution, although this assumption does not hold in some instances. In this situation, one possibility is to analyze the log-transformed variable, although log transformation may not be not sufficient. The model could easily be extended to fit a response variable that has any exponential family distribution (e.g. normal, gamma, Poisson). However, we have not yet implemented this option in the functions reported here. The extension of our proposed latent-class model to assess survival time, possibly with right-censored data, is not trivial but could be a very interesting avenue for future investigation. The parameter estimation procedure proposed here, allows the estimation of confidence intervals. The LC model was remarkably consistent with simulated data. In particular, we found that the p-values obtained with the LC model were more similar to the expected values than those obtained by the threshold and naïve methods. We maximize the likelihood function, assuming fixed weights for each copy number status, which accounts for possible misclassification. The main advantage of considering weights as known constants is that the Newton-Raphson procedure is much simpler, faster and feasible for obtaining the Hessian matrix analytically. We confirmed that the proposed model captures very well the nature of the synthetic data and variance estimates. Interestingly, we observed that the variance estimates using MLE were also reproduced when a bootstrap procedure was used (see Additional file 1, Table S2). In the interest of generalization, one can consider maximizing the likelihood function for both model parameters and weights. In that case, an EM algorithm should be used instead. However, one should bear in mind that EM does not allow for estimation of the variance of the model parameters and is computationally expensive, which may be particularly costly if this method is used in whole genome scan settings.

Conclusion

We have shown that the LC model can incorporate uncertainty of CNV calling in the analysis. We have also illustrated how to analyze quantitative traits as well as how to accomodate confounding variables. This is of particular importance in complex diseases studies where other clinical or biochemical factors need to be taken into account. The formulation can also be generalized to assess survival times or counts in longitudinal studies. The model has showed good performance when analyzing both targeted (MLPA data) and whole genome (aCGH data) studies.

Authors' contributions

JRG and IS developed the new statistical methods. JRG wrote the R functions and the main text of the manuscript and performed the simulation studies. GE and AC made abundant suggestions for developing the models. SP worked on the gaussian mixture approach to model quantitative CNVs measurements. XE reviewed the paper and revised its framework. LA and JRG proposed the need of a statistical tool to measure the biological differences in allele distribution in cohorts of cases and controls, and conceived the study. All authors have read, and approved the final manuscript.

Appendix

To obtain parameter estimates, we maximize the log-likelihood function where P(y| = c, Z, ) is given by equations (9) and (10) for discrete and quantitative traits, respectively. As previously mentioned, the k-th component of the score, S, is given by The k-th element of the Hessian, H, is where Herein we provide formulae for the derivatives of hfor all cases discussed in this paper. Although the following expressions may appear complicated, they are straightforward to program and are included in the >R functions available at .

Binary Traits

Binary Traits without covariates

In this case, the hfunction takes the form Therefore, where and

Binary Traits with covariates

In this case, the hfunction takes the form Therefore, where and For covariates:

Quantitative traits

Quantitative traits without covariates and shared variance

In this case, the hfunction takes the form Therefore,

Quantitative traits with covariates and shared variance

In this case, the hfunction takes the form Therefore,

Trend test

In this situation we can write the linear predictor of equation (18) as In other words, β1 plays the role of an intercept and β2 is the slope. In this case, we consider that both β1 and beta2 are shared for each latent class. In this situation, bearing in mind that , for the discrete traits, we have that and For quantitative traits, where , we have that and For the variance, we have that and

Additional file 1

Tables and figures for more scenarios of simulation studies. Click here for file

21 in total

1. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome.

Authors: Devin P Locke; Andrew J Sharp; Steven A McCarroll; Sean D McGrath; Tera L Newman; Ze Cheng; Stuart Schwartz; Donna G Albertson; Daniel Pinkel; David M Altshuler; Evan E Eichler
Journal: Am J Hum Genet Date: 2006-06-15 Impact factor: 11.025

Review 2. Structural variation in the human genome.

Authors: Lars Feuk; Andrew R Carson; Stephen W Scherer
Journal: Nat Rev Genet Date: 2006-02 Impact factor: 53.242

3. A segmentation/clustering model for the analysis of array CGH data.

Authors: F Picard; S Robin; E Lebarbier; J-J Daudin
Journal: Biometrics Date: 2007-09 Impact factor: 2.571

4. Nonparametric testing for DNA copy number induced differential mRNA gene expression.

Authors: Wessel N van Wieringen; Mark A van de Wiel
Journal: Biometrics Date: 2008-05-13 Impact factor: 2.571

5. A chromosome 8 gene-cluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon.

Authors: Klaus Fellermann; Daniel E Stange; Elke Schaeffeler; Hartmut Schmalzl; Jan Wehkamp; Charles L Bevins; Walter Reinisch; Alexander Teml; Matthias Schwab; Peter Lichter; Bernhard Radlwimmer; Eduard F Stange
Journal: Am J Hum Genet Date: 2006-07-12 Impact factor: 11.025

6. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility.

Authors: Enrique Gonzalez; Hemant Kulkarni; Hector Bolivar; Andrea Mangano; Racquel Sanchez; Gabriel Catano; Robert J Nibbs; Barry I Freedman; Marlon P Quinones; Michael J Bamshad; Krishna K Murthy; Brad H Rovin; William Bradley; Robert A Clark; Stephanie A Anderson; Robert J O'connell; Brian K Agan; Seema S Ahuja; Rosa Bologna; Luisa Sen; Matthew J Dolan; Sunil K Ahuja
Journal: Science Date: 2005-01-06 Impact factor: 47.728

7. Relative impact of nucleotide and copy number variation on gene expression phenotypes.

Authors: Barbara E Stranger; Matthew S Forrest; Mark Dunning; Catherine E Ingle; Claude Beazley; Natalie Thorne; Richard Redon; Christine P Bird; Anna de Grassi; Charles Lee; Chris Tyler-Smith; Nigel Carter; Stephen W Scherer; Simon Tavaré; Panagiotis Deloukas; Matthew E Hurles; Emmanouil T Dermitzakis
Journal: Science Date: 2007-02-09 Impact factor: 47.728

8. MLPA screening in the BRCA1 gene from 1,506 German hereditary breast cancer cases: novel deletions, frequent involvement of exon 17, and occurrence in single early-onset cases.

Authors: Stefanie Engert; Barbara Wappenschmidt; Beate Betz; Karin Kast; Michael Kutsche; Heide Hellebrand; Timm O Goecke; Marion Kiechle; Dieter Niederacher; Rita K Schmutzler; Alfons Meindl
Journal: Hum Mutat Date: 2008-07 Impact factor: 4.878

9. CGHregions: dimension reduction for array CGH data with minimal information loss.

Authors: Mark A van de Wiel; Wessel N van Wieringen
Journal: Cancer Inform Date: 2007-02-08

10. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans.

Authors: Timothy J Aitman; Rong Dong; Timothy J Vyse; Penny J Norsworthy; Michelle D Johnson; Jennifer Smith; Jonathan Mangion; Cheri Roberton-Lowe; Amy J Marshall; Enrico Petretto; Matthew D Hodges; Gurjeet Bhangal; Sheetal G Patel; Kelly Sheehan-Rooney; Mark Duda; Paul R Cook; David J Evans; Jan Domin; Jonathan Flint; Joseph J Boyle; Charles D Pusey; H Terence Cook
Journal: Nature Date: 2006-02-16 Impact factor: 49.962

11 in total

1. Assessment of copy number variation using the Illumina Infinium 1M SNP-array: a comparison of methodological approaches in the Spanish Bladder Cancer/EPICURO study.

Authors: Gaëlle Marenne; Benjamín Rodríguez-Santiago; Montserrat García Closas; Luis Pérez-Jurado; Nathaniel Rothman; Daniel Rico; Guillermo Pita; David G Pisano; Manolis Kogevinas; Debra T Silverman; Alfonso Valencia; Francisco X Real; Stephen J Chanock; Emmanuelle Génin; Núria Malats
Journal: Hum Mutat Date: 2011-01-25 Impact factor: 4.878

2. Genetic association analysis and meta-analysis of imputed SNPs in longitudinal studies.

Authors: Isaac Subirana; Juan R González
Journal: Genet Epidemiol Date: 2013-04-17 Impact factor: 2.135

3. Genome-wide CNV analysis replicates the association between GSTM1 deletion and bladder cancer: a support for using continuous measurement from SNP-array data.

Authors: Gaëlle Marenne; Francisco X Real; Nathaniel Rothman; Benjamin Rodríguez-Santiago; Luis Pérez-Jurado; Manolis Kogevinas; Montse García-Closas; Debra T Silverman; Stephen J Chanock; Emmanuelle Génin; Núria Malats
Journal: BMC Genomics Date: 2012-07-20 Impact factor: 3.969

4. Bayesian model to detect phenotype-specific genes for copy number data.

Authors: Juan R González; Carlos Abellán; Juan J Abellán
Journal: BMC Bioinformatics Date: 2012-06-13 Impact factor: 3.169

5. Alzheimer risk associated with a copy number variation in the complement receptor 1 increasing C3b/C4b binding sites.

Authors: N Brouwers; C Van Cauwenberghe; S Engelborghs; J-C Lambert; K Bettens; N Le Bastard; F Pasquier; A Gil Montoya; K Peeters; M Mattheijssens; R Vandenberghe; P P De Deyn; M Cruts; P Amouyel; K Sleegers; C Van Broeckhoven
Journal: Mol Psychiatry Date: 2011-03-15 Impact factor: 15.992

6. CNVassoc: Association analysis of CNV data using R.

Authors: Isaac Subirana; Ramon Diaz-Uriarte; Gavin Lucas; Juan R Gonzalez
Journal: BMC Med Genomics Date: 2011-05-24 Impact factor: 3.063

7. Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression.

Authors: John Wiedenhoeft; Eric Brugel; Alexander Schliep
Journal: PLoS Comput Biol Date: 2016-05-13 Impact factor: 4.475

8. A global analysis of CNVs in swine using whole genome sequence data and association analysis with fatty acid composition and growth traits.

Authors: Manuel Revilla; Anna Puig-Oliveras; Anna Castelló; Daniel Crespo-Piazuelo; Ediane Paludo; Ana I Fernández; Maria Ballester; Josep M Folch
Journal: PLoS One Date: 2017-05-04 Impact factor: 3.240

9. CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths.

Authors: Yu-Chung Wei; Guan-Hua Huang
Journal: Sci Rep Date: 2020-06-26 Impact factor: 4.379

10. Detection of recurrent copy number alterations in the genome: taking among-subject heterogeneity seriously.

Authors: Oscar M Rueda; Ramon Diaz-Uriarte
Journal: BMC Bioinformatics Date: 2009-09-23 Impact factor: 3.169