Literature DB >> 25288877

A bayesian integrative model for genetical genomics with spatially informed variable selection.

Alberto Cassese¹, Michele Guindani², Marina Vannucci³.

Abstract

We consider a Bayesian hierarchical model for the integration of gene expression levels with comparative genomic hybridization (CGH) array measurements collected on the same subjects. The approach defines a measurement error model that relates the gene expression levels to latent copy number states. In turn, the latent states are related to the observed surrogate CGH measurements via a hidden Markov model. The model further incorporates variable selection with a spatial prior based on a probit link that exploits dependencies across adjacent DNA segments. Posterior inference is carried out via Markov chain Monte Carlo stochastic search techniques. We study the performance of the model in simulations and show better results than those achieved with recently proposed alternative priors. We also show an application to data from a genomic study on lung squamous cell carcinoma, where we identify potential candidates of associations between copy number variants and the transcriptional activity of target genes. Gene ontology (GO) analyses of our findings reveal enrichments in genes that code for proteins involved in cancer. Our model also identifies a number of potential candidate biomarkers for further experimental validation.

Entities: Chemical Disease Gene Species

Keywords: Bayesian hierarchical models; copy number variants; gene expression; measurement error; variable selection

Year: 2014 PMID： 25288877 PMCID： PMC4179607 DOI： 10.4137/CIN.S13784

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Copy number variants (CNVs) are chromosomal aberrations that result in an abnormal number of copies of specific DNA segments in comparison with a reference genome. Studies have reported that as much as 12% of the human genome varies in copy number.1 It is believed that some CNVs have no obvious phenotypic consequence or are merely related to normal phenotypic variations, while others may be related to genomic disorders and susceptibility to disease. For example, the amplification of a DNA segment in a gene that promotes cell replication may cause the cell to begin dividing excessively, as usually happens in cancer cells. The challenge of detecting CNVs has received a lot of attention, and several methods have been developed to infer CNVs from high-throughput array-based technologies, such as comparative genomic hybridization (CGH) and single nucleotide polymorphism arrays. These methods mostly rely on hidden Markov models (HMMs)2,3 and circular binary segmentation.4 Another question of interest is the identification of CNVs associated with biological functions and complex human diseases. Procedures commonly used include univariate tests or simple linear regression models, with multiple testing correction, to relate the normalized intensity measurements to the outcomes of interest.3,5 A stochastic partitioning method for a multivariate model has been recently developed.6 The model identifies sets of correlated gene expression levels and sets of chromosomal aberrations that jointly affect mRNA transcript abundances. A disadvantage of all such methods is that they do not infer copy number states. Indeed, the high noise level in the raw signal intensities may lead to the identification of a large number of false positives (FPs).7 An approach widely used to address this problem is to perform the analysis in two steps, first by estimating the copy number states and then using those as the true states in a subsequent association analysis. However, using the estimated copy numbers as if they were the true states ignores the uncertainty in the estimation process and can introduce bias. Some methods that incorporate the uncertainty in copy number estimation into the association analysis have been proposed.8,9 Here, we consider a Bayesian hierarchical model that handles CNV detection and association analysis in a unified manner, by integrating array CGH and gene expression data collected on the same set of subjects. The framework takes advantage of a recently proposed measurement error model10 that relates the gene expression levels to latent copy number states. In turn, the latent states are related to the observed surrogate CGH measurements via an HMM. The model incorporates a variable selection procedure with a prior distribution on the latent selection indicator that exploits dependencies across adjacent DNA segments. In this study, we investigate an alternative formulation of the spatially dependent variable selection prior, that is the basis of the measurement error model in Ref. 10, and show that it allows for increased flexibility, remarkably easy interpretation of the key parameters and major performance improvements. More specifically, the selection prior that we propose herein is based on a latent probit link; therefore, it can easily accommodate additional available covariate information to improve detection of significant associations. Model fitting and posterior inference are accomplished via Markov chain Monte Carlo (MCMC) stochastic search techniques. We explore the performance of this model in simulations and demonstrate an overall better performance of the model with the newly proposed prior. We also show an application to data from a genomic study on lung squamous cell carcinoma, where we identify potential candidates of associations between CNVs and the transcriptional activity of target genes. GO analyses of our findings reveal enrichments in genes that code for proteins involved in cancer. The rest of the article is organized as follows: in Section 2, we introduce the integrative Bayesian model and its major components. In Section 3, we report the results from a simulation study and the case study. We conclude with some remarks in Section 4.

Methods

This section is organized as follows. In Section 2.1, we review the integrative framework that we follow in the manuscript. In Section 2.2, we introduce an improved prior model for gene–CGH associations, and in Section 2.3, we describe the model for analyzing copy number aberrations.

Integrative Bayesian hierarchical model

A Bayesian hierarchical model that integrates gene expression levels with CNVs has been recently proposed.10 The model provides a unified approach for simultaneously inferring copy number states for all samples and identifying associations between sets of gene expression levels and copy number states. Let Y denote the expression measurement for gene g (g = 1,…, G) and X the observed CGH measurement, ie, the normalized log2 intensity ratio, for the mth CGH probe (m = 1,…, M), in sample i (i = 1,…, n). Let Z = [Y, X] indicate the matrix of all data. In our integrative framework, the observed CGH intensities, X, are treated as surrogates for the unobserved copy number states, and an HMM accounts for the measurement error in the observed intensities. Let ξ = [ξ1,…, ξ] be the matrix of the latent copy number states. We assume that the CGH probes are ordered according to their chromosomal location and that the elements of the matrix ξ take any of the four possible values, We assume that, given the latent states, the observed CGH measurements contain no additional information on the observed gene expression levels. Furthermore, we assume independence of the gene expression measurements, conditional on the copy number states, and independence of the CGHs, given their latent states. Hence, we factorize the likelihood into two components as where one component captures the latent structure underlying the CGH intensities and the other component models the association between the resulting copy number states and the gene expression levels. Such joint modeling reduces the bias that arises when the uncertainty in the CNV estimation process is ignored (ie, copy number calls are used as if they were the true states), by allowing for the simultaneous inference of CNVs and their association with gene expression.10

Modeling the association between gene expression and CNVs

The model on Y in the likelihood factorization (1) captures the association between the gene expression levels and the latent CNV states. A commonly used modeling approach assumes a linear regression model of the type where μ1,…, μ are gene-specific intercepts, and with being the gene-specific variance.6,10,11 In model (2), we find, for each gene, a parsimonious set of CGH aberrations that most likely affect the gene expression levels. This can be seen as a variable selection problem. Let R be a binary matrix representing the associations, that is r is set to 1, if β in equation (2) is significantly different than zero, and is set to 0 otherwise. A common Bayesian approach to variable selection employs r to define a spike-and-slab prior on β, with δ0(·) being a point mass at zero.12–15 The prior model is completed with conjugate distributions on the error precision, , and on the intercepts, , with δ, d, c, and c being hyperparameters to be set. A key feature in the variable selection construction above is the prior distribution on the latent selection indicator r in (3). A mixture of an independent prior, ie, a Bernoulli prior, and a dependent component accounting for dependence between adjacent DNA segments has been proposed.10 Here, we propose a spatially informed distribution based on a probit link. Contiguous regions with the same non-neutral copy number state are likely to correspond to the same DNA aberration and therefore to jointly affect the expression level of a gene. Accordingly, a spatial prior formulation explicitly assumes that the probability of selection at location m depends on the copy number states and on the selection status of its adjacent probes at positions {m − 1, m + 1}. A way of achieving this is to first define a probe-specific quantity that captures information on the physical distance among probes and on the frequency of change points at position m in copy number states across all samples as with d being the distance between the adjacent probes {m − 1, m} and D the total length of the DNA fragment (eg, the length of the chromosome) under consideration. Comparable measures of similarity that incorporate physical distances between probes have been reported in the literature on copy number detection.3,16 In this study, we propose to model the probability that the mth probe is associated with the gth gene through a latent spatial probit regression. More specifically, we assume where Φ indicates the c.d.f. of a standard normal distribution, and Q defines a probe-level covariate that quantifies the available information as with α0 and α1 being hyperparameters to be set. From equations (4) and (5), some major features of the novel prior can be recognized. First of all, the probability of selection at location m depends on the adjacent probes at positions {m − 1, m + 1}. In particular, the probability can either increase or decrease based on the selection status of the adjacent probes, that is, whether they are included or excluded from the model. Furthermore, the amount of increase or decrease depends on the relative distance between probes as well as the frequency of change points observed at each location. For comparison, it is worth noting that the probability of selection in Ref. 10 can only increase when either r(−1) or r(+1) is selected. Due to the different weighting we propose, our model ensures better false discovery control. In addition, CNVs located in regions of persistent states of aberration are more likely to be jointly associated with the expression levels of a gene, and this effect is more likely with increased proximity of the CGH probes. Within the literature on Bayesian variable selection in linear regression models, interest is being shown to probit-like priors of type (4) as a convenient way to incorporate external information to guide the selection of the predictors.17 The advantage of this novel formulation lies also in the interpretability of the parameters. The parameter α0 represents a baseline intercept that can be directly set according to an a priori specified “level of significance” when there are no other covariates. For example, setting α0 = 3 and lacking any other covariate information, the probability of selection is 0.001 under the null distribution of no association (type 1 error). Similarly, α1 is immediately interpretable as the regression coefficient that captures the strength of the association between adjacent probes. Also as consequence, the probe-specific quantity s(−1) has a more direct effect on the probability of selection at location m. In particular, if s(−1) = s(+1) = 0, prior (4) conveniently reduces to 1 − Φ(α0), which is a Bernoulli distribution that is commonly used in Bayesian variable selection. Finally, the use of a spatial probit regression allows for the possibility of including further covariate information, if available, which can potentially drive the selection of relevant associations, eg, type of cancer, disease stage, probe methylation status, etc. The flexibility and ease of interpretation of the prior (4) and (5) result in simpler prior elicitation as well as improved performance with respect to those of previous proposals, eg,10 as shown in Section 3.

Modeling copy number aberrations in CGH data via HMM

The model on the CGH data in (1) is defined in terms of the emission probabilities of an underlying HMM. This choice is supported by the typically persistent state observed in copy number data, meaning that copy number losses or gains at a region are often associated with an increased probability of gains and losses at neighboring regions.3,18–20 We use a four-state HMM and assume that, conditional on the latent states, the CGH intensities are independent and normally distributed, with state-specific means and variances as where η and , respectively, represent the expected log2 ratio and the variance for CGH probes in state j (j = 1,…, 4).19 We assume truncated normal and gamma priors for η and , respectively. A first-order Markov model captures the dependence between states in adjacent probes as with A = (a) being the matrix of transition probabilities with strictly positive elements (h, j = 1,…, 4) and stationary distribution, π. The initial state probabilities are also assumed to be given by π. We assume that the rows of A are independent, each following a Dirichlet distribution, π ∼ Dir(a1, a2, a3, a4), for some a > 0, j = 1,…, 4.

Posterior inference

For posterior inference, we rely on an MCMC stochastic search algorithm.10 Our primary interest lies in the estimation of the association matrix R and the matrix of copy number states ξ. Therefore, the remaining model parameters can be integrated out, both to simplify the sampler and improve the mixing of the chain.13,14,21 Here, in particular, once we integrate out μ, and, , an MCMC algorithm can be designed as follows: Update R using a Metropolis algorithm by randomly selecting n genes and proposing, for each gene, a change in its inclusion status by an add/delete/swap move. Update ξ using a Metropolis–Hastings algorithm by randomly choosing a column and proposing new states for a subset of its elements using the current values of the transition matrix. Update the emission distribution parameters, η and , using Gibbs sampling. Update the transition probability matrix, A, using a Metropolis algorithm. Metropolis–Hastings stochastic search algorithms of this type have been used extensively in the Bayesian variable selection literature.10–15 The update on R can be made more efficient by selecting at random a subset of the rows and then performing an add/delete or swap move for every row in the subset. Also, for the update on ξ, CGH probes called in copy-neutral states in more than n × p samples at the current MCMC iteration (with p set by the user) can be disregarded, since these would not be expected to be associated with changes in mRNA transcript abundance. Given the output of the MCMC, for each element of , we can estimate its marginal posterior probability of inclusion (PPI), p(r = 1|data), by averaging the number of iterations where the element was set to 1. We can then select the most relevant associations by thresholding the PPIs based on some decision theoretic criterion. Finally, we can estimate each element of ξ as the modal state across the MCMC iterations.

Applications

Simulation study

In this section, we assess the performance of our model on simulated data. For comparison purposes, we follow the simulation scheme in Ref. 10, which reflects the understanding that single copy number aberrations typically affect segments of DNA, and that neighboring chromosomal locations are expected to share similar copy number states. In addition, transitions to the normal diploid state are more likely than transitions between different states of copy number aberration. Accordingly, we set M = 1000, G = 100, and n = 100; we initialize the matrix ξ with all elements set to 2; we select L < M columns at random in batches of adjacent columns and generate their values using the following transition matrix, we randomly select half of the remaining columns and, for each column, generate 10% of its positions according to the above transition matrix; we generate the elements of matrix X according to (6), fixing η1 = −0.65, η2 = 0, η3 = 0.65, η4 = 1.5 and σ1 = 0.1, σ2 = 0.1, σ3 = 0.1, σ4 = 0.2; we obtain the matrix of true associations, R, by selecting two clusters of 20 adjacent CGH probes among the L columns previously selected from X, and fix the corresponding values in R at 1. All other elements of R are set to 0; we generate the non-zero regression coefficients as β ∼ N(0.5, 32); finally, we generate the gene expression levels as Y = μ + + ε, with μ(0, 0.12) and . We consider two different settings for the random error standard deviation: σ = {0.1, 0.5}. For setting the hyperparameters, we follow the general guidelines in similar regression models for the specification of the priors on the parameters μ, and and on the HMM parameters η, and a.13,14,19 For the hyperparameters of the probit prior (4), we set α0 = 3, which is equivalent to a prior probability of selecting 0.001 when α1 = 0, and then perform a sensitivity analysis on the choice of α1. More specifically, we consider values of α1 in the set {0, 0.5, 1, 1.5, 2}. The results we report were obtained by running MCMC chains with 1,000,000 iterations and a burn-in of 500,000. Using a dual-core Intel® Xeon® processor with 16 GB of memory, 2.2 GHz, our code takes approximately 2 minutes to run 10,000 iterations. We begin the analysis of the simulation results by focusing on the inference on R. We compute an estimate of the Bayesian false discovery rate (FDRb) and use a threshold on the PPIs that controls the false discovery rate at the 0.05 level.22 In Table 1, we report the results in terms of specificity, sensitivity, FP counts, and false negative (FN) counts. Sensitivity is defined as the ratio of true positive (TP) counts over the total number of true connections, and specificity is defined as the ratio of true negative (TN) counts over the number of true missing connections. We also report the realized Bayesian q value, defined as min1−PPI≤ FDRB(k), and the Matthew correlation coefficient (MCC), calculated as

Table 1

Simulated data: Results on specificity, sensitivity, number of false positives and false negatives, MCC, Bayesian q values, and number of detections obtained for an FDR threshold of 0.05.

σ_ε=0.1α₁ VALUE	α₁ = 0	α₁ = 0.5	α₁ = 1	α₁ = 1.5	α₁ = 2
Specificity	0.99999	1	1	0.99998	0.99992
Sensitivity	0.85	0.95	1	0.95	1
FP/FN	1/3	0/1	0/0	2/1	8/0
MCC	0.89596	0.97467	1	0.92709	0.84512
q-value	0.03624	0.01830	0.04425	0.02619	0.04445
# of detections	18	19	20	21	28
σ_ε=0.5
α₁ VALUE	α₁ = 0	α_l = 0.5	α_l = 1	α_l = 1.5	α_l = 2
Specificity	0.99998	0.99999	0.99999	1	0.99992
Sensitivity	0.6	0.7	0.95	1	1
FP/FN	2/8	1/6	1/1	0/0	8/0
MCC	0.71709	0.80826	0.94999	1	0.84512
q-value	0.03198	0.04553	0.03101	0.02579	0.04444
# of detections	14	15	20	20	28

The results show that values of α1 in the range [1, 1.5] lead to excellent performances, compared with both the independent prior (ie, α1 = 0) and higher values of α1, particularly for the larger σ value. Overall, our results are consistently better than those obtained with competing models.10 This can be seen by the low number of FP detections for most of the parameter values in the range considered. Also, for some values of α1, our prior (4) and (5) achieves perfect classification, with specificity and sensitivity equal to 1. To investigate the effect of the threshold on the PPIs on the selection results, in Figure 1, we report receiver operating characteristic (ROC)-type curves that display the FP counts versus the FN counts, calculated at a grid of equispaced thresholds in the interval [0.07, 1]. The plots clearly show that our results are satisfactory across different thresholds. They also highlight the consistently worse performance of the independent prior.

Figure 1

Simulated data: The false positive (FP) and false negative (FN) counts obtained by considering different thresholds on the marginal probabilities, for (A) σε = 0.1 and (B) σε = 0.5. Threshold values are calculated as a grid of equispaced points in the range [0.07, 1].

As for the inference on ξ, Table 2 reports the misclassification counts and corresponding percent rates obtained by considering, for each element of , the modal state attained at that genomic location over all MCMC iterations (after burn-in). The performances are consistently good. For the case of α1 = 0 and σ = 0.1, our estimated means and standard deviations were and respectively, which are consistent with the values used to simulate the data. We obtained similar estimates in all the other cases.

Table 2

Simulated data: Results on ξ as the number of misclassified copy number states for various values of α1.

# MISCLASSIFICATIONS (PERCENT)	α₁ = 0	α₁ = 0.5	α₁ = 1	α₁ = 1.5	α₁ = 2
Scenario with σ_ε = 0.1	60 (0.06%)	55 (0.055%)	62 (0.062%)	53 (0.053%)	56 (0.056%)
Scenario with σ_ε = 0.5	48 (0.048%)	54 (0.054%)	51 (0.051%)	49 (0.049%)	52 (0.052%)

Lung cancer study

We applied our Bayesian model to data from a study of lung squamous cell carcinoma, which we obtained from The Cancer Genome Atlas data portal (https://tcga-data.nci.nih.gov/tcga/). We used the level 2 (normalized signals) Agilent 415K array as the CGH data, and the Affymetrix HG-U133A array as the gene expression levels. We performed our analysis on the 131 samples that were available for both data types. We considered CGH probes belonging to chromosome 3, as it has been highly implicated in lung squamous cell carcinoma.23,24 We further reduced the complexity of the data by filtering out genes and CGH probes that had a relatively small coefficient of variation (smaller than 1.9 and 0.35, in absolute value, for genes and CGH probes, respectively). The resulting data consisted of G = 133 genes and M = 2,133 CGH probes. We ran our model using a setting similar to that adopted in the simulated example described in Section 3.1. The results we report below were obtained by setting α1 = 1 and α0 = 2.32 and by running the MCMC sampler with 500,000 iterations and a burn-in of 250,000. Figure 2 shows a heat-map of the highest PPIs of gene–CNV associations corresponding to the elements of the association matrix R with a PPI larger than 0.1. As expected, despite the large number of potential associations being investigated, few have relatively large PPIs. Figure 3 shows the estimated frequencies of copy number gains and losses for each of the 2,133 CGH probes considered in our analysis, as is commonly done in the literature.25–29 In the figure, single and multiple copy gains are considered together as copy number amplifications. The estimates of the state-specific means and variances were close to their theoretical values (results not shown). Based on Figure 3, we can identify 67 probes with high-frequency (>45%) amplification and 23 probes with high-frequency (>25%) deletion. Among the identified probes, there are 36 and 13 annotated genes for amplification and deletion, respectively. Interestingly, one of those genes (DVL3) shows both high-frequency deletion and amplification, and has been recently found to be involved in lung squamous cell carcinoma.30 Other genes detected by our method have been implicated in lung cancer, for example, EPHA6, CENTB2, and ZNF717 for high-frequency amplification, and PLD1 and ATP2C1 for high-frequency deletion.31–34

Figure 2

Case study: Heatmap of the highest PPIs of gene–CNV associations, selected by looking at the elements of the association matrix that have a PPI greater than 0.1.

Figure 3

Case study: Frequencies of estimated gains and losses among the 131 samples for the 2,133 CGH probes considered in our analysis. Red horizontal lines correspond to the 0.25 and 0.45 thresholds on the frequencies of deletion and amplification, respectively.

Our findings identify potential candidates of associations between CNVs and the transcriptional activity of target genes. In order to assess whether the identified associations have biological relevance, we performed GO analyses on the lists of selected target genes and CGH probes, by using the database for annotation visualization and integrated discovery (DAVID) tool.35 We report the detailed results of the analyses in the Supplementary Material. Figure 4 shows some of the results from the enrichment analysis of the list of selected target genes. More specifically, the upper box of the figure (labeled mRNA) reports the four most relevant molecular functions, together with the corresponding lists of target genes. In the lower box (labeled DNA), we report the lists of CGH probes that our model found to be associated with the target genes. The estimated associations between target genes and CNVs are marked by solid lines; whereas probes appearing in multiple lists are indicated by dashed lines. In Figure 5, we report similar summaries from the gene enrichment analysis of the selected CGH probes. Specifically, in this figure, the upper box shows the molecular functions enriched in the list of CGH probes, and the lower box reports the list of target genes that our model found to be associated with the CGH probes.

Figure 4

Case study: Schematic representation of a GO analysis of the target genes identified by our model, via thresholding the posterior probabilities of inclusion. The upper box (labeled mRNA) shows the four most enriched molecular functions together with the corresponding lists of target genes. The lower box (labeled DNA) reports the lists of CGH probes that our model found to be associated with the target genes.

Notes: The solid connecting lines (––––) indicate estimated associations between target genes and CNVs; dashed lines (– – –) indicate probes that appear in multiple lists.

Figure 5

Case study: Schematic representation of a GO analysis of the CGH probes identified by our model, via thresholding the posterior probabilities of inclusion. The upper box (labeled DNA) shows the four most enriched molecular functions together with the corresponding lists of CGH probes. The lower box (labeled mRNA) reports the lists of target genes that our model found to be associated with the CGH probes.

Notes: Solid connecting lines (––––) indicate the estimated associations between target genes and CNVs; dashed lines (– – –) indicate probes that appear in multiple lists.

The results from the GO analyses highlight the enrichment of genes that code for proteins with binding function, cell surface binding, or an extracellular matrix constituent in the selected target genes (Fig. 4), and the enrichment of genes that code for proteins in the signal transduction machinery, mainly with kinase activity, in the selected CGH probes (Fig. 5). In both cases, we identified genes as members of the ephrin family or NTRK, which have been shown to be altered in another study on lung adenocarcinoma.36 Ephrin receptors have been shown to have an important role in tumor growth and progression in many cancers, including lung carcinoma.37 Another relevant protein from the GO analyses is PIK3CB, phosphatidylinositol-4,5-bisphosphate 3-kinase. The PI3K/AKT1 pathway has been shown to be altered in many cancer types, and often correlates with a more aggressive form of disease.38–41 We also found proteins of the matrix metalloproteinase family, which are often involved in the induction and promotion of cancer cell migration (MMP10 and ADAM23).42,43 Combining this observation with the finding of alterations in genes that code for members of the fibrinogen family (FGA, FGB, and FGG), and in genes that code for proteins with surface- and matrix-binding properties, we may hypothesize a dysregulation of pathways involved in the acquisition of a migratory phenotype. Extracellular matrix remodeling plays an important role in cancer progression since it can facilitate the migration and invasion of tumor cells. The genes we found to be altered may play an important role in this context. Such a hypothesis is interesting, but will require further experimental investigation. Similar findings exist in the general literature on lung cancer in both human and mouse studies.44–49 Finally, many of the genes we identified have been reported in the literature on lung cancer, for example ASCL1, HLA-DQA1, and PROM1 among the gene expression probes, and EPAH3, PRKCI, and EPHB1 among the identified CGH probes.36,50–54

Conclusions

In this study, we have considered a recently developed Bayesian hierarchical framework for the integration of gene expression levels with CGH array measurements, collected on the same subjects. The proposed measurement error model relates the gene expression levels to latent copy number states which, in turn, are related to the observed surrogate CGH measurements via an HMM. We have investigated an alternative formulation of the spatial variable selection prior for the gene–CGH associations. Our prior exploits dependencies across adjacent DNA segments and allows for increased modeling flexibility, which has been shown to result in easy interpretable model parameters for the purpose of prior elicitation, as well as improved performances and false discovery control on simulated data. Our HMM model considers four copy number aberration states, as commonly encountered in the literature.19,55,56 Once the HMM states are appropriately defined, our model can easily accommodate an additional state for the loss of both copies.3,18 We have presented an application to data from a genomic study on lung squamous cell carcinoma. Our model has identified potential candidates of associations between CNVs and the transcriptional activity of target genes. We have assessed the biological relevance of our findings through GO analyses. These have revealed enrichments in genes that code for proteins involved in cancer, such as those of the ephrin family, phosphatidylinositol-4,5-bisphosphate 3-kinase and matrix metalloproteinase family. Among these, some are already known to be involved in lung squamous cell carcinoma, while others are interesting potential candidates for further experimental validation. The approach we present can be extended to the analysis of RNA-Seq gene expression values. In order to appropriately take into account the nature of such data, the priors and the algorithm for posterior inference will need to be modified to accommodate the count data and a Poisson regression model. This represents an interesting avenue for future work. GO Analyses. The supplementary material shows details on the GO analyses described in the paper.

52 in total

1. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data.

Authors: Kai Wang; Mingyao Li; Dexter Hadley; Rui Liu; Joseph Glessner; Struan F A Grant; Hakon Hakonarson; Maja Bucan
Journal: Genome Res Date: 2007-10-05 Impact factor: 9.043

2. Plasma marker proteins associated with the progression of lung cancer in obese mice fed a high-fat diet.

Authors: Jung-Won Choi; Hao Liu; Hyerim Song; Jung Han Yoon Park; Jong Won Yun
Journal: Proteomics Date: 2012-06 Impact factor: 3.984

3. Increased expression of matrix metalloproteinases mediates thromboxane A2-induced invasion in lung cancer cells.

Authors: Xiuling Li; Hsin-Hsiung Tai
Journal: Curr Cancer Drug Targets Date: 2012-07 Impact factor: 3.428

4. Genetic and epigenetic analysis of non-small cell lung cancer with NotI-microarrays.

Authors: Alexey A Dmitriev; Vladimir I Kashuba; Klas Haraldson; Vera N Senchenko; Tatiana V Pavlova; Anna V Kudryavtseva; Ekaterina A Anedchenko; George S Krasnov; Irina V Pronina; Vitalij I Loginov; Tatiana T Kondratieva; Tatiana P Kazubskaya; Eleonora A Braga; Surya P Yenamandra; Ilya Ignatjev; Ingemar Ernberg; George Klein; Michael I Lerman; Eugene R Zabarovsky
Journal: Epigenetics Date: 2012-05-01 Impact factor: 4.528

5. The PRKCI and SOX2 oncogenes are coamplified and cooperate to activate Hedgehog signaling in lung squamous cell carcinoma.

Authors: Verline Justilien; Michael P Walsh; Syed A Ali; E Aubrey Thompson; Nicole R Murray; Alan P Fields
Journal: Cancer Cell Date: 2014-02-10 Impact factor: 31.743

6. The transcriptional consequences of somatic amplifications, deletions, and rearrangements in a human lung squamous cell carcinoma.

Authors: Lucy F Stead; Stefano Berri; Henry M Wood; Philip Egan; Caroline Conway; Catherine Daly; Kostas Papagiannopoulos; Pamela Rabbitts
Journal: Neoplasia Date: 2012-11 Impact factor: 5.715

7. A robust statistical method for case-control association testing with copy number variation.

Authors: Chris Barnes; Vincent Plagnol; Tomas Fitzgerald; Richard Redon; Jonathan Marchini; David Clayton; Matthew E Hurles
Journal: Nat Genet Date: 2008-09-07 Impact factor: 38.330

8. Identification and validation of PROM1 and CRTC2 mutations in lung cancer patients.

Authors: Yanqi He; Yalun Li; Zhixin Qiu; Bin Zhou; Shaoqin Shi; Kui Zhang; Yangkun Luo; Qian Huang; Weimin Li
Journal: Mol Cancer Date: 2014-01-31 Impact factor: 27.401

9. A Single Nucleotide Polymorphism in the Phospholipase D1 Gene is Associated with Risk of Non-Small Cell Lung Cancer.

Authors: Myung-Ju Ahn; Shin-Young Park; Won Kyu Kim; Ju Hwan Cho; Brian Junho Chang; Dong Jo Kim; Jin Seok Ahn; Keunchil Park; Joong-Soo Han
Journal: Int J Biomed Sci Date: 2012-06

10. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data.

Authors: Stefano Colella; Christopher Yau; Jennifer M Taylor; Ghazala Mirza; Helen Butler; Penny Clouston; Anne S Bassett; Anneke Seller; Christopher C Holmes; Jiannis Ragoussis
Journal: Nucleic Acids Res Date: 2007-03-06 Impact factor: 16.971

1 in total

1. A Bayesian model for the identification of differentially expressed genes in Daphnia magna exposed to munition pollutants.

Authors: Alberto Cassese; Michele Guindani; Philipp Antczak; Francesco Falciani; Marina Vannucci
Journal: Biometrics Date: 2015-03-13 Impact factor: 2.571

1 in total