Literature DB >> 26646822

Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer.

Silvia Pineda1,2, Francisco X Real3,4, Manolis Kogevinas5, Alfredo Carrato6, Stephen J Chanock7, Núria Malats1, Kristel Van Steen2,8.   

Abstract

Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26646822      PMCID: PMC4672920          DOI: 10.1371/journal.pgen.1005689

Source DB:  PubMed          Journal:  PLoS Genet        ISSN: 1553-7390            Impact factor:   5.917


Introduction

Integrating different omics data types, such as genomics, epigenomics and transcriptomics, may provide a new strategy to discover unknown genomic mechanisms involved in complex diseases [1-3]. In cancer, tumor initiation and progression are the consequence of alterations in multiple pathways and biological processes including gene mutations, epigenetic changes, modifications in gene regulation, and environmental influences. In the process to integrate all of this information many challenges arise, among them the high dimensionality of data—since >2 omics data sets with millions of measurements are available from the same set of individuals—and the huge heterogeneity of omics data due to the different measurement scales [4]. Besides that, the data might be highly correlated, i.e. Single Nucleotide Polymorphisms (SNPs) that are in high linkage disequilibrium (LD) block or DNA CpG sites that belong to the same CpG island, contributing to multicollinearity in the analysis. Another challenge in omics data integration regards to the very small number of individuals in comparison to the number of parameters (“n << p”). In addition, interpretation and validation of omics derived results require of resources that are still lacking at present. In this rapidly evolving scenario, advanced methodological techniques are continuously emerging, demanding the development of improved data analysis tools [5-7]. Integrative omics analysis refers to the combination of at least two different types of omics data. Relationships between two sets of omics parameters such as the expression quantitative trait loci (eQTL) [2,8,9] or the methylation-QTL (methQTL) [3,10,11], have been recently reported. The approach most commonly used for this type of pairwise analysis has been univariate models (i.e., Spearman/Pearson correlation or linear regression models), assuming that the changes in gene expression levels are only affected by one parameter. Until present, the combination of >2 omics data has been less explored. Towards this end, the previously mentioned challenges are magnified and there is a lack of advanced methodologies to deal with them. Recently, we published an integrative framework as a first approach to integrate genomics, epigenomics, and transcriptomics in individuals with urothelial bladder cancer (UBC) [12]. In that work, we found that some gene expressions were co-regulated by both DNA methylation and genetic variants, both acting together in trans relationships. Therefore, the integration of multiple types of omics data by applying multivariable approaches becomes essential to understand the intricacy of the genomic mechanisms behind complex diseases and to overcome the abovementioned challenges. In this regard, previous developments are Principal Component Analysis (PCA), to reduce data dimensionality, or Canonical Correlation Analysis (CCA) to investigate the overall correlation between two sets of variables. However, these methods are descriptive or exploratory techniques rather than hypothesis-testing tools. While some statistical applications have been developed in an omics integrative framework (sparse canonical correlation analysis [13], multiple factor analysis [14], or multivariate partial least square regression [15]), none of them offers the possibility to combine >2 omics data together in the same model. The Least Absolute Shrinkage and Selection Operator (LASSO) proposed by Tibshirani in 1996 [16] and the Elastic Net (ENET) proposed by Hui Zou and Trevor Hastie in 2005 [17] are penalized regression methods that, after appropriate standardization, can model more than one type of omics data, face multicollinearity issues, and mitigate the “n << p” problem. More importantly, both methods simultaneously execute variable selection and parameter estimation, thus reducing the computation time, while the traditional methods work on the two problems separately, first selecting the relevant parameters and then computing the estimates. LASSO and ENET have already been applied to GWAS studies [18-20] as well as in the context of integrative studies [21]. One limitation of penalized regression techniques is that the penalty produces biased estimators; consequently, standard errors are not meaningful and cannot provide p-values to assess significance. Here, we propose a permutation-based approach to assess significance and we combine it with a correction for Multiple Testing (MT) using the MaxT algorithm [22]. We apply this permutation-based MaxT method with LASSO and ENET to identify relationships between common genetic variation, DNA methylation, and gene expression, all determined in UBC tumor samples. Specifically, we first built a two omics integrative model associating SNPs or CpGs with gene expression levels and, then, we integrated the three omics data to assess whether changes in gene expression levels could be confounded/modified by genetic variants and/or DNA methylation.

Material and Methods

Penalized regression methods

LASSO and ENET penalized regression methods are applied to high-dimensional problems with a large number of parameters. The penalization produces a shrinkage of the regression coefficients towards zero given a sparse model reducing the irrelevant parameters. Both methods deal with highly correlated variables though in a different way. LASSO tends to select one variable from a group of correlated features whereas ENET selects the whole group of variables, when evidence for their relevance exists. The shrunk estimators introduce a bias while reducing the variance resulting in a better precision and accuracy model and, therefore, increasing its statistical power.

Definition of the methods

Consider the standard linear regression model where y = (y 1, …y ) is the response variable and x = (x 1, …x ) j = 1, …p are the standardized predictors, the LASSO solves the l penalized regression problem, the Ridge regression [23] solves the l penalized regression problem and the ENET is the combination between the l and l penalized regression problem. For the LASSO and ENET estimates ; are defined by with the restrictions: Here, t ≥ 0 is the tuning parameter that controls the amount of shrinkage that is applied to the estimates. For the un-penalized least squares estimate, . Values of t < t 0 will lead to shrinkage towards 0; some coefficients may be exactly equal to 0. Using the Lagrangian form, this optimization problem is equivalent to (LASSO): where λ is the penalty parameter related to t. To obtain the optimal penalty, k-fold cross validation (CV) was applied [24] maximizing the penalized log-likelihood function. (ENET): where λ1, λ2 are the penalty parameters related to t. In this sense, ENET can be viewed as a penalized least squares method. With α = λ 2/(λ 1 + λ 2), solving in Eq (4) is equivalent to the following optimization problem: This expression involves a convex combination of the LASSO and ridge penalty. When α = 1 the ENET becomes ridge regression and when α =0 the ENET becomes LASSO. To obtain the optimal penalty (λ), k-fold CV selecting the best α was applied. This value was obtained using a vector of α ϵ(0.01, 0.99) by 0.01. The LASSO and ENET methods described above were applied to our data with the R package glmnet, that relies on cyclical coordinate descent, computed along a regularization path [25]. To avoid small sample size limitations in variable selection while not introducing an important bias k = 5 was used in the k-fold CV. These methods are promising in the context of high-throughput data but one of their drawbacks is that they do not provide p-values to assess statistical significance of relationships, nor give a formal assessment of the overall goodness-of-fit. Therefore, a permutation based strategy was adopted to assess significance of discovered relationships combined with a MT correction approach (MaxT algorithm [22]) building upon the statistical concept of deviance. The deviance is used to compare two models and in this case we defined it as Here loglik is the loglikelihood function, full refers to the model with the parameters selected by LASSO or ENET, and null is the model with only the intercept estimated. Thus, the interpretation would be, the higher the deviance the better the model.

Permutation-based MaxT method

MaxT algorithm of Westfall & Young [22] is a step-down FWER-controlling MT procedure. The method uses the raw p-values or directly the statistics as explained in [26]. Using this approach, the permutation needed to obtain the p-values was combined with the one needed to apply the MaxT algorithm saving computational time. In this work, we used the deviance obtained per each of the permuted LASSO/ENET model to compute the MaxT algorithm and individuals within gene expression measure were permuted, that is the dependent variable in the models. The algorithm is explained in Box 1.

Box 1. Permutation-based MaxT method

From the original data, order the deviance obtained per each observed statistics: For the bth permutation, b = 1…B Permute the n individuals of each of the vectors Y = (y 1, … y ) Compute the statistics D 1 D Compute the U = max|D|, the successive step-down procedure is: U = |D| The steps are repeated B times and the adjusted p-values are estimated by:

Discovery phase: The Spanish Bladder Cancer/EPICURO Study

70 patients with a histologically confirmed UBC were recruited in 2 hospitals during 1997–1998 as part of the pilot phase of the Spanish Bladder Cancer/EPICURO Study. According to established criteria based on tumor stage and grade for UBC, the tumors were classified as low-grade non-muscle invasive, high-grade non-muscle invasive, and muscle invasive. Three sets of omics data were obtained using fresh tumor tissue, including common genetic variation (GSE51641), DNA methylation (GSE71666), and gene expression (GSE71576). The three omics data overlapped in 27 individuals that are included in this study and comprise 44% low-grade non-muscle invasive tumors, 30% high-grade non-muscle invasive tumors and 26% muscle invasive tumors. S1 Table shows the IDs of the 27 samples used in the following analysis. The local ethics committee of the participating centers approved the study and written informed consent was obtained from all participants at the time of recruitment. Genotyping of tumor samples was performed using Illumina HumanHap 1M array. A total of 1,047,101 SNPs were determined in 46 individuals and, after the standard quality control and filter the SNPs that were in perfect LD (r 2 = 1), they resulted in 567,513 SNPs. The application of multivariable models required no missing values, so genotypes were imputed with BEAGLE 3.0 method [27]. CpG methylation data was generated using the Infinium Human Methylation 27 BeadChip Kit. At each CpG site, the methylation levels were measured with M-values using the log2 transformation of the β-values since they are more statistically valid due to a better approximation of the homoscedasticity. The initial number of CpGs in the studied array was 27,578 and after background normalization and QC, a total number of 23,034 CpGs were left for analysis. Gene expression data were obtained from 44 tumor samples using the Affymetrix DNA Microarray Human Gene 1.0 ST Array with 32,321 probes. After the application of QC, it resulted in 20,899 probes determined in 37 individuals. Further details about the preprocessing of the data and the quality control applied can be found elsewhere [12]. The three measures were annotated using the UCSC hg19, NCBI build 37 to make them comparable and homogenize their position in the genome.

Simulation study

To generate a simulation sample, the association between SNPs and/or CpGs with gene expression was broken and therefore no significant results should be observed. To do that, 10-gene expression probes were randomly selected from our discovery sample showing no correlation structure between the probes and following a multivariate normal distribution. Then, the mean (μ = 8.4) and variance (σ2 = 0.4) of all the probes together were obtained. Finally, a simulated set of gene expression probes was generated using the normal distribution obtained and considering the same sample size of the discovery phase (p = 20,899 probes and N = 27 individuals).

Replication phase: The Cancer Genome Atlas (TCGA)

UBC tumor data were obtained from The Cancer Genome Atlas (TCGA) consortium (https://tcga-data.nci.nih.gov/tcga/) to replicate our findings. Data was downloaded and processed with the TCGA-Assembler [28]. The study included only individuals with muscle invasive UBC and the tumors were profiled with genome wide 6.0 Affymetrix, RNASeqV2, and HumanMethylation450K Illumina arrays yielding data for 20,502 gene expression probes, 905,422 SNPs, and 350,271 CpGs. The total number of individuals with overlapping data from the three platforms was 238 and they were used in the replication phase of this contribution. S2 Table shows the IDs corresponding to these 238 samples.

Overall analysis flow

Penalized regression methods LASSO and ENET were applied to the discovery data in combination with the proposed permutation-based MaxT method to select the SNPs and/or CpGs associated with gene expression levels in the following multivariable models: To apply this integrative idea to our set of data the following steps were performed: (1) SNPs and CpGs that were in a 1MB window upstream and downstream were selected from each probe in the gene expression array; (2) LASSO and ENET were applied to each probe and model (SNP, CpG, and Global models) obtaining the deviance per model; and (3), the permutation-based MaxT method was applied to obtain the adjusted p-values (B = 100 permutations and significant adjusted p-value < 0.1). The scenario and workflow is represented in Fig 1.
Fig 1

Scenario and workflow of the overall analysis implemented.

The integrative framework proposed is based on three steps. Step 1 corresponds to the selection of SNPs and CpGs in 1MB window upstream and downstream from each probe in the gene expression array. Step 2 corresponds to the application of LASSO and ENET to each probe obtaining the deviance per probe. Step 3 corresponds to the permutation-based MaxT method application where gene expression levels within the individuals are permuted B = 100 times obtaining the deviance per probe.

Scenario and workflow of the overall analysis implemented.

The integrative framework proposed is based on three steps. Step 1 corresponds to the selection of SNPs and CpGs in 1MB window upstream and downstream from each probe in the gene expression array. Step 2 corresponds to the application of LASSO and ENET to each probe obtaining the deviance per probe. Step 3 corresponds to the permutation-based MaxT method application where gene expression levels within the individuals are permuted B = 100 times obtaining the deviance per probe. Subsequently, this analysis flow was applied to the simulated data set using the same criteria. In the replication scenario, we aimed at determining whether the genes that were significant in the discovery phase were also significant in the replication dataset. Therefore, the analysis was restricted to the genes found to be significant in the discovery phase considering all models (SNP, CPG and/or Global) and methods (LASSO and/or ENET). Following the pipeline shown in Fig 1, we focused on the significant genes found in the discovery phase and SNPs and CpGs were selected in 1MB window from the TCGA database, even if the SNPs and CpGs were not the same as those analyzed in the discovery phase. Second, LASSO and/or ENET were conducted to SNP, CPG, and/or Global models. Finally, the permutation-based MaxT method was applied to obtain significance and correct for multiple testing. The replication analysis was performed with the same software and criteria as in the discovery analysis.

Gene enrichment analysis

To provide a biological interpretation to the results, the entire list of the significant genes identified in the discovery phase by both LASSO and ENET, and by the three models, was used to perform a gene enrichment analysis with the bioinformatics tool DAVID [29,30]. The functional annotation clustering analysis module offered by DAVID was used. The gene term annotation is based on 14 annotation categories (Gene Ontology (GO), Biological process, GO Molecular Function, GO Cellular Component, KEGG Pathways, BioCarta Pathways, Swiss-Prot Keywords, BBID Pathways, SMART Domains, NIH Genetics Association DB, UniProt Sequence Features, COG/KOG Ontology, NCBI OMIM, InterPro Domains, and PIR Super-Family Names) collected in the DAVID tool knowledgebase (https://david.ncifcrf.gov/knowledgebase/DAVID_knowledgebase.html). The method identifies related genes by measuring the similarity of their global annotation profiles. So, the “grouping term” is based on the idea that two genes that have similar annotation profiles are functionally related. Each group term provides an enrichment score (ES) that indicates biological significance when ≥1.3 (equivalent to non-log scale 0.05). DAVID also provides a p-value to examine the significance of gene-term enrichment, which is corrected by Benjamini MT [31].

Results

Discovery phase

LASSO and ENET were applied to 20,899 gene expression probes in each of the three models. Under the conditions mentioned above, LASSO yielded 9 genes with a significant signal in the SNP model, 19 in the CpG model, and 23 in the Global model. Table 1 shows the significant genes, mapped to each probe, with its deviance and p-value. Fig 2A–2C display all the probes analyzed with their deviances represented across the genome. Detailed information about the SNPs and/or CpGs mapped to these genes is provided as Supplementary Material (S1–S6 Excel). ENET identified a lower number of significant genes: 11 in the SNP model, 6 in the CpG model, and 4 in the Global model. These results are shown in Table 2 and Fig 2D–2F. When the MT correction threshold was relaxed, ENET provided additional significant genes.
Table 1

Statistically significant genes associated with SNPs and/or CpGs selected by LASSO&Permuted based maxT algorithm.

Gene NameChromosomeModelDeviancep-value 1
AIM2 1SNPs55.80.1
CpGs61.50.06
PLA2G2A 1CpGs71.40.01
S100A9 1CpGs53.70.03
SNPs + CpGs52.40.08
HMGCS2 1CpGs53.30.02
PIGR 1CpGs75.8< 0.01
CTSE 1CpGs60.70.06
S100A2 1SNPs + CpGs58.70.04
CP 3CpGs51.10.02
TMEM45A 3SNPs + CpGs57.30.08
IGJ 4CpGs58.40.03
SNPs + CpGs59.00.09
UBD 6SNPs + CpGs75.00.07
TRIM31 6SNPs + CpGs47.10.1
PTN 7SNPs67.00.08
SNPs + CpGs92.0< 0.01
ARHGEF35 7SNPs + CpGs49.60.09
CRH 8SNPs + CpGs56.70.1
CRTAC1 10SNPs66.20.03
MSMB 10CpGs67.30.06
CRTAC1 10SNPs60.90.1
SNPs + CpGs
TNNT3 11CpGs44.90.09
SAA1 11SNPs + CpGs127.80.04
SCCN1A 12SNPs57.90.08
CpGs58.80.03
KRT5 12CpGs58.20.03
TSPAN8 12SNPs + CpGs67.20.05
MYBPC1 12SNPs + CpGs74.50.08
SLC38A4 12SNPs + CpGs51.70.08
GTSF1 12SNPs + CpGs46.70.1
OLFM4 13CpGs60.00.06
FREM2 13CpGs46.00.06
SNPs + CpGs70.20.06
IGHD 14SNPs + CpGs59.40.1
C15orf48 15CpGs49.90.02
SNPs + CpGs83.70.05
CAPNS2 16SNPs + CpGs54.90.07
KRT20 17CpGs48.40.05
SNPs + CpGs93.7< 0.01
KRT13 17CpGs53.60.02
SERPINB4 18SNPs98.4< 0.01
SNPs + CpGs68.50.03
SERPINB3 18SNPs171.6< 0.01
SNPs + CpGs162.7< 0.01
CEACAM7 19CpGs76.0< 0.01
CEACAM6 19SNPs79.60.01
CXCL17 19SNPs + CpGs46.80.1
CLIC6 21SNPs75.30.01
CpGs45.10.09
SNPs + CpGs75.30.07
GSTT1 22SNPs40.40.07

1The p-value was obtained after applying the permuted based–maxT algorithm and was therefore corrected for MT.

Fig 2

Deviance across the genome when applying LASSO and ENET to select SNPs, CpGs or both (Global model).

The dots in the figure indicate the deviance of each gene located in the corresponding position in the genome. There are a total of 20,899 gene expression probes measured. Significant genes after applying the permutation-based MaxT method are tagged. The figures represent the deviance per gene expression probe using LASSO for the SNP model (A), the CpG model (B) and the Global model (C) and using ENET for the SNP model (D), the CpG model (E) and the Global model (F).

Table 2

Statistically significant genes associated with SNPs and/or CpGs selected by ENET&Permuted based maxT algorithm.

Gene NameChromosomeModelDeviancep-value 1
REN 1CPG84.30.03
CRABP2 1CPG65.20.09
ANXA10 4SNP137.00.01
PTN 7SNP97.20.07
SNP +CPG102.50.09
MSMB 10SNP91.80.07
CPG78.90.06
MMP7 11SNP94.80.06
TCN1 11SNP88.90.07
IGF2 11SNP101.60.05
CPG92.10.04
MMP7 11SNP + CPG99.40.08
GTSF1 12SNP109.60.05
IGHD 14SNP + CPG97.50.1
SERPINB4 18SNP105.20.04
SERPINB3 18SNP171.60.02
SNP + CPG171.30.01
CEACAM6 19SNP108.40.03
NRLP2 19CPG84.20.04
CEACAM5 19CPG92.10.06
IGLJ3 22SNP97.70.05

1The p-value was obtained after applying the permuted based–maxT algorithm and corrected by MT.

Deviance across the genome when applying LASSO and ENET to select SNPs, CpGs or both (Global model).

The dots in the figure indicate the deviance of each gene located in the corresponding position in the genome. There are a total of 20,899 gene expression probes measured. Significant genes after applying the permutation-based MaxT method are tagged. The figures represent the deviance per gene expression probe using LASSO for the SNP model (A), the CpG model (B) and the Global model (C) and using ENET for the SNP model (D), the CpG model (E) and the Global model (F). 1The p-value was obtained after applying the permuted based–maxT algorithm and was therefore corrected for MT. 1The p-value was obtained after applying the permuted based–maxT algorithm and corrected by MT. Some genes overlapped among methods and models: CLIC6 was identified by the three LASSO models; AIM2 and SCNN1A came out in the SNP and CpG models; PTN, CRTAC1, SERPINB3 and SERPINB4 were identified in the SNP and Global models; and S100A9, IGJ, FREM2, C15orf48 and KRT20 emerged in the CpG and Global models. Interestingly, 15 genes showed significance in the Global model when combining 3 omics data while they were not detected when analyzing only 2 types of omics data. The overlap of genes identified by the ENET model was lower: MSMB and IGF2 were identified by the SNP and CpG models, and PTN and SERPINB3 were selected by the SNP and the Global models. When comparing the methods, an overlap between LASSO and ENET was found for four (PTN, SERPINB3, SERPINB4 and CEACAM6), one (MSMB), and three (SERPINB3, PTN and IGHD) significant genes in the SNP, CpG, and Global models, respectively. These results are displayed in Fig 3 using Venn diagrams. In the simulation study, as expected, no gene was significantly associated with any of the two methods and the three models. An example of the deviances of each gene for the SNP 303 model and LASSO method is shown in S1 Fig.
Fig 3

Venn diagrams showing the overlap between the significant genes compared by the two methods (LASSO and ENET) and models (SNPs, CpGs and Global).

(A) Number of significant genes using the LASSO method for the three models (SNP, CPG, and Global); (B) number of significant genes using the ENET method for the three models (SNP, CPG and Global); and (C) number of significant genes per model comparing the two methods (LASSO and ENET).

Venn diagrams showing the overlap between the significant genes compared by the two methods (LASSO and ENET) and models (SNPs, CpGs and Global).

(A) Number of significant genes using the LASSO method for the three models (SNP, CPG, and Global); (B) number of significant genes using the ENET method for the three models (SNP, CPG and Global); and (C) number of significant genes per model comparing the two methods (LASSO and ENET).

Replication phase

The replication study was restricted to those genes (n = 48) that showed significant results in the discovery phase and we applied the same models, methods, and criteria of analysis to the TCGA data. Overall, we were able to replicate 75% of the results: 36 out of the 48 genes yielded a significant association at least in one of the models considered. Regarding the LASSO models, we replicated 3/9 genes from the SNP models, 17/19 genes from the CPG models, and 19/23 genes from the Global models (Table 3). Regarding ENET, we replicated 3/10 genes from the SNP model, 3/6 genes from the CPG model, and 3/3 genes from the Global model (Table 4).
Table 3

Significant genes obtained by LASSO&Permuted based maxT algorithm for the three models (SNP, CPG, and Global) in the original dataset (EPICURO Study) and the replication dataset (TCGA).

Original Data (EPICURO)Validation Data (TCGA)
GeneprobesetChrStartendDevp-valueSNPs (N)CpGs (N)Devp-value 1 SNPs (N)CpGs (N)SNPs (overlap)SNPs (rep)CpGs (overlap)CpGs (rep)
SNP model SERPINB3 8023696186132243361329197171.6<0.012901030
SERPINB4 802368818613044956131150298.4<0.011501020
CEACAM6 802909819422593984227611379.60.011001000
CLIC6 806838321360416883609051975.30.01303.5E-080.91140
CRTAC1 7935535 10 99624758 99790585 66.2 0.03 18 2.4E+09 0.001 12 4 1 (LD)
GSTT1 8074980 22 24376141 24384284 40.4 0.07 16 8.3E+07 <0.001 34 4 1 (LD)
PTN 8143144713691209213702854667.00.08901010
SCNN1A 7960529126456011648652357.90.082601080
AIM2 7921434 1 159032275 159046647 55.8 0.1 6 5.7E+05 0.03 1 2 0
CPG model CEACAM7 8037053 19 42177235 42192096 76.0 < 0.01 19 1.5E+07 < 0.001 2 17 0
PIGR 7923929 1 207101869 207119811 75.8 < 0.01 21 6.4E+08 0.001 19 18 1
PLA2G2A 7913216 1 20301925 20306932 71.4 0.01 10 5.2E+09 < 0.001 57 9 0
CP 8091385 3 148890292 148939832 51.1 0.02 3 1.6E+09 < 0.001 24 1 0
HMGCS2 7919055112029062012031155553.30.02801 0 8-
KRT5 7963427 12 52908361 52914243 58.2 0.02 25 3.6E+12 < 0.001 112 24 5
C15orf48 7983478 15 45722763 45725645 49.9 0.02 7 1.5E+08 < 0.001 23 5 0
KRT13 8015323 17 39657233 39661865 53.6 0.02 8 8.2E+11 < 0.001 5 6 0
IGJ 8100827 4 71521259 71532348 58.4 0.03 2 4.2E+08 < 0.001 19 2 0
SCNN1A 7960529 12 6456011 6486523 58.8 0.03 29 2.1E+09 < 0.001 12 27 0
S100A9 7905571 1 153330330 153333502 53.7 0.04 11 5.0E+11 < 0.001 33 9 1
KRT20 8015124 17 39032141 39041495 48.4 0.05 3 5.9E+09 < 0.001 45 3 0
CTSE 7909164 1 206317459 206332103 60.7 0.06 12 3.4E+09 < 0.001 36 12 1
AIM2 7921434 1 159032275 159046647 61.5 0.06 8 4.7E+07 0.002 27 4 0
OLFM4 7969288 13 53602972 53626186 60.0 0.06 10 1.6E+10 < 0.001 47 9 6
MSMB 792752910515495535156259067.30.06701 0 60
FREM2 7968678 13 39261173 39461265 46.0 0.08 2 4.4E+07 < 0.001 13 1 0
CLIC6 8068383 21 36041688 36090519 45.1 0.09 4 1.2E+08 < 0.001 19 4 0
TNNT3 7937749 11 1940799 1959935 44.9 0.09 26 5.2E+08 < 0.001 72 22 0
Global model SERPINB3 7920285 18 61322433 61329197 162.7 <0.01 15 0 3.0E+09 <0.001 6 4 1 0 0 0
KRT20 7905571 17 39032141 39041495 93.7 <0.01 19 7 5.7E+09 <0.001 8 38 0 0 0 0
PTN 7935535 7 136912092 137028546 92.0 <0.01 12 0 2.6E+08 <0.001 0 1 0 0 0 0
SERPINB4 7938758 18 61304495 61311502 68.6 0.03 4 0 7.9E+08 <0.001 27 11 0 0 0 0
SAA1 7962559111828780818291521127.80.042017.9E+080.6015100
S100A2 7957966 1 153533587 153538306 58.7 0.04 20 7 1.0E+11 <0.001 1 5 5 0 3 0
C15orf48 7964927 15 45722763 45725645 83.7 0.05 19 6 1.7E-07 <0.001 1 6 0 0 0 0
TSPAN8 796381712715188777155177967.20.05819.9E+050.02103010
FREM2 7968678 13 39261173 39461265 70.2 0.06 14 2 2.9E+07 <0.001 3 10 3 0 1 0
CLIC6 7983478 21 36041688 36090519 75.3 0.07 25 2 1.4E+08 <0.001 21 15 0 1 (LD) 0 0
UBD 7995712 6 29523390 29527702 75.0 0.07 6 5 8.8E+08 <0.001 0 25 0 0 0 0
CAPNS2 7981724 16 55600584 55601592 54.9 0.07 8 1 5.8E+07 <0.001 10 12 0 0 0 0
MYBPC1 80236881210198874710207965774.50.082339.9E-081012020
TMEM45A 8037197 3 100211463 100296285 57.3 0.08 12 0 1.6E+08 0.001 11 19 0 1 (LD) 0 0
S100A9 8015124 1 153330330 153333502 52.5 0.08 6 4 4.9E+11 <0.001 15 24 0 1 (LD) 4 1
SLC38A4 8023696 12 47158544 47219780 51.7 0.08 15 1 1.6E+08 0.001 8 15 6 0 1 0
IGJ 80683834715212597153234859.00.09323.3E+080.003130000
ARHGEF5 8081288 7 143883177 143892791 49.6 0.09 8 0 1.2E+07 <0.001 11 8 0 1 (LD) 0 0
CRTAC1 8100827 10 99624758 99790585 60.9 0.1 7 5 3.8E+09 <0.001 7 9 1 0 3 1
IGHD 81369811410630310210631201459.40.171--------
CRH 8151092 8 67088612 67090846 56.7 0.1 3 0 9.4E+08 <0.001 7 10 0 0 0 0
TRIM31 8178330 6 30070674 30080867 47.1 0.1 23 4 5.8E+08 <0.001 0 43 0 0 0 0
CXCL17 8143144 19 42932696 42947136 46.8 0.1 3 5 7.4E+08 <0.001 8 11 0 0 0 0
GTSF1 8124650 12 54849737 54867386 46.7 0.1 2 1 2.1E+07 <0.001 18 46 2 0 1 1

1Bonferroni correction for the p-value were: 0.005 (SNP model), 0.003 (CPG model) and 0.002 (Global model); SNPs (N) and CpGs (N) are the number of SNPs and CpGs that were selected by LASSO per each gene expression probe in EPICURO data with the Illumina HumanHap 1M array and the Methylation 27k array; SNPs (overlap) and CpGs (overlap) are the number of SNPs and CpGs that were present in the TCGA data with the Genome wide 6.0 Affymetrix and the Methylation 450k array; and the SNPs (rep) and CpGs (rep) are the ones selected by LASSO in the TCGA data in common with the EPICURO data. The gene with “no p-value” is a gene that was not present in the RNASeqV2 in TCGA data.

Table 4

Significant genes obtained by ENET&Permuted based maxT algorithm for the three models (SNP, CPG, and Global) in the original dataset (EPICURO Study) and the replication dataset (TCGA).

Original Data (EPICURO)Validation Data (TCGA)
GeneprobesetChrStartendDevp-valueSNPs (N)CpGs (N)Devp-value 1 SNPs (N)CpGs (N)SNPs (overlap)SNPs (rep)CpGs (overlap)CpGs (rep)
SNP model ANXA10 8098246 4 169013707 169108891 137.0 0.01 17 1.4E+08 <0.001 13 7 1
SERPINB38023696186132243361329197171.60.02301.4E+090.083230
CEACAM68029098194225939842276113108.40.03281.4E+090.04450
SERPINB48023688186130449561311502105.20.04311.1E+080.071081 (LD)
GTSF17963817125484973754867386109.60.05191.6E+060.08792 (LD)
IGF2 7937772 11 2150348 2170833 101.6 0.05 56 3.9E+12 0.002 31 12 0
IGLJ3798173022232470302324720597.70.05183-----
MMP7 7951217 11 102391240 102401478 94.8 0.06 19 2.8E+08 0.004 10 6 1 (LD)
PTN8143144713691209213702854697.20.0724010100
MSMB792752910515495535156259091.80.077801000
TCN1794844411596202815963404188.90.0712201000
CPG model REN 7923608120412394420413546584.30.0322011220
IGF2 7937772 11 2150348 2170833 92.1 0.04 15 8.1E+12 <0.001 609 12 7
NLRP2 8031398 19 55476652 55512508 84.2 0.04 34 8.0E+08 < 0.001 10 28 2
CEACAM5 802908619422125304223443692.10.06269.3E+080.0091230
MSMB 792752910515495535156259078.90.0696.2E+070.33671
CRABP2 7921099 1 156669410 156675375 65.2 0.09 39 1.1E+10 <0.001 132 35 11
Global model SERPINB3 7920285 18 61322433 61329197 171.3 0.01 27 1 5.3E+09 <0.001 37 15 0 0 1 1
MMP7 7951217 11 102391240 102401478 99.4 0.08 62 18 2.3E+08 0.003 5 2 0 0 0 0
PTN 8143144 7 136912092 137028546 102.5 0.09 20 0 6.1E+08 <0.001 16 15 0 0 0 0
IGHD 79817241410630310210631201497.50.1356--------

1Bonferroni correction for the p-values is: 0.008 (CPG model) and 0.01 (Global model); SNPs (N) and CpGs (N) are the number of SNPs and CpGs that were selected by ENET per each gene expression probe in EPICURO data with the Illumina HumanHap 1M array and the Methylation 27k array; SNPs (overlap) and CpGs (overlap) are the number of SNPs and CpGs that were present in the TCGA data with the Genome wide 6.0 Affymetrix and the Methylation 450k array and the SNPs (rep) and CpGs (rep) are the ones selected by ENET in the TCGA data in common with the EPICURO data. The gene with no p-value is a gene that was not present in the RNASeqV2 in TCGA data.

1Bonferroni correction for the p-value were: 0.005 (SNP model), 0.003 (CPG model) and 0.002 (Global model); SNPs (N) and CpGs (N) are the number of SNPs and CpGs that were selected by LASSO per each gene expression probe in EPICURO data with the Illumina HumanHap 1M array and the Methylation 27k array; SNPs (overlap) and CpGs (overlap) are the number of SNPs and CpGs that were present in the TCGA data with the Genome wide 6.0 Affymetrix and the Methylation 450k array; and the SNPs (rep) and CpGs (rep) are the ones selected by LASSO in the TCGA data in common with the EPICURO data. The gene with “no p-value” is a gene that was not present in the RNASeqV2 in TCGA data. 1Bonferroni correction for the p-values is: 0.008 (CPG model) and 0.01 (Global model); SNPs (N) and CpGs (N) are the number of SNPs and CpGs that were selected by ENET per each gene expression probe in EPICURO data with the Illumina HumanHap 1M array and the Methylation 27k array; SNPs (overlap) and CpGs (overlap) are the number of SNPs and CpGs that were present in the TCGA data with the Genome wide 6.0 Affymetrix and the Methylation 450k array and the SNPs (rep) and CpGs (rep) are the ones selected by ENET in the TCGA data in common with the EPICURO data. The gene with no p-value is a gene that was not present in the RNASeqV2 in TCGA data. Using DAVID, 46 out of 48 genes showing significant signals in the discovery phase were annotated from 14 public categories. After enrichment analysis, 7 clusters with an ES ≥1.3 were found (S3 Table). The cluster with the highest ES (3.5) regarded to the terms “extracellular region, secreted, and signal peptide” grouping the genes OLFM4, CRTAC1, MSMB, IGJ, MMP7, IGF2, PIGR, TCN1, CXCL17, S100A9, SAA1, IGHD, CRH, CTSE, FREM2, PLA2G2A, CEACAM7, CEACAM6, CEACAM5, REN, PTN, CP. The rest of the clusters with an ES ≥1.3 were not significant after MT correction. Cluster 5 (ES = 1.4) contains 3 genes coding for keratins (KRT5, KRT13, KRT20), cytoskeletal components that are regulated during urothelial differentiation, whose expression is altered in UBC, that have been proposed as markers for the molecular taxonomy of UBC [32]. In addition, cluster 7 “EF hand and calcium ion binding” (ES = 1.3) contains multiple genes shown to play an important role in cancer (S100A9, S100A2, CAPNS2, ANXA10, CRTAC1, FREM2, MMP7, PLA2G2A), including two members of the S100A family of proteins.

Discussion

Integration analysis is an emerging area in the field of omics data analysis to find new biological insights into complex traits [33]. In this regard, our pathophysiological understanding of cancer could be improved by using innovative approaches based on omics data to identify hidden mechanisms in which multiple factors are involved. We previously analyzed the set of omics data used here following a multi-stage approach by proposing an omics integration analysis framework. The results of this previous work highlighted relevant omics trans-acting relationships in UBC [12]. Here, we propose an omics integrative analysis pipeline using LASSO and ENET, and focus on cis-acting relationships that appear to have a predominant role in the regulation of gene expression [34]. The three omics data are combined in a large input matrix and then a permutationbased MaxT method is adapted to assess the significant models while correcting for MT. In comparison with classical approaches [6,7], our strategy has several advantages, including the possibility of working with a large number of parameters, even if the sample size is small, dealing with more than one set of heterogeneous data with highly-correlated variables, and providing MT corrected p-values to assess the models’ goodness of fit. Furthermore, the results are easily interpretable due to the dimensionality reduction during the variable selection process. The expression of 48 genes was found to be significantly associated with SNPs and CpGs in UBC, pointing to new mechanisms in an intricate scenario where common genetic variants and DNA methylation regulate gene expression in cis-acting (1MB) relationships. Some of the genes were identified by the three models and by the two methods, likely underscoring the existence of true relationships. The application of LASSO and ENET as part of the aforementioned integrative analysis framework led to different results. This is not surprising, mainly for two reasons: (1) the α parameter (Eq 5) used by LASSO is always equal to 1 while ENET uses α < 1. This gives a smaller penalization and therefore more variables with β ≠ 0 were foreseen using ENET; and (2) the fact that SNPs and CpGs may be correlated, mainly when they are closely positioned in the genome, leads LASSO to select one from the set of parameters that are highly correlated while ENET forms groups of nets with these variables. In our analysis, only 4/24 (SNP model), 1/25 (CpG model) and 3/28 (Global model) genes were shared by both methods. The genes detected only by LASSO showed large deviances and borderline p-values with ENET. Waldmann et al [35] reported that ENET usually detects more true and false positive associations. In our case, this may result in an increased probability of having significant associations by chance. In turn, this can lead to reduced power. On the other hand, ENET selected some genes that were not selected by LASSO, mainly due to the correlated structure of the parameters. An example of this is displayed in Fig 4, showing that MMP7 has three correlation nets that probably are responsible for the gene selection with ENET and not with LASSO. These comparisons are shown in S4 Table.
Fig 4

Example of a correlation plot for MMP7 detected by the Global model using ENET but not using LASSO.

The bar color represents the levels of correlation from 0 (no correlation) to 1 (perfect correlation) between SNPs and CpGs that were selected for the MMP7 models. Three nets of correlated variables are the ones responsible that the gene is only selected by ENET and not by LASSO.

Example of a correlation plot for MMP7 detected by the Global model using ENET but not using LASSO.

The bar color represents the levels of correlation from 0 (no correlation) to 1 (perfect correlation) between SNPs and CpGs that were selected for the MMP7 models. Three nets of correlated variables are the ones responsible that the gene is only selected by ENET and not by LASSO. Regarding the differences between the models, 13/25 and 6/20 significant genes in the CpG and 6/20 the SNP models, respectively, were not significant in the Global model. It is reported in the literature that 10% of SNPs are associated with gene expression and DNA methylation [36,37], hence DNA methylation may confound or modify the association between SNP and gene expression. Even though this is a potential explanation, discordances resulting from sample size cannot be discarded since the penalty function is selected by CV. However, k = 5 was used to apply the k-fold CV to decrease the problem of small sample size without increasing bias. In the reverse scenario, 16 genes were selected exclusively in the Global model. Some of the genes identified had high deviances and borderline p-values, probably because the Global models increase the deviance due to the addition of more information when integrating data. For the nonsignificant genes, the explanation could be the existence of an interaction effect between SNPs and CpGs (S5 Table). This further supports the importance of integrating omics data to discover hidden information. The validity of the strategy that we have developed, and of the results obtained, is supported by the fact that 75% (36/48) of the genes identified in the discovery phase were replicated using TCGA data by applying the same strategy. This represents 64% of all gene models found since some of the genes overlap between models and approaches. Also, the null results of the simulation study indicate that the significant associations found are unlikely to be due to chance. Importantly, several of the genes that emerged from our analyses have been previously shown to be important in bladder cancer biology, including KRT20, IGF2, CTSE, ANXA10 and CRH. These genes have already been proposed for a panel of molecular markers to improve the diagnosis and follow-up of UBC as part of a 12-gene expression urine signature to identify patients suffering from UBC and predict tumor aggressiveness [38]. The five genes aforementioned were also replicated in the TCGA data. Furthermore, KRT20, IGF2, and CTSE have also been previously associated with UBC. KRT20 is a highly specific marker of umbrella cells in normal urothelium and its expression is commonly altered in papillary non muscle-invasive UBC as well as in muscle-invasive UBC. It has been proposed that the correlation between FGFR3 mutations with normal KRT20 expression pattern may indicate that the mutation occurs earlier [39]. Loss of imprinting (LOI) is a common epigenetic event in cancer and a LOI of IGF2 has been reported in UBC [40]. In our analysis, IGF2 was detected in the SNP and CPG models suggesting that both type of factors may be involved in regulating the expression levels of this gene. CTSE expression was significantly associated with progression-free survival in pTa tumors in a study of gene expression profiles in UBC [41]. We also performed a gene enrichment analysis to assess whether the significant genes had related biological functions. The cluster with the highest ES was”Extracellular region, secreted, and signal peptide”. Secreted proteins are known to play a crucial role in cell signaling and the cellular secretome has a major impact on multiple aspects of tumor cell biology (cell growth, migration, invasion, and angiogenesis) [42]. One cluster highly enriched in keratins points to the regulation of cell differentiation, known to be important in the molecular classification of UBC. In addition, some genes—including S100A9 and S100A2—were grouped under the “EF hand and calcium ion binding” term. The S100 family is composed of, at least, 24 members carrying the Ca2+ binding EF-hand motif. Expression of S100 protein family members is regulated during inflammation and carcinogenesis and has been associated with poor prognosis in patients with UBC [43]. Other studies have reported an overexpression of S100A9 in UBC tissue [44,45]. Limitations of this work are the small sample size of the discovery phase study, due to the lack of enough fresh tumor tissue from the same set of individuals, and the lack of a comparable and independent UBC patient series with the 3-omics data available to replicate our results. While the discovery EPICURO study recruited all patients with UBC, the TCGA project focused on muscle-invasive UBC. In addition, different highthroughput technologies/platforms were used in each of the studies. The SNP arrays genotyped different SNPs and, consequently, provided different genomic coverage. The TCGA used a DNA methylation array of 450k with much higher resolution than the 27k the one used in the EPICURO study. Finally, the use of different technologies to measure transcriptomics is a considerable limitation. In the EPICURO discovery phase, gene expression levels were measured with microarrays which provide relative values at probe set level, that is, for one gene different expression levels can be obtained from each mapped probe, while in the TCGA study gene expression was measured with RNA-seq which gives absolute gene expression values. These differences between data sets introduce a massive heterogeneity that makes the replication even more difficult. In spite of that, we replicated 75% of the identified genes (64% of the models) with TCGA data, providing strong support to the appropriateness of our approach and the relevance of the results obtained. Another potential limitation is the fact that tumor samples are heterogeneous regarding neoplastic cell content and stromal cell composition. Consequently, we checked the expression of all significant genes in a panel of UBC cell lines with available microarray expression data [46] and found that all but one (IGJ) are expressed in urothelial tumor cells, indicating that our analyses likely reflect genomic regulatory events in the tumor cells. It is, however, likely that relevant genomic interactions control gene expression not only in neoplastic cells but also in the stroma. Given the importance of the latter in tumor progression, further integrative omics studies using microdissected material will be highly informative. One important strength of the approach used here is the lack of need to filter by LD in SNPs, or grouping CpGs within CpG islands, when dealing with a huge number of heterogeneous and correlated parameters delivered by different arrays. This emanates from the fact that LASSO and ENET can deal with highly correlated variables while performing variable selection. By performing data reduction/filtering before applying the statistical methods, there is a chance to filter out the functional SNPs and/or CpGs and thereby lose their association with gene expression. The adaptation of a strategy that performs a permutation and the maxT algorithm to assess p-values and to correct by MT, avoiding a double permutation and therefore reducing computational time, is also worthwhile emphasizing. In this regard, the permutation-based method considers the permutation of individuals within each gene, allowing to control for the possible dependence structure between genes. In addition, the MaxT algorithm is a permutationbased FWER controlling procedure which is adapted to the correlation structure found in the data and has been shown to be asymptotically optimal under dependence [47] In summary, we demonstrate that the integration of multiple omics data types allows the identification of hidden mechanisms that were missed when analyzing single omics data types individually. There is an urgent need to develop statistical methods to fill the gap between the huge amount of data generated and the mechanistic understanding of complex diseases. Here, we present two penalized regression methods (LASSO and ENET) in combination with a permutation–based strategy (permutation-based MaxT method) to deal with common problems found in integrative analysis: heterogeneity between data types, number of individuals much smaller than the parameters to assess, multicollinearity, and sparseness to facilitate the interpretation of the results. This approach is flexible and easy to implement in different omics data and diseases as well as when considering interaction terms in the model. We contribute to the field with a methodological development and with several significant and sound molecular associations conforming part of the genetic architecture of UBC. By using this cancer as an example, we conclude that modeling the intricacy of omics data variation with appropriate statistical strategies will certainly improve our knowledge of the mechanisms involved in complex diseases.

Data availability statement

Common genetic variation (GSE51641), DNA methylation (GSE71666), and gene expression (GSE71576) data for the discovery phase are available in GEO.

SNP-LASSO model.

Information about SNPs associated with gene expression using LASSO with SNP model for the discovery and replication phase. (XLSX) Click here for additional data file.

CPG-LASSO model.

Information about CpGs associated with gene expression using LASSO with CPG model for the discovery and replication phase. (XLSX) Click here for additional data file.

Global-LASSO model.

Information about SNPs and CpGs associated with gene expression using LASSO with Global model for the discovery and replication phase. (XLSX) Click here for additional data file.

SNP-ENET model.

Information about SNPs associated with gene expression using ENET with SNP model for the discovery and replication phase. (XLSX) Click here for additional data file.

CPG-ENET model.

Information about CpGs associated with gene expression using ENET with CPG model for the discovery and replication phase. (XLSX) Click here for additional data file.

Global-ENET model.

Information about SNPs and CpGs associated with gene expression using ENET with Global model for the discovery and replication phase. (XLSX) Click here for additional data file.

Deviance across the genome when applying LASSO for the SNP model for the simulated data.

The number of genes simulated are 20,899 for 27 individuals using a multivariate normal distribution (μ = 8.4, σ2 = 0.4). No gene was significantly associated after the permutation-based MaxT algorithm. (TIF) Click here for additional data file.

IDs corresponding to the 27 samples from the EPICURO data used in this analysis.

(DOCX) Click here for additional data file.

IDs corresponding to the 238 samples from the TCGA data used in this analysis.

(DOCX) Click here for additional data file.

Functional Annotation Clustering from DAVID tool (Enrichment score ≥ 1.3).

(DOCX) Click here for additional data file.

Comparison of the deviance, p-value and SNPs and/or CpGs detected by each model between LASSO and ENET methods.

(DOCX) Click here for additional data file.

Comparison of genes selected by each model (SNP, CpG and Global model) using LASSO.

(DOCX) Click here for additional data file.
  39 in total

1.  Sparse canonical correlation analysis with application to genomic data integration.

Authors:  Elena Parkhomenko; David Tritchler; Joseph Beyene
Journal:  Stat Appl Genet Mol Biol       Date:  2009-01-06

2.  Gene expression profiling of progressive papillary noninvasive carcinomas of the urinary bladder.

Authors:  Peter J Wild; Alexander Herr; Christoph Wissmann; Robert Stoehr; Andre Rosenthal; Dirk Zaak; Ronald Simon; Ruth Knuechel; Christian Pilarsky; Arndt Hartmann
Journal:  Clin Cancer Res       Date:  2005-06-15       Impact factor: 12.531

Review 3.  Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity.

Authors:  Margaret A Knowles; Carolyn D Hurst
Journal:  Nat Rev Cancer       Date:  2015-01       Impact factor: 60.716

4.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

Review 5.  Intrinsic basal and luminal subtypes of muscle-invasive bladder cancer.

Authors:  Woonyoung Choi; Bogdan Czerniak; Andrea Ochoa; Xiaoping Su; Arlene Siefker-Radtke; Colin Dinney; David J McConkey
Journal:  Nat Rev Urol       Date:  2014-06-24       Impact factor: 14.432

Review 6.  Integrative eQTL-based analyses reveal the biology of breast cancer risk loci.

Authors:  Qiyuan Li; Ji-Heui Seo; Barbara Stranger; Aaron McKenna; Itsik Pe'er; Thomas Laframboise; Myles Brown; Svitlana Tyekucheva; Matthew L Freedman
Journal:  Cell       Date:  2013-01-31       Impact factor: 41.582

7.  The presence of methylation quantitative trait loci indicates a direct genetic influence on the level of DNA methylation in adipose tissue.

Authors:  Alexander W Drong; George Nicholson; Asa K Hedman; Eshwar Meduri; Elin Grundberg; Kerrin S Small; So-Youn Shin; Jordana T Bell; Fredrik Karpe; Nicole Soranzo; Tim D Spector; Mark I McCarthy; Panos Deloukas; Mattias Rantalainen; Cecilia M Lindgren
Journal:  PLoS One       Date:  2013-02-19       Impact factor: 3.240

8.  Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles.

Authors:  Parminder K Mankoo; Ronglai Shen; Nikolaus Schultz; Douglas A Levine; Chris Sander
Journal:  PLoS One       Date:  2011-11-03       Impact factor: 3.240

9.  The UBC-40 Urothelial Bladder Cancer cell line index: a genomic resource for functional studies.

Authors:  Julie Earl; Daniel Rico; Enrique Carrillo-de-Santa-Pau; Benjamín Rodríguez-Santiago; Marinela Méndez-Pertuz; Herbert Auer; Gonzalo Gómez; Herbert Barton Grossman; David G Pisano; Wolfgang A Schulz; Luis A Pérez-Jurado; Alfredo Carrato; Dan Theodorescu; Stephen Chanock; Alfonso Valencia; Francisco X Real
Journal:  BMC Genomics       Date:  2015-05-22       Impact factor: 3.969

10.  The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts.

Authors:  James R Wagner; Stephan Busche; Bing Ge; Tony Kwan; Tomi Pastinen; Mathieu Blanchette
Journal:  Genome Biol       Date:  2014-02-20       Impact factor: 13.583

View more
  25 in total

1.  IMIX: a multivariate mixture model approach to association analysis through multi-omics data integration.

Authors:  Ziqiao Wang; Peng Wei
Journal:  Bioinformatics       Date:  2021-04-01       Impact factor: 6.937

2.  Computational challenges in modeling gene regulatory events.

Authors:  Abhijeet Pataskar; Vijay K Tiwari
Journal:  Transcription       Date:  2016-07-08

Review 3.  Molecular Aspects of Wound Healing and the Rise of Venous Leg Ulceration: Omics Approaches to Enhance Knowledge and Aid Diagnostic Discovery.

Authors:  Daniel A Broszczak; Elizabeth R Sydes; Daniel Wallace; Tony J Parker
Journal:  Clin Biochem Rev       Date:  2017-02

4.  Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data.

Authors:  Saurav Mallik; Zhongming Zhao
Journal:  Brief Bioinform       Date:  2020-03-23       Impact factor: 11.622

5.  Transcriptomic and functional network features of lung squamous cell carcinoma through integrative analysis of GEO and TCGA data.

Authors:  Yin Li; Jie Gu; Fengkai Xu; Qiaoliang Zhu; Di Ge; Chunlai Lu
Journal:  Sci Rep       Date:  2018-10-26       Impact factor: 4.379

6.  Exploring the Key Genes and Identification of Potential Diagnosis Biomarkers in Alzheimer's Disease Using Bioinformatics Analysis.

Authors:  Wuhan Yu; Weihua Yu; Yan Yang; Yang Lü
Journal:  Front Aging Neurosci       Date:  2021-06-14       Impact factor: 5.750

7.  A Systemic Analysis of Transcriptomic and Epigenomic Data To Reveal Regulation Patterns for Complex Disease.

Authors:  Chao Xu; Ji-Gang Zhang; Dongdong Lin; Lan Zhang; Hui Shen; Hong-Wen Deng
Journal:  G3 (Bethesda)       Date:  2017-07-05       Impact factor: 3.154

8.  Small-Magnitude Effect Sizes in Epigenetic End Points are Important in Children's Environmental Health Studies: The Children's Environmental Health and Disease Prevention Research Center's Epigenetics Working Group.

Authors:  Carrie V Breton; Carmen J Marsit; Elaine Faustman; Kari Nadeau; Jaclyn M Goodrich; Dana C Dolinoy; Julie Herbstman; Nina Holland; Janine M LaSalle; Rebecca Schmidt; Paul Yousefi; Frederica Perera; Bonnie R Joubert; Joseph Wiemels; Michele Taylor; Ivana V Yang; Rui Chen; Kinjal M Hew; Deborah M Hussey Freeland; Rachel Miller; Susan K Murphy
Journal:  Environ Health Perspect       Date:  2017-03-31       Impact factor: 9.031

9.  Genomic Analysis Using Regularized Regression in High-Grade Serous Ovarian Cancer.

Authors:  Yanina Natanzon; Madalene Earp; Julie M Cunningham; Kimberly R Kalli; Chen Wang; Sebastian M Armasu; Melissa C Larson; David Dl Bowtell; Dale W Garsed; Brooke L Fridley; Stacey J Winham; Ellen L Goode
Journal:  Cancer Inform       Date:  2018-02-01

10.  Seven-CpG-based prognostic signature coupled with gene expression predicts survival of oral squamous cell carcinoma.

Authors:  David C Christiani; Sipeng Shen; Guanrong Wang; Qianwen Shi; Ruyang Zhang; Yang Zhao; Yongyue Wei; Feng Chen
Journal:  Clin Epigenetics       Date:  2017-08-24       Impact factor: 6.551

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.