Literature DB >> 26041968

Integrative analyses of cancer data: a review from a statistical perspective.

Abstract

It has become increasingly common for large-scale public data repositories and clinical settings to have multiple types of data, including high-dimensional genomics, epigenomics, and proteomics data as well as survival data, measured simultaneously for the same group of biological samples, which provides unprecedented opportunities to understand cancer mechanisms from a more comprehensive scope and to develop new cancer therapies. Nevertheless, how to interpret a wealth of data into biologically and clinically meaningful information remains very challenging. In this paper, I review recent development in statistics for integrative analyses of cancer data. Topics will cover meta-analysis of homogeneous type of data across multiple studies, integrating multiple heterogeneous genomic data types, survival analysis with high-or ultrahigh-dimensional genomic profiles, and cross-data-type prediction where both predictors and responses are high-or ultrahigh-dimensional vectors. I compare existing statistical methods and comment on potential future research problems.

Entities: Disease Gene Species

Keywords: cancer genomics; high-dimensional data; integrative analysis; survival analysis; ultrahigh-dimensional data

Year: 2015 PMID： 26041968 PMCID： PMC4435444 DOI： 10.4137/CIN.S17303

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

With the rapid development and decreasing cost of high-throughput technologies, cancer biology has moved from a data-poor field to a data-abundant field. To illustrate, to date, more than 1,000,000 samples have been stored in Gene Expression Omnibus1 and ArrayExpress2; meanwhile, over 2,500 sequencing samples are deposited in the ENCyclopedia of DNA elements (ENCODE) project3 and the Sequence Read Archive4. Moreover, multiple types of genomics, epi genomics, and proteomics data, together with clinical data such as survival data, are simultaneously measured for cancer patients in the Cancer Genome Project (CGP),5 the International Cancer Genome Consortium (ICGC),6 and The Cancer Genome Atlas (TCGA). Consequently, the large volume of data provides unprecedented opportunities as well as challenges for using integrative analysis to reveal cancer mechanisms. Novel statistical methods and theories are in urgent demand for interpreting the wealth of data into biologically and clinically meaningful information while avoiding the “blind men and an elephant” scenario. When tracing the history of the past two decades, it can be seen that high-throughput biology has triggered the active statistical research in high-dimensional data. Recently, to handle the real dimension of genomic data, which is usually beyond the scale of hundreds of variables, methods hand ling ultrahigh-dimensional data are emerging. The diversity of cancer data types together with the availability of related studies on similar types of cancers adds another two dimensionalities of complexity. It is of critical clinical and biological interests to understand what subtypes a cancer have, how genomic profiles and survival rates of patients vary among subtypes, whether a patient’s survival can be predicted from his or her genomic profiles, and how one type of genomic profile is correlated with another type of genomic profile. No doubt, the abundance and sophisticated structures of cancer data will drive a whole class of exciting statistical problems in the coming years. In this paper, I review recent developments in statistical methods for integrative analyses of cancer data. This review complements an earlier review this year7 from a statistical perspective, with a more detailed comparison of statistical methods, a broader range of topics such as integration of homogeneous type of genomic profiles across studies and integration of genomic profiles with survival data, as well as comments on potential extension of current methods (see Table 1). With increasing feasibility to access cancer data from those public repositories8 so that anyone interested in the data can easily work on them, we hope the current review will arouse interests in developing new statistical tools and theories for integrative genomic analyses in cancer.

Table 1

Summary of the main reviewed methods.

NAME	INTEGRATION TYPE	CORE STATISTICAL METHOD	REFERENCE
Combat	Single data type, multiple studies	Empirical Bayes	10
SVA	Single data type, multiple studies	Surrogate variable analysis	11,12
svaseq	Single data type, multiple studies	Surrogate variable analysis	13
RUV	Single data type, multiple studies	Generalized linear model	14
Consistent DE	Single data type, multiple studies	Bayesian hierarchical model	15
EBarrays	Single data type, multiple studies	Bayesian hierarchical model	16–19
XDE	Single data type, multiple studies	Bayesian hierarchical model	20
Cormotif	Single data type, multiple studies	Bayesian hierarchical model	21
2-Norm group bridge	Single data type, multiple studies	Penalized method	22
iCluster	Multiple data types, single study	Matrix factorization	33
Joint Bayesian factor	Multiple data types, single study	Matrix factorization	38
JIVE	Multiple data types, single study	Matrix factorization	42
md-module	Multiple data types, single study	Matrix factorization	45
MDI	Multiple data types, single study	Bayesian hierarchical model	51
Prob_GBM	Multiple data types, single study	Bayesian hierarchical model	53
Consensus clustering	Multiple data types, single study	Bayesian hierarchical model	54
SNF	Multiple data types, single study	Network fusion	54
Multi-attribute graph	Multiple data types, single study	Network fusion	57,58
Penalized survival	Single data type with survival	Penalized method	69–71
Network penalized survival	Single data type with survival	Penalized method	73,74
SIS survival	Single data type with survival	Sure independence screening	76
PSIS	Single data type with survival	Sure independence screening	77
FAST	Single data type with survival	Sure independence screening	80
Bagging survival trees	Single data type with survival	Bootstrap	82
Survival ensembles	Single data type with survival	Inverse probability weighting	85
RIST	Single data type with survival	Imputation	88
T_SVD	Multiple data types multiple studies	Neural network	92

The rest of the paper is organized as follows. In the next section, I review models for integrating a single type of genomic profile across multiple studies to improve signal detection. This is followed by a section that is devoted to the integration of multiple types of genomic profiles. The next two sections present, respectively, integrating genomic data with survival analysis and the cross-data-type prediction problem where both the responses and predictors are high dimensional. Finally, the last section concludes the paper.

Integration of a Single Genomic Data Type

High-throughput technologies often have low signal-to-noise ratio. Consequently, results obtained from analysis based on a single study often suffer from low reproducibility either because of the low sample size or the heterogeneity of the datasets. With the rapid accumulation of related studies in public data repositories as mentioned above, it is more cost effective to borrow information across studies to improve signal detection. Nevertheless, caution should be paid when pooling datasets together to account for systematic biases such as batch effects as well as study specificity.

Batch effects

Batch effects are widespread in high-throughput biology. They are artifacts not related to the biological variation of scientific interests. For instance, two microarray experiments on the same technical replicates processed on two different days might present different results due to factors such as room temperature or the two technicians who did the two experiments. Batch effects can substantially confound the downstream analysis, especially meta-analysis across studies. Moreover, even more recent technologies such as next-generation sequencing cannot eliminate batch effects.9 Therefore, it is crucial to correct batch effects for valid integration across studies. For microarray data, when the batches are known, a location and scale adjustment method, combat, was developed to adjust for batch effects.10 The core idea of combat10 was that the observed measurement Y for the expression value of gene g for sample j from batch i can be expressed as where consist of covariates of scientific interests while γ and δ characterize the additive and multiplicative batch effects of batch i for gene g. After obtaining the estimators from the above linear regression, the raw data Y can be adjusted to : For real application, an empirical Bayes method was applied for parameter estimation. When batches were unknown, the surrogate variable analysis (SVA)11,12 was developed. The main idea was to separate the effects caused by covariates of our primary interests from the artifacts not modeled. Parallel to Equation (1), now the raw expression value Y of gene g in sample j can be formulated as where hs represent the unmodeled factors and are called as “surrogate variables”. Once again, the basic idea was to estimate hs and adjust them accordingly. An iterative algorithm based on singular value decomposition was derived to iterate between estimating the main effects given the estimation of surrogate variables and estimating surrogate variables from the residuals For sequencing data, svaseq, the generalized version of SVA, suggested applying a moderated log transformation to the count data or fragments per kilobase of exon per million fragments mapped (FPKM) first to account for the nature of discrete distributions,13 thus updating Equation (3) to: where c is a small positive constant. Instead of a direct transformation on the raw counts or FPKM, remove unwanted variation (RUV) adopted a generalized linear model for Y with the conditional mean specified as: RUV also allowed the use of negative control genes and control samples with details listed in its online methods.14

Hierarchical model

Differential expression detection between cancer patients and control samples is usually the first step to screen for risk genes and drug targets. However, as mentioned in the beginning of this section, gene expression microarrays suffer from noisy measurements, especially when only a small number of samples are available. Consequently, it is appealing to pool information across related studies or related cancer types to borrow strength. Specifically, within each study d = 1,…,D, we have n0 control samples and n1 cancer patients. The gene expression for total G genes is measured for each sample. Our task is to determine whether each gene g is differentially expressed in a given study d. Hereafter, we assume that data have already been properly normalized and adjusted for batch effects. The simplest method of pooling information is to assume that a gene is either differentially expressed in all studies or none of the studies.15 However, it fails to allow genes to be differentially expressed in only a subset of studies, thus losing study specificity. A more flexible model EBarrays16–18 included all the possible differential expression patterns into the mixture model and fitted the model with an empirical Bayes approach. The Markov-chain Monte Carlo (MCMC) algorithm was also developed for model fitting along the line.19 EBarrays performs well when the total number of studies integrated D is small, but it encounters the barrier of exponential growth of parameters when D is large, as it has to enumerate all 2 possible patterns. XDE20 did not have the exponential growth parameter space problem, but its Bayesian hierarchical model assumed that each gene had the same prior probability to be differentially expressed within a given study. To tackle the exponential growth of the parameter space while still allowing heterogeneity among genes, Cormotif21 adopted a small number of latent probability vectors to capture the correlation among studies while still able to regenerate all 2 differential expression patterns. Ma et al.22 considered a more general case where a response variable Y was available for each sample j within every study d. The task was to build a regression model f((β) ()), where () is the gene expression profile for sample j within study d. Differential expression detection is a subclass of this problem, with Y being binary. The authors adopted a penalized approach to select genes whose coefficients were nonzeros. Although the penalty functions were designed to enforce the same set of genes to have nonzero coefficients across all studies, the magnitudes of coefficients were allowed to vary across studies. It would be of interest to investigate a more flexible model where the set of genes with nonzero coefficients is also allowed to vary from study to study. Despite the refining methods for detecting differential expression from a single sequencing experiment, including DEseq23 and edgeR,24–26 the sequencing data version of hierarchical models for integration of multiple studies still requires development to address the typical discrete distributions observed for count and FPKM data. Before more fine-tailored methods becoming available, one potential easy approach might be to conduct a moderate log transformation as svaseq13 and then apply the aforementioned microarray-based methods.

Integration of Multiple Genomic Data Types

Due to the decreasing cost of high-throughput technologies, more and more studies now measure multiple heterogeneous genomic profiles simultaneously for the same set of samples (patients and controls) such as gene expression, gene mutations, copy number alterations, and DNA methylations, where each data type consists of tens of thousands of measurements. A key problem of heterogeneous data type integration is how to characterize the common structure shared by all the data types as well as the individual data-type-specific variation. In this review, I will focus on the recent statistical methodology development for integrative analyses of cancer data. Meanwhile, many well-developed machine learning algorithms, such as boosting,27,28 random forest,29 and support vector machine,30 have also been increasingly applied to cancer data and proven good prediction performance although with less interpretability. Readers may consult the corresponding review papers and the reference therein for details.7,31,32 The recently developed statistical methods can in general be categorized into three classes: matrix factorization, Bayesian models, and network fusion. In many scenarios, sparsity assumptions are also incorporated for regularization purpose to select a more parsimonious set of features. Here, I let {()} represent D different data types. are the measurements for , genomic features on N objects for data type d.

Matrix factorization

Matrix factorization aims at decomposing the variation in the datasets with lower rank matrix approximation. Assuming there are a set of “fundamental” common factors determining the values of all the original genomic features, the iCluster model was developed as33 Here, are the K underlying factors; ( ) is a p matrix containing the factor loadings specific to data type d; and () ∼ N(0, Ψ()) are the residual terms after accounting for the common factors. Sparsity was imposed on the loadings (). To accommodate different characteristics of heterogeneous data types, different types of penalty functions (the lasso penalty,34 the fused lasso penalty,35 and the elastic net penalty36) were applied to different data types. For instance, the fused lasso penalty was specifically suitable for DNA copy number data, as it accounted for spatial dependence along the genome. Treating as “missing data”, an Expectation-Maximization algorithm37 was applied to the penalized complete likelihood for model fitting. Cancer subtypes were determined according to a standard K-means clustering on E(). A resampling-based criterion measuring cluster reproducibility was used to choose the tuning parameters for the penalty parameters and the number of latent factors K. Along this line, Ray et al.38 generalized the above model to the following factorization: () represent the factor scores shared by all data types, and () are the factor scores specific to data type d. The model further assumed sparsity on both the factors scores () and (), as well as factor loadings (). For selection of the number of factors K, a finite beta-Bernoulli process was employed as an approximation to the Indian buffet process39,40 for the binary indicators of the nonzero components in () and (). After specifying the priors for all the parameters in the model, a Gibbs sampler41 was used for posterior inference. Instead of the same sharing factor loadings () for both () and (), Joint and Individual Variation Explained (JIVE) proposed a similar model where data-type-specific loadings were also allowed for the common factors ().42 In other words, the model can now be factored as Denoting = ()() and = ()(), =[1,…, D] and = 1…,D were allowed to have different ranks. A permutation testing approach was used to select the number of factors. With the orthogonal constraint that the joint structure and the individual structure = 1…,D were fitted iteratively by fixing one at a time and minimizing the square norm of the residual matrices. To induce sparsity, L1 penalties were placed on the loading matrices () and () and incorporated into the iterative estimating algorithm. Nonnegative matrix factorization (NMF) attempts to decompose a nonnegative matrix into nonnegative loadings and nonnegative factors, thus describing the non-subtractive patterns in the data.43,44 Zhang et al.45 generalized the single matrix NMF to integrative analysis of multidimensional genomic data. After transforming the raw data into input data fulfilling the constraints of nonnegativity as Kim et al.43, the following squared Euclidean error loss function was optimized45: One drawback of the NMF decomposition lies in the time complexity of the fitting algorithms, which is on the scale of with t being the iteration number for the fitting algorithm. Consequently, for a large number of genomic features, data reduction techniques such as principal component analysis46 were required in the data preprocessing step, which might result in loss of information. Moreover, network information can be incorporated into NMF. Network-based stratification (NBS)47 minimized the following objective function in order to cluster tumors into subtypes according to somatic mutation profiles with being an adjacency matrix encoding network information: As pointed out by the authors, NBS can be further generalized to integrate multiple layers of information47; thus I expect a loss function as a combination of Equations (9) and (10). A major issue with all the factorization approaches mentioned above is that they require proper normalization across data types. Generally, different data types have different distributions, different variability, and different numbers of genomic features. For instance, without proper scaling, as pointed out by Lock et al.42, it is very likely that “the largest data set wins”. JIVE attempted to handle that issue with normalization first across each row and then scaling across data types. On the other hand, as mentioned above, iCluster33 tried to use different penalty functions to take care of different data features. However, it still failed to distinguish between binary, categorical, and continuous data types. The method proposed by Mo et al.48 can be viewed as a generalization of iCluster33 by incorporating different distribution assumptions while still assuming the same common latent factors for all types of data. Specifically, with i indexing patient and j indexing genomic feature, for binary outcome, it rephrased Equation (6) as Similarly, for multicategory outcomes, with , c=1,…,C denoting the probability for each category, Equation (6) became Likelihood for count outcome with Poisson distribution and continuous variables with normal distribution can be derived accordingly. Lasso penalty was also placed on ( ) for regularization. The tuning parameters for regularization was chosen by Bayesian information criterion (BIC), and the model was fitted by the modified Monte Carlo Newton–Raphson algorithm.49,50 A potential future research problem would be how to adapt different distribution assumptions into a more flexible factorization framework such as the joint Bayesian factor model38 and JIVE.42 Moreover, other than Bayesian framework, how to conduct statistical inference including significance tests and confidence intervals for factor models, especially with penalization methods, would also be an important future research problem. Another problem worth investigation in real application, as pointed out by the referee, is the choice of the number of components or clusters K; the authors of the above models have tried resampling-based criterion measuring cluster reproducibility, permutation based testing approach, Indian buffet process, and BIC, whereas the Akaike information criterion (AIC) and Bayesian factor might also be of interest.

Bayesian models

Bayesian hierarchical models are another set of popular tools for integrative analysis of heterogeneous data types. They offer the flexibility to model different data-type-specific distributions as well as various types of correlation among data types. In multiple dataset integration (MDI),51 the authors considered the case where multiple genomic data types were measured under a single biological condition for a common set of genomic features. For instance, gene expression data, protein–DNA interaction data, and protein–protein interaction data were measured simultaneously for the same group of genes. The model assumed that each data type followed a K-component mixture model. Let c indicate the class membership of feature i in data-set d. Then, MDI modeled the associations among datasets via the following conditional prior for data-type-specific class memberships: Here, 1(·) is the indicator function, and ϕ characterizes the pairwise association among multiple datasets. MDI was further extended by incorporating a feature selection step in modeling its data-type-specific distributions and applied to gene expression, copy number variation, methylation, and microRNA data of 277 glioblastoma samples from TCGA.52 In MDI, ϕ describes the global association between two datasets for all the features. A more flexible model might allow the association to vary from a cluster of features (genes) to another cluster of features (genes). Instead of modeling associations among different data types, Prob_GBM53 modeled the associations among patients using a patient-similarity network. It first discretized all the genomic features and concatenated them into one vector for each patient. Next, for each patient, it assumed that each genomic feature was generated from a multinomial distribution whose parameters were determined by a K-dimensional Dirichlet distribution. Consequently, the likelihood can be written out in a similar fashion as that of MDI, where ϕ are now determined by the binary links in the patient-similarity network. One drawback of this approach is that it requires the discretization of each data type, which may lose a substantial amount of information. The Bayesian consensus clustering was proposed to model the overall clustering consensus among different data types rather than pairwise associations among data types. Therefore, an overall single clustering can be achieved at patient level, resulting in cancer subtype discoveries.54 Denoting the overall clustering labels as C = (C1, …, C), then compared to Equation (13), the data-type-specific conditional model can now be formulated as where α regulates the consensus between the clustering for dataset d and the overall clustering. So far, software has been developed with the data-specific distribution specified as normal distribution. All the above models were embedded into the Bayesian framework. Consequently, one main challenge lies in the computation of the MCMC algorithm for model fitting. Generally speaking, compared to matrix factorization methods, the Bayesian hierarchical model provides more flexibility to model data-type-specific distributions and various dependence structures. Nevertheless, it remains challenging to build models that comprehensively capture the association among different data types, among patients, and among different clusters of genomic features.

Network fusion

Another emerging approach for identifying cancer subtypes is to construct networks for patients and then conduct clustering according to the obtained network graph. Similarity network fusion (SNF)55 first constructed a similarity network of patients for each data type, where each node represented a patient and the weight on each edge indicated the similarity between two patients. Then, SNF normalized each network () into a matrix () that captured the global similarities among patients with row sums being 1 and a matrix () that described only the local similarities among the K nearest neighbors of each patient. By iteratively updating () = () × (′) × (())′ ≠ d until convergence, SNF fused multiple networks () into a single network and used spectral clustering56 to obtain clusters of nodes (patients). Instead of building a graph for each type of data, Katenka et al.57 stacked () to = (((1)),(())). A hypothesis testing approach was used to construct an association network according to canonical correlation between two groups of attributes. Kolar et al.58 continued on the same line and assumed followed a joint multivariate Gaussian distribution. Then, a penalized log likelihood was optimized to estimate the partial canonical correlations for constructing a Markov graph.59 Finally, nodes (patients) were clustered using a heuristic based on the edge weights in the obtained graph, as in Katenka et al.57 SNF lacks a rigorous probabilistic model to fuse multiple graphs; the methods of both Katenka et al.57 and Kolar et al.58 required () to be continuous, which might not be suitable for some types of genomic data such as copy number variation. Given the burst of statistical literature on multiple graphs estimation,60–66 though usually for single data type across multiple conditions, I expect estimation of multiple graphs constructed from multiple data types and construction of a single graph from heterogeneous data types with data-type-specific distributions will call for novel statistical models, methods, and theories for network research.

Integration of Genomic Data with Survival Data

One of the major goals of cancer research is to identify the survival curves for cancer patients. Therefore, statistical methods for studying the relationship between survival data and high-dimensional genomic data are of vital clinical importance. Here, I briefly review recent development in integrating genomic data with survival data. Let T and C denote the true underlying failure time and censoring time. However, we only see observed failure time Y = min(T,C), and I use δ = 1(T ≤ C) to indicate whether the observation is censored or not. = (X1,…,X) are the p-dimensional covariates. Conditional-independent censoring mechanism given the covariates is usually assumed. Our goal is to reveal the dependence of survival time T on covariates with the censored data (Y, δ, ). Two main approaches to model survival data with high-dimensional genomic data are penalization-based variable selection methods and tree-based ensemble learning methods.

Variable selection methods

The Cox proportional hazard model67 is one of the most widely used models for survival data. It assumes that the hazard at time t for x is where λ0(t) is the unspecified baseline hazard function. Then, for model fitting, the partial likelihood68 can be derived as where D is the set of indices of observed failures, and R is the set of indices for subjects who are at risk at failure time Y of subject i. can be achieved by maximizing the log partial likelihood. For high-dimensional covariates , penalty functions for β such as lasso penalty69 and smoothly clipped absolute deviation70,71 can be applied to the log partial likelihood. For penalized variable selection methods as well as other dimension reduction methods developed for survival data before 2009, see Witten and Tibshirani72 for a detailed review. Along the same road map, when biological pathway information is available, penalty functions were also designed to conduct both group-level selection and within-group-level variable selection73 as well as enforcing smoothness of regression coefficients for genes connected in a network.74 Parallel to the development of methods for linear models moving from high dimension to ultrahigh dimension, defined as the dimensionality growing exponentially with the sample size in Fan and Lv,75 sure independence screening (SIS) type of methods have also been developed for survival data. Given outcome Y and ultrahigh-dimensional covariates = (X1,…,X), SIS first screened = (X1,…,X) according to their marginal correlation with Y to a subset of = {X} and then built a regression model for Y with the selected set using various penalized approaches. For survival data, Fan et al.76, extended the marginal correlation screening to screening on the marginal utility, defined as the maximum of the partial likelihood achieved by each single covariate for the censored outcome. The principled Cox sure independence screening procedure (PSIS)77 screened on the standardized coefficient where is the variance estimate, and further incorporated a false discovery rate control78,79 procedure to determine the cutoff threshold automatically with theoretical justification provided. In contrast, rather than using the proportional hazard model, the feature aberration at survival times (FAST) statistic was developed in Gorst-Rasmussen and Scheike80 as a measure of the aberration of each covariate relative to its at-risk average. Specifically, let N(t) = 1(T ∧ C ≤ t, T ≤ C) be the counting process for the number of failures up to time t, and Y (t) = 1(T ∧ C ≤ t). Then, abbreviating FAST is defined as Theoretical justification of the SIS property for the FAST statistics within a class of single-index hazard rate models was provided.

Ensemble learning

Ensemble learning methods such as random forest29 and boosting81 are well known for offering outstanding prediction accuracy. Several methods have attempted to handle the missingness caused by censoring and thus generalized ensemble learning methods to survival data. Hothorn et al.82 first drew multiple bootstrap samples83 with replacement and constructed a survival tree for each bootstrap sample. Given a new observation, its survival function is estimated by the Kaplan–Meier curve84 for all data points in all the trees that belonged to the same leaf node as the new observation. In Hothorn et al.85, the authors first log-transformed Y, then missingness was accounted by adding inverse probability of censoring (IPC) weights86 to the loss function for either random forest or gradient boosting.87 The recursively imputed survival trees (RIST),88 on the other hand, attempted to impute the censoring data and ran extremely randomized trees (ERT), a tree-based ensemble method with a higher degree of randomization than random forests, on the imputed complete data. RIST iterated between imputing censored data using conditional survival distributions and refitting the conditional survival distributions by pooling all the trees with imputed data. Despite the wide use of random forest, theoretical analyses of its consistency and asymptotics89–91 are just emerging. Therefore, at present theoretical properties of tree-based ensemble methods remain significant challenges. Moreover, generalization to scenarios where covariates consist of multiple data types as discussed in section “Integration of multiple genomic data types” is also of great interests.

Cross-Data-Type Prediction

An ultimate goal of genomics is to demystify the regulation program of different functional genomic profiles. How is DNA methylation related to gene expression? How does transcription factor binding control gene expression? What is the relationship between chromatin status and methylation status? The core problem underlying all these questions is whether we can predict one type of genomic profile from another, where both the response and predictor variables are multivariate with at least tens of thousands of variables. In such scenarios, we surpass simple or multiple linear regressions, penalized approaches such as lasso, and sure independence screening for ultrahigh dimensions in that the response variable itself is also an ultrahigh-dimension vector rather than a scalar one. The small sample size adds another dimension of challenge for inferring the relations between tens of thousands of responses and predictors. The thresholding singular value decomposition (T_SVD) regression92 is among the very first to study this problem. T_SVD actually adopted a standard single-layer neural network model to link the high-dimensional predictors with the high-dimensional responses . To bring this to light, the regression model can be formulated as where is the input weights for the jth hidden-intermediate node in the neural network while is the output weights for the jth node. Consequently, a sparse orthogonal decomposition algorithm preserving sparsity in and was developed to estimate and iteratively. It can be seen that the cross-data-type prediction will open another new field for statistical methodology and theoretical research, given that both the predictors and the responses can not only be ultrahigh dimensional but also consist of multiple data types.

Conclusions

More and more efforts have been devoted to the development of statistical models and methods for integrative cancer data. Nevertheless, research on integrative analyses for cancer is still in its infancy with many open problems. How can systematic biases such as batch effects be detected and corrected in each new type of high-throughput technology so that meta-analysis across studies can be conducted? How can cancer subtypes be classified according to multiple genomic profiles jointly or determined by only a subset of genomic profiles? How can a single network be constructed with multidimensional genomic profiles? How can networks constructed from different types of data be modeled jointly? How can survival time of cancer patients be predicted by multiple types of ultrahigh-dimensional genomic profiles? How can one ultrahigh-dimensional vector be predicted from another ultrahigh-dimensional vector, one maybe continuous while the other discrete? All these questions are of vital clinical importance for identifying risk factors, drug targets, cancer diagnosis, survival prediction, and therapy selection toward a personalized approach. Naturally, they urge the demand for developing valid statistical methods with outstanding practical performance as well as solid theoretical foundations. I anticipate a wealth of new computationally efficient, interpretable, and robust statistical methods for integrative cancer analyses in the near future, which will thereby significantly promote cancer research and therapeutic development.

56 in total

1. Metagenes and molecular pattern discovery using matrix factorization.

Authors: Jean-Philippe Brunet; Pablo Tamayo; Todd R Golub; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2004-03-11 Impact factor: 11.205

2. Normalization of RNA-seq data using factor analysis of control genes or samples.

Authors: Davide Risso; John Ngai; Terence P Speed; Sandrine Dudoit
Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908

3. Integrative analysis and variable selection with multiple high-dimensional data sets.

Authors: Shuangge Ma; Jian Huang; Xiao Song
Journal: Biostatistics Date: 2011-03-16 Impact factor: 5.899

4. Robustly detecting differential expression in RNA sequencing data using observation weights.

Authors: Xiaobei Zhou; Helen Lindsay; Mark D Robinson
Journal: Nucleic Acids Res Date: 2014-04-20 Impact factor: 16.971

5. Learning regulatory programs by threshold SVD regression.

Authors: Xin Ma; Luo Xiao; Wing Hung Wong
Journal: Proc Natl Acad Sci U S A Date: 2014-10-20 Impact factor: 11.205

6. JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES.

Authors: Eric F Lock; Katherine A Hoadley; J S Marron; Andrew B Nobel
Journal: Ann Appl Stat Date: 2013-03-01 Impact factor: 2.083

7. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

8. International network of cancer genome projects.

Authors: Thomas J Hudson; Warwick Anderson; Axel Artez; Anna D Barker; Cindy Bell; Rosa R Bernabé; M K Bhan; Fabien Calvo; Iiro Eerola; Daniela S Gerhard; Alan Guttmacher; Mark Guyer; Fiona M Hemsley; Jennifer L Jennings; David Kerr; Peter Klatt; Patrik Kolar; Jun Kusada; David P Lane; Frank Laplace; Lu Youyong; Gerd Nettekoven; Brad Ozenberger; Jane Peterson; T S Rao; Jacques Remacle; Alan J Schafer; Tatsuhiro Shibata; Michael R Stratton; Joseph G Vockley; Koichi Watanabe; Huanming Yang; Matthew M F Yuen; Bartha M Knoppers; Martin Bobrow; Anne Cambon-Thomsen; Lynn G Dressler; Stephanie O M Dyke; Yann Joly; Kazuto Kato; Karen L Kennedy; Pilar Nicolás; Michael J Parker; Emmanuelle Rial-Sebbag; Carlos M Romeo-Casabona; Kenna M Shaw; Susan Wallace; Georgia L Wiesner; Nikolajs Zeps; Peter Lichter; Andrew V Biankin; Christian Chabannon; Lynda Chin; Bruno Clément; Enrique de Alava; Françoise Degos; Martin L Ferguson; Peter Geary; D Neil Hayes; Thomas J Hudson; Amber L Johns; Arek Kasprzyk; Hidewaki Nakagawa; Robert Penny; Miguel A Piris; Rajiv Sarin; Aldo Scarpa; Tatsuhiro Shibata; Marc van de Vijver; P Andrew Futreal; Hiroyuki Aburatani; Mónica Bayés; David D L Botwell; Peter J Campbell; Xavier Estivill; Daniela S Gerhard; Sean M Grimmond; Ivo Gut; Martin Hirst; Carlos López-Otín; Partha Majumder; Marco Marra; John D McPherson; Hidewaki Nakagawa; Zemin Ning; Xose S Puente; Yijun Ruan; Tatsuhiro Shibata; Michael R Stratton; Hendrik G Stunnenberg; Harold Swerdlow; Victor E Velculescu; Richard K Wilson; Hong H Xue; Liu Yang; Paul T Spellman; Gary D Bader; Paul C Boutros; Peter J Campbell; Paul Flicek; Gad Getz; Roderic Guigó; Guangwu Guo; David Haussler; Simon Heath; Tim J Hubbard; Tao Jiang; Steven M Jones; Qibin Li; Nuria López-Bigas; Ruibang Luo; Lakshmi Muthuswamy; B F Francis Ouellette; John V Pearson; Xose S Puente; Victor Quesada; Benjamin J Raphael; Chris Sander; Tatsuhiro Shibata; Terence P Speed; Lincoln D Stein; Joshua M Stuart; Jon W Teague; Yasushi Totoki; Tatsuhiko Tsunoda; Alfonso Valencia; David A Wheeler; Honglong Wu; Shancen Zhao; Guangyu Zhou; Lincoln D Stein; Roderic Guigó; Tim J Hubbard; Yann Joly; Steven M Jones; Arek Kasprzyk; Mark Lathrop; Nuria López-Bigas; B F Francis Ouellette; Paul T Spellman; Jon W Teague; Gilles Thomas; Alfonso Valencia; Teruhiko Yoshida; Karen L Kennedy; Myles Axton; Stephanie O M Dyke; P Andrew Futreal; Daniela S Gerhard; Chris Gunter; Mark Guyer; Thomas J Hudson; John D McPherson; Linda J Miller; Brad Ozenberger; Kenna M Shaw; Arek Kasprzyk; Lincoln D Stein; Junjun Zhang; Syed A Haider; Jianxin Wang; Christina K Yung; Anthony Cros; Anthony Cross; Yong Liang; Saravanamuttu Gnaneshan; Jonathan Guberman; Jack Hsu; Martin Bobrow; Don R C Chalmers; Karl W Hasel; Yann Joly; Terry S H Kaan; Karen L Kennedy; Bartha M Knoppers; William W Lowrance; Tohru Masui; Pilar Nicolás; Emmanuelle Rial-Sebbag; Laura Lyman Rodriguez; Catherine Vergely; Teruhiko Yoshida; Sean M Grimmond; Andrew V Biankin; David D L Bowtell; Nicole Cloonan; Anna deFazio; James R Eshleman; Dariush Etemadmoghadam; Brooke B Gardiner; Brooke A Gardiner; James G Kench; Aldo Scarpa; Robert L Sutherland; Margaret A Tempero; Nicola J Waddell; Peter J Wilson; John D McPherson; Steve Gallinger; Ming-Sound Tsao; Patricia A Shaw; Gloria M Petersen; Debabrata Mukhopadhyay; Lynda Chin; Ronald A DePinho; Sarah Thayer; Lakshmi Muthuswamy; Kamran Shazand; Timothy Beck; Michelle Sam; Lee Timms; Vanessa Ballin; Youyong Lu; Jiafu Ji; Xiuqing Zhang; Feng Chen; Xueda Hu; Guangyu Zhou; Qi Yang; Geng Tian; Lianhai Zhang; Xiaofang Xing; Xianghong Li; Zhenggang Zhu; Yingyan Yu; Jun Yu; Huanming Yang; Mark Lathrop; Jörg Tost; Paul Brennan; Ivana Holcatova; David Zaridze; Alvis Brazma; Lars Egevard; Egor Prokhortchouk; Rosamonde Elizabeth Banks; Mathias Uhlén; Anne Cambon-Thomsen; Juris Viksna; Fredrik Ponten; Konstantin Skryabin; Michael R Stratton; P Andrew Futreal; Ewan Birney; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Sancha Martin; Jorge S Reis-Filho; Andrea L Richardson; Christos Sotiriou; Hendrik G Stunnenberg; Giles Thoms; Marc van de Vijver; Laura van't Veer; Fabien Calvo; Daniel Birnbaum; Hélène Blanche; Pascal Boucher; Sandrine Boyault; Christian Chabannon; Ivo Gut; Jocelyne D Masson-Jacquemier; Mark Lathrop; Iris Pauporté; Xavier Pivot; Anne Vincent-Salomon; Eric Tabone; Charles Theillet; Gilles Thomas; Jörg Tost; Isabelle Treilleux; Fabien Calvo; Paulette Bioulac-Sage; Bruno Clément; Thomas Decaens; Françoise Degos; Dominique Franco; Ivo Gut; Marta Gut; Simon Heath; Mark Lathrop; Didier Samuel; Gilles Thomas; Jessica Zucman-Rossi; Peter Lichter; Roland Eils; Benedikt Brors; Jan O Korbel; Andrey Korshunov; Pablo Landgraf; Hans Lehrach; Stefan Pfister; Bernhard Radlwimmer; Guido Reifenberger; Michael D Taylor; Christof von Kalle; Partha P Majumder; Rajiv Sarin; T S Rao; M K Bhan; Aldo Scarpa; Paolo Pederzoli; Rita A Lawlor; Massimo Delledonne; Alberto Bardelli; Andrew V Biankin; Sean M Grimmond; Thomas Gress; David Klimstra; Giuseppe Zamboni; Tatsuhiro Shibata; Yusuke Nakamura; Hidewaki Nakagawa; Jun Kusada; Tatsuhiko Tsunoda; Satoru Miyano; Hiroyuki Aburatani; Kazuto Kato; Akihiro Fujimoto; Teruhiko Yoshida; Elias Campo; Carlos López-Otín; Xavier Estivill; Roderic Guigó; Silvia de Sanjosé; Miguel A Piris; Emili Montserrat; Marcos González-Díaz; Xose S Puente; Pedro Jares; Alfonso Valencia; Heinz Himmelbauer; Heinz Himmelbaue; Victor Quesada; Silvia Bea; Michael R Stratton; P Andrew Futreal; Peter J Campbell; Anne Vincent-Salomon; Andrea L Richardson; Jorge S Reis-Filho; Marc van de Vijver; Gilles Thomas; Jocelyne D Masson-Jacquemier; Samuel Aparicio; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Hendrik G Stunnenberg; Laura van't Veer; Douglas F Easton; Paul T Spellman; Sancha Martin; Anna D Barker; Lynda Chin; Francis S Collins; Carolyn C Compton; Martin L Ferguson; Daniela S Gerhard; Gad Getz; Chris Gunter; Alan Guttmacher; Mark Guyer; D Neil Hayes; Eric S Lander; Brad Ozenberger; Robert Penny; Jane Peterson; Chris Sander; Kenna M Shaw; Terence P Speed; Paul T Spellman; Joseph G Vockley; David A Wheeler; Richard K Wilson; Thomas J Hudson; Lynda Chin; Bartha M Knoppers; Eric S Lander; Peter Lichter; Lincoln D Stein; Michael R Stratton; Warwick Anderson; Anna D Barker; Cindy Bell; Martin Bobrow; Wylie Burke; Francis S Collins; Carolyn C Compton; Ronald A DePinho; Douglas F Easton; P Andrew Futreal; Daniela S Gerhard; Anthony R Green; Mark Guyer; Stanley R Hamilton; Tim J Hubbard; Olli P Kallioniemi; Karen L Kennedy; Timothy J Ley; Edison T Liu; Youyong Lu; Partha Majumder; Marco Marra; Brad Ozenberger; Jane Peterson; Alan J Schafer; Paul T Spellman; Hendrik G Stunnenberg; Brandon J Wainwright; Richard K Wilson; Huanming Yang
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

Review 9. The cancer genome.

Authors: Michael R Stratton; Peter J Campbell; P Andrew Futreal
Journal: Nature Date: 2009-04-09 Impact factor: 49.962

10. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

5 in total

1. Integrative analysis reveals disrupted pathways regulated by microRNAs in cancer.

Authors: Gary Wilk; Rosemary Braun
Journal: Nucleic Acids Res Date: 2018-02-16 Impact factor: 16.971

Review 2. Discovering MicroRNA-Regulatory Modules in Multi-Dimensional Cancer Genomic Data: A Survey of Computational Methods.

Authors: Christopher J Walsh; Pingzhao Hu; Jane Batt; Claudia C Dos Santos
Journal: Cancer Inform Date: 2016-10-03

3. Prediction With Dimension Reduction of Multiple Molecular Data Sources for Patient Survival.

Authors: Adam Kaplan; Eric F Lock
Journal: Cancer Inform Date: 2017-07-11

4. Integrating heterogeneous genomic data to accurately identify disease subtypes.

Authors: Xianwen Ren; Hua Fu; Qi Jin
Journal: BMC Med Genomics Date: 2015-11-20 Impact factor: 3.063

5. An integrated transcriptomics and metabolomics study of the immune response of newly hatched chicks to the cytosine-phosphate-guanine oligonucleotide stimulation.

Authors: Djomangan Adama Ouattara; Lydie Remolue; Jérémie Becker; Magali Perret; Andrei Bunescu; Kristin Hennig; Emeline Biliaut; Annemanuelle Badin; Cesarino Giacomini; Frédéric Reynier; Christine Andreoni; Frédéric Béquet; Patrick Lecine; Karelle De Luca
Journal: Poult Sci Date: 2020-07-02 Impact factor: 3.352

5 in total