| Literature DB >> 29497285 |
Irene Sui Lan Zeng1, Thomas Lumley1.
Abstract
Integrated omics is becoming a new channel for investigating the complex molecular system in modern biological science and sets a foundation for systematic learning for precision medicine. The statistical/machine learning methods that have emerged in the past decade for integrated omics are not only innovative but also multidisciplinary with integrated knowledge in biology, medicine, statistics, machine learning, and artificial intelligence. Here, we review the nontrivial classes of learning methods from the statistical aspects and streamline these learning methods within the statistical learning framework. The intriguing findings from the review are that the methods used are generalizable to other disciplines with complex systematic structure, and the integrated omics is part of an integrated information science which has collated and integrated different types of information for inferences and decision making. We review the statistical learning methods of exploratory and supervised learning from 42 publications. We also discuss the strengths and limitations of the extended principal component analysis, cluster analysis, network analysis, and regression methods. Statistical techniques such as penalization for sparsity induction when there are fewer observations than the number of features and using Bayesian approach when there are prior knowledge to be integrated are also included in the commentary. For the completeness of the review, a table of currently available software and packages from 23 publications for omics are summarized in the appendix.Entities:
Keywords: Statistical learnings; exploratory learning; integrated omics; network learning; regression
Year: 2018 PMID: 29497285 PMCID: PMC5824897 DOI: 10.1177/1177932218759292
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1.Multiblock principal component analysis (A, B, C). The multiblock principal component analysis starts from a random global score vector (a randomly chosen starting scale for the principal component space). The blocks of data X (different omics measurements) are regressed via t and result in the principal loading P which represents the importance (weight) of each omics measurement variable contributing to the latent structure components. The loading P is normalized to , and a new block score is formed by multiplying with the data blocks X. The new block scores of vector ts are combined and become the block score matrix T. T is used to regress on the global score vector t resulting in weight vector w which is normalized to the length of 1. The new global score vector t for the next iteration is then calculated by multiplying the weight w and the new block score matrix T.
Figure 2.AutoSOME (Automatic clustering using Self Organized Map) (A, B, C, D). Self-organizing map (SOM) is a stochastic clustering method to reduce the number of dimensions and preserve the local topology of gene expressions.
Figure 3.A multilayer network (A, B). Multiplex fusion algorithm.[19]
Computing and analytical software (packages) for omics data sets.
| Names | Language used | Analytical functions | Include visualization | Provide public databases | Omics techniques | Designed for integrative omics analysis[ | Involved statistical models | Designed for human study | Web-based open source |
|---|---|---|---|---|---|---|---|---|---|
| OmicKriging[ | R | It is designed for predicting complex traits (quantitative and qualitative) by leveraging and integrating similarity in genetic and large-scale omics | No | NA | miRNA, mRNA, T, SNP, and other large-scale omics | Yes (subject level) | Yes. It uses an algorithm to optimize the composited similarity matrix which integrates different omics correlation matrices | Yes | No |
| TranSMART + Galaxy + MINERVA: a combination pipeline[ | NA | TranSMART repository provides integration of low dimensional clinical data and high-dimensional molecular data sets, with built-in data mining and analysis applications | Yes | eTRIKS, | GE, T, P, M (not specified) | Yes (subject level) | Yes. Galaxy uses the R Bioconductor packages limma | Yes | Yes. Galaxy is a Web server and cloud bench |
| OmicsAnalyzer[ | JAVA | As a plug-in for cystoscope, it has the functions of mapping different data sets, estimating associations, and visualizations | Yes | No | NA | Yes (molecular ID level) | Yes | Not specified | NA |
| VANTED[ | JAVA | VANTED is a framework providing essential functions for system biology. It has 7 tasks including data integration, visualization and data analysis for correlation, clustering, differential analysis, and enrichment analysis. It also computes some topological features for the network | Yes | Connect to network database: MetaCrop, KEGG, RIMAS | Not specified | Yes (molecular ID level) | Yes | Yes | No |
| The DNA Microarray Inter-omics Analysis Platform[ | R Bioconductor packages and custom Java solutions | It provides data process function and focuses on the integration of these 2 types: | Yes | Murine nutrigenomics data set; Normal Human Dermal Fibroblasts (NHDF) | GE, miRNA, T, L, miRNA-mRNA interaction | Yes (subject level) | Yes | Yes | Yes |
| Lemon-Tree[ | JAVA | It is a modular network software. It provides a function (ganesh) for model-based Gibbs sampler to infer coexpression modules and condition clusters within each modular | Yes | TCGA glioblastoma expression and copy number data | T, miRNA, GE, CNA, eQTL, any others | Yes (subject level) | Yes | Yes | No |
| integrOmics[ | R | It provides regularized canonical correlation analysis, sparse partial least squares regression | Yes | No | M, L, C | Yes (subject level) | Yes | Yes | No |
| Mayday SeaSight[ | JAVA with an built-in R terminal | Mayday has the daily used methods for array analysis. It includes cluster, differentiation analysis, and machine learning methods. It also has a terminal connection with R which facilitates usage of R functions | Yes | KEGG, MetaCyc | GE, T | SeaSight provides the integrative function for GE and next generation sequence data (at the experiment level) | Yes | Yes | Yes |
| DASS-GUI[ | C++ | It provides 2 modes: | No | No | NA | No | Yes. It uses biclustering and other data mining method | Yes | No |
| GeneTrail2[ | Optimized C++ library based on Boost, Eigen 3, and GMP | It provides differential expression tests at the identifier level and set level. It also provides multiple tests corrections. Its gene set and phenotype strategies use an optimal permutation method to reduce computing time | No | No | T, M, P, GE, miRNA | No | Yes | Yes | Yes |
| OmixAnalyzer[ | Java, R, Perl | It includes differential analysis ( | Yes | No | GE, EX, P (on its way) | No | Yes | Yes | Yes |
| Specmine[ | R | Identification of metabolites, univariate (corr, regression, ANOVA), multivariate (robust PCA, cluster), machine learning, and feature selection (classification and regression, validation) | Yes | No | M, S | No | Yes | Yes | No |
| imDEV[ | R and Visual Basic | It provides functions to execute multivariate R functions from Excel. It includes MDS methods (Cluster, PCA, PLS) and 2/3 dimensional visualizations | Yes | No | M,C | No | Yes | Yes | No |
| XMRF[ | R | Fitting Markov networks to a wide range of high-throughput genomics data | Yes | No | GE | No | Yes | Not specified | No |
| PathVisioRPC[ | Allowed access from R, Perl, Python, Java, C, C++, PHP | A Remote Procedure Call for PathVisio, provides a link/communicating between the interface (PathVisio) and the statistical analytical tools (scripts). PathVisioRPC wraps PathVisio functionality into XMLRPC functions which can be implemented in many languages for execution. | Yes, it is provided from PathVisio | No | GE, M, T | No | No. PathVisio provides pathway analysis and data visualization software. It provides | Yes | Yes |
| COBRApy[ | Python, MATLAB | It uses constrained modeling to represent the complex biological process of metabolism and gene expression in a pathway. Constrained-based modeling includes a biological system constraint which is defined by the objective function and usually linear programming is used as the analytical method | Yes | No | GE, M | No | No. It applied linear programming (machine learning) | Yes | No |
| 3Omics[ | Perl and PHP scripts and running on a Linux-based Apache Web server | Correlation networking, coexpression, phenotyping, pathway enrichment, and GO (Gene Ontology) enrichment | Yes | PubMed database, KEGG, Human Cyc, iHOP, DAVID, Entrez Gene, OMIM, and UniProt | T, P, M | Yes (molecular ID level) | No | Yes | Yes |
| PaintOmics[ | Perl & Python scripts running on an Apache Web server | A joint visualization tool for transcriptomics and metabolomics | Yes | KEGG | GE, M | Yes (subject level) | No | Yes | Yes |
| COEUS[ | Jena, Java | It is a data integration software, a new sematic Web framework | No | Unipro, OMIM | Not specified | No | No | Yes | Yes |
| Cytoscape[ | JAVA | A popular tool for biological network visualization and data integration | Yes | No | All data types for biological network | No | No | Yes | Yes |
| Plug-in for Pathway Tools | Providing an add-on function for the pathway tools, a plug-in API for its GUI: | No | Pathway/Genome Databases (PGDBs) | GE | No | No | Yes | No | |
| MGV (Mayday Graph Viewer)[ | JAVA, it is an extension of the platform Mayday | It provides visualizations for cluster comparison between studies, cross data sets biological pathway, gene models, and probe centric view | Yes, mainly for visualization | No | T, M, P, GE | No | No | Yes | No |
| Omix[ | OVL script | A customized visualization tool for metabolic network | Yes | KEGG | T, M, F | No | No | No | No |
| MVBioDataSim[ | R | It is a multiview genomic data simulator | No | No | GE | No | No | No | No |
| ATHENA[ | Implemented in C++ and uses the libGE (version 0.206) and GAlib (version 2.4.7) genetic algorithm library | Grammatical evaluation neural network is used to analyze associations between single, multiple level genetic interactions and clinical outcomes. It includes (1) variable/feature selection, (2) model main and interactions effects predicting clinical outcomes, and (3) interpretation prepared for further bioinformatics | No | (TCGA) data portal ovarian cancer | CNA, GM, miRNA, GE, C | Yes (subject level) | Machine learning method: extension of artificial neural network | Yes | No |
Abbreviations: C, clinical data/outcomes; CNA, copy number alteration; EX, exon arrays; GE, gene expression (microarray); GM, gene methylation; L, lipidomics; M, metabolomics; NA, not available; P, proteomics; S, spectral data; T, transcriptomics.
Integration occurs at the molecular level: the input data are IDs of gene, protein, and metabolite and merged by these ID; results are derived using public databases (ie, pathway enrichment analysis via information of KEGG). Integration occurs at the subject level: the input data are an original expression or sequence variables from the same subject, data are merged by subject ID.
Software package that has functions to integrate clinical data and omics data and provides advanced statistical techniques for integrated data analysis.
Figure 4.N-partial least squares (N-PLS) construct data array with responses (Y) from different omics platforms[7]. The predictor data blocks (X) curated from another type of omics platform in the multifactorial (N) spaces.
Similarities and differences across different platforms of omics.
| Transcriptomics | Transcriptomics | Proteomics | Metabolomics | Common single-nucleotide polymorphism genotypes | MicroRNA expression | DNA methylation | |
|---|---|---|---|---|---|---|---|
| Technology | RNA sequencing | Microarray | Mass spec. | Mass spec. | Microarray | Microarray | Microarray |
| Statistical distributions used | Log-normal distributed/Poisson distributed | Normal or log-normal distributed | Log-normal–distributed peptide intensity to form hierarchical protein abundance | Log-normal–distributed peptide intensity to form hierarchical protein abundance | Binomial distributed | Log-normal distributed/Poisson distributed | Binomial distributed |