| Literature DB >> 32543653 |
Abstract
BACKGROUND: Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets.Entities:
Keywords: bioinformatics; computational biology; data integration; deep learning; epigenetics; epigenomics; gene regulation; genomics; high-throughput sequencing; machine learning
Year: 2020 PMID: 32543653 PMCID: PMC7297091 DOI: 10.1093/gigascience/giaa064
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Genomic features affecting gene regulation are shown, along with the corresponding assays used to infer the state of the regulatory feature. We note that this is not an exhaustive list of assays available to profile regulatory features.
Figure 2:Information returned by types of functional assays targeting RNA-DNA interactions, RNA-protein interactions (including histones), quantifying RNA abundance, DNA-DNA interactions, DNA-protein interactions (including histones), and direct biochemical modifications to DNA. By probing the association of DNA with these regulatory factors, we can observe the activity fingerprint of a genome. Combined with functional outcome information such as gene expression, we can infer the flow of signals that result in a phenotype. (Methyl-Seq normally covers the entire genome, but for simplicity we show coverage at a single methylated site in the hypothetical genome. We note that the performance of some of these assays may vary depending on the experimental design and region targeted [1].)
Figure 3:Our proposed functional taxonomy of data integration methods. (A) Inter-modality and intra-modality harmonization methods exist. Data aggregation tools are separate from these methods, which are not data integrators in the context of our review. With inter-modality restricted methods, custom strategies are common. (B) For inter-modality generic methods, 5 approaches are common. Mutual nearest neighbours exploits common points between single-cell datasets as references, matrix factorization operates on abundance measures to categorize data and is agnostic to data type, multivariate models attempt to account for dependent and independent variable contribution to the output, latent variable models attempt to model an unobserved factor’s contribution to the output, and deep learning optimizes a series of regressions to yield a categorical variable or generate an output. Intra-modality harmonization methods share these strategies but apply them specifically to reduce unwanted technical variation. We note that current methods use data in a processed state, and for a better harmonization raw or intermediate data (as shown by the dotted lines in panel A) can be used.
Inter-modality data harmonization approaches with a restricted modality scope
| Method name | Strategy | Main advantages | Main limitations | Citation |
|---|---|---|---|---|
| MDI | Bayesian Consensus Clustering | Identifies gene clusters across datasets with specific shared characteristics. Can model time-series data | Limited to querying a small subset of genes. Trained only on array data | [ |
| RIMBANET | Bayesian MCMC | Integrates many data types simultaneously | Requires large quantities of multimodal data. Method was specifically designed for experiment | [ |
| EPIP | Ensemble boosting | Effective in unbalanced datasets | Limitations of training data reduce model effectiveness in small datasets | [ |
| EAGLE | Ensemble boosting | Uses higher-level features to buffer against overfitting | Custom genome-specific features need to be calculated for classification | [ |
| PreSTIGE | Information theory | Outputs different specificity thresholds | Biased to cell type | [ |
| TEPIC | Machine learning | Feature space improves result interpretability | Limited performance in gene-dense regions or with small sample sizes | [ |
| iOmicsPASS | Network analysis | Produces a sparse set of easily interpretable biological interactions. Effective in heterogeneous datasets | Important markers that are poorly represented in biological networks can be lost in the analysis | [ |
| LemonTree | Network analysis; Gibbs sampler; decision tree | Modular model parts for different cases | Trained on cancer data | [ |
| PANDA | Network analysis; message passing | Accounts for lack of direct regulatory element interaction | Choice of convergence parameter affects results. Results may be difficult to interpret | [ |
| PARADIGM | Network analysis; Probabilistic Graph Model | Robust to false-positive results | Training was performed on microarray data. Effectiveness in sequencing data unknown. Trained on cancer data | [ |
| IM-PET | Random forest classifier | Expected to generalize to other species | Requires assembly of 4 manually derived scores | [ |
| JEME | Random forest classifier; regression | Easily retrainable on different systems if sufficient data are available | At least 4 input data types are required | [ |
| RIPPLE | Random forest classifier; regression | Generalizable to other biological conditions and cell types | Assumes balanced data categories | [ |
| SVM-MAP | Support Vector Machine | Expected to generalize to multiple cancer types | Limited enhancer coverage in training data | [ |
| ELMER | Wilcoxon rank-sum test | Identifies upstream master regulators | Restricted to methylation arrays in cancer | [ |
| TENET | Wilcoxon rank-sum test | Expected to generalize to other biological systems | Targets group expression differences only | [ |
| RegNetDriver | Wilcoxon rank-sum test | Provides a framework to construct tissue-specific regulatory networks | Requires assembly of multiple manually derived scores from system-specific steps | [ |
Names, strategies, advantages, and limitations of each method is provided. Regarding advantages and limitations, a few major points were highlighted, and it is important to note that many of these methods are highly nuanced. A citation for reference to the original manuscript of each method is provided where full details can be obtained.
Type and number of data modalities supported by each inter-modality data harmonization approach (restricted modality scope).
| Method name | No. of modalities compatible | 3D chromosome structure | DNA methylation | Epigenetic peak data | DNA-Protein binding | DNA-RNA interactions | RNA-Protein interactions | Protein-Protein interactions | Genomics | Transcriptomics | Citation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MDI | 3 | X | X | O | X | X | X | O | X | O | [ |
| RIMBANET | 4 | X | X | X | O | X | X | O | O | O | [ |
| EPIP | 4 | O | X | O | O | X | X | X | X | O | [ |
| EAGLE | 2 | X | X | X | O | X | X | X | X | O | [ |
| PreSTIGE | 2 | X | X | O | X | X | X | X | X | O | [ |
| TEPIC | 3 | O | X | O | X | X | X | X | X | O | [ |
| iOmicsPASS | 2 | X | X | X | X | X | X | X | O | O | [ |
| LemonTree | 2 | X | X | X | X | X | X | X | O | O | [ |
| PANDA | 3 | X | X | X | O | X | X | O | X | O | [ |
| PARADIGM | 2 | X | X | X | X | X | X | X | O | O | [ |
| IM-PET | 2 | X | X | O | X | X | X | X | X | O | [ |
| JEME | 2 | X | X | O | X | X | X | X | X | O | [ |
| RIPPLE | 3 | X | X | O | O | X | X | X | X | O | [ |
| SVM-MAP | 2 | X | O | X | X | X | X | X | X | O | [ |
| ELMER | 2 | X | O | X | X | X | X | X | X | O | [ |
| TENET | 2 | X | O | X | X | X | X | X | X | O | [ |
| RegNetDriver | 5 | X | O | O | O | X | X | X | O | O | [ |
"DNA methylation" in this context refers specifically to the ratio of signal between methylated and unmethylated alleles. For simplicity, some modalities have been aggregated, e.g., transcriptomics data include both gene expression and small RNA data. Some methods are capable of handling proteomics, metabolomics, or medical images, but these are excluded because they are not a focus of this review. A link to each method is provided for easy reference.
Inter-modality data harmonization approaches with a free modality scope
| Method name | Strategy | Main advantages | Main limitations | Citation |
|---|---|---|---|---|
| DeepMF | Deep learning and non-negative matrix factorization | Robust to noise and missing data | Manual parameter tuning and prior information may be required | [ |
| JIVE | Dimensionality reduction | Identifies the global modes of variation that drive associations across and within data types | Not robust to outliers, missing values, or class imbalance | [ |
| GCCA | Generalized canonical correlation analysis | Identifies blocks of variables within datasets for correlation across datasets | Less effective if the number of observations is smaller than the number of variables or if multiple linear correlations are present between datasets. Biases towards strong variation in the data | [ |
| NetICS | Graph diffusion | Robust to frequency of aberrant genes in sample | Can only examine effects of known genes present in a defined interaction network | [ |
| DIABLO | Multivariate model and latent variable model | Captures quantitative information. Visual outputs aid interpretation | Assumes a linear relationship between the selected omics features. Parameter tuning is required | [ |
| iCluster | Latent variable model | Captures both concordant and unique alterations across data types | Sensitive to initial subset selection. Trained only on array data | [ |
| GFA | Latent variable model | Accepts data with missing values | Manual parameter tuning. Prior information may be required | [ |
| MOFA | Latent variable model and probabilistic Bayesian | Leverages multiomics to impute missing values. Single-cell version available | Assumes a linear relationship between the selected omics features. Manual parameter tuning required | [ |
| seurat | Mutual nearest neighbours | Effective in intra-modality as well as inter-modality integration. Robust to parameter changes | Restricted to single cell. Requires robust reference data | [ |
| SNF | Network analysis | Effective in small heterogeneous samples. Captures quantitative information | Does not yield quantitative data. Trained only on array data | [ |
| NMF | Non-negative matrix factorization | Accounts for complex modular structures in multimodal data | Trained only on array data | [ |
| iNMF | Non-negative matrix factorization | Stable even in heterogeneous conditions | Trained only on array data | [ |
| LIGER | Non-negative matrix factorization | Effective in intra-modality as well as inter-modality integration; effective in highly divergent datasets | Restricted to single cell | [ |
| sMBPLS | Sparse multi-block partial least-squares regression | Derives weights for modalities indicating contributions to expression | Performance is reduced with lower data dimensions | [ |
Note that seurat and LIGER are specific to single-cell data and the others are intended for bulk data. Names, strategies, advantages, and limitations of each method are provided. Regarding advantages and limitations, a few major points are highlighted. A citation for reference to the original publication of each method is provided where full details can be obtained.
Type and number of data modalities tested by each inter-modality data harmonization approach (free modality scope)
| Method name | No. of modalities trained | 3D chromosome structure | DNA methylation | Epigenetic peak data | DNA-Protein binding | DNA-RNA interactions | RNA-Protein interactions | Protein-Protein interactions | Genomics | Transcriptomics | Citation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DeepMF | 1 | X | X | X | X | X | X | X | X | O | [ |
| JIVE | 1 | X | X | X | X | X | X | X | X | O | [ |
| GCCA | 2 | X | X | X | X | X | X | X | O | O | [ |
| NetICS | 3 | X | O | X | X | X | X | X | O | O | [ |
| DIABLO | 2 | X | O | X | X | X | X | X | X | O | [ |
| iCluster | 3 | X | O | X | X | X | X | X | O | O | [ |
| GFA | 2 | X | O | X | X | X | X | X | X | O | [ |
| MOFA | 2 | X | O | X | X | X | X | X | X | O | [ |
| seurat* | 2 | X | X | O | X | X | X | X | X | O | [ |
| SNF | 2 | X | O | X | X | X | X | X | X | O | [ |
| NMF | 2 | X | O | X | X | X | X | X | X | O | [ |
| iNMF | 2 | X | O | X | X | X | X | X | X | O | [ |
| LIGER* | 2 | X | O | X | X | X | X | X | X | O | [ |
| sMBPLS | 3 | X | O | X | X | X | X | X | O | O | [ |
Note that GCCA [72], seurat [69], and LIGER [67] are specific to single-cell data and the others are intended for bulk data. “DNA methylation” in this context refers specifically to the ratio of signal between methylated and unmethylated alleles. In contrast to Table 4, the quantity of modalities represents the quantity of modalities on which the algorithm was tested and does not reflect the modalities with which the algorithm is compatible. For simplicity, some modalities have been aggregated, e.g., transcriptomics data include both gene expression and small RNA data, which gives the illusion that DeepMF [21] and JIVE [71] were trained on unimodal data. Some methods are capable of handling proteomics, metabolomics, or medical images, but these are excluded because they are not a focus of this review. A link to each method is provided for easy reference.
Intra-modality data harmonization approaches
| Method name | Strategy | Main advantages | Main limitations | Citation |
|---|---|---|---|---|
| ComBat | Bayesian empirical | Removes batch effect in most cases | Removes biological signal in most cases | [ |
| RUV | Linear model | Effective with spike-in controls | Individual variants make specific assumptions about the data | [ |
| removeBatchEffect | Linear model | Generalizable to most transcriptomic data types | May be less effective in complex experimental designs | [ |
| SVN | Linear model | Generalizable to many cases | Assumes that feature similarities between datasets are due to biology | [ |
| mnnCorrect | Mutual nearest neighbours | Accounts for heterogeneity within sample groups | Restricted to single-cell data | [ |
| MINT | Multivariate model | Robust to overfitting and strong multidimensional technical variation | Minimum sample count requirement | [ |
| Scanorama | Mutual nearest neighbours | Scales to very large sample sizes. Robust to overcorrection | Restricted to single-cell data | [ |
| MultiCluster | Tensor decomposition | Accounts for multiple batch variables simultaneously | Restricted to 3-way variable comparisons | [ |
| zeroSum | Zero sum regression | Generalizable across different technologies and platforms | Weak or non-linear features may be masked by strong features | [ |
Batch is a special case of intra-modality harmonization and is included for completeness because many underlying strategies used are applicable to broader data integration. All methods are restricted to a single data modality of transcriptomics. Names, strategies, advantages, and limitations of each method are provided. Regarding advantages and limitations, a few major points are highlighted. A citation for reference to the original publication of each method is provided where full details can be obtained.