| Literature DB >> 36187917 |
Abhishek Subramanian1,2, Pooya Zakeri3,4,5, Mira Mousa6, Halima Alnaqbi6, Fatima Yousif Alshamsi6,7, Leo Bettoni1,2, Ernesto Damiani8, Habiba Alsafar6,7, Yvan Saeys9,10, Peter Carmeliet1,2,3,6.
Abstract
Multi-omics technologies are being increasingly utilized in angiogenesis research. Yet, computational methods have not been widely used for angiogenic target discovery and prioritization in this field, partly because (wet-lab) vascular biologists are insufficiently familiar with computational biology tools and the opportunities they may offer. With this review, written for vascular biologists who lack expertise in computational methods, we aspire to break boundaries between both fields and to illustrate the potential of these tools for future angiogenic target discovery. We provide a comprehensive survey of currently available computational approaches that may be useful in prioritizing candidate genes, predicting associated mechanisms, and identifying their specificity to endothelial cell subtypes. We specifically highlight tools that use flexible, machine learning frameworks for large-scale data integration and gene prioritization. For each purpose-oriented category of tools, we describe underlying conceptual principles, highlight interesting applications and discuss limitations. Finally, we will discuss challenges and recommend some guidelines which can help to optimize the process of accurate target discovery.Entities:
Keywords: Angiogenesis; Biological networks; Functional enrichment; Gene prioritization; Single-cell multi-omics; Unsupervised and supervised data fusion
Year: 2022 PMID: 36187917 PMCID: PMC9508490 DOI: 10.1016/j.csbj.2022.09.019
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1UpSet Plot showing the classification of studies characterizing single-cell EC heterogeneity with respect to the applied computational techniques. A total of 87 studies detailed in Supplementary Table 1, characterize single-cell EC heterogeneity with the distribution of studies that use different task-specific computational techniques. Performing differential expression of biomolecular abundances between conditions and subsequent coupling with functional enrichment techniques are commonly used to discover novel biological knowledge in single-cell ECs (82 studies). This is followed by the use of biological network inference techniques to identify novel biomolecular interactions from changes in gene expression (18 studies). Within biological network inference approaches, most studies intend to predict cell–cell communication through ligand-receptor interactions followed by inference of gene-regulatory networks. Only one study focused on predicting varying pathway activity using genome-scale metabolic networks. Also, biological network inference studies are only used complementary to functional enrichment techniques (overlap between biological network-based studies and functional enrichment). Among integration-based approaches, most studies fuse single-cell transcriptomes from multiple datasets laterally as compared to vertical fusion of multiple omics data types. Automated gene-prioritization for the identification of AAT targets is the least explored (only 3 studies have attempted prioritization of genes). The bar plot in the bottom left shows comparison of the number of studies which use a particular technique. The bar plots on the top indicate the number of studies that have used a combination of different tools for analysis. The filled dots and lines in the matrix visually represent studies that use different combinations of the tools enlisted in the rows.
Computational tools for knowledge discovery and target prioritization.
| ||
| identifies enriched gene-sets based on the strength of overlap between user-defined gene list and reference gene sets | g:Profiler; Panther; Enrichr | |
| enriches gene sets based on the degree / significance of relative gene expression changes | clusterProfiler; GenePattern; GSEA tool, BIOMEX | |
| estimates varying gene-sets across samples by generating gene-sets vs samples scoring matrix | GSVA package, BIOMEX | |
| ||
| use differentially expressed ligands and receptors to identify interactions between clusters of cells. | CellTalker; iTALK; PyMINEr | |
| statistical scoring of each ligand-receptor pair based on permutation test-based filtering, non-parametric tests with a null model or defined empirical rules | CellChat; CellPhoneDB; Giotto; ICELLNET; SingleCellSignalR | |
| uses networks of interactions between ligands, receptors, and downstream targets to prioritize ligand-receptor interactions | CCCExplorer; NicheNet; | |
| help to generate a hypergraph (network representing many-to-many relationships) of ligands and receptors from co-expression data. | scTensor | |
| ||
| prediction of activation / inhibition relationships based on co-expression of transcription factors and their targets (or transcription-factor target promotor binding) across conditions or time dependent changes. | GENIE3; SCENIC; AR1MA1; SCODE | |
| ||
| mathematical model of whole cell metabolism that can be tailored to predict condition-specific metabolic fluxes using uptake and ‘omics’ abundance constraints | COBRA toolbox, COBRApy, | |
| a method to estimate pseudo steady-state metabolic fluxes in a genome-scale metabolic reconstruction that is required to optimize the synthesis of specific metabolites | ||
| modification of optimization solver to account for cell–cell metabolic variation | scFEA | |
| ||
| captures cell–cell correspondence by identifying shared feature associations between paired or unpaired modalities | Seurat V3; BindSC; | |
| captures cell–cell correspondence by identifying conserved cluster structures between paired or unpaired modalities | Seurat V4; CiteFuse | |
| uses the Bayesian framework of modeling to scale and map different modalities | BREM-SC; Clonealign | |
| uses auto-encoders to identify non-linear relationships between features and modalities to make interpretations | TotalVI; | |
| ||
| an early integration technique, where the fusion of several data sources takes place at the raw data level | ||
| an intermediate integration technique, where different data sources are fused while learning | ||
| a late integration technique, where each data source is modeled separately and integrates the data at the decision level through decision aggregation | ScanCluster | |
| reduces data dimensionality while remaining fully aware of the class labels and can be used for classification purposes | MixOmics; MINT; | |
| ||
| OCC aims at identifying data elements of a given class among all objects by learning mostly from a training set that only contains objects of that class. | ||
| similar to one-class classification, PU-Learning focuses on one-class. However, in PU learning, two sets of examples are supposed to be accessible for training: a positive set P and an unlabeled set, which is expected to contain both positive and negative examples. In PU learning, a binary classifier is trained in a semi-supervised manner from solely positive and unlabeled sample points. | GuiltyTargets; n2a-SVM; Node2vec; DeepPVP | |
| detecting disease-associated genes through ML technologies. | exTasy; Endeavour; Genehound | |
Web-based applications for knowledge discovery and target prioritization.
| gProfiler | |||
| WebGestalt 2019 | |||
| Panther Gene List Analysis | |||
| Enrichr | |||
| WebGestalt 2019 | |||
| EndoDB | |||
| EnrichNet | |||
| ShinyGO | |||
| GeneTrail | |||
| TissueEnrich | |||
| WhichGenes | |||
| ClusterGrammer | |||
| PAGER Web APP | |||
| TALKLR | |||
| InterCellar | |||
| scConnect | |||
| CellPhoneDB | |||
| CellLinker | |||
| FlyPhoneDB | |||
| DIANE | |||
| COXPRESdb | |||
| GeneFriends | |||
| COEXPEDIA | |||
| SEEK | |||
| GeNeCK | |||
| Virtual Metabolic Human | |||
| Metabolic Atlas | |||
| BiGG Models | |||
| Fluxer | |||
| Escher-FBA | |||
| MiBiOmics | |||
| OmicsNet | |||
| ToppGene | |||
| PhenoPred | |||
| Endeavour | |||
| pBRIT | |||
| PhenoApt | |||
| PolySearch2 | |||
| PINTA | |||
| GeneMANIA | |||
| WebPropagate | |||
Fig. 4Techniques for ML-based supervised fusion of attributes from various data sources. To commonly explain multiple ML techniques, we use a representative example where the aim is to classify genes as pro-angiogenic (+ class) and anti-angiogenic (− class) based on different attributes measured from multiple data sources. (A) Raw fusion: A supervised fusion method that first concatenates attributes from data modalities 1 and 2 (blue and orange colors) and subsequently uses the concatenated dataset for machine learning and classification. (B) Transitional fusion: Here, a structure or pattern is generated for each modalities 1 and 2 separately but they are integrated while learning. The integrated structure is used for classification. (C) Decision fusion: Unlike transitional fusion, the data structures are generated independently for independent learning and only prediction outcomes of + and − class are fused based on majority voting. (D) Supervised deep learning for omics data integration: Deep neural networks (Box 1) are generated for each modality separately. Attributes for each modality are reconstructed an compared with input to evaluate learning performance. The reconstructed features from each omics modality are concatenated finally providing information of cluster labels. (E) Partial least squares-discriminant analysis (PLS-DA): PLS-DA integrates the different attributes from two modalities (blue and orange colors) into PC1 and PC2 and learns the cluster information during integration, and, hence, is an example of intermediate integration. Each PLS-DA component (PC1, PC2) represents a linear combination of correlated attributes from each data source. (F) One-class support vector machine (one-class SVM): Unlike binary SVM (Box 1), in a one-class SVM, different sets of data points are classified into high (large number of points with orange color) or low density regions (low number of points with blue color). The support vectors are then chosen from the high density region depending upon the distance from the center of the high density region to form a hyperplane that is farther from the origin. Based on the labelled information from + pro-angiogenic class, it can predict genes that belong to the - anti-angiogenic class. (G) Gene prioritization by Genehound: Genehound employs a gene prioritization strategy that transforms a gene by phenotype matrix into a completely-filled gene by phenotype matrix using matrix factorization to decompose the gene (green box) and phenotype information (cyan box) as latent factors (Box 1). This completely-filled matrix is used to prioritize genes based on ranking for each phenotype. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 3Techniques for unsupervised fusion of single-cell multi-omics modalities. In all the figure panels, Modality 1 (red in color), Modality 2 (blue in color) represent two omics modalities. Heatmaps represent variation in feature across cells. Paired modality integrations are illustrated in green color, whereas unpaired modality integration are represented by mixture of blue and orange colors. Colored dots and triangles represent different types of cells. (A) Cell-cell correlation: Cells from modalities 1 and 2 are integrated by measuring correlation between the features from the two omics modalities. (B) Non-negative matrix factorization (NMF): NMF methods map features from two paired modalities and cell-level batch effects to latent factors (Box 1). The number of latent factors being less compared to original number of genes in the figure signifies dimensionality reduction. Cluster identities are assigned to common cells in this latent space (Box 1). (C) Manifold-based fusion: Manifold fusion methods map the input feature dimensions from modalities 1 and 2 to a low-dimensional manifold space (In the figure, 9 row-wise features are mapped to 3 dimensions). The manifolds (Box 1) generated for each paired modality are aligned with each other to identify common cells between modalities. (D) Network-based fusion: Similarity networks are generated for the unpaired modalities 1 and 2. Cells with similar feature profiles are connected to each other within this network. The conserved connections between the two networks are used for integration. (E) Statistical modeling: Statistical modeling methods identify shared clusters and common cells between paired modalities 1 and 2 by generating a probabilistic model (Box 1). As the same prior probability distribution is used for clusters in both modalities to tune the model, shared cell-specific random effects are captured, which are useful for finding posterior cell identities. (F) Deep learning representations: Deep learning for unsupervised omics integration is performed using autoencoders (Box 1), which contains an encoder-decoder scheme. In theory, any of the methods (A to E) can be combined in the hidden layer of the autoencoder scheme to predict cell clusters. Here, the NMF method is shown as an example. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 2Techniques for specialized mechanism discovery. The commonly used tools for mechanism predictions are based either on functional enrichment (A to C) or biological network inference (D to F). (A) Over-representation analysis (ORA): ORA compares the fraction of observed list of genes overlapping with known gene sets (observed) versus the fraction of total list of genes within an organism’s genome that overlaps with known gene sets (expected) to identify enriched gene sets. The overlaps are indicated by Venn diagrams. (B) Gene-set enrichment analysis (GSEA): GSEA ranks genes based on differential expression between control and case samples (indicated by red dots in the Volcano plot) and subsequently, uses the ranks of overlapping genes between the observed and expected cases to score the membership of a gene list to each of the known gene sets (shown as dot plot in the figure). The statistical significance of the enrichment score per gene set is calculated using permutation tests (Box 1). (C) Gene-set variation analysis: GSVA converts the log-normalized gene expression matrix (genes vs samples) into a GSVA score matrix (gene sets vs samples) by ranking genes per sample. (D) Cell-cell communication inference (CCI): CCI methods use the information of differentially expressing (indicated by the red dots in the Volcano plot) or co-expressing ligands and receptors (indicated by heatmap) and compare them with a database of known ligand-receptor interactions to prioritize potential ligand-receptor interactions in a given condition (indicated by a Circos plot connecting ligands to receptors). (E) Gene-regulatory network (GRN) inference: GRN inference methods use the information of transcription factor (TF) expression profile and expression profile of their downstream target genes (indicated by heatmap vectors) to find meaningful co-expressing pairs, which are represented as a network of TF-target interactions. (F) Metabolic network inference: Active, condition-specific metabolic networks are derived by using metabolic gene expression data (heatmap) as biochemical constraints for tailoring a generic genome-scale metabolic network of an organism. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 5A potential pipeline for discovering novel anti-angiogenic targets from single-cell multi-omics datasets. This pipeline showcases a potential workflow that can seamlessly integrate the discussed techniques for anti-angiogenic target discovery. The olive green boxes represent the data fusion and knowledge discovery techniques. The light blue boxes highlight the use of unsupervised and supervised data fusion techniques for integrating heterogeneous data sources. The green box highlights the target predicted from this workflow and its subsequent follow-up with experimental validation. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)