| Literature DB >> 26141830 |
Abstract
For many complex diseases, an earlier and more reliable diagnosis is considered a key prerequisite for developing more effective therapies to prevent or delay disease progression. Classical statistical learning approaches for specimen classification using omics data, however, often cannot provide diagnostic models with sufficient accuracy and robustness for heterogeneous diseases like cancers or neurodegenerative disorders. In recent years, new approaches for building multivariate biomarker models on omics data have been proposed, which exploit prior biological knowledge from molecular networks and cellular pathways to address these limitations. This survey provides an overview of these recent developments and compares pathway- and network-based specimen classification approaches in terms of their utility for improving model robustness, accuracy and biological interpretability. Different routes to translate omics-based multifactorial biomarker models into clinical diagnostic tests are discussed, and a previous study is presented as example.Entities:
Keywords: biomarker modeling; cross-study analysis; machine learning; network analysis; pathway analysis
Mesh:
Substances:
Year: 2015 PMID: 26141830 PMCID: PMC4870394 DOI: 10.1093/bib/bbv044
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Overview of pathway-based machine learning approaches for supervised classification of omics samples
| Publication | Pathway activity scoring method | Prediction method |
|---|---|---|
| Guo | Mean or median expression levels in GO modules | Decision tree classification |
| Tomfohr | Expression levels are summarized via the first eigenvector from SVD analysis | Focus on predictive feature selection via |
| Lee | Normalized sum of expression levels for CORGs within pathways | Logistic regression |
| Su | Weighted sum of LLRs for pathway members | Logistic regression or linear discriminant analysis |
| Glaab | Variance across pathway member activity is quantified and compared instead of averaged pathway activity | SVM, random forest, nearest shrunken centroid classifier or ensemble learning |
| Svenson | For each gene set and patient a ratio score is defined depending on the number of up- and down-regulated members in the gene set | Nearest centroids classification |
| Efroni | Joint scoring of pathway activity and consistency using interactions within pathways | Bayesian linear discriminant classifier |
| Vaske | Probabilistic inference is used to predict pathway activities from a factor graph model of genes and their products | Cox proportional hazard regression (survival prediction) |
| Breslin | Signal transduction pathways with directional interactions are used to define the activity of a pathway based on the average expression of its downstream target genes | No classification is performed, but contingency tables show associations between sample-wise pathway activity and clinical sample classifications |
| Kim | Hierarchical feature vectors are used, assigning a two-level hierarchical structure to the features/genes determined by their pathway membership | SVM |
Averaging and dimension reduction approaches are listed on top, whereas graph-based and hierarchical pathway activity scoring methods are listed below the bold black line.
Overview of network-based methods for machine learning analysis of omics data sequential network activity scoring and prediction methods are shown on top, whereas machine learning approaches using embedded network-based feature selection are listed below the bold black line
| Methodology publication | Network activity/alteration scoring method | Prediction method |
|---|---|---|
| Tuck | Sample-specific gene regulatory networks are constructed and subnetwork activity is scored by summing over active interactions | Nearest neighbors, decision tree, Naïve Bayes, among others |
| Ma | Disease association is scored for genes based on gene expression data and their neighbors’ association scores in a PPI network using Markov Random Field theory | The approach is evaluated for disease gene prioritization but is applicable for predictive feature selection in combination with any prediction method |
| Chuang | Normalized gene expression data is mapped onto a protein interaction network and discriminative subnetworks are identified via a greedy search procedure | Logistic regression |
| Taylor | Hub nodes in protein interaction networks are determined and the relative gene expression of hubs with each of their interacting partners is computed to identify hubs with diverse relative expression across sample groups | Affinity propagation clustering is used to assign a probability of poor prognosis to breast cancer patients |
| Petrochilos | A random walk community detection algorithm is applied to discover modules in a molecular interaction network, and gene expression data is used to identify disease-associated modules | The approach is used to identify cancer-associated network modules and validated by scoring the enrichment of known cancer-related genes extracted from the OMIM database |
| Rapaport | Spectral decomposition of gene expression profiles is applied with respect to the eigenfunctions of a network graph, attenuating the high-frequency components of the expression profiles with respect to the graph topology | SVM |
| Li | A network-constrained regularization procedure for linear regression analysis is used to identify disease-related discriminative subnetworks | Penalized linear regression |
| Yang | Three machine learning methods for graph-guided feature selection and grouping are proposed, including a convex function and two non-convex formulations designed to reduce the estimation bias | Penalized least squares-based approach (GOSCAR: Graph octagonal shrinkage and clustering algorithm for regression) |
| Lorbert | A sparse regression approach is proposed, using the PEN penalty to favor the grouping of strongly correlated features based on pairwise similarities (e.g. derived from a molecular interaction graph) | Penalized regression (PEN penalty) |
| Vlassis | Penalized logistic regression is applied using a convex PEN penalty function (see approach by Lorbert | Penalized logistic regression (PEN penalty with absolute feature weights) |
Figure 1.Example illustration of common stages during the development of omics-based diagnostic tests (simplified version of the process presented in a study by the US Institute of Medicine [97], focusing on the major steps in the pipeline). After the transition from the second to the third phase (highlighted by the lock symbol), the diagnostic test must be fully defined, validated and locked down. Many important variants and alternatives to the outlined example process exist, as well as different realizations of generic steps in the process (e.g. cases in which a test directs patient management may cover different situations, depending on whether clinicians are free to use the test result as they see fit, or whether predefined procedures have to be followed subject to contraindications and/or subject to the test results). The setup may also vary depending on whether it is known exactly how patients would have been treated had they been randomized to the opposite arm, depending on whether the test entails a treatment delay, and whether the adequate cutoff threshold for the test is uncertain.