Literature DB >> 34954788

MAINE: a web tool for multi-Omics feature selection and rule based data exploration.

Aleksandra Gruca¹, Joanna Henzel¹, Iwona Kostorz², Tomasz Stęclik², Łukasz Wróbel¹, Marek Sikora^1,2.

Abstract

SUMMARY: Patinent multi-omics datasets are often characterized by a high dimensionality, however usually only for a small fraction of the features is informative, that is changes in their values is directly related to the disease outcome or patient survival. In medical sciences, in addition to a robust feature selection procedure, the ability to discover human-readable patterns in the analysed data is also desirable. To address this need, we created MAINE-Multi-omics Analysis and Exploration. The unique functionality of MAINE is the ability to discover multidimensional dependencies between the selected multi-omics features and event outcome prediction as well as patient survival probability. Learned patterns are visualized in the form of interpretable decision/survival trees and rules. AVAILABILITY: MAINE is freely available at maine.ibemag.pl as an online web application. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34954788 PMCID： PMC8896606 DOI： 10.1093/bioinformatics/btab862

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Multi-omics data are characterized by a high dimensionality while, at the same time, only a small fraction of the features describing these data are informative. Therefore, the first step of the multi-omics data analysis usually consists of selecting the most relevant set of features describing the available data. However, in medical sciences, it is crucial that a data analysis method is also able to discover human-readable patterns hidden in the analyzed data. To address this need, we developed MAINE—a web application for feature selection and explanatory analysis of multi-omics data. MAINE provides illustrative reports describing multidimensional dependencies between features. Our approach is based on the observation that the most interesting features are related to event outcome prediction (classification) or patient survival probability (survival analyses), and those dependencies can be represented in a form of trees and rules (Burkart and Huber, 2021; Ishwaran ; Sikora ). With the growing number of multi-omics datasets, there is an urgent need to develop applications that provide the users with an intuitive interface, making the analysis available also to domain experts who do not have programming skills. Recently, several software tools have been developed providing a wide spectrum of different methods and approaches both for feature selection as well as data visualization and explanation. Those tools are available either in a form of a stand-alone software packages or as web services. Typically, stand-alone software packages provide the user with extended functionality and wrapper functions build on top of existing libraries. However, using such tools require programming skills, as well as the access to computers with high computing power when analyzing larger datasets. MixOmics (Rohart ) is the example of the stand-alone R-package for multi-omics data analysis. It provides a wrapper function for a set of statistical methodologies to analyze high-throughput data as well as a package for data visualization such as relevance networks, clustered image maps and circle plots. The tool, however, does not allow to perform survival analysis. IntLIM (Siddiqui ) is a tool that integrates metabolomics and transcriptomics data and is also available as an R-package with shiny-based GUI . Here, gene-metabolite associations that are specific to a particular phenotype are uncovered by linear modeling approach. Another example of a stand-alone application is PROMO (Netanely ), a Windows application with a fully interactive graphical user interface that runs over the MATALB environment This tool provides the user with a set of standard methods for genomics cancer data analysis starting from data prepossessing and visualization trough clustering, decision trees generation and survival analysis to biomarker discovery. However, most of those methods are dedicated only to the single type of data and their multi-omics functionality is limited to feature correlation analysis in two selected omics or clustering the samples based on several omic matrices simultaneously. The example of a web-based platform for multi-omics data analysis is PaintOmics 3 (Hernández-de Diego ) that provides the user with the methods for feature matching, pathway enrichment analysis and pathway-based results visualization. MultiSLIDE (Ghosh ) is a web-based interactive tool that allows to identify molecular signatures by statistical analysis and simultaneous visualization of molecular features in heatmaps of multi-omics datasets. MiBiOmics (Zoppi ) is another example of a web-based application for multi-omics data filtration, normalization and transformation. The main functionality of this tool is data exploration based on PCA and PCoA plots, and network inference based on Weighted Gene Correlation Network Analysis. Mergeomics (Ding ) is a tool which, after filtering omics marker redundancies, uses Marker Set Enrichment Analysis to summarizes enrichment of disease/trait omics markers in sets of functionally related genes and Meta-MSEA to integrate of multiple datasets of the same omics type or multiple omics types. MAINE provides a statistical and machine learning-based frameworks for explaining multidimensional dependencies between selected multi-omics attributes and event outcome or patient survival time based on decision/survival trees and decision/survival rules. In our classification and survival reports, we focus on explaining the relation between selected attribute values and outcome prediction/survival time. Our approach differs from other tools by providing not only a list of relevant features but also by creating subgroups of patients that are similar according to the criteria presented in a tree node or rule premise. The subgroups of patients can be then described by attributes and their values, and characterized by their outcome or survival time. Tree generation methods are based on divide-and-conquer (DnC) approach and rule generation methods are based on separate-and-conquer (SnC) approach. Both approaches allow to create subgroups of similar patients described by interpretable rules; however, the discovered patterns might differ due to different algorithmic approaches. The DnC strategy does not allow examples to be covered by many rules, whereas the SnC approach lacks this limitation. In addition rules generated from decision trees may contain redundant features, unlike SnC-based rules where each rule is induced separately. Moreover, our service provides the unique method for feature selection. The method is based on Rough Set Theory (RST) and Monte Carlo-based Approximate Relative Reducts (MCARR) feature selection (Riza ). Within the RST-MCARR approach a minimal set of features is selected that ensures the same discernability between examples from different decision classes as the whole (non-reduced) feature set.

2 Materials and methods

MAINE web application allows the user to submit patient data that include multi-omics experimental measurements obtained with different high-throughput platforms. Currently, accepted data types are probe measurements from methylation assays, RNA-Seq expression data and copy number variance (CNV). Feature selection can be performed based on statistical approach typical for downstream methylation/expression data analyses or based on the RST-MCARR approach. Three workflows for data analysis are available: (i) attribute selection and normalization are done separately for each multi-omics data type and are based on statistical approach for methylation and RNA-Seq data, and the RST-MCARR algorithm for CNV data, (ii) attribute selection is done separately for each data type and is based on RST-MCARR, (iii) all attributes are first combined into a single table and then attribute selection is performed using RST-MCARR. Detailed explanatory charts of the workflows are presented in Supplementary Section S1. After the feature selection process, the results can be downloaded by the user for further analyses or as a basis to generate explanatory classification or survival reports. The second part of the MAINE application provides the user with explanatory reports showing important features and discovered patterns. To obtain the highly interpretable results, we provide decision trees and rules for classification reports and survival trees, and rules for survival reports. Both the trees and the rules enable to divide observations (patients) into subgroups with different outcome/survivability characteristics. Both methods allow not only to identify the variables that have significant impact on the outcome prediction/survival time, but also enable to model nonlinear dependencies and interactions between the variables. Each classification report contains decision trees generated with two different methods: rpart (Therneau ) and ctrees (Hothorn ), a set of decision rules (Gudyś ), and a list of the most important features. If patient data include survival information, it can be used in the classification report to assign the Kaplan–Meier survival function to the discovered subgroups of observations. This approach allows not only to highlight the importance of particular features but also to understand relations between conditional features and decision attribute. Survival report provides information on which features are related to the occurence of an event of interest.. This relation can be represented either in a form of survival trees or survival rules.

3 Results and discussion

Here, we briefly provide an illustrative example of data analysis with the use of MAINE obtained with the (iii) workflow. Analyzed dataset was derived from the TCGA study of Acute Myeloid Leukemia (AML; Cancer Genome Atlas Research Network, 2013). In our example, we focused on experimental data from RNA expression, DNA-methylation profiling and CNV. After filtering for the samples that had measurements for all three experimental platforms, we obtained data for 109 patients, including 51 patients with survival status alive and 58 patients with survival status dead. Input dataset included: 40 571 RNA-Seq, 321 500 methylation and 19 482 CNV features. After the selection process, the number of features was significantly reduced to 6. The number of selected features depends on the RST-MCARR parameter settings and selection can be made less restrictive by tuning the method parameters. The selected features were probes from methylation dataset: cg01686920 (targeting CRNDE, IRX5), cg26838023 (targeting LINC00028, REM1), cg20895586 (targeting RP11-126F18.2), cg15862128 (targeting MIR1193, MIR494), one gene from RNA-Seq dataset: CLCN5 and one gene from CNV dataset: ACTN2. By analyzing the literature, we notice that those genes frequently take part in pathways and processes related to tumor development or suppression. For example, among related pathways of ACTN2 is RET signaling pathway and various AML subtypes are dependent on expression of the RET receptor tyrosine kinase (Rudat ). Another example is MIR1193 which suppresses the proliferation and invasion of human T-cell leukemia cells through directly targeting TM9SF3 (Shen ). Interestingly, among selected genes we can also find CLCN5 which, based on current literature reports, is not related to leukemia prognostic. Since we can see this gene frequently appear both in discovered trees and rules it can be a potential target for future research for new prognostic markers in AML. Figure 1 presents the decision tree and the survival curves obtained as the results of our analyses. Highlighted curves are calculated on the basis of the examples assigned to the leaves labeled as 3, 6, 7 and for the entire dataset. The biological description of the results and the full classification report is provided in Supplementary Sections S2 and S3.

Fig. 1.

Decision tree. Each node provides: the name of the outcome value indicating the majority class assigned to that node; ratio of patients with the node outcome to all patients assigned to that node; the information about the percentage of the cases assigned to the node. Survival curves drawn for the selected (3, 6, 11) leaves of the decision tree. Red line represents the survival curve calculated for the entire dataset. Each curve shows the probability of staying alive for a certain amount of time after the treatment

4 Conclusions

In this work we presented MAINE, a web server that provides two main functionalities: multi-omics feature selection and classification/survival explanatory reports generation. It is the first web application enabling RST-MCARR-based feature selection of multi-omics data. MAINE not only provides a list of selected features but also supports the user in understanding multidimensional dependencies hidden in data by explaining how feature values are related to the event outcome (classification reports) or survival probability (survival reports).

Funding

This work was partially funded by the Polish National Centre for Research and Development [Grant No. STRATEGMED3/304586/5/NCBR/2017]; the Statutory Research Fund of Łukasiewicz Research Network—Institute of Innovative Technologies EMAG; and the Young Researchers funds of Department of Computer Networks and Systems, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland [Project No. 02/120/BKM21/0012]. Conflict of Interest: none declared.

Data availability statement

The data underlying this article are available in the Genomic Data Commons Data Portal at https://portal.gdc.cancer.gov and on the MAINE website at maine.ibemag.pl/#exemplaries. Click here for additional data file.

9 in total

1. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia.

Authors: Timothy J Ley; Christopher Miller; Li Ding; Benjamin J Raphael; Andrew J Mungall; A Gordon Robertson; Katherine Hoadley; Timothy J Triche; Peter W Laird; Jack D Baty; Lucinda L Fulton; Robert Fulton; Sharon E Heath; Joelle Kalicki-Veizer; Cyriac Kandoth; Jeffery M Klco; Daniel C Koboldt; Krishna-Latha Kanchi; Shashikant Kulkarni; Tamara L Lamprecht; David E Larson; Ling Lin; Charles Lu; Michael D McLellan; Joshua F McMichael; Jacqueline Payton; Heather Schmidt; David H Spencer; Michael H Tomasson; John W Wallis; Lukas D Wartman; Mark A Watson; John Welch; Michael C Wendl; Adrian Ally; Miruna Balasundaram; Inanc Birol; Yaron Butterfield; Readman Chiu; Andy Chu; Eric Chuah; Hye-Jung Chun; Richard Corbett; Noreen Dhalla; Ranabir Guin; An He; Carrie Hirst; Martin Hirst; Robert A Holt; Steven Jones; Aly Karsan; Darlene Lee; Haiyan I Li; Marco A Marra; Michael Mayo; Richard A Moore; Karen Mungall; Jeremy Parker; Erin Pleasance; Patrick Plettner; Jacquie Schein; Dominik Stoll; Lucas Swanson; Angela Tam; Nina Thiessen; Richard Varhol; Natasja Wye; Yongjun Zhao; Stacey Gabriel; Gad Getz; Carrie Sougnez; Lihua Zou; Mark D M Leiserson; Fabio Vandin; Hsin-Ta Wu; Frederick Applebaum; Stephen B Baylin; Rehan Akbani; Bradley M Broom; Ken Chen; Thomas C Motter; Khanh Nguyen; John N Weinstein; Nianziang Zhang; Martin L Ferguson; Christopher Adams; Aaron Black; Jay Bowen; Julie Gastier-Foster; Thomas Grossman; Tara Lichtenberg; Lisa Wise; Tanja Davidsen; John A Demchok; Kenna R Mills Shaw; Margi Sheth; Heidi J Sofia; Liming Yang; James R Downing; Greg Eley
Journal: N Engl J Med Date: 2013-05-01 Impact factor: 91.245

2. mixOmics: An R package for 'omics feature selection and multiple data integration.

Authors: Florian Rohart; Benoît Gautier; Amrit Singh; Kim-Anh Lê Cao
Journal: PLoS Comput Biol Date: 2017-11-03 Impact factor: 4.475

3. IntLIM: integration using linear models of metabolomics and gene expression data.

Authors: Jalal K Siddiqui; Elizabeth Baskin; Mingrui Liu; Carmen Z Cantemir-Stone; Bofei Zhang; Russell Bonneville; Joseph P McElroy; Kevin R Coombes; Ewy A Mathé
Journal: BMC Bioinformatics Date: 2018-03-05 Impact factor: 3.169

4. PaintOmics 3: a web resource for the pathway analysis and visualization of multi-omics data.

Authors: Rafael Hernández-de-Diego; Sonia Tarazona; Carlos Martínez-Mira; Leandro Balzano-Nogueira; Pedro Furió-Tarí; Georgios J Pappas; Ana Conesa
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

5. PROMO: an interactive tool for analyzing clinically-labeled multi-omic cancer datasets.

Authors: Dvir Netanely; Neta Stern; Itay Laufer; Ron Shamir
Journal: BMC Bioinformatics Date: 2019-12-26 Impact factor: 3.169

6. miR-1193 Suppresses the Proliferation and Invasion of Human T-Cell Leukemia Cells Through Directly Targeting the Transmembrane 9 Superfamily 3 (TM9SF3).

Authors: Liyun Shen; Xingjun Du; Hongyan Ma; Shunxi Mei
Journal: Oncol Res Date: 2017-03-30 Impact factor: 5.574