Literature DB >> 34529048

SGI: Automatic clinical subgroup identification in omics datasets.

Mustafa Buyukozkan^1,2, Karsten Suhre^2,3, Jan Krumsiek^1,2.

Abstract

The 'Subgroup Identification' (SGI) toolbox provides an algorithm to automatically detect clinical subgroups of samples in large-scale omics datasets. It is based on hierarchical clustering trees in combination with a specifically designed association testing and visualization framework that can process an arbitrary number of clinical parameters and outcomes in a systematic fashion. A multi-block extension allows for the simultaneous use of multiple omics datasets on the same samples. In this paper, we first describe the functionality of the toolbox and then demonstrate its capabilities through application examples on a type 2 diabetes metabolomics study as well as two copy number variation datasets from The Cancer Genome Atlas. AVAILABILITY: SGI is an open-source package implemented in R. Package source codes and hands-on tutorials are available at https://github.com/krumsieklab/sgi. The QMdiab metabolomics data is included in the package and can be downloaded from https://doi.org/10.6084/m9.figshare.5904022. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34529048 PMCID： PMC8723155 DOI： 10.1093/bioinformatics/btab656

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The identification of patient subgroups from high-dimensional molecular profiles has become a central approach in biomedical research, driven by the wide availability of modern ‘multi-omics’ datasets (Collisson ; Eddy ; Rouzier ). The central idea is that genomics, transcriptomics, proteomics, metabolomics and other deep molecular phenotypes will inherently define groups of patients that are similar with respect to disease-relevant clinical outcomes. Note that, ‘outcome’ is here defined in a statistical sense, and includes parameters such as sex, prevalent disease and current BMI. Recent examples of molecularly defined subgroups include the identification of subtypes of lymphoma that severely impact survival (Nowakowski and Czuczman, 2015), subtypes of various cancers identified in the ‘The Cancer Genome Atlas’ (Sanchez-Vega ) and patient stratification in allergic diseases (Agache and Akdis, 2019). Various bioinformatics methods for the computational extraction of subgroups from omics data have been published in recent years. Most of those methods implement novel metrics for the pairwise similarity of samples in a multi-omics setting (Loh ; Rappoport and Shamir, 2019; Speicher and Pfeifer, 2015; Wang ) followed by standard clustering, or improved clustering approaches addressing the robustness of resulting subgroups (Nguyen ). Here, we present the ‘SGI’ (subgroup identification) package, which implements a new method for the automatic detection of omics-based subgroups. It provides the following novel features compared to previously published methods: (i) the algorithm works hierarchically and can thus identify subgroups of any granularity in complex patient cohorts. (ii) Any sample-wise distance metric can be used for hierarchical clustering, including classical Euclidian distance, or more advanced measures such as similarity network fusion (Wang ) and multiple kernel learning (Speicher and Pfeifer, 2015). (iii) It can handle an arbitrary number of clinical variables of interest simultaneously. Combined with the hierarchical approach, this allows the user to explore the complex relationships between various outcomes. (iv) The algorithm is assumption-free and does not require any model fitting, since it only operates on a distance matrix across the omics samples. Moreover, the toolbox implements a comprehensive set of methods to visualize the associations for further interpretation.

2 Description

2.1 SGI method

The SGI algorithm generates a hierarchical clustering of samples and runs a two-group association test against the analyzed clinical outcomes at each branching point in the tree. An example output plot is shown in Figure 1, which will be further discussed in the application section below. The algorithm works as follows: (i) Clustering. A dendrogram of the samples is generated using standard hierarchical clustering on the input data matrix. The function accepts any hclust object, giving the user full control over the choice of distance and linkage functions. (ii) Generate valid cluster pairs. To avoid low-powered calculations in small clusters, the method enumerates all branching points where both left and right subclusters are above a user-defined size threshold. This results in a list of ‘valid’ cluster pairs. SGI runs with a default setting of 5% of the sample size. (iii) Association analysis. Run statistical tests with all clinical outcomes for all valid cluster pairs, i.e. compare left versus right subcluster at the respective branching points. SGI has built-in implementations for categorical outcomes (Fisher’s exact tests), continuous outcomes (two-sample t-tests) and survival outcomes (log-rank tests). The appropriate test is automatically determined by the toolbox based on the data type of the clinical variable. Furthermore, the user can define arbitrary association functions for more complex data types. (iv) Multiple testing correction. Since a dendrogram clusters the samples into cascaded, non-overlapping groups, all statistical tests are strictly independent. Thus, SGI performs Bonferroni multiple testing correction by adjusting each P-value by a factor of the number of valid cluster pairs.

Fig. 1.

Application example. Blood metabolomics-based clustering of n = 356 participants of the QMdiab study. White circles in the tree indicate the left/right splitting points of the samples in the data (note that, these are not centered if the subclusters are of unequal size). Markings on the tree indicate statistically significant associations of the parameter with the respective left and right subgroups at that split. Heatmap track below the tree shows individual values for selected parameters. Red circles between gaps indicate significant results for left versus right at that split and are horizontally aligned with their respective white circles on the tree. Bottom panel shows the metabolomics data matrix behind the clustering

2.2 Visualization

Association results with multiple outcomes on a hierarchical tree are inherently complex to visualize. The SGI toolbox provides a variety of dedicated functions to visually inspect the statistical associations. This includes tree visualizations of all statistically significant outcome associations that are displayed at the respective branching points and heatmaps of the actual data (Fig. 1). Moreover, the user can generate plots to inspect specific associations, e.g. boxplots of a quantitative clinical outcome between two clusters. This allows the user to obtain a quick overview of the correlation structure between the input omics dataset, the resulting patient groups and the clinical features that are analyzed. The simultaneous visualization of all data also allows to dissect the relationship between confounding variables, avoiding the predefined choice of a list of confounder variables to correct for.

2.3 Multi-omics datasets

The SGI package provides clustering functionality for the analysis of multi-omics datasets, i.e. datasets where more than one omics layer has been measured for the same samples. To this end, SGI generates a joint samples X samples distance matrix from the individual distance matrices of each omics layer. Since different omics layers will have varying numbers of variables, the respective distance values are not at comparable scales. The toolbox thus normalizes each individual distance matrix by its maximum and defines , with representing the final distance matrix, etc. the original distance matrices and the number of omics datasets. The approach was adapted from Chavent , where it was originally introduced to generate a Ward-like clustering. Notably, the method also works for the normalization of multi-omics contributions for other types of linkages, such as average linkage and complete linkage, and works with any distance metric. A detailed example of the multi-omics capabilities of SGI on combined plasma, urine and saliva metabolomics data can be found in the example R codes of the github repository.

3 Application examples

We will demonstrate the functionality of the SGI package on plasma metabolomics dataset from the ‘QMdiab’ diabetes case/control study with 356 participants (Do ; Mook-Kanamori ). We chose Euclidean distance with Ward linkage for hierarchical clustering. Outcome parameters were type 2 diabetes diagnosis and nine anthropometric and clinical biochemistry parameters: age, sex, BMI, HbA1c, albumin, hemoglobin, LDL cholesterol, total cholesterol and skin auto-fluorescence (AF score) (Mook-Kanamori ). The goal was to determine clusters of study participants defined by their profiles of circulating metabolites, and how these metabolomic clusters correlate with the different clinical parameters. The resulting visualization is shown in Figure 1. The following lines of code directly generate that plot using the SGI package: # hierarchical clustering hc = hclust (dist(sgi::qmdiab_plasma), method = "ward.D2") # initialize SGI structure sg = sgi_init(hc, outcomes = sgi::qmdiab_clin) # run SGI as = sgi_run(sg) # generate tree plot, show results for adjusted p-values <=0.05 (gg_tree = plot(as, p_th = 0.05)) # plot overview, including clinical data and metabolomics data matrix plot_overview (gg_tree = gg_tree, as = as, outcomes = sgi::qmdiab_clin, xdata = sgi::qmdiab_plasma) In this R code, sgi::qmdiab_plasma and sgi::qmdiab_clin are data frames holding the metabolomics and clinical variables, respectively. These data frames are contained in the package. The tree shows how metabolomic profiles separate study participants into two major groups at the top level of the tree (clusters 2 versus 3), one with a higher proportion of males, higher BMI, higher HDL-C, as well as higher AF score, and the other group with the reversed effect directions. Inside those two groups, further subgroups were identified; for example, two clusters with different proportions of diabetes, which also associate with Hba1c, age and further diabetes-related risk factors (clusters 4 versus 5). In addition to the diabetes showcase, we ran SGI on two datasets from The Cancer Genome Atlas, in order to assess whether the algorithm can reproduce known molecular subtypes of cancer. In the first example, we recovered IDH-related mutational subgroups (Ceccarelli ; Sanchez-Vega ) using SGI on copy number variation data from low-grade glioma samples (Supplementary Material S1). The second example demonstrates how SGI identifies previously described subgroups in uterine corpus endometrial carcinoma samples (Levine ; Sanchez-Vega ), where both our groups and the originally reported groups were based on copy number variation measurements (Supplementary Material S2).

4 Conclusion

SGI provides a flexible, unbiased and data-driven way to automatically identify sample subgroups in omics profiles. It identifies and visualizes complex, hierarchical relationships for an arbitrary number of clinical outcomes in a visually intuitive way. The toolbox is easy to use, open source and comes with a series of examples in the online repository. Click here for additional data file.

15 in total

1. Breast cancer molecular subtypes respond differently to preoperative chemotherapy.

Authors: Roman Rouzier; Charles M Perou; W Fraser Symmans; Nuhad Ibrahim; Massimo Cristofanilli; Keith Anderson; Kenneth R Hess; James Stec; Mark Ayers; Peter Wagner; Paolo Morandi; Chang Fan; Islam Rabiul; Jeffrey S Ross; Gabriel N Hortobagyi; Lajos Pusztai
Journal: Clin Cancer Res Date: 2005-08-15 Impact factor: 12.531

Review 2. ABC, GCB, and Double-Hit Diffuse Large B-Cell Lymphoma: Does Subtype Make a Difference in Therapy Selection?

Authors: Grzegorz S Nowakowski; Myron S Czuczman
Journal: Am Soc Clin Oncol Educ Book Date: 2015

Review 3. Integrated multi-omics approaches to improve classification of chronic kidney disease.

Authors: Sean Eddy; Laura H Mariani; Matthias Kretzler
Journal: Nat Rev Nephrol Date: 2020-05-18 Impact factor: 28.314

Review 4. Precision medicine and phenotypes, endotypes, genotypes, regiotypes, and theratypes of allergic diseases.

Authors: Ioana Agache; Cezmi A Akdis
Journal: J Clin Invest Date: 2019-03-11 Impact factor: 14.808

5. Similarity network fusion for aggregating data types on a genomic scale.

Authors: Bo Wang; Aziz M Mezlini; Feyyaz Demir; Marc Fiume; Zhuowen Tu; Michael Brudno; Benjamin Haibe-Kains; Anna Goldenberg
Journal: Nat Methods Date: 2014-01-26 Impact factor: 28.547

6. 1,5-Anhydroglucitol in saliva is a noninvasive marker of short-term glycemic control.

Authors: Dennis O Mook-Kanamori; Mohammed M El-Din Selim; Ahmed H Takiddin; Hala Al-Homsi; Khoulood A S Al-Mahmoud; Amina Al-Obaidli; Mahmoud A Zirie; Jillian Rowe; Noha A Yousri; Edward D Karoly; Thomas Kocher; Wafaa Sekkal Gherbi; Omar M Chidiac; Marjonneke J Mook-Kanamori; Sara Abdul Kader; Wadha A Al Muftah; Cindy McKeon; Karsten Suhre
Journal: J Clin Endocrinol Metab Date: 2014-01-01 Impact factor: 5.958

7. Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma.

Authors: Michele Ceccarelli; Floris P Barthel; Tathiane M Malta; Thais S Sabedot; Sofie R Salama; Bradley A Murray; Olena Morozova; Yulia Newton; Amie Radenbaugh; Stefano M Pagnotta; Samreen Anjum; Jiguang Wang; Ganiraju Manyam; Pietro Zoppoli; Shiyun Ling; Arjun A Rao; Mia Grifford; Andrew D Cherniack; Hailei Zhang; Laila Poisson; Carlos Gilberto Carlotti; Daniela Pretti da Cunha Tirapelli; Arvind Rao; Tom Mikkelsen; Ching C Lau; W K Alfred Yung; Raul Rabadan; Jason Huse; Daniel J Brat; Norman L Lehman; Jill S Barnholtz-Sloan; Siyuan Zheng; Kenneth Hess; Ganesh Rao; Matthew Meyerson; Rameen Beroukhim; Lee Cooper; Rehan Akbani; Margaret Wrensch; David Haussler; Kenneth D Aldape; Peter W Laird; David H Gutmann; Houtan Noushmehr; Antonio Iavarone; Roel G W Verhaak
Journal: Cell Date: 2016-01-28 Impact factor: 41.582

8. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery.

Authors: Nora K Speicher; Nico Pfeifer
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

9. Ethnic and gender differences in advanced glycation end products measured by skin auto-fluorescence.

Authors: Marjonneke J Mook-Kanamori; Mohammed M El-Din Selim; Ahmed H Takiddin; Hala Al-Homsi; Khoulood A S Al-Mahmoud; Amina Al-Obaidli; Mahmoud A Zirie; Jillian Rowe; Wafaa Sekkal Gherbi; Omar M Chidiac; Sara Abdul Kader; Wadha A Al Muftah; Cindy McKeon; Karsten Suhre; Dennis O Mook-Kanamori
Journal: Dermatoendocrinol Date: 2013-04-01

10. MoDentify: phenotype-driven module identification in metabolomics networks at different resolutions.

Authors: Kieu Trinh Do; David J N-P Rasp; Gabi Kastenmüller; Karsten Suhre; Jan Krumsiek
Journal: Bioinformatics Date: 2019-02-01 Impact factor: 6.937