Literature DB >> 35561166

hacksig: a unified and tidy R framework to easily compute gene expression signature scores.

Andrea Carenzo1, Federico Pistore2, Mara S Serafini1, Deborah Lenoci1, Armando G Licata1, Loris De Cecco1.   

Abstract

SUMMARY: Hundreds of gene expression signatures have been developed during the last two decades. However, due to the multitude of development procedures and sometimes a lack of explanation for their implementation, it can become challenging to apply the original method on custom data. Moreover, at present, there is no unified and tidy interface to compute signature scores with different single sample enrichment methods. For these reasons, we developed hacksig, an R package intended as a unified framework to obtain single sample scores with a tidy output as well as a collection of manually curated gene signatures and methods from cancer transcriptomics literature.
AVAILABILITY AND IMPLEMENTATION: The hacksig R package is freely available on CRAN (https://CRAN.R-project.org/package=hacksig) under the MIT license. The source code can be found on GitHub at https://github.com/Acare/hacksig. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2022        PMID: 35561166      PMCID: PMC9113261          DOI: 10.1093/bioinformatics/btac161

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

A gene signature can be defined as a set of genes sharing a common pattern of expression in relation to a certain phenotype or biological process (Cantini ). In years, several enrichment methods have been developed for microarray and RNA-seq data in order to summarize the information coming from gene sets into a single score. This could lead to a more meaningful interpretation of results, both from a biological and clinical point of view, as well as a reduction in the effects of the curse of dimensionality problem affecting genomic studies (i.e. many more variables than samples). Gene signature scoring methods can follow a number of approaches. For example, by averaging the expression values for genes in a signature (Rooney ); or also, by fitting a penalized regression model and then computing single sample scores as a weighted sum between fitted model coefficients and gene expression values (De Cecco ). However, it is not always straightforward to directly apply gene signature scoring methods from the literature to custom data. Sometimes details about how a method is implemented are vague and open to interpretation. Other times gene identifiers and eventual model coefficients must be extracted manually from .pdf files or even from images. So, computing signature scores with the original publication method can become a time-consuming procedure even in the best-case scenario. Gene expression signature scores can be derived using either the original publication procedure or one of five single sample enrichment methods, all of which are collected into two distinct R packages: GSVA, implementing four methods and singscore, enabling to compute enrichment scores with the self-titled procedure (Hänzelmann ; Foroutan ). The interfaces of these R packages are obviously different and designed to work primarily within the Bioconductor ecosystem (Huber ). Hence, none of them do provide a tidy output as intended by Wickham that is a consistent format for successive data analysis pipelines such as data visualization with ggplot2 and modeling with tidymodels. Herein, we propose the R package hacksig in order to address the above-mentioned issues and hence to: compute single sample scores for both custom and manually curated gene expression signatures either with the original publication method or with three alternatives, namely the combined z-score, single sample GSEA (ssGSEA) and singscore (Lee ; Barbie ; Foroutan ); provide a unified, simple interface and a tidy output.

2 Features and methods

The current release of includes 23 cancer transcriptomics gene signatures, which were selected mainly due to our interest in the tumor microenvironment biology and its possible influence on response to treatment in head and neck cancer patients (Van den Bossche ). The function can be used to retrieve IDs, associated keywords, DOIs for the original publication and a brief description for each signature (see also Supplementary Table S1). Most of the package functions require a normalized gene expression matrix as a primary input argument, with genes as rows and samples as columns. Hence, both microarray and RNA-seq normalized data are supported. A list of official HUGO gene symbols composing each implemented signature can be obtained with get_sig_genes().

2.1 The main function

The most important function of the package is hack_sig(), which can be used to easily compute gene expression signature scores in a number of ways. Figure 1A summarizes the syntax and different choices for arguments of the function.
Fig. 1.

Syntax and output for the main function hack_sig(). (A) Possible choices for its three main arguments expr_data, signatures and method are shown. (B) An example output resulting from choosing arguments in bold is shown (i.e. a custom list of gene sets and the combined z-score method). The ellipsis represents additional arguments controlling options for the enrichment methods

Syntax and output for the main function hack_sig(). (A) Possible choices for its three main arguments expr_data, signatures and method are shown. (B) An example output resulting from choosing arguments in bold is shown (i.e. a custom list of gene sets and the combined z-score method). The ellipsis represents additional arguments controlling options for the enrichment methods If only an expression matrix is given in input, then will compute scores with the original procedure for all the implemented signatures, except those related to CINSARC (Chibon ), ESTIMATE (Yoshihara ) and Immunophenoscore (Charoentong ), for which dedicated functions exist (see next section). Anyway, signature scores can also be derived with one of the three possible single sample enrichment methods by setting the argument to one of , ‘ or ‘, corresponding to the combined z-score, ssGSEA and singscore, respectively. This will cause to compute enrichment scores with that particular procedure for all the implemented signatures. In addition, other optional arguments regarding single sample methods can be modified, such as the exponent in the running sum statistic of ssGSEA or its type of normalization. It is possible to select just a particular group of signatures or to compute scores for a custom list of gene sets by means of the argument , which is set to ‘ (i.e. all the implemented signatures) by default. If is a character vector (e.g. ) with one or more valid keywords, will compute scores only for signatures matching those strings, either with the original procedure or with one of the three single sample alternatives depending on the choice of . If is a custom list of gene sets, then will compute scores with the procedure specified in the argument, which cannot be set to ‘original’ in this case. If is not specified, raw ssGSEA scores will be obtained by default for custom gene sets. In general, the result of calling will be a tibble (i.e. a modern redefining of the R class) with one row per sample, a column indicating sample IDs and one column for each considered gene signature giving the corresponding scores (Fig. 1B).

2.2 Other features

Although can be used to compute scores for most of the gene signatures included in the package, there are three particular methods which for us deserve their own function implementation. These are , which implements the CINSARC classification (Chibon ; Lesluyes and Chibon, 2020); , which computes the immune, stroma, ESTIMATE and tumor purity scores as in Yoshihara ; , giving immune marker scores together with the Immunophenoscore (Charoentong ). Before computing enrichment scores, it should be considered good practice to always check if genes composing a signature are well represented in the expression matrix. For this reason, we developed , a function that returns counts and proportions of how many genes are present in a gene expression matrix for every input signature as well as possible missing genes. Finally, the package supports the framework to parallelize and speed-up computations either on a local machine or a computer cluster (Bengtsson, 2021). More details about the usage of are reported in the package vignette, either on CRAN or running in R.

3 Conclusions and future perspectives

The R package offers a tidy and unified framework aimed at simplifying the computation of gene signature scores following both the original methods or three single sample alternatives. We acknowledge that our implementations of enrichment methods using ranks (i.e. ssGSEA and singscore) are slower than those in the and packages (see Supplementary Material). Parallelization through the R package is supported and can definitely decrease computation time. Nonetheless, the code for some functions might be rewritten in order to improve performance even more. More features are planned to be added, and we want to encourage future users of the package to open an issue on GitHub for every signature or method they would wish to be implemented. Click here for additional data file.
  13 in total

1.  Validated prediction of clinical outcome in sarcomas and multiple types of cancer on the basis of a gene expression signature related to genome complexity.

Authors:  Frédéric Chibon; Pauline Lagarde; Sébastien Salas; Gaëlle Pérot; Véronique Brouste; Franck Tirode; Carlo Lucchesi; Aurélien de Reynies; Audrey Kauffmann; Binh Bui; Philippe Terrier; Sylvie Bonvalot; Axel Le Cesne; Dominique Vince-Ranchère; Jean-Yves Blay; Françoise Collin; Louis Guillou; Agnès Leroux; Jean-Michel Coindre; Alain Aurias
Journal:  Nat Med       Date:  2010-06-27       Impact factor: 53.440

2.  Molecular and genetic properties of tumors associated with local immune cytolytic activity.

Authors:  Michael S Rooney; Sachet A Shukla; Catherine J Wu; Gad Getz; Nir Hacohen
Journal:  Cell       Date:  2015-01-15       Impact factor: 41.582

3.  Pan-cancer Immunogenomic Analyses Reveal Genotype-Immunophenotype Relationships and Predictors of Response to Checkpoint Blockade.

Authors:  Pornpimol Charoentong; Francesca Finotello; Mihaela Angelova; Clemens Mayer; Mirjana Efremova; Dietmar Rieder; Hubert Hackl; Zlatko Trajanoski
Journal:  Cell Rep       Date:  2017-01-03       Impact factor: 9.423

Review 4.  Microenvironment-driven intratumoral heterogeneity in head and neck cancers: clinical challenges and opportunities for precision medicine.

Authors:  Valentin Van den Bossche; Hannah Zaryouh; Marianela Vara-Messler; Julie Vignau; Jean-Pascal Machiels; An Wouters; Sandra Schmitz; Cyril Corbet
Journal:  Drug Resist Updat       Date:  2022-01-25       Impact factor: 18.500

5.  A Global and Integrated Analysis of CINSARC-Associated Genetic Defects.

Authors:  Tom Lesluyes; Frédéric Chibon
Journal:  Cancer Res       Date:  2020-10-06       Impact factor: 12.701

6.  Classification of gene signatures for their information value and functional redundancy.

Authors:  Laura Cantini; Laurence Calzone; Loredana Martignetti; Mattias Rydenfelt; Nils Blüthgen; Emmanuel Barillot; Andrei Zinovyev
Journal:  NPJ Syst Biol Appl       Date:  2017-12-19

7.  Single sample scoring of molecular phenotypes.

Authors:  Momeneh Foroutan; Dharmesh D Bhuva; Ruqian Lyu; Kristy Horan; Joseph Cursons; Melissa J Davis
Journal:  BMC Bioinformatics       Date:  2018-11-06       Impact factor: 3.169

8.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.

Authors:  David A Barbie; Pablo Tamayo; Jesse S Boehm; So Young Kim; Susan E Moody; Ian F Dunn; Anna C Schinzel; Peter Sandy; Etienne Meylan; Claudia Scholl; Stefan Fröhling; Edmond M Chan; Martin L Sos; Kathrin Michel; Craig Mermel; Serena J Silver; Barbara A Weir; Jan H Reiling; Qing Sheng; Piyush B Gupta; Raymond C Wadlow; Hanh Le; Sebastian Hoersch; Ben S Wittner; Sridhar Ramaswamy; David M Livingston; David M Sabatini; Matthew Meyerson; Roman K Thomas; Eric S Lander; Jill P Mesirov; David E Root; D Gary Gilliland; Tyler Jacks; William C Hahn
Journal:  Nature       Date:  2009-10-21       Impact factor: 49.962

9.  GSVA: gene set variation analysis for microarray and RNA-seq data.

Authors:  Sonja Hänzelmann; Robert Castelo; Justin Guinney
Journal:  BMC Bioinformatics       Date:  2013-01-16       Impact factor: 3.169

10.  Inferring pathway activity toward precise disease classification.

Authors:  Eunjung Lee; Han-Yu Chuang; Jong-Won Kim; Trey Ideker; Doheon Lee
Journal:  PLoS Comput Biol       Date:  2008-11-07       Impact factor: 4.475

View more
  1 in total

1.  An immuno-score signature of tumor immune microenvironment predicts clinical outcomes in locally advanced rectal cancer.

Authors:  Zhengfa Xue; Shuxin Yang; Yun Luo; Ming He; Huimin Qiao; Wei Peng; Suxin Tong; Guini Hong; You Guo
Journal:  Front Oncol       Date:  2022-09-29       Impact factor: 5.738

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.