Literature DB >> 26448683

TIN: An R Package for Transcriptome Instability Analysis.

Bjarne Johannessen¹, Anita Sveen¹, Rolf I Skotheim².

Abstract

Alternative splicing is a key regulatory mechanism for gene expression, vital for the proper functioning of eukaryotic cells. Disruption of normal pre-mRNA splicing has the potential to cause and reinforce human disease. Owing to rapid advances in high-throughput technologies, it is now possible to identify novel mRNA isoforms and detect aberrant splicing patterns on a genome scale, across large data sets. Analogous to the genomic types of instability describing cancer genomes (eg, chromosomal instability and microsatellite instability), transcriptome instability (TIN) has recently been proposed as a splicing-related genome-wide characteristic of certain solid cancers. We present the R package TIN, available from Bioconductor, which implements a set of methods for TIN analysis based on exon-level microarray expression profiles. TIN provides tools for estimating aberrant exon usage across samples and for analyzing correlation patterns between TIN and splicing factor expression levels.

Entities: Chemical Disease Gene Species

Keywords: R software; alternative splicing; exon microarray; splicing factor; transcriptome instability

Year: 2015 PMID： 26448683 PMCID： PMC4578549 DOI： 10.4137/CIN.S31363

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Cancers often harbor genomic types of instability, including chromosomal instability and microsatellite instability. However, cancer-associated variation may occur at several levels of gene regulation and, in particular, the processing of pre-mRNA into mature mRNAs is important for proper protein synthesis and cell function. Alternative pre-mRNA splicing is a major source of genetic variation in human beings, and disruption of the splicing process may cause cancer.1,2 An improved understanding of the mechanisms that cause such structural transcript variation may provide important insights into disease development and progression. Alternative splicing is regulated by splicing factors, proteins that remove certain introns from the pre-mRNA, thereby joining the exons of the mRNA together. We have recently described transcriptome instability (TIN) in cancer, a genome-wide characteristic defined by the amounts of aberrant exon usage per sample, and shown that this is strongly and nonrandomly associated with splicing factor expression levels in several cancer types.3,4 High-resolution microarrays allow for genome-wide expression profiling at the exon level, enabling the detection of alternative splicing across a large series of samples. Here, we describe TIN, an R package enabling analysis of TIN from expression data obtained by Affymetrix Human Exon 1.0 ST Arrays. A major challenge in large-scale data analysis is reproducibility. With this aim, the TIN package consists of a set of unambiguous procedures that use raw expression data (cell intensity [CEL] files) as input, which are readily accessible and easy to extend. Information on how to install the package is provided in the Supplementary File.

Methods

The TIN software package is a collection of R modules that make use of the aroma.affymetrix5 framework to analyze exon-level expression data. Starting from raw CEL files, the TIN tool applies the Finding Isoforms using Robust Multichip Analysis (FIRMA) method6 for preprocessing and alternative splicing detection. The FIRMA method is an extension of the robust multichip analysis (RMA) approach7 that not only estimates expression levels but also detects alternative splicing patterns between samples. Using the FIRMA method, the first two pre-processing steps, background correction of perfect match probes and inter-chip quantile normalization, are performed in concordance with standard RMA procedures. For the summarization step, a more general model that includes the relative change for each sample in a particular exon is introduced in the FIRMA approach to allow for alternative splicing or different levels of expression for each exon along the gene. For each exon sample combination, the FIRMA method calculates alternative splicing scores, FIRMA scores, based on whether the probes systematically deviate from the expected gene expression level. Thus, the FIRMA scores are a measure of the relative ratio between exon expression level and corresponding gene expression level. Strong positive and negative scores are indicative of differential exon inclusion and skipping, respectively. The main idea is to test the association between splicing factor expression levels and the amounts of aberrant exon usage among samples. Sample-wise total relative amounts of aberrant exon usage are recorded from exons with FIRMA scores exceeding user-defined thresholds, and the correlation between aberrant exon usage amounts and splicing factor expression levels is tested across all samples. Two methods are implemented for testing if the correlation between sample-wise aberrant exon usage amounts and splicing factor expression levels is stronger than expected by chance. First, permutations of the FIRMA scores are done for each probe set/exon across all samples, and the sample-wise amounts of aberrant exon usage are recalculated based on the permutations. If the correlation between the aberrant exon usage amounts and splicing factor expression levels is considerably lower when based on permutations compared to the original FIRMA scores, it is an indication of splicing factor expression having impact on the aberrant exon usage in the samples. Second, correlation is tested using a number of miscellaneous gene sets instead of the original set of 280 splicing factor genes. Equivalently, poorer correlation for random gene sets compared to the splicing factor set can be considered an indication that the aberrant exon usage to some extent was attributable to the expression levels of the splicing factor genes. An overview of the pipeline is outlined in Figure 1.

Figure 1

Pipeline to investigate TIN in tumor samples based on exon-level microarray data. CEL files with raw expression data is taken as input, along with gene-level expression data. The FIRMA algorithm is used to identify exon skipping and inclusion events, and user-defined thresholds (such as the upper and lower first percentile) are used for denoting exons as aberrantly spliced. The correlation between aberrant exon usage and splicing factor gene expression is evaluated and tested against random associations in two ways. First, the correlation step is carried out using permutations of the expression data at each probe set. Second, the correlation is calculated using random gene sets instead of known splicing factor genes.

Example

Five R data sets are included in the package. By issuing the following commands: data(splicingFactors) data(geneSets) data(geneAnnotation), data.frames with the three sets of data will become available. The first object is a comprehensive list of 280 splicing factor genes created by combining search results from several public annotation databases.3 Second, one of the major collections of gene sets in the Molecular Signatures Database, MSigDB,8 comprising 1,454 Gene Ontology gene sets, is included to see if the association between aberrant exon usage and gene expression levels is different in the splicing factor gene set compared to more general gene sets. Third, a list of matching gene symbols and Affymetrix transcript cluster identifiers for the full genome (core set of human genes) are provided in the annotation data set to provide easy access and enable generation of new gene sets. The main purpose behind the TIN package is to facilitate reproducibility through a consistent set of algorithms, which may be applied on real-world data; however, for educational purposes, a small toy data set is embedded in the release. Thus, preprocessed FIRMA scores for 16 samples and 10,000 randomly selected probe sets are included in the sampleSetFirmaScores object. Equivalently, gene-level expression data for the same 16 samples across the core set of human genes are provided through the sampleSetGeneSummaries object. Summary files for real gene-level expression data can be generated by using, for instance, Affymetrix Power Tools or Expression Console prior to applying the TIN package. The analysis pipeline is outlined in the following example, with expression data from 131 prostate cancers.9 The data set is publicly available from NCBI’s Gene Expression Omnibus (GEO; accession number GSE21034). fs <- firmaAnalysis(useToyData = FALSE, aromaPath = “/path/to/aroma.affymetrix”, dataSetName = “Prostate”) gs <- readGeneSummaries(useToyData = FALSE, summaryFile = “/path/to/prostate-gene-level-summary. Txt”) To use the small toy data set supplied with the package instead, load the sample data by issuing the following two commands data(sampleSetFirmaScores) data(sampleSetGeneSummaries, and copy the two objects into the fs and gs variables, respectively. tra <- aberrantExonUsage(1.0, fs) perms <- probesetPermutations(fs, quantiles) corr <- correlation(splicingFactors, gs, tra) gsc <- geneSetCorrelation(geneSets, geneAnnotation, gs, tra, 100) In the example, the lower and upper 1st percentiles are used as threshold values to score exons with deviating skipping or inclusion (Fig S1). Information on where to find documentation of the different functions is provided in the Supplementary File. Having performed FIRMA analysis and entered gene-level expression values, sample-wise amounts of aberrant exon usage are calculated. Pearson correlation between relative amounts of aberrant exon usage and splicing factor expression is obtained using tools from the WGCNA package (Bioconductor).10 To assess the association, correlation is also calculated for random permutations of the FIRMA scores at each probe set and for random sets of genes.

Visualization

The TIN package implements four different methods for visualizing the results (Fig. 2). First, a scatter plot visualizes sample-wise relative amounts of aberrant exon inclusion vs exon exclusion and optionally includes amounts calculated from random permutations of the FIRMA scores for comparison (Fig. 2A). Second, the package includes a function for comparing sample-wise correlation between splicing factor gene expression and total relative amounts of aberrant exon usage, with correlations obtained either by making permutations of the sample-wise amounts of aberrant exon usage or by using randomly generated gene sets (Fig. 2B). Third, a scatter plot that compares the amount of splicing factor genes for which expression levels are significantly positively and negatively correlated with the total relative amounts of aberrant exon usage per sample is created (Fig. 2C). This plot may also include results based on permutations of the sample-wise aberrant exon usage amounts and randomly constructed gene sets. In addition, a function for hierarchical clustering of the samples based on splicing factor expression levels is included to test for separation of samples according to aberrant exon usage amounts (Fig. 2D). Example commands for creating visualization plots are outlined below:

Figure 2

(A) Sample-wise relative amounts (blue dots) of aberrant exon inclusion (horizontal axis) and exon skipping (vertical axis) events for the 131 prostate cancers in the worked example, compared to random sample-wise amounts calculated from permuted FIRMA scores (yellow dots). (B) Correlation between estimated aberrant exon usage and splicing factor expression compared with random gene sets and permuted TIN-estimates. In the example cancer dataset, 195 of the 280 (70%) splicing factor genes had expression levels that were significantly correlated (P < 0.05; Pearson correlation; red dot; horizontal axis). This is more than expected by chance, as compared with first making 1,000 random permutations of the TIN-estimates (bar graphs in dark blue) and second by selecting 1,000 random sets of 280 genes (bar graphs in light blue). (C) Negative correlation between TIN-estimates and splicing factor expression in the example prostate cancer dataset. Inverse relationship with strong associations between TIN-estimates and expression levels of splicing factors (n = 280), with a much higher percentage of significantly negatively (horizontal axes) than positively (vertical axes) correlated splicing factor genes (red). The shift was higher than expected by chance, as demonstrated by comparing first with each of 1,000 permutations of the TIN-estimates (dark blue) and second with genes in each of 1,000 random sets of 280 genes (light blue). (D) Unsupervised hierarchical clustering analysis (Euclidean distance metrics; complete linkage) of all the 131 samples based on the expression levels of all 280 splicing factor genes. The example prostate series is separated into clusters with some samples having predominantly lower (blue) or higher (red) relative amounts of deviating exon usage than the more average sample (black).

scatterPlot(“scatter.png”, TRUE, hits, perms) correlationPlot(“correlation.png”, tra, gs, splicing-Factors, 1000, 1000) posNegCorrPlot(“posNegCorr.png”, tra, gs, splicing-Factors, 1000, 1000) clusterPlot(gs, tra, “euclidean”, “complete”, “cluster. png”) Further instructions on parameter usage and how the methods work are provided in the accompanying vignette and documentation of the package.

Results and Discussion

We have developed the TIN package (Bioconductor) to analyze TIN in cancer or other disease conditions from exon-level microarray data. By using computational tools already available to create algorithms for analyzing TIN, the package offers a framework for calculating and visualizing the correlation between sample-wise aberrant exon usage amounts and expression levels of multiple gene sets, including splicing factors. The R software has been applied to expression data from different cancer types, and we have shown that TIN is a common feature of several types of solid cancer. In most cancer types studied, we found strong and nonrandom (P < 0.001) correlations between the estimated aberrant exon usage and the expression levels of splicing factor genes.4 When analyzing multiple data sets, it is of great importance to be able to repeat and standardize computational methodology. The TIN package facilitates reproducibility through an unambiguous analysis pipeline. Supplementary File. This file contains installation guidelines, links to documentation for the TIN package, and Figure S1, a distribution plot of all FIRMA scores in the worked example.

9 in total

1. Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

Authors: Rafael A Irizarry; Bridget Hobbs; Francois Collin; Yasmin D Beazer-Barclay; Kristen J Antonellis; Uwe Scherf; Terence P Speed
Journal: Biostatistics Date: 2003-04 Impact factor: 5.899

Review 2. Alternative pre-mRNA splicing regulation in cancer: pathways and programs unhinged.

Authors: Charles J David; James L Manley
Journal: Genes Dev Date: 2010-11-01 Impact factor: 11.361

3. Integrative genomic profiling of human prostate cancer.

Authors: Barry S Taylor; Nikolaus Schultz; Haley Hieronymus; Anuradha Gopalan; Yonghong Xiao; Brett S Carver; Vivek K Arora; Poorvi Kaushik; Ethan Cerami; Boris Reva; Yevgeniy Antipin; Nicholas Mitsiades; Thomas Landers; Igor Dolgalev; John E Major; Manda Wilson; Nicholas D Socci; Alex E Lash; Adriana Heguy; James A Eastham; Howard I Scher; Victor E Reuter; Peter T Scardino; Chris Sander; Charles L Sawyers; William L Gerald
Journal: Cancer Cell Date: 2010-06-24 Impact factor: 31.743

4. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

Review 5. Alternative splicing in cancer: noise, functional, or systematic?

Authors: Rolf I Skotheim; Matthias Nees
Journal: Int J Biochem Cell Biol Date: 2007-03-01 Impact factor: 5.085

6. Transcriptome instability in colorectal cancer identified by exon microarray analyses: Associations with splicing factor expression levels and patient survival.

Authors: Anita Sveen; Trude H Agesen; Arild Nesbakken; Torleiv O Rognum; Ragnhild A Lothe; Rolf I Skotheim
Journal: Genome Med Date: 2011-05-27 Impact factor: 11.117

7. FIRMA: a method for detection of alternative splicing from exon array data.

Authors: E Purdom; K M Simpson; M D Robinson; J G Conboy; A V Lapuk; T P Speed
Journal: Bioinformatics Date: 2008-06-23 Impact factor: 6.937

8. WGCNA: an R package for weighted correlation network analysis.

Authors: Peter Langfelder; Steve Horvath
Journal: BMC Bioinformatics Date: 2008-12-29 Impact factor: 3.169

9. Transcriptome instability as a molecular pan-cancer characteristic of carcinomas.

Authors: Anita Sveen; Bjarne Johannessen; Manuel R Teixeira; Ragnhild A Lothe; Rolf I Skotheim
Journal: BMC Genomics Date: 2014-08-10 Impact factor: 3.969

9 in total

3 in total

1. Knockdown of EWSR1/FLI1 expression alters the transcriptome of Ewing sarcoma cells in vitro.

Authors: Jihan Wang; Wenyan Jiang; Yuzhu Yan; Chu Chen; Yan Yu; Biao Wang; Heping Zhao
Journal: J Bone Oncol Date: 2016-05-30 Impact factor: 4.072

2. A bioinformatic analysis identifies circadian expression of splicing factors and time-dependent alternative splicing events in the HD-MY-Z cell line.

Authors: Nikolai Genov; Alireza Basti; Mónica Abreu; Rosario Astaburuaga; Angela Relógio
Journal: Sci Rep Date: 2019-07-30 Impact factor: 4.379

3. Transcriptome analysis of alternative splicing in the pathogen life cycle in human foreskin fibroblasts infected with Trypanosoma cruzi.

Authors: Hyeim Jung; Seonggyun Han; Younghee Lee
Journal: Sci Rep Date: 2020-10-15 Impact factor: 4.379

3 in total