Literature DB >> 19276151

The tspair package for finding top scoring pair classifiers in R.

Abstract

UNLABELLED: Top scoring pairs (TSPs) are pairs of genes whose relative rankings can be used to accurately classify individuals into one of two classes. TSPs have two main advantages over many standard classifiers used in gene expression studies: (i) a TSP is based on only two genes, which leads to easily interpretable and inexpensive diagnostic tests and (ii) TSP classifiers are based on gene rankings, so they are more robust to variation in technical factors or normalization than classifiers based on expression levels of individual genes. Here I describe the R package, tspair, which can be used to quickly identify and assess TSP classifiers for gene expression data. AVAILABILITY: The R package tspair is freely available from Bioconductor: http://www.bioconductor.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19276151 PMCID： PMC2672632 DOI： 10.1093/bioinformatics/btp126

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Classification of patients into disease groups or subtypes is the most direct way to translate microarray technology into a clinically useful tool (Quackenbush, 2006). A small number of tests based on microarrays have even been approved for clinical use, for example, for diagnosing breast cancer subtypes (Ma et al., 2004; Marchionni et al., 2008; Paik et al., 2004; van't Veer et al., 2002). But standard microarray classifiers are based on complicated functions of many gene expression measurements. This type of classifier is both hard to interpret and depends critically on the platform, pre-processing and normalization steps to be effective (Quackenbush, 2006). Identifying biologically interpretable, robust and cheap classifiers based on small subsets of genes would greatly speed progress in the development of clinical tests from microarray experiments. Top scoring pairs (TSPs) are pairs of genes that accurately classify patients into clinically relevant groups based on their ranks (Geman et al., 2004; Tan et al., 2005; Xu et al., 2005). The basic idea is to search among all pairs of genes, and look for genes whose ranking most consistently switches between two groups. To understand how the classification scheme works, consider the simulated gene expression data in Figure 1. In this figure there are two groups of arrays, separated by the black line. These groups could represent healthy patients versus cancer patients, or two distinct subtypes of cancer. For all but one array in Group 1, Gene 1 has higher expression than Gene 2, and the reverse is true in Group 2. In this case, Genes 1 and 2 form a classifier based on their relative levels of expression. A new sample where the gene expression for Gene 1 was higher than the gene expression for Gene 2 would be classified as Group 1.

Fig. 1.

An Example of a TSP. In this simulated example, the expression for Gene 1 is higher than the expression for Gene 2 for almost all of the arrays in the group on the left and this relationship reverses for the group on the right. The TSP approach has been successfully applied to identify subtypes of sarcoma, resulting in a RT-PCR-based test that correctly classified 20 independent tumors with perfect accuracy (Price et al., 2007). This early success suggests that it may be possible to identify TSP classifiers for other important diseases and quickly develop new inexpensive diagnostic tests.

2 THE TSPAIR PACKAGE

Calculating the TSP for a gene expression dataset is relatively straightforward, but computationally intensive. I have developed an R package tspair that can rapidly calculate the TSP for typical gene expression datasets, with tens of thousands of genes. The TSP can be calculated both in R or with an external C function, which allows both for rapid calculation and flexible development of the tspair package. The tspair package includes functions for calculating the statistical significance of a TSP by permutation test, and is fully compatible with Bioconductor expression sets. The R package is freely available from the Bioconductor web site (www.bioconductor.org).

3 AN EXAMPLE SESSION

Here I present an example session on a simple simulated dataset included in the tspair package. I calculate the TSP, assess the strength of evidence for the classifier with a permutation test, plot the output and show how to predict outcomes for a new dataset. The main function in the tspair package is tspcalc(). This function accepts either (i) a gene expression matrix or an expression set and a group indicator vector, or (ii) an expression set object and a column number, indicating which column of the annotation data to use as the group indicator. The result is a tsp object which gives the TSP score, indices, gene expression data and group labels for the TSP. If there are multiple pairs that achieve the top score, then the tie-breaking score developed by Tan et al. (2005) is reported. The function tspsig() can be used to calculate the significance of a TSP classifier by permutation as described in Geman et al. (2004). The class labels are permuted, a new TSP is calculated for each permutation, and the null scores are compared with the observed TSP score to calculate a P-value. Since the maximum score is calculated for each null permutation, tspsig() performs a test of the null hypothesis that no TSP classifier is better than random chance. Once a TSP has been calculated, the tspplot() function can be used to visualize the classifier. The resulting TSP figure (Fig. 2) plots the expression for the first gene in the pair versus the expression for the second gene in the pair. The true group difference is indicated by the color of the points, and the score for the TSP classifier is shown in the title of the plot. The black 45○ line indicates the classification from the TSP; the better the black line separates the colors the better the accuracy of the TSP.

Fig. 2.

A TSP plot. A TSP plot for the simulated data example in the tspair package. The colors indicate the true groups, and the black line indicates the TSP classification. The black line is the line where expression for ‘Gene 5’ equals the expression for ‘Gene 338’; the classification boundary is not data-driven, it is set in advance. A major advantage of the TSP approach is that predictions are very simple and can be easily calculated either by hand or using the built-in functionality of the tspair package. In this example, the expression value for ‘Gene5’ is greater than the expression value for ‘Gene338’ much more often for the diseased patients. In a new dataset, when the expression for ‘Gene5’ is greater than the expression for ‘Gene338’ I predict that the patient will be diseased. The tspair package can be used to predict the outcomes of new samples based on new expression data. The new data can take the form of a new expression matrix, or an expression set object. The R function predict() searches for the TSP gene names from the original tspcalc() function call, and based on the row names or featureNames of the new dataset identifies the genes to use for prediction. If multiple TSPs are reported, the default is to predict with the TSP achieving the top tie-breaking score (Tan et al., 2005), but the user may also elect to use a different TSP for prediction. In this example, the predict() function finds the genes with labels ‘Gene5’ and ‘Gene338’ in the second dataset and calculates the TSP predictions based on the values of these two genes. The new data matrix need not be defined by a microarray, it could easily be the result of RT-PCR or any other expression assay, imported into R as a tab-delimited text file.

9 in total

Review 1. Microarray analysis and tumor classification.

Authors: John Quackenbush
Journal: N Engl J Med Date: 2006-06-08 Impact factor: 91.245

2. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer.

Authors: Soonmyung Paik; Steven Shak; Gong Tang; Chungyeul Kim; Joffre Baker; Maureen Cronin; Frederick L Baehner; Michael G Walker; Drew Watson; Taesung Park; William Hiller; Edwin R Fisher; D Lawrence Wickerham; John Bryant; Norman Wolmark
Journal: N Engl J Med Date: 2004-12-10 Impact factor: 91.245

3. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962

4. Classifying gene expression profiles from pairwise mRNA comparisons.

Authors: Donald Geman; Christian d'Avignon; Daniel Q Naiman; Raimond L Winslow
Journal: Stat Appl Genet Mol Biol Date: 2004-08-30

5. Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data.

Authors: Lei Xu; Aik Choon Tan; Daniel Q Naiman; Donald Geman; Raimond L Winslow
Journal: Bioinformatics Date: 2005-08-30 Impact factor: 6.937

6. Simple decision rules for classifying human cancers from gene expression profiles.

Authors: Aik Choon Tan; Daniel Q Naiman; Lei Xu; Raimond L Winslow; Donald Geman
Journal: Bioinformatics Date: 2005-08-16 Impact factor: 6.937

7. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen.

Authors: Xiao-Jun Ma; Zuncai Wang; Paula D Ryan; Steven J Isakoff; Anne Barmettler; Andrew Fuller; Beth Muir; Gayatry Mohapatra; Ranelle Salunga; J Todd Tuggle; Yen Tran; Diem Tran; Ana Tassin; Paul Amon; Wilson Wang; Wei Wang; Edward Enright; Kimberly Stecker; Eden Estepa-Sabal; Barbara Smith; Jerry Younger; Ulysses Balis; James Michaelson; Atul Bhan; Karleen Habin; Thomas M Baer; Joan Brugge; Daniel A Haber; Mark G Erlander; Dennis C Sgroi
Journal: Cancer Cell Date: 2004-06 Impact factor: 31.743

8. Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas.

Authors: Nathan D Price; Jonathan Trent; Adel K El-Naggar; David Cogdell; Ellen Taylor; Kelly K Hunt; Raphael E Pollock; Leroy Hood; Ilya Shmulevich; Wei Zhang
Journal: Proc Natl Acad Sci U S A Date: 2007-02-21 Impact factor: 11.205

Review 9. Systematic review: gene expression profiling assays in early-stage breast cancer.

Authors: Luigi Marchionni; Renee F Wilson; Antonio C Wolff; Spyridon Marinopoulos; Giovanni Parmigiani; Eric B Bass; Steven N Goodman
Journal: Ann Intern Med Date: 2008-02-04 Impact factor: 25.391

9 in total

20 in total

1. Identification of Marker Genes for Cancer Based on Microarrays Using a Computational Biology Approach.

Authors: Xiaosheng Wang
Journal: Curr Bioinform Date: 2014-04-01 Impact factor: 3.543

2. Graphics processing unit implementations of relative expression analysis algorithms enable dramatic computational speedup.

Authors: Andrew T Magis; John C Earls; Youn-Hee Ko; James A Eddy; Nathan D Price
Journal: Bioinformatics Date: 2011-01-20 Impact factor: 6.937

3. switchBox: an R package for k-Top Scoring Pairs classifier development.

Authors: Bahman Afsari; Elana J Fertig; Donald Geman; Luigi Marchionni
Journal: Bioinformatics Date: 2014-09-26 Impact factor: 6.937

4. NF-κB and stat3 transcription factor signatures differentiate HPV-positive and HPV-negative head and neck squamous cell carcinoma.

Authors: Daria A Gaykalova; Judith B Manola; Hiroyuki Ozawa; Veronika Zizkova; Kathryn Morton; Justin A Bishop; Rajni Sharma; Chi Zhang; Christina Michailidi; Michael Considine; Marietta Tan; Elana J Fertig; Patrick T Hennessey; Julie Ahn; Wayne M Koch; William H Westra; Zubair Khan; Christine H Chung; Michael F Ochs; Joseph A Califano
Journal: Int J Cancer Date: 2015-06-23 Impact factor: 7.396

Review 5. Relative expression analysis for molecular cancer diagnosis and prognosis.

Authors: James A Eddy; Jaeyun Sung; Donald Geman; Nathan D Price
Journal: Technol Cancer Res Treat Date: 2010-04

6. Identifying tightly regulated and variably expressed networks by Differential Rank Conservation (DIRAC).

Authors: James A Eddy; Leroy Hood; Nathan D Price; Donald Geman
Journal: PLoS Comput Biol Date: 2010-05-27 Impact factor: 4.475

7. Selective tropism of Seneca Valley virus for variant subtype small cell lung cancer.

Authors: J T Poirier; Irina Dobromilskaya; Whei F Moriarty; Craig D Peacock; Christine L Hann; Charles M Rudin
Journal: J Natl Cancer Inst Date: 2013-06-05 Impact factor: 13.506

8. Purity Independent Subtyping of Tumors (PurIST), A Clinically Robust, Single-sample Classifier for Tumor Subtyping in Pancreatic Cancer.

Authors: Naim U Rashid; Xianlu L Peng; Chong Jin; Richard A Moffitt; Keith E Volmar; Brian A Belt; Roheena Z Panni; Timothy M Nywening; Silvia G Herrera; Kristin J Moore; Sarah G Hennessey; Ashley B Morrison; Ryan Kawalerski; Apoorve Nayyar; Audrey E Chang; Benjamin Schmidt; Hong Jin Kim; David C Linehan; Jen Jen Yeh
Journal: Clin Cancer Res Date: 2019-11-21 Impact factor: 12.531

9. Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction.

Authors: Naim U Rashid; Quefeng Li; Jen Jen Yeh; Joseph G Ibrahim
Journal: J Am Stat Assoc Date: 2019-10-29 Impact factor: 5.033

10. MicroRNA expression profiles of whole blood in lung adenocarcinoma.

Authors: Santosh K Patnaik; Sai Yendamuri; Eric Kannisto; John C Kucharczuk; Sunil Singhal; Anil Vachani
Journal: PLoS One Date: 2012-09-28 Impact factor: 3.240