Literature DB >> 21388526

QiSampler: evaluation of scoring schemes for high-throughput datasets using a repetitive sampling strategy on gold standards.

Jean F Fontaine¹, Bernhard Suter, Miguel A Andrade-Navarro.

Abstract

BACKGROUND: High-throughput biological experiments can produce a large amount of data showing little overlap with current knowledge. This may be a problem when evaluating alternative scoring mechanisms for such data according to a gold standard dataset because standard statistical tests may not be appropriate.
FINDINGS: To address this problem we have implemented the QiSampler tool that uses a repetitive sampling strategy to evaluate several scoring schemes or experimental parameters for any type of high-throughput data given a gold standard. We provide two example applications of the tool: selection of the best scoring scheme for a high-throughput protein-protein interaction dataset by comparison to a dataset derived from the literature, and evaluation of functional enrichment in a set of tumour-related differentially expressed genes from a thyroid microarray dataset.
CONCLUSIONS: QiSampler is implemented as an open source R script and a web server, which can be accessed at http://cbdm.mdc-berlin.de/tools/sampler/.

Entities: Chemical Disease Gene Species

Year: 2011 PMID： 21388526 PMCID： PMC3060832 DOI： 10.1186/1756-0500-4-57

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Findings

Background

Due to their large size and complexity, the processing and analysis of data produced by high-throughput molecular technologies requires the application of computer programs and algorithms. For example, the transcriptome of a cell can be assayed with mRNA microarrays, full genomes can be quickly sequenced using deep-sequencing technologies, and data on thousands of protein-protein interactions (PPIs) can be generated using high-throughput yeast two-hybrid screening or mass spectrometry [1,2]. As these technologies produce a huge amount of data, often from samples or conditions never studied before, evaluation of the significance of such results is challenging. When evaluating the results of a new high-throughput experiment, a biologist's first reaction is often to compare the experimental results to a golden set built from his knowledge and from the literature [3,4]. This golden set would allow the identification of known (previously described in the literature) and novel (not described in the literature) results in the data set, and it would allow the evaluation of different scoring schemes (e.g. experimental parameters, confidence scores or statistical tests) that can be used to prioritize the results. The systematic comparison of prioritizations from different scoring schemes may positively impact the results of a study (e.g. by helping to select appropriate confirmatory experiments) or suggest changes in experimental protocols. However, it is often the case that high-throughput datasets overlap minimally with available golden sets. For example, in the case of PPI data, a golden set composed of tens of thousands of curated human PPIs can be built from public databases. While this at first sounds like a large golden set, the total interaction space contains hundreds of millions of potentially interacting protein pairs [2,5]. In cases such as this, standard statistical tests may not be appropriate for evaluating a dataset. Therefore, alternative scoring schemes including comparisons of biological attributes such as gene expression or Gene Ontology terms have been used instead [3,6,7]. Yet, the application of these methods to a new dataset requires programming skills and specific statistical knowledge. We have implemented the QiSampler tool to systematically evaluate several scoring schemes for high-throughput experiments versus given golden sets using a sampling strategy. To demonstrate QiSampler's usefulness, we applied the algorithm to a public PPI dataset to select the experimental score that best prioritizes the data, and to a public microarray dataset to evaluate a functional enrichment in a set of differentially expressed genes.

Algorithm

The main input to the QiSampler algorithm is a table with at least three pieces of information for each experimental result, or "item": a label describing the item (e.g. gene name, PPI description, etc.), an indication of whether the item is in the golden set ("known") or not ("novel") encoded as '1' and '0' respectively, and one or more scores for the item that was calculated using a scoring scheme that you wish to evaluate. The scores can be integers or decimal numbers, and are expected to correlate positively with the significance of the item. In addition to this table, the user must also provide the number of repetitions N (a positive integer value), and the sampling rate SR ∈ ]0, 1] used to define a sample size S = | number of known items * SR |. The QiSampler algorithm works by assessing, for each column of alternative scores, whether high values are assigned preferentially to known cases. This is done by comparing the values given to known items to randomly chosen novel items. The algorithm can be described as follows (see also Figure 1):

Figure 1

Flow chart of the QiSampler algorithm. The data to be processed (all known and novel items and corresponding scoring schemes), and the values of S and N are set from the inputs (see algorithm section for details). 1. select a random sample of size S from the known items in the dataset 2. select a random sample of the same size S from the novel items in the dataset 3. compute the classification performance on the random sample for a given scoring scheme 4. repeat steps 1 to 3 N times 5. compute the average classification performance over the N random samples 6. update output plots with averaged performance curves over the N repetitions 7. repeat steps 1 to 6 for each scoring scheme Performances are summarized using four graphical plots: precision-recall, precision-cutoff, recall-cutoff, and receiver operating characteristic (ROC) curves (See Figure 2). Classification statistics are defined as follows: given a score cutoff for item selection, true positive items (TP) are defined as known items that are selected, false positive items (FP) as novel items that are selected, true negative items (TN) as novel items that are not selected, and false negative items (FN) as known items that are not selected. Then, the following classification performance measures are used: recall = TP/(TP + FN), true positive rate = recall, false positive rate = FP/(FP + TN), and precision = TP/(TP + FP) [8,9]. Examination of the graphs permits comparison of the scoring systems, and could suggest optimal cutoff values for specific applications.

Figure 2

Scores comparison. These graphs produced by QiSampler show the average performance of two scores (scaled to [0,1]), used to select PPIs from the same experimental dataset [10]. Performance was averaged over 1000 repetitions with a sampling rate of 25%. Dashed lines represent randomized data. Based on the Precision-recall and ROC graphs, the normalized DN score performs better than the z-score and a cut-off close to 0.3 would produce optimal values of recall and precision. When running QiSampler, the user can test the effect of changing the SR parameter, which defines the sample size S. This is useful to detect whether the results of the algorithm could be biased by the high scores of a few true positives. The observation of a drastic decrease in performance when changing SR from high to low (e.g. from 90% to 10%) would tell that just a few true positive items have higher scores than the random selection. On the contrary, stable performance when varying SR would tell that most of the true positive items have comparable scores. Another parameter that is defined by the user is the number N of repetitions of the test. The procedure should be repeated a minimal number of times to produce a good estimate of the performance (we would suggest at least 100 times). Lower values may have to be used if the computational requirements are too high.

Example of application to a PPI dataset

We have downloaded data from a proteomics study where interaction partners of 75 deubiquitinating enzymes were defined using protein purification, immunoprecipitation and tandem mass spectrometry [10]. The full dataset consisted of 26,803 protein pairs evaluated for potential interaction using two different scores: the z-score and the DN confidence score, the latter introduced by the authors to score interactions using information from parallel nonreciprocal datasets. The superiority of the DN score was shown only for a few examples [10]. Here, we have used QiSampler to perform a systematic statistical evaluation of the two scoring systems in this dataset (Figure 2). The dataset included 105 known pairs that have been previously described in the literature (as defined in [10]). We selected N = 1000 repetitions and SR = 25% (equivalent to 26 pairs). The running time was approximately 40 minutes, but it can vary drastically from a few seconds to a few hours when different parameters are used (Table 1).

Table 1

Average running times on the full dataset

Sampling rate	0.25	0.75	1
Running time for 10 repetitions	00:00:04	00:00:07	00:00:11
Running time for 100 repetitions	00:00:54	00:03:30	00:04:46
Running time for 1000 repetitions	00:40:15	03:09:14	05:09:52

The full dataset contained 26,803 protein pairs including 105 known in the literature. Times were averaged over two runs and were recorded on an AMD Opteron (64 bits, 2.3 GHz) processor-based computer.

Average running times on the full dataset The full dataset contained 26,803 protein pairs including 105 known in the literature. Times were averaged over two runs and were recorded on an AMD Opteron (64 bits, 2.3 GHz) processor-based computer. For comparison, the DN score was log transformed (logarithm base 10) and both scores (z-score and DN score) were then scaled to [0,1]. The scaled version Sscaled of a score S was defined as Sscaled = (S - Smin)/( Smax - Smin), where Smax and Smin are the maximal and minimal value of S respectively. The DN score (blue curves) showed higher precision but lower recall than the z-score (red curves). Nevertheless, the balance between precision and recall (precision-recall curve), or true positive rate and false positive rate (ROC curve) was better for the DN score, showing its superiority. Both scores were better than random controls in the four plots. Results were stable when varying the sampling rate from 10% to 100% (data not shown).

Example of application to a microarray dataset

We downloaded from the Gene Expression Omnibus database [11] a microarray dataset (identifier: GSE6339) containing normalized gene expression values of human thyroid samples, and extracted data from 30 oncocytic thyroid adenoma (OTA) and 24 wild type (WT) samples [12]. OTA cells are characterized by an accumulation of mitochondria [13]. We used QiSampler to see if genes disregulated in OTA samples were significantly related to oxidoreductase activity, which is related to mitochondrial function [14]. Z-scores comparing expression values between OTA and WT samples were computed for 3,821 gene probes with associated Gene Ontology (GO) annotations and a number of missing values less than or equal to 27 (representing 50% of the samples). There were 137 (3.6%) gene probes associated to "oxidoreductase activity" annotation (GO:0016491). We selected N = 1000 repetitions and SR = 25% (equivalent to 34 gene probes) to produce classification performance plots by QiSampler (Figure 3).

Figure 3

Functional enrichment in differentially expressed genes. These graphs produced by QiSampler show the average performance of the z-score comparing gene expression values between 30 OTA and 24 WT samples to select gene probes related to oxidoreductase activity in a thyroid microarray gene expression dataset [12]. The performance was averaged over 1000 repetitions with a sampling rate of 25% representing 34/137 known items and 34/3684 novel items. Dashed lines represent randomized data. The precision increases with the z-score cutoff indicating functional enrichment in the upregulated genes. The separation of the precision-recall curve from the random curve shows that genes related to oxidoreductase activity tend to be upregulated demonstrating the initial hypothesis. The shape of the ROC curve is probably due to the fact that not all the genes related to oxidoreductase activity are involved in OTA and that the set of upregulated genes is expected to include genes related to other cellular and molecular processes such as apoptosis or mitochondrial homoeostasis [13,15].

Implementation

The R script is designed for R 2.9.0 [16] and plots are generated by the ROCR package [9]. The script is used in command-line mode. To facilitate access to the algorithm we also implemented the algorithm as a public web tool, programmed using JavaScript, HTML 4 and Perl 5.8.8, but it operates with some query-size limits due to restricted computing power. Web pages were tested on Linux or Windows using Firefox 3.6.8, Google Chrome 5.0, and Internet Explorer 8.0.

Discussion

QiSampler can systematically evaluate the classification performance of experimental scores in comparison to gold standards and to random controls. A given scoring scheme can be identified as relevant if it produces better classification performance than randomly generated scores. If the user provides multiple scoring schemes, their performance can be easily compared, as illustrated in Figure 2. To illustrate QiSampler's usefulness we applied it to the analysis of a high-throughput PPI dataset (26,803 protein pairs) produced by mass spectrometry experiments that has little overlap with the PPI literature (105 known interacting pairs) [10]. In the original publication, a particular score was designed for mass spectrometry data (DN score), which accounts for protein abundance and performs better than the z-score in the selection of protein spectrometry results [10]. Accordingly, the QiSampler was able to reproduce the superiority of the DN score over the z-score (Figure 2). In a second application, we used QiSampler to demonstrate the enrichment in functions related to oxidoreductase activity in genes upregulated in oncocytic thyroid adenomas from a microarray dataset (Figure 3). Although QiSampler was created to process datasets with little overlap to the literature, it will not be able to process a dataset with very few known cases or with too few different score levels (e.g. a binary score) due to a limitation in a function of the ROCR package. We recommend running the procedure with datasets having at least 10 known cases, the more the better, and scored with continuously distributed values. The algorithm may have long run times when processing large datasets (Table 1), and due to restricted computing power the web server operates with some restrictions on the size of the query. To avoid this problem, an open-source R script is provided which allows one to use QiSampler locally as a command line program without limited inputs. Plans for the future include an improved web server able to process large datasets. Automatic computation of optimal score cutoffs could also be useful, though different applications may require different cutoffs, e.g. giving priority to higher recall accepting poor precision (e.g. genetic disease screenings) or to higher precision accepting poor recall (e.g. identification of disease markers). Result reliability when varying the sampling rate could be automatically computed, though running QiSampler twice is sufficient to see such an effect, for example comparing sampling rates of 10% and 90%. Finally, to further simplify the use of the QiSampler tool, which requires the user to provide the scores and the identification of the known items, we will implement optional pre-computed score systems and golden standards upon request from users if these are commonly used in the research community. In conclusion, QiSampler can be used for the selection of the most useful experimental scores or parameters. Simplicity of the input format allows the use of QiSampler with various dataset types, such as PPI, gene-expression microarray, or deep sequencing datasets.

Availability and requirements

Project name: QiSampler Project home page: http://cbdm.mdc-berlin.de/tools/sampler/ Operating system(s): platform independent Programming language: R, Perl, HTML, and JavaScript Other requirements: either a modern web browser or R 2.9.0 and the ROCR package License: BSD license Any restrictions to use by non-academics: none

List of abbreviations

PPI: protein-protein interaction.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JFF conceived of the study, designed the study, carried out implementation and statistical analysis, and drafted the manuscript. BS participated in the statistical analysis. MAAN participated in the study design and collaborated in the writing of the manuscript. All authors read and approved the final manuscript.

14 in total

Review 1. Exploiting big biology: integrating large-scale biological data for function inference.

Authors: E Marcotte; S Date
Journal: Brief Bioinform Date: 2001-12 Impact factor: 11.622

Review 2. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction.

Authors: Ronald Jansen; Mark Gerstein
Journal: Curr Opin Microbiol Date: 2004-10 Impact factor: 7.934

3. ROCR: visualizing classifier performance in R.

Authors: Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal: Bioinformatics Date: 2005-08-11 Impact factor: 6.937

4. Microarray analysis refines classification of non-medullary thyroid tumours of uncertain malignancy.

Authors: J-F Fontaine; D Mirebeau-Prunier; B Franc; S Triau; P Rodien; R Houlgatte; Y Malthièry; F Savagner
Journal: Oncogene Date: 2007-10-29 Impact factor: 9.867

Review 5. Nitric oxide: orchestrating hypoxia regulation through mitochondrial respiration and the endoplasmic reticulum stress response.

Authors: Weiming Xu; Ian G Charles; Salvador Moncada
Journal: Cell Res Date: 2005-01 Impact factor: 25.617

6. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

7. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

8. Large-scale mapping of human protein-protein interactions by mass spectrometry.

Authors: Rob M Ewing; Peter Chu; Fred Elisma; Hongyan Li; Paul Taylor; Shane Climie; Linda McBroom-Cerajewski; Mark D Robinson; Liam O'Connor; Michael Li; Rod Taylor; Moyez Dharsee; Yuen Ho; Adrian Heilbut; Lynda Moore; Shudong Zhang; Olga Ornatsky; Yury V Bukhman; Martin Ethier; Yinglun Sheng; Julian Vasilescu; Mohamed Abu-Farha; Jean-Philippe Lambert; Henry S Duewel; Ian I Stewart; Bonnie Kuehl; Kelly Hogue; Karen Colwill; Katharine Gladwish; Brenda Muskat; Robert Kinach; Sally-Lin Adams; Michael F Moran; Gregg B Morin; Thodoros Topaloglou; Daniel Figeys
Journal: Mol Syst Biol Date: 2007-03-13 Impact factor: 11.429

9. Death-associated protein 3 is overexpressed in human thyroid oncocytic tumours.

Authors: C Jacques; J-F Fontaine; B Franc; D Mirebeau-Prunier; S Triau; F Savagner; Y Malthiery
Journal: Br J Cancer Date: 2009-06-16 Impact factor: 7.640

10. An empirical framework for binary interactome mapping.

Authors: Kavitha Venkatesan; Jean-François Rual; Alexei Vazquez; Ulrich Stelzl; Irma Lemmens; Tomoko Hirozane-Kishikawa; Tong Hao; Martina Zenkner; Xiaofeng Xin; Kwang-Il Goh; Muhammed A Yildirim; Nicolas Simonis; Kathrin Heinzmann; Fana Gebreab; Julie M Sahalie; Sebiha Cevik; Christophe Simon; Anne-Sophie de Smet; Elizabeth Dann; Alex Smolyar; Arunachalam Vinayagam; Haiyuan Yu; David Szeto; Heather Borick; Amélie Dricot; Niels Klitgord; Ryan R Murray; Chenwei Lin; Maciej Lalowski; Jan Timm; Kirstin Rau; Charles Boone; Pascal Braun; Michael E Cusick; Frederick P Roth; David E Hill; Jan Tavernier; Erich E Wanker; Albert-László Barabási; Marc Vidal
Journal: Nat Methods Date: 2008-12-07 Impact factor: 28.547

3 in total

1. Development and application of a DNA microarray-based yeast two-hybrid system.

Authors: Bernhard Suter; Jean-Fred Fontaine; Reha Yildirimman; Tamás Raskó; Martin H Schaefer; Axel Rasche; Pablo Porras; Blanca M Vázquez-Álvarez; Jenny Russ; Kirstin Rau; Raphaele Foulle; Martina Zenkner; Kathrin Saar; Ralf Herwig; Miguel A Andrade-Navarro; Erich E Wanker
Journal: Nucleic Acids Res Date: 2012-12-28 Impact factor: 16.971

2. Alkemio: association of chemicals with biomedical topics by text and data mining.

Authors: José A Gijón-Correas; Miguel A Andrade-Navarro; Jean F Fontaine
Journal: Nucleic Acids Res Date: 2014-05-16 Impact factor: 16.971

3. Next-Generation Sequencing for Binary Protein-Protein Interactions.

Authors: Bernhard Suter; Xinmin Zhang; C Gustavo Pesce; Andrew R Mendelsohn; Savithramma P Dinesh-Kumar; Jian-Hua Mao
Journal: Front Genet Date: 2015-12-17 Impact factor: 4.599

3 in total