Literature DB >> 28481966

RankProd 2.0: a refactored bioconductor package for detecting differentially expressed features in molecular profiling datasets.

Francesco Del Carratore¹, Andris Jankevics², Rob Eisinga³, Tom Heskes⁴, Fangxin Hong⁵, Rainer Breitling¹.

Abstract

MOTIVATION: The Rank Product (RP) is a statistical technique widely used to detect differentially expressed features in molecular profiling experiments such as transcriptomics, metabolomics and proteomics studies. An implementation of the RP and the closely related Rank Sum (RS) statistics has been available in the RankProd Bioconductor package for several years. However, several recent advances in the understanding of the statistical foundations of the method have made a complete refactoring of the existing package desirable.
RESULTS: We implemented a completely refactored version of the RankProd package, which provides a more principled implementation of the statistics for unpaired datasets. Moreover, the permutation-based P -value estimation methods have been replaced by exact methods, providing faster and more accurate results.
AVAILABILITY AND IMPLEMENTATION: RankProd 2.0 is available at Bioconductor ( https://www.bioconductor.org/packages/devel/bioc/html/RankProd.html ) and as part of the mzMatch pipeline ( http://www.mzmatch.sourceforge.net ). CONTACT: rainer.breitling@manchester.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28481966 PMCID： PMC5860065 DOI： 10.1093/bioinformatics/btx292

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Finding differentially expressed molecular features when comparing different conditions plays a pivotal role in all kinds of molecular profiling studies (‘omics’). The Rank Product (RP) and the Rank Sum (RS) are two non-parametric statistics widely used to detect variables consistently upregulated (or downregulated) in replicate experiments (Breitling and Herzyk, 2005; Breitling ). Originally developed for the analysis of gene expression microarrays, both methods are more accurate and powerful than their usual competitors in a number of different scenarios (e.g. abnormally distributed noise, heterogeneity of samples, small fraction of changed features, small sample size), as demonstrated in extensive numerical studies (Breitling and Herzyk, 2005; Jeffery ; Koziol, 2010a, b). The main identified weakness of the RP method is its sensitivity to variable-specific measurement variance. Nevertheless, this problem has been successfully addressed by a number of variance stabilizing normalization techniques (Breitling and Herzyk, 2005; Durbin ; Huber ). An R Bioconductor package implementing RP and the closely related RS has been available and widely used for several years (Hong ). However, recent improvements in our understanding of the two statistics made a refactored version of the package desirable. In the old implementation, the P-value estimation had been performed by a permutation-based method for both statistics (Hong ). This method requires a computationally demanding number of permutations in order to obtain accurate results and, when dealing with the tails of the distribution (i.e. the most interesting molecular features), the estimates are particularly unreliable. In RankProd 2.0, this limitation has been successfully tackled. Regarding the RP, the P-value estimation is now performed by applying the fast method proposed by Heskes . This tailor-made solution calculates strict bounds and very accurate approximate P-values for RP analysis. For the RS, a new exact method for the evaluation of the P-values has been developed and implemented as described in Section 3. The RP was initially introduced for the analysis of gene expression in paired datasets, specifically two-color microarrays (Breitling ). Nevertheless, the old RankProd package provided an ad hoc strategy to cope with unpaired datasets. Provided that unpaired datasets are increasingly common, we developed a more principled approach described in Section 4, which provides a more reliable application of RP and RS in the analysis of unpaired datasets.

2 P-values estimation for the RP

The P-value estimation for the RP has been intensely studied in the last few years. Koziol (2010a, 2016) approximated the distribution of the RP with a gamma distribution. Such approximation resulted to be imprecise when dealing with the tails of the distribution (Eisinga ). Eisinga derived the exact probability distribution of the RP statistic. Unfortunately, this is time-demanding and impractical to use with large datasets. For this reason, we chose the method proposed by Heskes , which allows a very accurate approximation of the P-values in a computationally fast manner. This method allows us to calculate strict bounds for the exact P-values and extremely accurate estimates by considering the geometric mean of the upper and lower bounds. This approach significantly speeds up the RP analysis. When considering a typical paired dataset (N = 1000 and K = 10), the computation time is now reduced by a factor of , when compared with the analysis performed with the previous approach (using 10 000 permutations).

3 P-values estimation for the RS

Previously, the only method available to estimate the P-values for the RS statistic was the permutation-based approach already implemented in the RankProd package (Hong ). Here we introduce a method for the exact calculation of the RS P-values. This is derived from the simple observation that under the null hypothesis, the probability distribution of the RS, in an experiment with N variables and K replicates, is exactly the same as the probability distribution of the sum of the outcomes obtained by rolling K dice with N faces (http://mathworld.wolfram.com/Dice.html). The implementation of this approach notably speeds up the RS analysis. When considering a typical paired dataset (N = 1000 and K = 10), the computation time is now reduced by a factor of , when compared with the analysis performed with the previous approach (using 10 000 permutations). When the size of the dataset is such that the time needed to evaluate the exact P-values becomes unacceptable, the new package uses the exact distribution for the tails of the distribution only, whereas all the other P-values are evaluated through a very accurate Gaussian approximation. The extent of the tails and the threshold used to switch between the two strategies are determined by the heuristic rule described, together with the details of the calculation, in the Supplementary Material.

4 Application to unpaired datasets

The previous version of the RankProd package provided an ad hoc approach to analyze unpaired datasets. This approach consists in considering all the possible pairs that can be obtained from the unpaired samples. Conversely, our new approach computes a user-defined number of random paired datasets and evaluates the RP (or RS) statistic per each of them. Each of these randomly paired datasets has the same size as if the experiment had originally been performed in a paired design. For each variable, the final RP (or RS) value returned is the median of all the values found during the random pairing process. The P-values are then computed as in the case of a paired experiment. A detailed description of this new approach can be found in the Supplementary Material.

5 Conclusion

The RankProd 2.0 package provides a robust and reliable implementation of the RP methods. Unpaired datasets are now handled through a new approach that significantly improves the performance of the methods. The P-value estimation for the RP is now faster and much more accurate, while for the RS we introduced a new and fast method able to evaluate the exact P-values. Full backward compatibility has been kept despite the complete refactoring. This improved implementation allows a more reliable application of these methods across the full spectrum of modern molecular profiling technologies. The new implementation of the method has also been integrated in the mzMatch pipeline (Scheltema ).

Funding

This work was supported by the BBSRC [BB/M017702/1]; ‘Centre for synthetic biology of fine and speciality chemicals’. Conflict of Interest: none declared. Click here for additional data file.

12 in total

1. A variance-stabilizing transformation for gene-expression microarray data.

Authors: B P Durbin; J S Hardin; D M Hawkins; D M Rocke
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

2. Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data.

Authors: Rainer Breitling; Pawel Herzyk
Journal: J Bioinform Comput Biol Date: 2005-10 Impact factor: 1.122

3. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis.

Authors: Fangxin Hong; Rainer Breitling; Connor W McEntee; Ben S Wittner; Jennifer L Nemhauser; Joanne Chory
Journal: Bioinformatics Date: 2006-09-18 Impact factor: 6.937

4. Comments on the rank product method for analyzing replicated experiments.

Authors: James A Koziol
Journal: FEBS Lett Date: 2010-01-20 Impact factor: 4.124

5. PeakML/mzMatch: a file format, Java library, R library, and tool-chain for mass spectrometry data analysis.

Authors: Richard A Scheltema; Andris Jankevics; Ritsert C Jansen; Morris A Swertz; Rainer Breitling
Journal: Anal Chem Date: 2011-03-14 Impact factor: 6.986

6. The exact probability distribution of the rank product statistics for replicated experiments.

Authors: Rob Eisinga; Rainer Breitling; Tom Heskes
Journal: FEBS Lett Date: 2013-02-08 Impact factor: 4.124

7. The rank product method with two samples.

Authors: James A Koziol
Journal: FEBS Lett Date: 2010-10-14 Impact factor: 4.124

8. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.

Authors: Rainer Breitling; Patrick Armengaud; Anna Amtmann; Pawel Herzyk
Journal: FEBS Lett Date: 2004-08-27 Impact factor: 4.124

9. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data.

Authors: Ian B Jeffery; Desmond G Higgins; Aedín C Culhane
Journal: BMC Bioinformatics Date: 2006-07-26 Impact factor: 3.169

10. A fast algorithm for determining bounds and accurate approximate p-values of the rank product statistic for replicate experiments.

Authors: Tom Heskes; Rob Eisinga; Rainer Breitling
Journal: BMC Bioinformatics Date: 2014-11-21 Impact factor: 3.169

42 in total

1. PhosPiR: an automated phosphoproteomic pipeline in R.

Authors: Ye Hong; Dani Flinkman; Tomi Suomi; Sami Pietilä; Peter James; Eleanor Coffey; Laura L Elo
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

2. Ethanol induces heat tolerance in plants by stimulating unfolded protein response.

Authors: Akihiro Matsui; Daisuke Todaka; Maho Tanaka; Kayoko Mizunashi; Satoshi Takahashi; Yuji Sunaoshi; Yuuri Tsuboi; Junko Ishida; Khurram Bashir; Jun Kikuchi; Miyako Kusano; Makoto Kobayashi; Kanako Kawaura; Motoaki Seki
Journal: Plant Mol Biol Date: 2022-06-22 Impact factor: 4.335

3. Meta-Analysis of Immune Induced Gene Expression Changes in Diverse Drosophila melanogaster Innate Immune Responses.

Authors: Ashley L Waring; Joshua Hill; Brooke M Allen; Nicholas M Bretz; Nguyen Le; Pooja Kr; Dakota Fuss; Nathan T Mortimer
Journal: Insects Date: 2022-05-23 Impact factor: 3.139

4. The Transcriptional Landscape of BRAF Wild Type Metastatic Melanoma: A Pilot Study.

Authors: Elena Lastraioli; Federico Alessandro Ruffinatti; Giacomo Bagni; Luca Visentin; Francesco di Costanzo; Luca Munaron; Annarosa Arcangeli
Journal: Int J Mol Sci Date: 2022-06-21 Impact factor: 6.208

5. iDRiP for the systematic discovery of proteins bound directly to noncoding RNA.

Authors: Hsueh-Ping Chu; Anand Minajigi; Yunfei Chen; Robert Morris; Chia-Yu Guh; Yu-Hung Hsieh; Myriam Boukhali; Wilhelm Haas; Jeannie T Lee
Journal: Nat Protoc Date: 2021-06-09 Impact factor: 13.491

6. Peripheral T cell expansion predicts tumour infiltration and clinical response.

Authors: Thomas D Wu; Shravan Madireddi; Patricia E de Almeida; Romain Banchereau; Ying-Jiun J Chen; Avantika S Chitre; Eugene Y Chiang; Hina Iftikhar; William E O'Gorman; Amelia Au-Yeung; Chikara Takahashi; Leonard D Goldstein; Chungkee Poon; Shilpa Keerthivasan; Denise E de Almeida Nagata; Xiangnan Du; Hyang-Mi Lee; Karl L Banta; Sanjeev Mariathasan; Meghna Das Thakur; Mahrukh A Huseni; Marcus Ballinger; Ivette Estay; Patrick Caplazi; Zora Modrusan; Lélia Delamarre; Ira Mellman; Richard Bourgon; Jane L Grogan
Journal: Nature Date: 2020-02-26 Impact factor: 69.504

Review 7. Available Software for Meta-analyses of Genome-wide Expression Studies.

Authors: Diego A Forero
Journal: Curr Genomics Date: 2019-08 Impact factor: 2.236

8. Comparison of Transcriptional Response of C₃ and C₄ Plants to Drought Stress Using Meta-Analysis and Systems Biology Approach.

Authors: Ahmad Tahmasebi; Ali Niazi
Journal: Front Plant Sci Date: 2021-07-01 Impact factor: 5.753

9. Translation Stress Positively Regulates MscL-Dependent Excretion of Cytoplasmic Proteins.

Authors: Rosa Morra; Francesco Del Carratore; Howbeer Muhamadali; Luminita Gabriela Horga; Samantha Halliwell; Royston Goodacre; Rainer Breitling; Neil Dixon
Journal: mBio Date: 2018-01-30 Impact factor: 7.867

10. A comprehensive database for integrated analysis of omics data in autoimmune diseases.

Authors: Marta E Alarcón-Riquelme; Pedro Carmona-Sáez; Jordi Martorell-Marugán; Raúl López-Domínguez; Adrián García-Moreno; Daniel Toro-Domínguez; Juan Antonio Villatoro-García; Guillermo Barturen; Adoración Martín-Gómez; Kevin Troule; Gonzalo Gómez-López; Fátima Al-Shahrour; Víctor González-Rumayor; María Peña-Chilet; Joaquín Dopazo; Julio Sáez-Rodríguez
Journal: BMC Bioinformatics Date: 2021-06-24 Impact factor: 3.169