Literature DB >> 34165986

BlackSheep: A Bioconductor and Bioconda Package for Differential Extreme Value Analysis.

Lili Blumenberg^1,2,3, Emily A Kawaler^1,4,3, MacIntosh Cornwell^1,2,3, Shaleigh Smith¹, Kelly V Ruggles^2,3, David Fenyö^4,3.

Abstract

Unbiased assays such as shotgun proteomics and RNA-seq provide high-resolution molecular characterization of tumors. These assays measure molecules with highly varied distributions, making interpretation and hypothesis testing challenging. Samples with the most extreme measurements for a molecule can reveal the most interesting biological insights yet are often excluded from analysis. Furthermore, rare disease subtypes are, by definition, underrepresented in cancer cohorts. To provide a strategy for identifying molecules aberrantly enriched in small sample cohorts, we present BlackSheep, a package for nonparametric description and differential analysis of genome-wide data, available from Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/blacksheepr.html) and Bioconda (https://bioconda.github.io/recipes/blksheep/README.html). BlackSheep is a complementary tool to other differential expression analysis methods, which is particularly useful when analyzing small subgroups in a larger cohort.

Entities: Chemical

Keywords: differential expression; extreme values; outliers; phosphoproteomics; proteomics

Mesh：

Year: 2021 PMID： 34165986 PMCID： PMC8256816 DOI： 10.1021/acs.jproteome.1c00190

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 5.370

Introduction

Proteogenomic studies characterizing cancer have been completed by many groups, several of which also conducted phosphoproteome analysis.[1−12] Outlier identification was used in a number of these studies to identify samples with aberrantly high levels of each phosphosite and phosphoprotein.[1,4,9,10,13] In these studies, outlier identification and subsequent subtype enrichment were used to interpret phosphopeptide data at the protein level and to highlight novel putative clinically relevant targets[1,4,9−11] or to nominate targets in a kinase inhibitor screen for sensitizers in drug-resistant cell lines.[13] This nonparametric method is of particular use for multiomics studies, as nonparametric approaches are more robust to the various sources of technical noise present in these data sets, which violate assumptions in parametric tests. Outlier values in a data set are often assumed to be experimental artifacts and are typically discarded prior to downstream statistical analyses. However, recurrent outliers are sometimes the most meaningful values in a data set, representing profound biological effects. In particular, when characterizing biological systems and identifying disease vulnerabilities, the largest changes in abundance are often the most revealing.[14,15] Furthermore, many diseases, including cancer, are heterogeneous, with significant molecular variability requiring highly personalized approaches for successful treatment. The current strategies for identifying characteristic molecular patterns for groups of samples are especially underpowered when studying rare disease subtypes, as their sample sizes tend to be much smaller than those for their more common counterparts. They also tend to rely on assumptions about the underlying distributions of the features in question—assumptions that are often inaccurate or discard extreme values with biological significance. We propose a complementary strategy using the enrichment of outlier values within subtypes for characterizing disease subtypes that could inform diagnostic panels and potentially be used in the design of personalized therapeutic strategies for individual patients.

Materials and Methods

BlackSheep is an easy-to-use package available on Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/blacksheepr.html) and Bioconda (https://bioconda.github.io/recipes/blksheep/README.html). It can be used in R or Python or as a command line utility. BlackSheep has two major components: the “DEVA” (Differential Extreme Value Analysis) module for calling outliers, collapsing features by parent molecule (i.e., phosphopeptides to a protein) and differential analysis, and the “run_simulations” module for assigning p values to each outlier call. The input data is an expression matrix, structured as rows of features (genes, proteins, phosphosites, etc.) and columns of samples, and a sample annotation file is used to group samples for comparisons (Supplementary Table 1A,B). No prefiltering is necessary or recommended for DEVA. Normalization and log2 transformation of the input matrix is strongly recommended; a function for sample value normalization is provided.

Differential Extreme Value Analysis

To call outliers, the median and interquartile range (IQR) for each row are calculated. The user specifies whether to call overly abundant (i.e., up) or depleted (i.e., down) values. Outliers are defined as any value more than a multiple of the IQR above or below the median, where the multiple of the IQR is user-specified (default 1.5) (Figure A). Missing values are omitted from this calculation for each individual feature’s analysis. After calling outliers, there is an optional aggregation step for collapsing rows containing related features into a single row (e.g., many phosphosites collapsed into a protein). Aggregation is achieved by counting outliers and nonoutliers of all phosphosites for each protein. The output is two tables: one with outlier and nonoutlier counts per protein to be used for downstream comparisons (Supplementary Table 2A,B) and the other containing the fraction of outliers in each sample, which is helpful for visualization (Supplementary Table 2C,D).

Figure 1

BlackSheep workflow. (A) Outliers are initially identified for each feature (row) in the experimental data set. (B) Simulations and data resampling are used to assign significance values for each sample and feature. (C) Cohort comparisons identify features with enriched outliers within a sample cohort of interest.

Simulations and Outlier p Values

The second main function in the package is “run_simulations”, which uses simulations based on the observed data to create simulated samples that are used to calculate a p value for each sample for each parent molecule. For each simulated sample, the procedure is as follows. First, for each feature in the parent molecule (e.g., phosphosite on a protein), its value in a simulated sample is determined to be either “observed” or “missing”. The likelihood is based on the proportion of missing values for that feature in the actual data. This step is most important in data sets that tend to have an abundance of missing data when imputation is undesirable. Next, if it is determined to have an “observed” value, then a random value is assigned from a kernel density estimate (KDE) fit to the observed values from the associated feature. The assigned value is tested against the outlier threshold for that feature to determine the outlier status. This is repeated for all features related to the parent molecule. The frequency of outliers found in the simulated data is used to assign a p value to the observed data based on the number of outliers found for each parent molecule in each observed sample. A significance threshold is set at a user-defined alpha (default p < 0.05). The output file (Supplementary Table 3) contains a p value for outlier status for each parent molecule in each sample if it reaches significance (Figure B). This function is useful when the user is curious about the significance in individual samples rather than in a cohort.

Cohort Comparisons

Groups of samples can be compared with DEVA to identify features with an enrichment of outliers within a group. BlackSheep calculates the enrichment of outliers, whether high-expression outliers or low-expression outliers, for every group of samples identified in a user-supplied sample annotation table (Supplementary Table 1B). The analysis can be limited to a user-supplied list of genes, such as kinases or known druggable targets.[1] To calculate enrichment, first, a row-based filter is applied, removing rows where the average rate of outliers for that row is lower in the annotated group of interest (the “in-group”) than in the out-group. Second, to ensure that results are not driven by an excessively small subset of the in-group, we only keep rows that have at least one outlier value in a user-defined proportion of samples in the in-group; the proportion defaults to 0.3. Finally, DEVA performs Fisher’s exact test on counts from outlier and nonoutlier values in the in-group against the out-group. All p values are then corrected for multiple hypothesis testing using the Benjamini–Hochberg procedure.[16] Results can be output as a table of q values for all comparisons (Supplementary Table 2E,F), a table with outlier counts, p values, and q values per comparison (Supplementary Table 2G,H), or a heatmap showing outlier fraction values in each sample for rows with a significant enrichments of outliers (Figures C and 4).

Figure 4

Expression of phosphosites in the Her2 signaling pathway. Z scores of relative log2 abundance of all phosphosites in the Her2 signaling pathway with FDR < 0.01 calculated by BlackSheep.

Results

Performance Evaluation on Simulated Data

To rigorously compare BlackSheep to EdgeR[17] and Limma,[18] we generated several sets of simulated data that recapitulate challenging patterns found in real-world data sets, such as molecular data from cancer samples. Often, researchers are interested in finding molecules that are enriched in a small subgroup within a cohort. To address this common problem, we created simulated cohorts of 100 samples with 400 features each, split into an in-group and out-group of 12 and 88 samples, respectively (Figure A,B). Features for these cohorts were pulled from Gaussian distributions with standard deviations of 1. All features for the out-group were sampled from a distribution with a mean of 0, and we varied the mean of the in-group distribution between 2, 1.5, 1, 0.5, and 0. We then tested all features for differences between the in- and out-groups using DEVA, EdgeR, and Limma (Figure A). DEVA outperformed the other tools for small but not large mean differences. With a mean difference of 0.5, Limma and EdgeR do not have the power to detect differences between groups of 12 and 88 samples.

Figure 2

Comparison of DEVA with EdgeR and Limma for simulated samples. The top panels show feature values for samples in each group. The bottom panels show ROC curves for each tested tool. (A) 88 out-group and 12 in-group samples were generated with 400 features each by sampling from Gaussian distributions with standard deviations of 1. Out-group means are 0; in-group means are as indicated. (B) For the simulated cohort with a mean difference of 2, increasing numbers of samples were swapped between the in- and out-groups to simulate imperfect labeling or heterogeneity. We then set out to create a data set that would simulate the heterogeneous nature and imperfect labeling of molecular cancer data. Tumors are notoriously difficult to classify. In cancer data sets, subgroups often represent mixtures of molecular groups (e.g., luminal breast cancer containing both A and B subtypes), and samples can easily be misclassified (e.g., a Her2-enriched breast cancer sample mislabeled as luminal B). Subgroups used for comparisons therefore often contain mixtures of samples with varying levels of enrichment for a given feature. Because of the high variation of that feature within a group of interest, many differential expression methods will be unable to detect that enrichment. To simulate such a situation, we used our previously generated simulated data set with an in-group mean of 2. Within that data set, we swapped increasing numbers of samples between the in- and out-groups (Figure B). With increasing levels of impurity between the groups, DEVA strongly outperformed the other two tools. The above pattern of performance is not specific to the group sizes used in the simulations. The same pattern was replicated with other sizes of imbalanced groups (data not shown).

Application to Breast Cancer Cohort

Reproducibly hyperphosphorylated kinases within a specific subtype or patient cohort represent attractive targets for future drug development and repurposing.[19−24] To demonstrate the utility of BlackSheep, we applied it to a data set from a proteogenomic breast cancer study[1] to find putatively overactive kinases that are unique to a molecular subtype.[1,25,26] We also compared the results of BlackSheep to the commonly used rank-sum test on real data. We used both tools to identify differentially abundant phosphosites in Her2-enriched (Her2e) breast cancer samples as compared with all other samples (Figure ). In this cohort, Her2e is the smallest group, comprising 12 of the 76 samples. We used the full phosphosite expression matrix (63 130 phosphosites on 9881 proteins). Results of BlackSheep and the rank-sum test were corrected for multiple hypotheses with equal stringency. At an FDR cutoff of 0.01, rank-sum identified one enriched phosphosite (ERBB2-T1240) on the Her2 (ERBB2) protein (Figure ). The DEVA pipeline identified 10 additional phosphosites on ERBB2 as well as phosphosites on established coamplicons and modulators of Her2 signaling, such as GRB7[27,28] (Figure ). BlackSheep does not identify features that are enriched in large fractions of samples within a cohort. A feature with consistently high or low values in a large fraction of the cohort will increase the median and the IQR and will no longer be called an outlier. For small groups within a cohort, BlackSheep is able to identify enriched features (Figure ).

Figure 3

Comparing BlackSheep and rank-sum tests. Signed log10q values from blacksheep.deva and rank-sum tests when comparing normalized values in Her2e against all other samples using phospho data. Dotted lines indicate FDR < 0.01. Expression of phosphosites in the Her2 signaling pathway. Z scores of relative log2 abundance of all phosphosites in the Her2 signaling pathway with FDR < 0.01 calculated by BlackSheep.

Conclusions

Several cancer types have patients that fall into rare subgroups with worse prognoses than the majority of patients (e.g., the serous subgroup in endometrial cancer or the basal-like subgroup in breast cancer). Because of the difficulty in acquiring sufficient numbers of samples, these patients are the hardest to study, yet they are the patients most in need of new therapies. Whereas standard analysis techniques are useful for finding outliers that are enriched in large subgroups of samples, these strategies often lack the power to find the same for small subgroups. BlackSheep provides a user-friendly, complementary method to delineate enrichments of outlier events in a small group of samples within a cohort. We show that BlackSheep can find enrichment of known markers for small groups of samples, such as ERBB2 and GRB7 in Her2e samples, which other commonly used analysis paradigms miss. BlackSheep is a flexible complement to other methods such as rank-sum tests, EdgeR, and Limma. Whereas the necessary cohort and group-of-interest sizes depend on the effect size a user would like to detect, our simulations show that BlackSheep is highly sensitive and specific for detecting enriched molecules in small groups of interest. BlackSheep has previously been used to successfully detect druggable target kinases in small subgroups or even single samples within small cohorts.[1,4,9,11] In the future, BlackSheep-like strategies can be applied in the clinic to design and interpret diagnostic panels applied to single tumors, to highlight targets of drugs that can be repurposed for new indications, and to devise personalized treatments by prioritizing drugs that target significant outliers in a tumor.

27 in total

Review 1. Effect size, confidence interval and statistical significance: a practical guide for biologists.

Authors: Shinichi Nakagawa; Innes C Cuthill
Journal: Biol Rev Camb Philos Soc Date: 2007-11

2. Long-Term Outcomes of Imatinib Treatment for Chronic Myeloid Leukemia.

Authors: Andreas Hochhaus; Richard A Larson; François Guilhot; Jerald P Radich; Susan Branford; Timothy P Hughes; Michele Baccarani; Michael W Deininger; Francisco Cervantes; Satoko Fujihara; Christine-Elke Ortmann; Hans D Menssen; Hagop Kantarjian; Stephen G O'Brien; Brian J Druker
Journal: N Engl J Med Date: 2017-03-09 Impact factor: 91.245

3. Structure, regulation, signaling, and targeting of abl kinases in cancer.

Authors: Oliver Hantschel
Journal: Genes Cancer Date: 2012-05

Review 4. Vemurafenib: the first drug approved for BRAF-mutant cancer.

Authors: Gideon Bollag; James Tsai; Jiazhong Zhang; Chao Zhang; Prabha Ibrahim; Keith Nolop; Peter Hirth
Journal: Nat Rev Drug Discov Date: 2012-10-12 Impact factor: 84.694

5. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer.

Authors: Hui Zhang; Tao Liu; Zhen Zhang; Samuel H Payne; Bai Zhang; Jason E McDermott; Jian-Ying Zhou; Vladislav A Petyuk; Li Chen; Debjit Ray; Shisheng Sun; Feng Yang; Lijun Chen; Jing Wang; Punit Shah; Seong Won Cha; Paul Aiyetan; Sunghee Woo; Yuan Tian; Marina A Gritsenko; Therese R Clauss; Caitlin Choi; Matthew E Monroe; Stefani Thomas; Song Nie; Chaochao Wu; Ronald J Moore; Kun-Hsing Yu; David L Tabb; David Fenyö; Vineet Bafna; Yue Wang; Henry Rodriguez; Emily S Boja; Tara Hiltke; Robert C Rivers; Lori Sokoll; Heng Zhu; Ie-Ming Shih; Leslie Cope; Akhilesh Pandey; Bing Zhang; Michael P Snyder; Douglas A Levine; Richard D Smith; Daniel W Chan; Karin D Rodland
Journal: Cell Date: 2016-06-29 Impact factor: 41.582

6. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma.

Authors: Chen Huang; Lijun Chen; Sara R Savage; Rodrigo Vargas Eguez; Yongchao Dou; Yize Li; Felipe da Veiga Leprevost; Eric J Jaehnig; Jonathan T Lei; Bo Wen; Michael Schnaubelt; Karsten Krug; Xiaoyu Song; Marcin Cieślik; Hui-Yin Chang; Matthew A Wyczalkowski; Kai Li; Antonio Colaprico; Qing Kay Li; David J Clark; Yingwei Hu; Liwei Cao; Jianbo Pan; Yuefan Wang; Kyung-Cho Cho; Zhiao Shi; Yuxing Liao; Wen Jiang; Meenakshi Anurag; Jiayi Ji; Seungyeul Yoo; Daniel Cui Zhou; Wen-Wei Liang; Michael Wendl; Pankaj Vats; Steven A Carr; D R Mani; Zhen Zhang; Jiang Qian; Xi S Chen; Alexander R Pico; Pei Wang; Arul M Chinnaiyan; Karen A Ketchum; Christopher R Kinsinger; Ana I Robles; Eunkyung An; Tara Hiltke; Mehdi Mesri; Mathangi Thiagarajan; Alissa M Weaver; Andrew G Sikora; Jan Lubiński; Małgorzata Wierzbicka; Maciej Wiznerowicz; Shankha Satpathy; Michael A Gillette; George Miles; Matthew J Ellis; Gilbert S Omenn; Henry Rodriguez; Emily S Boja; Saravana M Dhanasekaran; Li Ding; Alexey I Nesvizhskii; Adel K El-Naggar; Daniel W Chan; Hui Zhang; Bing Zhang
Journal: Cancer Cell Date: 2021-01-07 Impact factor: 31.743

7. Proteogenomic integration reveals therapeutic targets in breast cancer xenografts.

Authors: Kuan-Lin Huang; Shunqiang Li; Philipp Mertins; Song Cao; Harsha P Gunawardena; Kelly V Ruggles; D R Mani; Karl R Clauser; Maki Tanioka; Jerry Usary; Shyam M Kavuri; Ling Xie; Christopher Yoon; Jana W Qiao; John Wrobel; Matthew A Wyczalkowski; Petra Erdmann-Gilmore; Jacqueline E Snider; Jeremy Hoog; Purba Singh; Beifung Niu; Zhanfang Guo; Sam Qiancheng Sun; Souzan Sanati; Emily Kawaler; Xuya Wang; Adam Scott; Kai Ye; Michael D McLellan; Michael C Wendl; Anna Malovannaya; Jason M Held; Michael A Gillette; David Fenyö; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Sherri R Davies; Charles M Perou; Cynthia Ma; R Reid Townsend; Xian Chen; Steven A Carr; Matthew J Ellis; Li Ding
Journal: Nat Commun Date: 2017-03-28 Impact factor: 14.919

8. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

9. Proteogenomics connects somatic mutations to signalling in breast cancer.

Authors: Philipp Mertins; D R Mani; Kelly V Ruggles; Michael A Gillette; Karl R Clauser; Pei Wang; Xianlong Wang; Jana W Qiao; Song Cao; Francesca Petralia; Emily Kawaler; Filip Mundt; Karsten Krug; Zhidong Tu; Jonathan T Lei; Michael L Gatza; Matthew Wilkerson; Charles M Perou; Venkata Yellapantula; Kuan-lin Huang; Chenwei Lin; Michael D McLellan; Ping Yan; Sherri R Davies; R Reid Townsend; Steven J Skates; Jing Wang; Bing Zhang; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Li Ding; Amanda G Paulovich; David Fenyö; Matthew J Ellis; Steven A Carr
Journal: Nature Date: 2016-05-25 Impact factor: 49.962

Review 10. Chronic myeloid leukemia: the paradigm of targeting oncogenic tyrosine kinase signaling and counteracting resistance for successful cancer therapy.

Authors: Simona Soverini; Manuela Mancini; Luana Bavaro; Michele Cavo; Giovanni Martinelli
Journal: Mol Cancer Date: 2018-02-19 Impact factor: 27.401

2 in total

Review 1. Cancer proteogenomics: current impact and future prospects.

Authors: D R Mani; Karsten Krug; Bing Zhang; Shankha Satpathy; Karl R Clauser; Li Ding; Matthew Ellis; Michael A Gillette; Steven A Carr
Journal: Nat Rev Cancer Date: 2022-03-02 Impact factor: 60.716

2. PANOPLY: a cloud-based platform for automated and reproducible proteogenomic data analysis.

Authors: D R Mani; Myranda Maynard; Ramani Kothadia; Karsten Krug; Karen E Christianson; David Heiman; Karl R Clauser; Chet Birger; Gad Getz; Steven A Carr
Journal: Nat Methods Date: 2021-06 Impact factor: 28.547

2 in total